# Example usage

To use `ols_regressor` in a project:

In [1]:
from ols_regressor.regressor import LinearRegressor
from ols_regressor.cross_validate import cross_validate
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split

In this instructional guide, we aim to demonstrate the practical application of our `OLS_Regressor` tool through an in-depth analysis of a comprehensive vehicle dataset gathered in Australia for the year 2023. This rich dataset encapsulates a wide range of information, including the latest car prices in Australia, encompassing a diverse array of brands, models, vehicle types, and distinctive features prevalent in the Australian automotive market.

The core focus of our study is to accurately predict the market price of vehicles using the basic information provided in the dataset. To achieve this, we employ our `OLS_Regressor` package. As the name indicates, this package is adept at fitting a linear regression model by utilizing the Ordinary Least Squares (OLS) method. It is further equipped with several analytical tools, including methods like `predict` for price estimation and `score` for evaluating model performance. In addition to these features, we have innovatively designed a bespoke `cross_validate` method, specifically tailored for the hyperparameter tuning process to enhance the model's accuracy and efficiency.

Moreover, our package has been carefully crafted to align with the design patterns and methodologies used in the renowned `scikit-learn` package, with some minor yet significant modifications. This alignment ensures familiarity for users experienced with `scikit-learn`, while our enhancements offer additional value and unique capabilities.

### Data Preprocessing

We have already worked on the data for an initial preprocessing which mainly consists of dropping some useless columns. The details of the initial preprocessing can be found in the `data` folder in the Github repository. 

The initially preprocessed data is stored in the `data/preprocessed_data.csv` file. We read it using `read_csv` method provided by `pandas`.

In [13]:
data = pd.read_csv("../data/preprocessed_data.csv")
data.columns

Index(['Brand', 'UsedOrNew', 'Transmission', 'DriveType', 'FuelType',
       'BodyType', 'Doors', 'Seats', 'Engine_cylinder_number',
       'Engine_total_volume', 'ExteriorColour', 'Year', 'Kilometres', 'Price',
       'fuel_comsumption_liter', 'fuel_comsumption_km'],
      dtype='object')

The `Price` column is our response in this study. Therefore, we removed it from the data to create `X` and `y` for training purposes.

The data are then split into training and test parts using the `train_test_split` method provided by `scikit-learn`.

In [5]:
X, y = data.drop(columns=["Price"]), data["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

A noticeable fact of our data is that the columns belong to various types of features. For an OLS model, the input features should only be numeric. The code snippet below is part of a data preprocessing pipeline. It demonstrates the use of different transformers for various types of features: categorical, ordinal, numeric, and features to be dropped. Let's break down what each part of the code does:

1. **Categorical Features**: 
   - `categorical_features` are those variables that represent categories, such as the brand or type of fuel. These features are transformed using `OneHotEncoder`, which converts categorical data into a format that can be used by machine learning algorithms. The `handle_unknown="ignore"` parameter ensures that if any unknown category is encountered during transformation, it will be ignored rather than throwing an error.

2. **Ordinal Features**: 
   - `ordinal_features` are categorical features but with a clear ordering or ranking (e.g., number of doors in a car). These are transformed using `OrdinalEncoder`. The `handle_unknown="use_encoded_value", unknown_value=999` parameter settings imply that if an unknown category is encountered, it will be assigned a value of 999.

3. **Numeric Features**: 
   - `numeric_features` are continuous numbers (e.g., year of manufacture, kilometers driven). These are standardized using `StandardScaler` to normalize their range and distribution, making them more suitable for many machine learning algorithms.

4. **Dropping Features**: 
   - `drop_features` specifies the features to be excluded from the model. In this case, `fuel_comsumption_liter` is being dropped. The `"drop"` transformer is used for this purpose.

5. **Column Transformer (`ct`)**:
   - `make_column_transformer` is used to apply these transformations to the appropriate columns in the dataset. It creates a single transformer object (`ct`) which applies all the specified transformations to the dataset in a streamlined manner.

6. **Transforming the Training Data (`X_train_encoded`)**:
   - Finally, `X_train_encoded` is created by applying the `ct` transformer to `X_train`. This results in a transformed training dataset with one-hot encoded, ordinal encoded, standardized, and dropped features, making it ready for use in a machine learning model.

This preprocessing step is crucial for preparing the data correctly, ensuring that the machine learning model we choose to apply next can learn effectively from this structured and cleaned data.

In [6]:
# Categorical features should be encoded with OneHotEncoder
categorical_features = ['Brand', 'UsedOrNew', 'Transmission', 'DriveType', 'FuelType',
       'BodyType', 'ExteriorColour']
# Ordinal features should be encoded with OrdinalEncoder
ordinal_features = ['Doors', 'Seats', 'Engine_cylinder_number']
# Numeric features should be normalized with StandardScaler
numeric_features = ['Year', 'Kilometres', 'Engine_total_volume', 'fuel_comsumption_liter']
# Since this feature contains only 100 for all observations, we simply drop it
drop_features = ['fuel_comsumption_liter']

# make up a column transformers based on the feature types
ct = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_features),
    (OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=999), ordinal_features),
    (StandardScaler(), numeric_features),
    ("drop", drop_features)
)
# fit the column transformer on the training data
X_train_encoded = ct.fit_transform(X_train)
X_train_encoded

array([[ 0.        ,  0.        ,  0.        , ...,  1.49214463,
         1.41067538,  1.51222474],
       [ 0.        ,  0.        ,  0.        , ..., -0.44649103,
        -0.66911493, -0.71054051],
       [ 0.        ,  0.        ,  0.        , ..., -0.51863016,
        -1.01574664, -1.7538793 ],
       ...,
       [ 0.        ,  0.        ,  0.        , ..., -0.00508252,
        -0.43802711, -0.61981539],
       [ 0.        ,  0.        ,  0.        , ...,  0.65793342,
        -0.43802711, -0.84662817],
       [ 0.        ,  0.        ,  0.        , ..., -0.2266645 ,
        -1.24683446,  0.37816084]])

In [7]:
X_test_encoded = ct.transform(X_test)
X_test_encoded

array([[ 0.        ,  0.        ,  0.        , ..., -1.07234979,
         0.48632413, -0.07546472],
       [ 0.        ,  0.        ,  0.        , ...,  0.61690914,
         0.71741194,  0.06062295],
       [ 0.        ,  0.        ,  0.        , ..., -1.22961588,
        -0.43802711,  0.55961106],
       ...,
       [ 0.        ,  0.        ,  0.        , ..., -1.22920882,
        -1.01574664, -0.34764006],
       [ 0.        ,  0.        ,  0.        , ..., -0.4935322 ,
        -0.43802711, -0.52909028],
       [ 0.        ,  0.        ,  0.        , ..., -0.08272971,
        -0.43802711, -0.34764006]])

### Fitting the model on the training data

The `fit` function in the `ols_regressor` package will calculate the coefficients for the linear regression model using the Ordinary Least Squares (OLS) method. It converts the input features and target values into NumPy arrays. The function then augments the feature matrix with an intercept term and computes the model coefficients using the OLS formula. The resulting coefficients are stored in the `self.coef` attribute, representing the weights that minimize the sum of squared differences between the predicted and actual target values.

The use of this function is demonstrated below.


In [8]:
model = LinearRegressor()
model.fit(X_train_encoded, y_train)

array([ 3.31243420e+04, -3.93475551e+04, -3.24646310e+04,  1.59195142e+05,
       -2.52374164e+04, -2.26803002e+04, -5.99355383e+04,  2.01100254e+05,
       -2.02462372e+04, -5.23111571e+04,  4.22082863e+04, -4.60241526e+04,
       -4.31256021e+04, -2.12256241e+04, -1.41643446e+04, -5.61456854e+04,
        1.46990564e+04, -4.78145775e+04, -1.41355982e+04,  4.71981281e+05,
       -4.21568755e+04, -3.55554893e+04, -5.88606161e+04, -5.23137175e+04,
       -6.34807611e+04, -1.50773697e+04, -5.60401972e+04, -9.49431876e+03,
       -5.27748547e+04, -1.11349964e+03, -4.11487549e+04, -3.26223863e+04,
        1.93720776e+04, -3.69586651e+04, -1.60671589e+04, -4.53015641e+04,
        2.06220654e+05, -3.02795247e+04,  5.11685045e+04, -2.09534816e+04,
       -4.06118740e+04, -3.87486786e+04, -5.02415150e+04,  3.00881145e+05,
       -6.54027568e+03, -2.28377052e+04,  5.18866121e+04, -4.98507563e+04,
       -5.64256853e+04,  4.87344179e+04, -3.78121378e+04,  3.34400983e+05,
       -1.58124468e+04, -

In [9]:
model.coef

array([ 3.31243420e+04, -3.93475551e+04, -3.24646310e+04,  1.59195142e+05,
       -2.52374164e+04, -2.26803002e+04, -5.99355383e+04,  2.01100254e+05,
       -2.02462372e+04, -5.23111571e+04,  4.22082863e+04, -4.60241526e+04,
       -4.31256021e+04, -2.12256241e+04, -1.41643446e+04, -5.61456854e+04,
        1.46990564e+04, -4.78145775e+04, -1.41355982e+04,  4.71981281e+05,
       -4.21568755e+04, -3.55554893e+04, -5.88606161e+04, -5.23137175e+04,
       -6.34807611e+04, -1.50773697e+04, -5.60401972e+04, -9.49431876e+03,
       -5.27748547e+04, -1.11349964e+03, -4.11487549e+04, -3.26223863e+04,
        1.93720776e+04, -3.69586651e+04, -1.60671589e+04, -4.53015641e+04,
        2.06220654e+05, -3.02795247e+04,  5.11685045e+04, -2.09534816e+04,
       -4.06118740e+04, -3.87486786e+04, -5.02415150e+04,  3.00881145e+05,
       -6.54027568e+03, -2.28377052e+04,  5.18866121e+04, -4.98507563e+04,
       -5.64256853e+04,  4.87344179e+04, -3.78121378e+04,  3.34400983e+05,
       -1.58124468e+04, -

### Cross Validation

Sometimes, we may need to get both the train score and validation score for hyperparameter tuning. Therefore, we need cross-validation to get the validation performance while avoiding overfitting. Our implementation of cross-validation is somewhat similar to the implementation of `scikit-learn`. The `cross_validate` function accepts five arguments: 

- `model`: The model to perform cross-validation
- `X`: The predictors of training data. It should be a 2D numpy array
- `y`: The response of training data. It should be a 1D numpy array
- `cv`: The number of folds used for cross-validation
- `random_state`: The random state of the random shuffling in the cross-validation

Notice that our `cross_validate` does not require a `return_train_score` argument. The train scores are automatically returned in the cross-validation results.

In [10]:
cv_results = cross_validate(model, X_train_encoded, y_train.to_numpy(), 5, 42)
pd.DataFrame(cv_results)

Unnamed: 0,train_score,test_score,fit_time,score_time
0,0.669816,0.664145,0.026347,0.000695
1,0.688293,0.537996,0.007791,0.001042
2,0.666635,0.670863,0.072412,0.000649
3,0.715619,0.488649,0.021645,0.000488
4,0.665131,0.687616,0.045048,0.000465
5,0.672295,-19.2097,0.017041,8.9e-05


### Predicting with the Fitted Model

Now that our regression model has been fitted, it is time to utilize it for making predictions on unseen data. The `predict` function within the `ols_regressor` package has been designed for this purpose. This function expects an array-like matrix X of shape (n_samples, n_features) as input so that we can compute the predicted target values with the coefficients stored in the `self.coef` attribute. The `predict` function will return an array containing the model's predictions based on the provided input features.

The use of this function is demonstrated below.

In [11]:
model.predict(X_test_encoded)

array([55898.43408674, 75474.05804754, 67526.80375232, ...,
       44614.05367543, 33326.253816  , 23928.94731774])

### Scoring the Fitted Model

Here we use the `score` function within the `ols_regressor` package. The function takes in X(n_samples, n_features) and y_pred(n_samples, ) as input and calculates the coefficient of determination for the prediction. 

In [12]:
model.score(X_test_encoded, y_test)

0.5705212165277045