# Example usage

To use `ols_regressor` in a project:

In [1]:
from ols_regressor.regressor import LinearRegressor
from ols_regressor.cross_validate import cross_validate
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv("../data/preprocessed_data.csv")
data.head()

Unnamed: 0,Brand,UsedOrNew,Transmission,DriveType,FuelType,BodyType,Doors,Seats,Engine_cylinder_number,Engine_total_volume,ExteriorColour,Year,Kilometres,Price,fuel_comsumption_liter,fuel_comsumption_km
0,Ssangyong,DEMO,Automatic,AWD,Diesel,SUV,4,7,4,2.2,White,2022,5595.0,51990.0,8.7,100.0
1,MG,USED,Automatic,Front,Premium,Hatchback,5,5,4,1.5,Black,2022,16.0,19990.0,6.7,100.0
2,BMW,USED,Automatic,Rear,Premium,Coupe,2,4,4,2.0,Grey,2022,8472.0,108988.0,6.6,100.0
3,Mercedes-Benz,USED,Automatic,Rear,Premium,Coupe,2,4,8,5.5,White,2011,136517.0,32990.0,11.0,100.0
4,Renault,USED,Automatic,Front,Unleaded,SUV,4,5,4,1.3,Grey,2022,1035.0,34990.0,6.0,100.0


In [3]:
data["fuel_comsumption_km"].value_counts()

fuel_comsumption_km
100.0    16734
Name: count, dtype: int64

In [4]:
data.columns

Index(['Brand', 'UsedOrNew', 'Transmission', 'DriveType', 'FuelType',
       'BodyType', 'Doors', 'Seats', 'Engine_cylinder_number',
       'Engine_total_volume', 'ExteriorColour', 'Year', 'Kilometres', 'Price',
       'fuel_comsumption_liter', 'fuel_comsumption_km'],
      dtype='object')

In [5]:
X, y = data.drop(columns=["Price"]), data["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [6]:
categorical_features = ['Brand', 'UsedOrNew', 'Transmission', 'DriveType', 'FuelType',
       'BodyType', 'ExteriorColour']
ordinal_features = ['Doors', 'Seats', 'Engine_cylinder_number']
numeric_features = ['Year', 'Kilometres', 'Engine_total_volume', 'fuel_comsumption_liter']
drop_features = ['fuel_comsumption_liter']
ct = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_features),
    (OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=999), ordinal_features),
    (StandardScaler(), numeric_features),
    ("drop", drop_features)
)
X_train_encoded = ct.fit_transform(X_train)
X_train_encoded

array([[ 0.        ,  0.        ,  0.        , ...,  1.49214463,
         1.41067538,  1.51222474],
       [ 0.        ,  0.        ,  0.        , ..., -0.44649103,
        -0.66911493, -0.71054051],
       [ 0.        ,  0.        ,  0.        , ..., -0.51863016,
        -1.01574664, -1.7538793 ],
       ...,
       [ 0.        ,  0.        ,  0.        , ..., -0.00508252,
        -0.43802711, -0.61981539],
       [ 0.        ,  0.        ,  0.        , ...,  0.65793342,
        -0.43802711, -0.84662817],
       [ 0.        ,  0.        ,  0.        , ..., -0.2266645 ,
        -1.24683446,  0.37816084]])

In [7]:
X_test_encoded = ct.transform(X_test)
X_train_encoded

array([[ 0.        ,  0.        ,  0.        , ...,  1.49214463,
         1.41067538,  1.51222474],
       [ 0.        ,  0.        ,  0.        , ..., -0.44649103,
        -0.66911493, -0.71054051],
       [ 0.        ,  0.        ,  0.        , ..., -0.51863016,
        -1.01574664, -1.7538793 ],
       ...,
       [ 0.        ,  0.        ,  0.        , ..., -0.00508252,
        -0.43802711, -0.61981539],
       [ 0.        ,  0.        ,  0.        , ...,  0.65793342,
        -0.43802711, -0.84662817],
       [ 0.        ,  0.        ,  0.        , ..., -0.2266645 ,
        -1.24683446,  0.37816084]])

### Fitting the model on the training data
The `fit` function in the `ols_regressor` package will calculate the coefficients for the linear regression model using the Ordinary Least Squares (OLS) method. It converts the input features and target values into NumPy arrays. The function then augments the feature matrix with an intercept term and computes the model coefficients using the OLS formula. The resulting coefficients are stored in the `self.coef` attribute, representing the weights that minimize the sum of squared differences between the predicted and actual target values.

The use of this function is demonstrated below.




In [8]:
model = LinearRegressor()
model.fit(X_train_encoded, y_train)

array([ 3.74898465e+04, -1.26339999e+02,  7.11831732e+00,  4.04630506e+03,
        1.26282038e+03,  1.64789618e+03, -3.97321281e+02,  6.19317822e+03,
        1.03251630e+02, -5.59094871e+02,  2.09777756e+03, -4.66435234e+02,
       -3.67181323e+02,  2.03920137e+02,  1.58213603e+02, -6.28933543e+02,
        4.55525187e+02, -3.81343490e+02,  2.94984577e+02,  1.16212028e+04,
       -4.22283424e+02, -6.65019443e+02, -3.79681297e+02, -2.14315054e+03,
       -3.41974560e+02,  5.47612239e+02, -6.27018411e+02,  1.24860503e+03,
       -7.50508269e+02,  1.09905040e+03, -2.05695840e+03,  1.90157237e+01,
        4.99444840e+02, -1.11071173e+03,  2.13852138e+02, -2.21332856e+02,
        3.26349352e+03,  2.77743492e+02,  3.97406991e+03,  6.41584576e+02,
       -1.07343660e+03, -1.28275515e+03, -1.49486106e+03,  3.36430408e+03,
        3.14400603e+03,  1.03653225e+03,  8.37046169e+02, -2.28471859e+03,
       -5.11475395e+02,  2.52498439e+03, -1.29102625e+03,  9.11152765e+03,
        3.25951560e+03, -

In [9]:
model.coef

array([ 3.74898465e+04, -1.26339999e+02,  7.11831732e+00,  4.04630506e+03,
        1.26282038e+03,  1.64789618e+03, -3.97321281e+02,  6.19317822e+03,
        1.03251630e+02, -5.59094871e+02,  2.09777756e+03, -4.66435234e+02,
       -3.67181323e+02,  2.03920137e+02,  1.58213603e+02, -6.28933543e+02,
        4.55525187e+02, -3.81343490e+02,  2.94984577e+02,  1.16212028e+04,
       -4.22283424e+02, -6.65019443e+02, -3.79681297e+02, -2.14315054e+03,
       -3.41974560e+02,  5.47612239e+02, -6.27018411e+02,  1.24860503e+03,
       -7.50508269e+02,  1.09905040e+03, -2.05695840e+03,  1.90157237e+01,
        4.99444840e+02, -1.11071173e+03,  2.13852138e+02, -2.21332856e+02,
        3.26349352e+03,  2.77743492e+02,  3.97406991e+03,  6.41584576e+02,
       -1.07343660e+03, -1.28275515e+03, -1.49486106e+03,  3.36430408e+03,
        3.14400603e+03,  1.03653225e+03,  8.37046169e+02, -2.28471859e+03,
       -5.11475395e+02,  2.52498439e+03, -1.29102625e+03,  9.11152765e+03,
        3.25951560e+03, -

In [10]:
# y_train = y_train.to_numpy()

### Cross Validation
Sometimes, we may need to get both train score and validation score for hyperparameter tuning. Therefore, we need the cross validation to get the validation performance while avoiding overfitting. Our implementation of cross validation is somewhat similar to the implementation of `scikit-learn`. The `cross_validate` function accept five arguments: 

- `model`: The model to perform cross validation
- `X`: The predictors of training data. It should be a 2D numpy array
- `y`: The response of training data. It should be a 1D numpy array
- `cv`: The number of folds used for cross validation
- `random_state`: The random state of the random shuffling in the cross validation

Notice that our `cross_validate` does not require a `return_train_score` argument. The train scores are automatically returnd in the cross validation results.

In [11]:
cv_results = cross_validate(model, X_train_encoded, y_train, 5, 42)
pd.DataFrame(cv_results)

  X_normalized = (X_np - X_np.mean(axis=0)) / X_np.std(axis=0)


Unnamed: 0,train_score,test_score,fit_time,score_time
0,,,0.063,0.001996
1,,,0.043,0.003003
2,,,0.039002,0.002001
3,,,0.034009,0.002964
4,0.182976,0.201663,0.03401,0.003987
5,0.18638,-123.351296,0.042041,0.0


### Predicting with the Fitted Model
Now that our regression model has been fitted, it is time to utilize it for making predictions on unseen data. The `predict` function within the `ols_regressor` package has been designed for this purpose. This function expects an array-like matrix X of shape (n_samples, n_features) as input, so that we can compute the predicted target values with the coefficients stored in the `self.coef` attribute. The `predict` function will return an array contains the model's predictions based on the provided input features.

The use of this function is demonstrated below.

In [12]:
model.predict(X_test_encoded)

array([71242.39545227, 65086.71872127, 75088.73651052, ...,
       65159.56570948, 53953.6612666 , 48427.00252345])

### Scoring the Fitted Model

Here we use the `score` function within the `ols_regressor` package. The function takes in X(n_samples, n_features) and y_pred(n_samples, ) as input and calculates the coefficient of determination $R^2$ for the prediction. Specifically, the function will calculate by the followin steps to get $R^2$:
$$
SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$
$$
SST = \sum_{i=1}^{n} (y_i - \bar{y})^2
$$
$$
R^2 = 1 - \frac{SSE}{SST}
$$

In [13]:
model.score(X_test_encoded, y_test)

0.18939667040126118