### Terminologies:
* equation of a straight line: y=mx+c
* Straight lines
* feature coefficient (slope, gradient, m)
* bias coeffcient (y-interccept, c)
* domain: x-axis, independent variable
* range: y-axis, dependent variable
* loss function, cost function, objective function, error function
* bias-variance tradeoff, overfitting, underfitting
* ordinary least square method
* gradient descent method
* residual, error, squared error
* train data, test data

### Import Required Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

### Load Data

In [2]:
df = pd.read_csv("vw.csv")

display(df.head())
print(df.shape)

Unnamed: 0,model,year,price,transmission,mileage,fuelType,mpg,engineSize
0,T-Roc,2019,25000,Automatic,13904,Diesel,49.6,2.0
1,T-Roc,2019,26883,Automatic,4562,Diesel,49.6,2.0
2,T-Roc,2019,20000,Manual,7414,Diesel,50.4,2.0
3,T-Roc,2019,33492,Automatic,4825,Petrol,32.5,2.0
4,T-Roc,2019,22900,Semi-Auto,6500,Petrol,39.8,1.5


(15157, 8)


In [3]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15157 entries, 0 to 15156
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         15157 non-null  object 
 1   year          15157 non-null  int64  
 2   price         15157 non-null  int64  
 3   transmission  15157 non-null  object 
 4   mileage       15157 non-null  int64  
 5   fuelType      15157 non-null  object 
 6   mpg           15157 non-null  float64
 7   engineSize    15157 non-null  float64
dtypes: float64(2), int64(3), object(3)
memory usage: 947.4+ KB


In [4]:
df.columns

Index(['model', 'year', 'price', 'transmission', 'mileage', 'fuelType', 'mpg',
       'engineSize'],
      dtype='object')

In [5]:
df["model"].value_counts()

 Golf               4863
 Polo               3287
 Tiguan             1765
 Passat              915
 Up                  884
 T-Roc               733
 Touareg             363
 Touran              352
 T-Cross             300
 Golf SV             268
 Sharan              260
 Arteon              248
 Scirocco            242
 Amarok              111
 Caravelle           101
 CC                   95
 Tiguan Allspace      91
 Beetle               83
 Shuttle              61
 Caddy Maxi Life      59
 Jetta                32
 California           15
 Caddy Life            8
 Eos                   7
 Caddy                 6
 Caddy Maxi            4
 Fox                   4
Name: model, dtype: int64

In [6]:
df["transmission"].value_counts()

Manual       9417
Semi-Auto    3780
Automatic    1960
Name: transmission, dtype: int64

In [7]:
df["fuelType"].value_counts()

Petrol    8553
Diesel    6372
Hybrid     145
Other       87
Name: fuelType, dtype: int64

#### Separating the feature and target variable

In [8]:
features = [ 'year', 'mileage', 'mpg','engineSize']
target = ["price"]

x=df[features]
y=df[target]

print("Shape of x = ", x.shape)
print("Shape of y = ", y.shape)


Shape of x =  (15157, 4)
Shape of y =  (15157, 1)


### Create train and test data set

In [9]:
x_train, x_test,y_train, y_test = train_test_split(x,y,
                                                    test_size=0.2, 
                                                     random_state=42)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(12125, 4) (3032, 4) (12125, 1) (3032, 1)


### Linear Regression

In [10]:
model = LinearRegression()
model = model.fit(x_train,y_train)

y = m1*x1 + m2*x2 + m3*x3 + m4*x4 + c

In [11]:
c = model.intercept_
print(c)

[-2992133.6731303]


In [12]:
coefficients = model.coef_
print(coefficients)

[[ 1.48739376e+03 -8.48572114e-02 -8.58241212e+01  9.37268406e+03]]


In [13]:
coef_df = pd.DataFrame({"features":x.columns,
                      "coefficients": np.squeeze(coefficients)})

display(coef_df)

Unnamed: 0,features,coefficients
0,year,1487.393763
1,mileage,-0.084857
2,mpg,-85.824121
3,engineSize,9372.684065


### Prediction

In [14]:
y_pred = model.predict(x_test)
print(y_pred)

[[13865.0143481 ]
 [24971.69877535]
 [19559.80753035]
 ...
 [ 9592.74274532]
 [ 4825.24563734]
 [ 5633.54327228]]


In [15]:
print(y_test)

       price
7342   14450
10328  23950
14992  10495
8466    9990
10347  21998
...      ...
8211   17250
8401   10450
9810   10290
7872    7499
9399    7290

[3032 rows x 1 columns]


### Prediction Error

In [16]:
MSE = mean_squared_error(y_test, y_pred)
print(MSE)

15140519.167912768


In [17]:
RMSE = mean_squared_error(y_test, y_pred, squared = False )
print(RMSE)

3891.0820047787183


In [18]:
r2 = r2_score(y_test, y_pred)
print("r_squared = ", r2)

r_squared =  0.7448589068322593


### KPI -> Key Performance Indicator (MSE, RMSE.....)

References:

[1] A Gentle Introduction to Machine Learning: https://www.youtube.com/watch?v=Gv9_4yMHFhI&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&ab_channel=StatQuestwithJoshStarmer

[2] Linear Regression, Clearly Explained!!!: https://www.youtube.com/watch?v=nk2CQITm_eo&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=10&ab_channel=StatQuestwithJoshStarmer

[3] Linear Regression scikit-learn: https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

[4] Data Splitting: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

[5] Mean Squared Error: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

[6] RMSE calculation: https://www.youtube.com/watch?v=zMFdb__sUpw&ab_channel=KhanAcademy

[7] Regression coefficients: https://statisticsbyjim.com/glossary/regression-coefficient/

[8] Machine Learning Quiz 01: Linear Regression https://kawsar34.medium.com/machine-learning-quiz-01-a2fac2712a55

[9] Linear Regression Assumptions: https://www.statology.org/linear-regression-assumptions/

[10] Constant Variance: https://stats.stackexchange.com/questions/52089/what-does-having-constant-variance-in-a-linear-regression-model-mean

[11] Multiple Regression: https://www.youtube.com/watch?v=zITIFTsivN8&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=11&ab_channel=StatQuestwithJoshStarmer

[12] Linear Regression Simplified - Ordinary Least Square vs Gradient Descent: https://towardsdatascience.com/linear-regression-simplified-ordinary-least-square-vs-gradient-descent-48145de2cf76