# Table of Content

1. [Introduction](#INTRODUCTION)
    1. [TL-DR;Summary](#TL-DR;-Summary)
    1. [Linear Regression](#Linear-Regression)
    1. [Conditions of Application](#Conditions-of-application)
1. [Building the Model](#Building-the-Model)
    1. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
	1. [Initial Model](#Initial-Model)
	1. [Initial Diagnose](#Initial-Diagnose)
	1. [Final Diagnostics](#Final-Diagnostics)
        1. [Normality of the residuals](#Normality-of-the-residuals)
		1. [Homoscedasticity](#Homoscedasticity)
		1. [Outlier](#Outlier)
1.[Model Interpretation and Conclusions](#Model-Interpretation-and-Conclusions)
1.[Sources](#Sources)

# INTRODUCTION

The analysis and modelling in this notebook goal is to answer this question: <br>
**Create multiple linear regression analysis of “mtcars” data, then create a model to predict mpg (miles per gallon) using best variable(s) available. Explain the diagnostic tests and make a conclusion about the model!**
<br>

The data source can be accessed [here][1]. <br>

[1]: https://www.kaggle.com/datasets/ruiromanini/mtcars

## TL-DR; Summary

- From the multiple linear regression analysis of mtcars data, we found our model, that is:
<center>$mpg = 9.62 - 3.92(wt) + 1.23(qsec) + 2.94(am)$</center>

- The diagnostic test of the final model satisfy all the condition of application.

- In conclusion we can interpret the model as follow: <br>
The Miles/(US) gallon ( mpg ) data is negatively dependent on the Weight (1000 lbs) of the car ( wt ), positively correlated with 1/4 mile time ( qsec ) and Transmission (0 = automatic, 1 = manual) ( am ).

# Building the Model

We'll start building the model by importing the necessary libraries and the data into our notebook.

In [3]:
!pip install polars

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting polars
  Downloading polars-0.14.19-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 4.5 MB/s 
Installing collected packages: polars
Successfully installed polars-0.14.19


In [28]:
import logging
logging.captureWarnings(True)

# basic library
import pandas as pd
import numpy as np
import scipy.stats as stats
import polars as pl

# viz library
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots


# ML library
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer, LabelEncoder, LabelBinarizer
from sklearn.impute import SimpleImputer

from sklearn import set_config

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
# from tensorflow.keras import callbacks

In [8]:
# Load the data and show the first 5 rows
valid_cols = ['Location', 'Year', 'Fuel_Type', 'Transmission',
              'Owner_Type', 'Seats', 'Price', 'km_per_unit_fuel', 'engine_num',
              'power_num', 'new_price_num', 'Brand', 'Model', 'price_log',
              'kilometers_driven_log']

# used_cars_clean = pd.read_csv("used_cars_clean.csv")[valid_cols]
# used_cars_clean.head()

used_cars_clean1 = pl.read_csv("data/used_cars_clean.csv")[valid_cols]
used_cars_clean1.head()

Location,Year,Fuel_Type,Transmission,Owner_Type,Seats,Price,km_per_unit_fuel,engine_num,power_num,new_price_num,Brand,Model,price_log,kilometers_driven_log
str,i64,str,str,str,f64,f64,f64,f64,f64,f64,str,str,f64,f64
"""Mumbai""",2010,"""CNG""","""Manual""","""First""",5.0,1.75,26.6,998.0,58.16,5.51,"""maruti""","""wagon""",0.559616,11.184421
"""Pune""",2015,"""Diesel""","""Manual""","""First""",5.0,12.5,19.67,1582.0,126.2,16.06,"""hyundai""","""creta""",2.525729,10.621327
"""Chennai""",2011,"""Petrol""","""Manual""","""First""",5.0,4.5,18.2,1199.0,88.7,8.61,"""honda""","""jazz""",1.504077,10.736397
"""Chennai""",2012,"""Diesel""","""Manual""","""First""",7.0,6.0,20.77,1248.0,88.76,11.27,"""maruti""","""ertiga""",1.791759,11.373663
"""Coimbatore""",2013,"""Diesel""","""Automatic""","""Second""",5.0,17.74,15.2,1968.0,140.8,53.14,"""audi""","""a4""",2.875822,10.613246


## Exploratory Data Analysis

Before we jump into the model, it's a good idea to take a look at our data so we know roughly what we're dealing with. In this case we will to an exploratory data analysis (EDA).

In [9]:
# check null value

used_cars_clean1.null_count()

Location,Year,Fuel_Type,Transmission,Owner_Type,Seats,Price,km_per_unit_fuel,engine_num,power_num,new_price_num,Brand,Model,price_log,kilometers_driven_log
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [10]:
cols = 3
rows = len(used_cars_clean1.columns)//cols

z = used_cars_clean1.columns
x = [i+1 for i in range(cols)]
y = [i+1 for i in range(rows)]

a = np.array([np.repeat(x, rows), np.tile(y, cols), z]).T

In [20]:
# let's visualize the distribution of the features of the cars

fig = make_subplots(
    rows=rows, cols=cols,
    # horizontal_spacing = 0.05,
    # vertical_spacing = 0.
    )

for i in a:
  fig.add_trace(
      go.Histogram(
          x = used_cars_clean1[i[2]],
          name = i[2],
          marker_color = 'lightblue',
          ),
          row=int(i[1]),
          col=int(i[0])
          )
  
# Format and show fig
fig.update_layout(
    title_text="Features Distribution",
    height=1500, width=1200,
    margin=dict(l=50, r=10, t=50, b=10)
    )

fig.show()

## Pipelines Placeholder

In [None]:
## Feature engineering pipeline

## Modeling pipeline


## Initial Model

In [61]:
randomstate = 42

X = used_cars_clean1.drop(["Price", "price_log"])
y = used_cars_clean1.select(pl.col("Price"))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = randomstate)

X_train.head()

Location,Year,Fuel_Type,Transmission,Owner_Type,Seats,km_per_unit_fuel,engine_num,power_num,new_price_num,Brand,Model,kilometers_driven_log
str,i64,str,str,str,f64,f64,f64,f64,f64,str,str,f64
"""Delhi""",2011,"""Petrol""","""Manual""","""First""",5.0,18.6,1199.0,79.4,9.675,"""chevrolet""","""beat""",11.240526
"""Coimbatore""",2014,"""Diesel""","""Manual""","""First""",5.0,22.77,1498.0,98.59,11.685,"""ford""","""ecosport""",11.076542
"""Kolkata""",2018,"""Diesel""","""Manual""","""First""",5.0,24.3,1248.0,88.5,10.13,"""maruti""","""vitara""",7.972466
"""Delhi""",2011,"""Diesel""","""Manual""","""First""",8.0,12.99,2494.0,100.6,24.01,"""toyota""","""innova""",11.918391
"""Hyderabad""",2017,"""Diesel""","""Manual""","""First""",5.0,22.95,1248.0,74.0,7.97,"""tata""","""bolt""",11.497812


In [64]:
X_train1 = X_train.select(
    pl.col('engine_num')
)

In [120]:
# Define model

from tensorflow.keras import losses
from tensorflow.keras import optimizers

tf.random.set_seed(42)

In [141]:
model1 = Sequential()

model1.add(layers.BatchNormalization(input_shape = [1]))
# model1.add(layers.Normalization(input_shape = [1]))
model1.add(layers.Dense(1))

In [142]:
adam = optimizers.Adam(learning_rate=0.001)

model1.compile(loss=losses.MeanSquaredError(), optimizer=adam, metrics=['MSE'])

In [143]:
model1.summary()

Model: "sequential_13"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 batch_normalization_3 (Batc  (None, 1)                4         
 hNormalization)                                                 
                                                                 
 dense_11 (Dense)            (None, 1)                 2         
                                                                 
Total params: 6
Trainable params: 4
Non-trainable params: 2
_________________________________________________________________


In [144]:
EPOCH = 20
BATCH_SIZE = 32

history_model_1 = model1.fit(
    X_train1.to_numpy(), y_train.to_numpy(), 
    validation_split=0.2, epochs=EPOCH, batch_size=BATCH_SIZE, verbose=1
    )
    

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [147]:
history1 = pl.DataFrame(history_model_1.history)
history1 = history1.with_column(pl.Series(name="epoch", values=[i+1 for i in range(0,EPOCH)]))

In [146]:
fig = go.Figure()

fig.add_scatter(
    x=history1.select(pl.col("epoch")).to_series(), 
    y=history1.select(pl.col("loss")).to_series(), 
    name = "train",
    mode='lines'
    )

fig.add_scatter(
    x=history1.select(pl.col("epoch")).to_series(), 
    y=history1.select(pl.col("val_loss")).to_series(),
    name = "val",
    mode='lines'
)

# Format and show fig
fig.update_layout(
    title_text="Train and Val Loss",
    height=400, width=800,
    margin=dict(l=50, r=10, t=50, b=10)
    )

fig.show()