In statistics, an outlier is a data point that differs significantly from other observations.

anomalies = Anomalies are instances or collections of data that occur very rarely in the data set and whose features differ significantly from most of the data.

# Importing dependencies

In [64]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split  # (x,y)(test,train). 
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline

import math
from sklearn.metrics import r2_score, mean_squared_error #mse 


        
import warnings
warnings.filterwarnings('ignore')

# Reading CSV

In [65]:
cars_data = pd.read_csv('data.csv')

In [66]:
cars_data.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


In [67]:
cars_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Make               11914 non-null  object 
 1   Model              11914 non-null  object 
 2   Year               11914 non-null  int64  
 3   Engine Fuel Type   11911 non-null  object 
 4   Engine HP          11845 non-null  float64
 5   Engine Cylinders   11884 non-null  float64
 6   Transmission Type  11914 non-null  object 
 7   Driven_Wheels      11914 non-null  object 
 8   Number of Doors    11908 non-null  float64
 9   Market Category    8172 non-null   object 
 10  Vehicle Size       11914 non-null  object 
 11  Vehicle Style      11914 non-null  object 
 12  highway MPG        11914 non-null  int64  
 13  city mpg           11914 non-null  int64  
 14  Popularity         11914 non-null  int64  
 15  MSRP               11914 non-null  int64  
dtypes: float64(3), int64(5

In [68]:
cars_data.describe()

Unnamed: 0,Year,Engine HP,Engine Cylinders,Number of Doors,highway MPG,city mpg,Popularity,MSRP
count,11914.0,11845.0,11884.0,11908.0,11914.0,11914.0,11914.0,11914.0
mean,2010.384338,249.38607,5.628829,3.436093,26.637485,19.733255,1554.911197,40594.74
std,7.57974,109.19187,1.780559,0.881315,8.863001,8.987798,1441.855347,60109.1
min,1990.0,55.0,0.0,2.0,12.0,7.0,2.0,2000.0
25%,2007.0,170.0,4.0,2.0,22.0,16.0,549.0,21000.0
50%,2015.0,227.0,6.0,4.0,26.0,18.0,1385.0,29995.0
75%,2016.0,300.0,6.0,4.0,30.0,22.0,2009.0,42231.25
max,2017.0,1001.0,16.0,4.0,354.0,137.0,5657.0,2065902.0


**Insights**
* ***Year***: The minimum year in dataset is 1990, average year is 2010 and maximum year is 2017.
* ***Engine Hp***:: The minimum hp in dataset is 55, average hp is 249 and maximum hp is 1001.
* ***Engine Cylinders***: The minimum cylinders in dataset are 0, average cylinders are 5 and maximum cylindes are 16.
* ***Number of doors***: The minimum doors a car have are 2, on average, the number of doors are 3 and maximum doors are 4.
* ***Highway mpg***: The minimum highway mpg in dataset is 12, average highway mpg is 26.6 and maximum highway mpg is 354.
* ***City mpg***: The minimum city mpg in dataset is 7, average city mpg is 19.7 and maximum city mpg is 137.
* ***Popularity***: The minimum popularity of a car in dataset is 2, average car popularity is 1554.9 and maximum car popularity is 5657.
* ***Price***: The minimum car price in dataset is 2000, average car price is 40K and maximum car price is 2.06 million.

# Data Cleaning

In [69]:
cars_data.columns = cars_data.columns.str.lower().str.replace(" ", "_")
cars_data.columns

Index(['make', 'model', 'year', 'engine_fuel_type', 'engine_hp',
       'engine_cylinders', 'transmission_type', 'driven_wheels',
       'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style',
       'highway_mpg', 'city_mpg', 'popularity', 'msrp'],
      dtype='object')

In [70]:
cars_data.rename(columns = {'engine_fuel_type' : 'fuel_type', 'engine_hp' : 'hp', 'engine_cylinders' : 'cylinders', 'transmission_type' : 'transmission', 'driven_wheels' : 'drive', 'number_of_doors' : 'doors', 'market_category' : 'market', 'vehicle_size' : 'size', 'vehicle_style' : 'style', 'msrp' : 'price'}, inplace = True)

In [71]:
print('Number of duplicates are : ', cars_data.duplicated().sum())
cars_data = cars_data.drop_duplicates()

Number of duplicates are :  715


In [72]:
print('Number of missing values in each columns are below : ')
print(cars_data.isnull().sum())

Number of missing values in each columns are below : 
make               0
model              0
year               0
fuel_type          3
hp                69
cylinders         30
transmission       0
drive              0
doors              6
market          3376
size               0
style              0
highway_mpg        0
city_mpg           0
popularity         0
price              0
dtype: int64


> Let's drop the market column as it contains too much of the null values and also this feature doesn't have high importance regarding target feature which is price

In [73]:
cars_data.drop('market', axis = 1, inplace = True)   #not executing multiple times 


> Now i'll check what dataset columns have null values

In [27]:
null_values = cars_data[cars_data.isnull().any(axis = 1)]
null_values

Unnamed: 0,make,model,year,fuel_type,hp,cylinders,transmission,drive,doors,market,size,style,highway_mpg,city_mpg,popularity,price
87,Nissan,200SX,1996,regular unleaded,115.0,4.0,MANUAL,front wheel drive,2.0,,Compact,Coupe,36,26,2009,2000
91,Nissan,200SX,1997,regular unleaded,115.0,4.0,MANUAL,front wheel drive,2.0,,Compact,Coupe,35,25,2009,2000
93,Nissan,200SX,1998,regular unleaded,115.0,4.0,MANUAL,front wheel drive,2.0,,Compact,Coupe,35,25,2009,2000
203,Chrysler,300,2015,regular unleaded,300.0,6.0,AUTOMATIC,all wheel drive,4.0,,Large,Sedan,27,18,1013,37570
204,Chrysler,300,2015,regular unleaded,292.0,6.0,AUTOMATIC,rear wheel drive,4.0,,Large,Sedan,31,19,1013,31695
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11794,Subaru,XT,1991,regular unleaded,145.0,6.0,MANUAL,all wheel drive,2.0,,Compact,Coupe,23,16,640,2000
11809,Toyota,Yaris iA,2017,regular unleaded,106.0,4.0,MANUAL,front wheel drive,4.0,,Compact,Sedan,39,30,2031,15950
11810,Toyota,Yaris iA,2017,regular unleaded,106.0,4.0,AUTOMATIC,front wheel drive,4.0,,Compact,Sedan,40,32,2031,17050
11867,GMC,Yukon,2015,premium unleaded (recommended),420.0,8.0,AUTOMATIC,rear wheel drive,4.0,,Large,4dr SUV,21,15,549,64520


> Fuel type, Hp, Cylinders and also doors have null values.
> * I'll fill the null values in fuel type with the mode as this columns is a categorical one.
> * Hp will be filled by 0 as the cars having null values are electric cars. So they don't have Hp in general
> * We know that electric cars doesn't have any cylinders so the null values will be again filled with 0.
> * It's fine to fill the null values in doors with mean as it's a numerical column.

In [74]:
cars_data['fuel_type'] = cars_data['fuel_type'].fillna('regular unleaded')

cars_data['hp'] = cars_data['hp'].fillna(0)

cars_data['cylinders'] = cars_data['cylinders'].fillna(0)

cars_data['doors'] = cars_data['doors'].fillna(cars_data['doors'].mean())


In [79]:
cars_data.head(5)

Unnamed: 0,make,model,year,fuel_type,hp,cylinders,transmission,drive,doors,size,style,highway_mpg,city_mpg,popularity,price
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Compact,Convertible,28,18,3916,34500


> Now let's seperate the numerical and categorical columns for using later.

In [82]:
num_col = cars_data.select_dtypes(include = [np.number])
num_col

Unnamed: 0,year,hp,cylinders,doors,highway_mpg,city_mpg,popularity,price
0,2011,335.0,6.0,2.0,26,19,3916,46135
1,2011,300.0,6.0,2.0,28,19,3916,40650
2,2011,300.0,6.0,2.0,28,20,3916,36350
3,2011,230.0,6.0,2.0,28,18,3916,29450
4,2011,230.0,6.0,2.0,28,18,3916,34500
...,...,...,...,...,...,...,...,...
11909,2012,300.0,6.0,4.0,23,16,204,46120
11910,2012,300.0,6.0,4.0,23,16,204,56670
11911,2012,300.0,6.0,4.0,23,16,204,50620
11912,2013,300.0,6.0,4.0,23,16,204,50920


In [83]:
cat_col = cars_data.select_dtypes(exclude = [np.number])
cat_col

Unnamed: 0,make,model,fuel_type,transmission,drive,size,style
0,BMW,1 Series M,premium unleaded (required),MANUAL,rear wheel drive,Compact,Coupe
1,BMW,1 Series,premium unleaded (required),MANUAL,rear wheel drive,Compact,Convertible
2,BMW,1 Series,premium unleaded (required),MANUAL,rear wheel drive,Compact,Coupe
3,BMW,1 Series,premium unleaded (required),MANUAL,rear wheel drive,Compact,Coupe
4,BMW,1 Series,premium unleaded (required),MANUAL,rear wheel drive,Compact,Convertible
...,...,...,...,...,...,...,...
11909,Acura,ZDX,premium unleaded (required),AUTOMATIC,all wheel drive,Midsize,4dr Hatchback
11910,Acura,ZDX,premium unleaded (required),AUTOMATIC,all wheel drive,Midsize,4dr Hatchback
11911,Acura,ZDX,premium unleaded (required),AUTOMATIC,all wheel drive,Midsize,4dr Hatchback
11912,Acura,ZDX,premium unleaded (recommended),AUTOMATIC,all wheel drive,Midsize,4dr Hatchback


> Printing each column's unique values to see if the data inside columns is fine or not. In short, this step will be taken to spot anomalies.

anomalies = the identification of rare events, items 

example = if large sums of money are spent one after another within one day and it is not your typical behavior, a bank can block your card.

In [None]:
for col in cat_col:
    print(col)
    print(cars_data[col].unique())
    print(cars_data[col].nunique())
    print('\n', "======================================", '\n')

> The transmission columns has a value 'UNKNOWN' which is clearly an anomoly. So i'll drop all of those cars who's transmission is UNKNOWN.

In [None]:
cars_data.drop(cars_data[cars_data['transmission']=='UNKNOWN'].index, axis='index', inplace = True)

# Handling Outliers

> Let's make a boxplot for each numerical variable to spot outliers using plotly

Box plots provide a quick visual summary of the variability of values in a dataset. They show the median, upper and lower quartiles, minimum and maximum values, and any outliers in the dataset. 

Outliers can reveal mistakes or unusual occurrences in data.

In [None]:
for i in num_col:
    fig = px.box(cars_data, x = cars_data[i])
    fig.update_traces(fillcolor = '#C9A26B')
    fig.show()

**Insights**
* ***Engine Hp***:: Any Hp above than 500 is an outlier.
* ***Engine Cylinders***: Cylinders above than 8 and lower than 3 are outliers.
* ***Highway mpg***: Any highway mpg below than 12 and higher than 42 is as an outlier.
* ***City mpg***: Any city mpg below than 7 and higher than 31 is as an outlier.
* ***Price***: Car prices lower than 2000 and upper than 70.9K is an outlier.

> Deleting the outliers below by dropping them from the original dataset

In [None]:
s1 = cars_data.shape
clean = cars_data[['hp', 'cylinders', 'highway_mpg', 'city_mpg', 'price']]
for i in clean.columns:
    qt1 = cars_data[i].quantile(0.25)
    qt3 = cars_data[i].quantile(0.75)
    iqr =  qt3 - qt1
    lower = qt1-(1.5*iqr)
    upper = qt3+(1.5*iqr)
    min_in = cars_data[cars_data[i]<lower].index
    max_in = cars_data[cars_data[i]>upper].index
    cars_data.drop(min_in, inplace = True)
    cars_data.drop(max_in, inplace = True)
s2 = cars_data.shape
outliers = s1[0] - s2[0]
print("Deleted outliers are : ", outliers)

> Checking the changes after outliers deletion from a dataset column by plotting the boxplot again

In [None]:
fig = px.box(cars_data, x = cars_data['hp'])
fig.update_traces(fillcolor = '#C9A26B')

In [None]:
cars_data.describe()

**Insights**
* ***Engine Hp***:: The minimum hp in dataset after outlier deletion is 0, average hp is 231 and maximum hp is 485.
* ***Engine Cylinders***: The minimum cylinders in dataset after outlier deletion are 3, average cylinders are 5 and maximum cylindes are 8.
* ***Highway mpg***: The minimum highway mpg in dataset after outlier deletion is 12, average highway mpg is 26.3 and maximum highway mpg is 42.
* ***City mpg***: The minimum city mpg in dataset after outlier deletion is 10, average city mpg is 19.1 and maximum city mpg is 31.
* ***Price***: The minimum car price in dataset after outlier deletion is 2000, average car price is around 29K and maximum car price is 70.9K.

# Univariate Analysis

In [None]:
for i in cars_data:
    fig = px.histogram(cars_data, x= i, color_discrete_sequence = ['#C9A26B'])
    fig.show()

**Insights**
* ***Make***:: Least count: **Maserati** (2), Maximum count: **Chevrolet** (1039).
* ***Model***: Least count: **Many models** (1), Maximum count: **Silverado 1500** (156).
* ***Year***: Least count **1990** (76): , Maximum count: **2016** (1756).
* ***Fuel type***: Least count: **Natural Gas** (2), Maximum count: **Regular Unleaded** (6380).
* ***Hp***:: Least count: **Many hps** (1), Maximum count: **200** (355).
* ***Cylinders***: Least count: **3** (11), Maximum count: **4** (4110).
* ***Transmission***: Least count: **Automated Manual** (316), Maximum count: **Automatic** (7097).
* ***Drive***: Least count: **four wheel drive** (1220), Maximum count: **front wheel drive** (4056).
* ***Doors***: Least count: **3** (356), Maximum count: **4** (7185).
* ***Size***:: Least count: **Large** (2241), Maximum count: **Compact** (3822).
* ***Style***:: Least count: **Convertible SUV** (28), Maximum count: **Sedan** (2448).
* ***Highway mpg***: Least count: **12** (1), Maximum count: **24** (752).
* ***City mpg***: Least count: **10** (5), Maximum count: **17** (1117).
* ***Popularity***: Least count: **238** (2), Maximum count: **1385** (1039).
* ***Price***: Least count: **Many prices** (1), Maximum count: **2000** (736).

# Bivariate Analysis

### According to HP

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['make'].unique(), y = cars_data.groupby('make').mean()['hp'].sort_values()[:5], marker_color = '#C9A25B'))
fig.update_layout(title_text = 'Top 5 car brands with highest Hp', xaxis_title = "Car Brand", yaxis_title = "Hp")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['model'].unique(), y = cars_data.groupby('model').mean()['hp'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 car models with highest Hp', xaxis_title = "Car Model", yaxis_title = "Hp")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['fuel_type'].unique(), y = cars_data.groupby('fuel_type').mean()['hp'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 car fuel types with highest Hp', xaxis_title = "Car Fuel Type", yaxis_title = "Hp")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['transmission'].unique(), y = cars_data.groupby('transmission').mean()['hp'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car transmissions according to Hp', xaxis_title = "Car Transmission", yaxis_title = "Hp")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['drive'].unique(), y = cars_data.groupby('drive').mean()['hp'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car drive according to Hp', xaxis_title = "Car Drive", yaxis_title = "Hp")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['size'].unique(), y = cars_data.groupby('size').mean()['hp'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car size according to Hp', xaxis_title = "Car Size", yaxis_title = "Hp")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['style'].unique(), y = cars_data.groupby('style').mean()['hp'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car style according to Hp', xaxis_title = "Car Style", yaxis_title = "Hp")
fig.show()

**Insights**
* ***Make***: The car brand of **Chrysler** has the highest Hp of **169.8**.
* ***Model***:: The car model of **190-Class** has the highest Hp of **74**.
* ***Fuel Type***: The car fuel type of **Diesel** has the highest Hp of **264.3**.
* ***Transmission***: Car transmission of **Manual** has the lowest Hp of **189.2** and car transmission of **Automated-Manual** has the highest hp of **245.6**.
* ***Drive***: Car drive of **Rear wheel drive** has the lowest Hp of **186.2** and car drive of **Four wheel drive** has the highest hp of **269**.
* ***Size***: Car size of **Compact** has the lowest Hp of **183.8** and car size of **Large** has the highest hp of **294**.
* ***Style***: Car style of **Coupe** has the lowest Hp of **136.2** and car style of **Passenger Van** has the highest hp of **301.2**.

### According to City mpg

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['make'].unique(), y = cars_data.groupby('make').mean()['city_mpg'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 car brands with highest city mpg', xaxis_title = "Car Brand", yaxis_title = "City Mpg")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['model'].unique(), y = cars_data.groupby('model').mean()['city_mpg'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 car models with highest city mpg', xaxis_title = "Car Model", yaxis_title = "City Mpg")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['fuel_type'].unique(), y = cars_data.groupby('fuel_type').mean()['city_mpg'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 car fuel types with highest city mpg', xaxis_title = "Car Fuel Type", yaxis_title = "City Mpg")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['transmission'].unique(), y = cars_data.groupby('transmission').mean()['city_mpg'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car transmissions according to city mpg', xaxis_title = "Car Transmission", yaxis_title = "City Mpg")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['drive'].unique(), y = cars_data.groupby('drive').mean()['city_mpg'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car drive according to city mpg', xaxis_title = "Car Drive", yaxis_title = "City Mpg")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['size'].unique(), y = cars_data.groupby('size').mean()['city_mpg'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car size according to city mpg', xaxis_title = "Car Size", yaxis_title = "City Mpg")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['style'].unique(), y = cars_data.groupby('style').mean()['city_mpg'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car style according to city mpg', xaxis_title = "Car Style", yaxis_title = "City Mpg")
fig.show()

**Insights**
* ***Make***: The car brand of **Chrysler** has the highest city mpg of **16.6**.
* ***Model***:: The car model of **190-Class and some others** has the highest city mpg of **11**.
* ***Fuel Type***: The car fuel type of **Diesel** has the highest city mpg of **18.5**.
* ***Transmission***: Car transmission of **Manual** has the lowest city mpg of **18.6** and car transmission of **Automated-Manual** has the highest city mpg of **23.8**.
* ***Drive***: Car drive of **Rear wheel drive** has the lowest city mpg of **15.1** and car drive of **Four wheel drive** has the highest city mpg of **21.8**.
* ***Size***: Car size of **Compact** has the lowest city mpg of **15.8** and car size of **Large** has the highest city mpg of **21.3**.
* ***Style***: Car style of **Coupe** has the lowest city mpg of **12.5** and car style of **Passenger Van** has the highest city mpg of **24.3**.

### According to Highway mpg

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['make'].unique(), y = cars_data.groupby('make').mean()['highway_mpg'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 car brands with highest highway mpg', xaxis_title = "Car Brand", yaxis_title = "Highway Mpg")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['model'].unique(), y = cars_data.groupby('model').mean()['highway_mpg'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 car models with highest highway mpg', xaxis_title = "Car Model", yaxis_title = "Highway Mpg")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['fuel_type'].unique(), y = cars_data.groupby('fuel_type').mean()['highway_mpg'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 car fuel types with highest highway mpg', xaxis_title = "Car Fuel Type", yaxis_title = "Highway Mpg")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['transmission'].unique(), y = cars_data.groupby('transmission').mean()['highway_mpg'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car transmissions according to highway mpg', xaxis_title = "Car Transmission", yaxis_title = "Highway Mpg")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['drive'].unique(), y = cars_data.groupby('drive').mean()['highway_mpg'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car drive according to highway mpg', xaxis_title = "Car Drive", yaxis_title = "Highway Mpg")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['size'].unique(), y = cars_data.groupby('size').mean()['highway_mpg'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car size according to highway mpg', xaxis_title = "Car Size", yaxis_title = "Highway Mpg")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['style'].unique(), y = cars_data.groupby('style').mean()['highway_mpg'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car style according to highway mpg', xaxis_title = "Car Style", yaxis_title = "Highway Mpg")
fig.show()

**Insights**
* ***Make***: The car brand of **Chrysler** has the highest highay mpg of **23.2**.
* ***Model***:: The car model of **190-Class** has the highest highay mpg of **14.5**.
* ***Fuel Type***: The car fuel type of **Diesel** has the highest highay mpg of **25.9**.
* ***Transmission***: Car transmission of **Manual** has the lowest highay mpg of **25.6** and car transmission of **Automated-Manual** has the highest highay mpg of **32.4**.
* ***Drive***: Car drive of **Rear wheel drive** has the lowest highay mpg of **19.8** and car drive of **Four wheel drive** has the highest highay mpg of **30**.
* ***Size***: Car size of **Compact** has the lowest highay mpg of **22.1** and car size of **Large** has the highest highay mpg of **28.5**.
* ***Style***: Car style of **Coupe** has the lowest highay mpg of **16.5** and car style of **Passenger Van** has the highest highay mpg of **32.4**.

### According to Popularity

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['make'].unique(), y = cars_data.groupby('make').mean()['popularity'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 popular car brands', xaxis_title = "Car Brand", yaxis_title = "Popularity")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['model'].unique(), y = cars_data.groupby('model').mean()['popularity'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 popular car models', xaxis_title = "Car Model", yaxis_title = "Popularity")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['fuel_type'].unique(), y = cars_data.groupby('fuel_type').mean()['popularity'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 popular car fuel types', xaxis_title = "Car Fuel Type", yaxis_title = "Popularity")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['transmission'].unique(), y = cars_data.groupby('transmission').mean()['popularity'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Popularity in car transmissions', xaxis_title = "Car Transmission", yaxis_title = "Popularity")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['drive'].unique(), y = cars_data.groupby('drive').mean()['popularity'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Popularity in car drives', xaxis_title = "Car Drive", yaxis_title = "Popularity")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['size'].unique(), y = cars_data.groupby('size').mean()['popularity'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Popularity in car sizes', xaxis_title = "Car Size", yaxis_title = "Popularity")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['style'].unique(), y = cars_data.groupby('style').mean()['popularity'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Popularity in car styles', xaxis_title = "Car Style", yaxis_title = "Popularity")
fig.show()

**Insights**
* ***Make***: The car brand of **Chrysler** has the highest popularity of **113**.
* ***Model***:: The car model of **190-Class and some others** has the highest popularity of **26**.
* ***Fuel Type***: The car fuel type of **Diesel** has the highest popularity of **1421.3**.
* ***Transmission***: Car transmission of **Manual** has the lowest popularity of **1536.5** and car transmission of **Automated-Manual** has the highest popularity of **1715.7**.
* ***Drive***: Car drive of **Rear wheel drive** has the lowest popularity of **1360** and car drive of **Four wheel drive** has the highest popularity of **1833.5**.
* ***Size***: Car size of **Compact** has the lowest popularity of **1425.9** and car size of **Large** has the highest popularity of **1989.8**.
* ***Style***: Car style of **Coupe** has the lowest popularity of **814.5** and car style of **Passenger Van** has the highest popularity of **3871.7**.

### According to Price

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['make'].unique(), y = cars_data.groupby('make').mean()['price'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 expensive car brands', xaxis_title = "Car Brand", yaxis_title = "Price")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['model'].unique(), y = cars_data.groupby('model').mean()['price'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 expensive car models', xaxis_title = "Car Model", yaxis_title = "Price")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['fuel_type'].unique(), y = cars_data.groupby('fuel_type').mean()['price'].sort_values()[:5], marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Top 5 expensive car Fuel Types', xaxis_title = "Car Fuel Type", yaxis_title = "Price")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['transmission'].unique(), y = cars_data.groupby('transmission').mean()['price'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car transmissions according to price', xaxis_title = "Car Transmission", yaxis_title = "Price")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['drive'].unique(), y = cars_data.groupby('drive').mean()['price'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car drive according to price', xaxis_title = "Car Drive", yaxis_title = "Price")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['size'].unique(), y = cars_data.groupby('size').mean()['price'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car size according to price', xaxis_title = "Car Size", yaxis_title = "Price")
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x = cars_data['style'].unique(), y = cars_data.groupby('style').mean()['price'].sort_values(), marker_color = '#C9A26B'))
fig.update_layout(title_text = 'Car style according to price', xaxis_title = "Car Style", yaxis_title = "Price")
fig.show()

**Insights**
* ***Make***: The car brand of **Chrysler** has the highest pricec of around **20K**.
* ***Model***:: Expensive car models like **190-Class and some others** has the highest prices of around **2000**.
* ***Fuel Type***: Cars that use fuel type of **Diesel** has highest prices of around **39.1K**.
* ***Transmission***: Car transmission of **Manual** has prices around **18.3K** and car transmission of **Automated-Manual** has the prices around **36K**.
* ***Drive***: Car drive of **Rear wheel drive** has the prices around **23K** and car drive of **Four wheel drive** has the prices around **38.2K**.
* ***Size***: **Compact** car sizes has prices around **22.6K** and car size of **Large** has prices around of **35.9K**.
* ***Style***: Car style of **Coupe** has prices around **14.8K** and car style of **Passenger Van** has the prices around **37.1K**.

### Relation between price and other numerical variables

In [None]:
fig = px.scatter(cars_data, x = 'year', y = 'price', color = 'cylinders')
fig.show()

**Insights**
* The amount of cylinders being used in cars have increased from year 2001.
* In year 2015, 2016 and 2017, it can be observed that most of the cars are using 4 number of cylinders. Perhaps this step is taken to reduce carbon emissions. 
* It can also be seen that the car prices have rapidly increased in 2001. They were low in previous years. It might be due to inflation or something like that. Furthermore, the prices kept increasing since year 2001.

In [None]:
fig = px.scatter(cars_data, x = 'hp', y = 'price', color = 'cylinders')
fig.show()

**Insights**
* There is a positive relation between hp and price. It can be seen that as the hp increases, the car prices also increases.
* As the hp increases more than 290, most of the cars are likely to have 8 cylinders. 
* Cars having hp less than 180 are more likely to have 4 cylinders.

In [None]:
fig = px.scatter(cars_data, x = 'cylinders', y = 'price', color = 'cylinders')
fig.show()

**Insights**
* The price of 4, 6 and 8 number of cylinders can be as low as 2000 and as high as 70K.

In [None]:
fig = px.scatter(cars_data, x = 'doors', y = 'price', color = 'cylinders')
fig.show()

**Insights**
* Most of the cars having 2 doors are more likely to have 4 cylinders on average. 
* Most of the cars having 4 doors and price above than 48K are more likely to have 8 cylinders.

In [None]:
fig = px.scatter(cars_data, x = 'city_mpg', y = 'price', color = 'cylinders')
fig.show()

**Insights**
* Cars that are giving a city mpg of 10 and 11 have only 8 cylinders.
* Cars that are giving a city mpg of 17 have mostly 6 cylinders.
* Cars with 4 cylinders have a really good mpg inside a city.
* The prices of cars with different city mpg varies. It can be as low as 2000 and as high as 70K.

In [None]:
fig = px.scatter(cars_data, x = 'highway_mpg', y = 'price', color = 'cylinders')
fig.show()

**Insights**
* Cars that are giving a highway mpg of 12, 13, 14, 15, 16, 17 and 18 have mostly 8 cylinders.
* Cars that are giving a highway mpg from 19 to 27 have mostly 6 cylinders.
* Cars giving a highway mpg 28 or more than 28 have mostly 4 cylinders.
* The prices of cars with different highway mpg varies. It can be as low as 2000 and as high as 70K.

In [None]:
fig = px.scatter(cars_data, x = 'popularity', y = 'price', color = 'cylinders')
fig.show()

**Insights**
* Most of the cars having popularity of 204 with 6 cylinders have prices starting 35K and goes as high as 57K.
* Cars having popularity of 549 with 8 cylinders have prices starting from 42K and goes as high as 70.2K
* Cars having popularity of 1385 and 8 cylinders have prices higher than 44K.
* Cars having popularity of 5657 and prices above than 21K are more likely to have 6 cylinders.

# Preprocessing

> Making the dummies of for the machine learning process. This is very important as machine learning does not deal with categorical features. Therefore, i used pandas for the purpose of encoding these categorical features.

In [None]:
cat_features = ['make', 'model', 'fuel_type', 'transmission', 'drive', 'size', 'style']
cars_data = pd.get_dummies(cars_data, columns = cat_features)

> Splitting the dataset into train and test split. I'll keep the test size of 0.2.

In [None]:
X = cars_data.drop('price', axis = 1)
y = cars_data['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

> Now i'll do normalization or standardize my data by using Standard Scaler.

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Modeling

> I've decided to use random forest regressor model for my current dataset. Let's make a model at first and then make a pipeline. Later i'll fit the model and then use it for prediction. 

In [None]:
rfr = RandomForestRegressor(n_estimators = 40)
rfr_algo = make_pipeline(rfr)

rfr_algo.fit(X_train, y_train)
rfr_pred = rfr_algo.predict(X_test)

print('R2 Score is : ', r2_score(y_test, rfr_pred))
print('Mean squared error is : ', math.sqrt(mean_squared_error(y_test, rfr_pred)))

> Let's visualise our model's performace to see how it worked.

In [None]:
plt.figure(figsize=(10,10))
plt.ylabel("Predicted Value")
sns.regplot(y_test, rfr_pred, fit_reg=True, scatter_kws={"s": 100})

**Conclusion**
* The model performed really well with the r2 score of 0.95. 
* The mean square error is also 3331.

