In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)

In [16]:
df = pd.read_csv("../data/data.csv")
# To display the top 5 rows 
df.head(5)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


## 3. Checking the types of data

Here we check for the datatypes because sometimes the MSRP or the price of the car would be stored as a string, if in that case, we have to convert that string to the integer data only then we can plot the data via a graph. Here, in this case, the data is already in integer format so nothing to worry.

In [17]:
df.dtypes

Make                  object
Model                 object
Year                   int64
Engine Fuel Type      object
Engine HP            float64
Engine Cylinders     float64
Transmission Type     object
Driven_Wheels         object
Number of Doors      float64
Market Category       object
Vehicle Size          object
Vehicle Style         object
highway MPG            int64
city mpg               int64
Popularity             int64
MSRP                   int64
dtype: object

## 4. Identifying Columns of Interest

In the original EDA, columns such as `Transmission Type` and `Driven_Wheels` were included in the EDA, but those columns was never utilized in any of the analysis. We are interested in identifying relationships between various columns. For example, how has the fuel consumption changed over the years? Another example is how does popularity impact the pricing?

In [18]:
df = df.drop(['Engine Fuel Type', 'Driven_Wheels', 'Number of Doors', 'Market Category', 'Vehicle Size', 'Vehicle Style'], axis=1)
df.head(5)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,26,19,3916,46135
1,BMW,1 Series,2011,300.0,6.0,MANUAL,28,19,3916,40650
2,BMW,1 Series,2011,300.0,6.0,MANUAL,28,20,3916,36350
3,BMW,1 Series,2011,230.0,6.0,MANUAL,28,18,3916,29450
4,BMW,1 Series,2011,230.0,6.0,MANUAL,28,18,3916,34500


## 5. Labelling Data Columns

For readability, we're changing the labels of the columns to add clarity.

In [19]:
df = df.rename(columns={'Engine HP': 'Horsepower', 'Engine Cylinders': 'Cylinders', 'Transmission Type': 'Transmission','highway MPG': 'MPG-HW', 'city mpg': 'MPG-CTY', 'Popularity':'Rank', 'MSRP': 'Price' })
df.head(5)

Unnamed: 0,Make,Model,Year,Horsepower,Cylinders Cnt,Transmission,MPG-HW,MPG-CTY,Rank,Price
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,26,19,3916,46135
1,BMW,1 Series,2011,300.0,6.0,MANUAL,28,19,3916,40650
2,BMW,1 Series,2011,300.0,6.0,MANUAL,28,20,3916,36350
3,BMW,1 Series,2011,230.0,6.0,MANUAL,28,18,3916,29450
4,BMW,1 Series,2011,230.0,6.0,MANUAL,28,18,3916,34500
