# Exploratory Data Analysis

This notebook aims to explore the car feature dataset and uncover trends to help predict car
prices. This notebookk will:

- rank utility of features using mutual infomation
- explore relationship of variables with price

In [None]:
from eda import *
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

In [None]:
df = pd.read_pickle('car_data.pkl')
df.head()

In [None]:
X = df.copy()
y = X.pop('price')

## Get mutual infomation scores

In [None]:
mi_scores = make_mi_scores(X, y)
mi_scores[::3]

Key variables like manufactured year and mileage look they have a effect of price as expected.
By this metric valiables like doors and seats have little to no effect on price.

In [None]:
plot_mi_scores(mi_scores)

The price distribution shows that prices range up to £200,000 on the extreme upper end. However morst of the price data is below £60,000 as seen on the right side of the figure below. The data is also right skewed.

In [None]:
plot_price_historgam_subplot(df)

Plotting some of the numerical variables against price we can see trends below.
1. Car length is ranked 1 for the mutual information scores and as seen in the chart there is a positive correlation between car length and price.
2. Price seems to increase with boot space, but more so with the seats down measurement. This is also confirmed by the mutual information scores as the seats down measurement was ranked higher than the seats up measurement.
3. Price increases with the wheelbase. There are some incorrect figures as some wheelbase data points are at 0mm which makes no sense.
4. Height, manufactured ear and engine power all have strong positive correlations with price. 

In [None]:
df_sample = df.sample(frac=.4, random_state=1)
plot_eda_subplot1(df_sample)

1. Engine torque, mileage, fuel tank capacity and CO2 emmissions seem to have strong correlations with price and will likely be useful in price prediction.
2. The relationship between annual tax and price is unclear. Tax is ranked low on the mutual information scores.
3. Price decreases with an increase in number of owners. Number of owners could be closely correlated with mileage.

In [None]:
plot_eda_subplot2(df_sample)

1. Price increases with top speed and engine size.
2. Total sellers reviews does not have a clear trend with price.
3. Combined, urban and extra urban ratings all seem to have a slight negative correlation with price although it's not very clear. These variables may not be good predictors of price.

In [None]:
plot_eda_subplot3(df_sample)

1. Longitude, doors, seats and number of photos do not have a clear correlation with price so will probably be useless for price prediction.
2. Seller ratings is positively correlated with price. However not all sellers have ratings as some are private. This can lead a lot of data loss if this variable is used.
3. Price increases with number of valves and cylinders.

In [None]:
plot_eda_subplot4(df_sample)

In [None]:
plot_seats(df_sample)

1. Figures 1 and 2 show the top ten makes and models in terms of price. It includes luxury cars such as maseratia and porshe.
2. Automatic transmission cars have a higher median price than manuals
3. Fuel type seems to have an effect on price. Diesel plugin hybrids have the highest median price.
4. Body type does not seem to have much effect on price with the exception of the van. This has a higher median than the other body types.

In [None]:
plot_eda_subplot5(df_sample)

## Conclusions

Based on the mutual information scores and visualisations, the following 
columns will not be used in model building as they will likely not help in predicting price:

- number of photos
- number of reviews
- longitute and latitude
- doors
- seats
- imported
- region 
- sellers rating
- ulez
- cylinders
- values
- fuel type
- body type
- annual tax
- seller segment
- county
- town
- total reviews
- combined 


1. Some of the wheelbase data is incorrect as there are zero values which does not make sense.
2. Insurance group variable will be dropped as it's is likely influenced by price so will not be available at the time of prediction.
3. The combined rating will be dropped as the data points do not have as much spread compared to price as the urban and extra urban ratings.
4. The number of owners is likely correlated with mileage so will be discarded.#
5. Make, model and trim all have high cardinality so if used will need to be encoded. 
