## Exploratory Data Analysis 

- The dataset is retrieved from Kaggle's Car Features and MSRP 
- The dataframe `car_df1`, which is used to predict the car maker origins based on the features available in the dataset, is filtered by the model year to the recent decade(2010-present).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pickle
import functions as fn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [None]:
car_df = pd.read_csv('data/cardataset.zip')

In [None]:
car_df.shape

In [None]:
car_df.head()

In [None]:
car_df.isna().sum()

In [None]:
car_df.info()

In [None]:
car_df1 = car_df.loc[car_df['Year'] > 2009]

Data cleaning process: 

- Remove the rows, where there are missing value in the features mentioned below: 
    - `Number of Doors` 
    - `Engine HP` 
    - `Market Category`
- Fill in value `0` for the electric cars for the feature `Engine Cylinders`
- Remove the feature `Number of Doors` as it is considered irrelevant for building the prediction model
- Remove the duplicate rows in the dataframe `car_df1`
- Remove 12 rows from the `Vehicle Style` as the dataset is considered to be the noises for building the prediction model

In [None]:
car_df1.isna().sum()

In [None]:
car_df1.dropna(subset=['Number of Doors'], axis=0, inplace=True)

In [None]:
car_df1.dropna(subset=['Engine HP'], axis=0, inplace=True)
car_df1.isna().sum()

In [None]:
car_df1.loc[car_df1['Engine Cylinders'].isna()] = car_df1.loc[car_df1['Engine Cylinders'].isna()].fillna(0)
car_df1.isna().sum()

In [None]:
car_df1.loc[car_df1['Market Category'].isna()]['Make'].value_counts()

In [None]:
car_df1.info()

In [None]:
car_df1.dropna(axis=0, inplace=True)

In [None]:
car_df1.info()

In [None]:
car_df1.duplicated().sum()

In [None]:
car_df1.drop_duplicates(inplace=True)
car_df1.info()

In [None]:
car_df1.drop(columns='Number of Doors', axis=1, inplace=True)
car_df1.info()

In [None]:
car_df1['Vehicle Style'].value_counts()

In [None]:
car_df1 = car_df1.loc[(car_df1['Vehicle Style'] != '2dr SUV') 
                      & (car_df1['Vehicle Style'] != 'Convertible SUV')
                      & (car_df1['Vehicle Style'] != 'Cargo Minivan')]
car_df1.info()

Create a new column and assign carmaker origin that matches with the brand(`Make` feature) in `car_df1` dataframe.

In [None]:
with open('make_origin.pickle', 'rb') as f:
    make_origin_dict = pickle.load(f)

In [None]:
car_df1['Origin'] = car_df1['Make'].apply(lambda m: make_origin_dict[m])

In [None]:
car_df1.drop(['Make', 'Model'], axis=1, inplace=True)

Apply the function to split the attributes in the `Market Catergory` feature to treat each attributes as individual binary feature.

In [None]:
car_df1 = fn.market_columns(car_df1)
car_df1.info()

Split the dataset into train and test subsets for building the predictive model.

In [None]:
X = car_df1.drop('Origin', axis=1)
y = car_df1['Origin']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
X_train_all = fn.onehotencode(X_train)

In [None]:
X_test_all = fn.onehotencode(X_test)

In [None]:
origins = list(set(make_origin_dict.values()))

origin_code = {}
for i in range(len(origins)):
    origin_code[origins[i]] = i

y_train = y_train.apply(lambda x: origin_code[x])
y_test = y_test.apply(lambda x: origin_code[x])