# AUTO1 GROUP Data Science Challenge

Author: Kai Chen
Date:   May, 2018

Please take a look at the dataset in the file “Auto1-DS-TestData.csv” (see https://archive.ics.uci.edu/ml/datasets/Automobile for information on the features and other attributes) and answer the following questions:

### Question 1 (10 Points)
List as many use cases for the dataset as possible.

### Question 2 (10 Points)
Auto1 has a similar dataset (yet much larger...) 
Pick one of the use cases you listed in question 1 and describe how building a statistical model based on the dataset could best be used to improve Auto1’s business.

### Question 3 (20 Points)
Implement the model you described in question 2 in R or Python. The code has to retrieve the data, train and test a statistical model, and report relevant performance criteria. 

When submitting the challenge, send us the link for a Git repository containing the code for analysis and the potential pre-processing steps you needed to apply to the dataset. You can use your own account at github.com or create a new one specifically for this challenge if you feel more comfortable.

Ideally, we should be able to replicate your analysis from your submitted source-code, so please explicit the versions of the tools and packages you are using (R, Python, etc).


### Question 4 (60 Points)
A. Explain each and every of your design choices (e.g., preprocessing, model selection, hyper parameters, evaluation criteria). Compare and contrast your choices with alternative methodologies. 

B. Describe how you would improve the model in Question 3 if you had more time.

In [26]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split
from pandas.api.types import is_string_dtype, is_numeric_dtype
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
import math

## Step 1. Data Exploration

In [27]:
df_raw = pd.read_csv('Auto1-DS-TestData.csv', low_memory=False)

In [28]:
df_raw.shape

(205, 26)

In [29]:
display(df_raw.head(3))

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500


## Step 2. Data preparation

* Convert '?' to None
* Convert columns ("normalized-losses", "bore", "stroke", "horsepower", "peak-rpm", "price") to continuous variable
* Make all string type variables to categorical variables
* Handle missing values:
    - For categorial variables, we don't need to do anything, because pandas automatically convert NA to -1 for categorical variables.
    
    - For continuous variables, we need to replace NA with mean or median. Then create a col_NA column to indicate which row has NAs.
*  Convert categorical variables to their numberic representations.


In [30]:
# convert '?' to None
df_raw = df_raw.replace('?', np.nan)

In [31]:
# extract all string-type columns
cols_str = []
for col in df_raw:
    if is_string_dtype(df_raw[col]):
        cols_str.append(col)
print(cols_str)

['normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'engine-type', 'num-of-cylinders', 'fuel-system', 'bore', 'stroke', 'horsepower', 'peak-rpm', 'price']


In [32]:
# convert following columns to continuous variables based on data description
# normalized-losses, bore, stroke, horsepower, peak-rpm, price
cols = ["normalized-losses", "bore", "stroke", "horsepower", "peak-rpm", "price"]
for col in cols:
    df_raw[col] = pd.to_numeric(df_raw[col], errors='raise')

In [33]:
# make all string type variables to categorical variables.
for col in df_raw:
    if is_string_dtype(df_raw[col]):
        df_raw[col] = df_raw[col].astype('category').cat.as_ordered()

In [34]:
# Handle missing values: 
# - For categorial variables, we don't need to do anything, 
# because pandas automatically convert NA to -1 for categorical variables. 
# - For continuous variables, we need to replace NA with mean or median. 
# Then create a col_NA column to indicate which row has NAs.

for col in df_raw:
    if is_numeric_dtype(df_raw[col]):
        col_vals = df_raw[col]
        if sum(col_vals.isnull()) != 0:
            df_raw[col+'_na'] = col_vals.isnull()
            df_raw[col] = col_vals.fillna(col_vals.median())

In [35]:
# Convert categorical variables to their numberic representations.
for col in df_raw:
    if str(df_raw[col].dtype) == "category":
        df_raw[col] = df_raw[col].cat.codes + 1

In [36]:
display(df_raw.head(5))

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,peak-rpm,city-mpg,highway-mpg,price,normalized-losses_na,bore_na,stroke_na,horsepower_na,peak-rpm_na,price_na
0,3,115.0,1,2,1,2,1,3,1,88.6,...,5000.0,21,27,13495.0,True,False,False,False,False,False
1,3,115.0,1,2,1,2,1,3,1,88.6,...,5000.0,21,27,16500.0,True,False,False,False,False,False
2,1,115.0,1,2,1,2,3,3,1,94.5,...,5000.0,19,26,16500.0,True,False,False,False,False,False
3,2,164.0,2,2,1,1,4,2,1,99.8,...,5500.0,24,30,13950.0,False,False,False,False,False,False
4,2,164.0,2,2,1,1,4,1,1,99.4,...,5500.0,18,22,17450.0,False,False,False,False,False,False


In [37]:
print(df_raw.describe())

        symboling  normalized-losses        make   fuel-type  aspiration  \
count  205.000000         205.000000  205.000000  205.000000  205.000000   
mean     0.834146         120.600000   13.195122    1.902439    1.180488   
std      1.245307          31.805105    6.274831    0.297446    0.385535   
min     -2.000000          65.000000    1.000000    1.000000    1.000000   
25%      0.000000         101.000000    9.000000    2.000000    1.000000   
50%      1.000000         115.000000   13.000000    2.000000    1.000000   
75%      2.000000         137.000000   20.000000    2.000000    1.000000   
max      3.000000         256.000000   22.000000    2.000000    2.000000   

       num-of-doors  body-style  drive-wheels  engine-location  wheel-base  \
count    205.000000  205.000000    205.000000       205.000000  205.000000   
mean       1.424390    3.614634      2.326829         1.014634   98.756585   
std        0.514867    0.859081      0.556171         0.120377    6.021776   
min

## Step 3. Predict Continuous Price

In [24]:
print(df_raw.shape)

(205, 32)


In [39]:
X = df_raw.drop('price', axis=1)
y = df_raw['price']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 99)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((164, 31), (41, 31), (164,), (41,))

In [40]:
clf = RandomForestRegressor(n_jobs=-1)
clf.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [41]:
def rmse(preds, actuals):
    return math.sqrt(((preds-actuals)**2).mean())

In [45]:
rmse_train = rmse(clf.predict(X_train), y_train)
rmse_val = rmse(clf.predict(X_val), y_val)
score_train = clf.score(X_train, y_train)
score_val = clf.score(X_val, y_val)

print('train rmse: {}'.format(round(rmse_train, 3)))
print('validation rmse: {}'.format(round(rmse_val, 3)))
print('train score: {}'.format(round(score_train, 3)))
print('validation score: {}'.format(round(score_val, 3)))

train rmse: 1338.293
validation rmse: 2577.038
train score: 0.972
validation score: 0.858


## Step 3. Predict Categorical Symboling