# Overview
The purpose of this notebook is to implement data preparation and detailed analysis specifically for the chosen modeling task: predict car make from technical attributes (no price or insurance info)

## Modeling Task
The motivating objective for this modeling task is to compare the efficacy of two different approaches for characterising "feature importance" for Random Forest classifiers.  A similar analysis is demonstrated here: https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html 
I will perform a similar set of tasks, but I will do so on the 1985 Auto Imports database.

Upon a cursory reading of the description of the data set (https://www.openml.org/d/9), I noticed (based on intuition) that many of the technical attributes are probably strongly correlated due to constrains of physics, engineering, manufacturing, etc.:

15. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
16. num-of-cylinders: eight, five, four, six, three, twelve, two.
17. engine-size: continuous from 61 to 326.
18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
19. bore: continuous from 2.54 to 3.94.
20. stroke: continuous from 2.07 to 4.17.
21. compression-ratio: continuous from 7 to 23.
22. horsepower: continuous from 48 to 288.
23. peak-rpm: continuous from 4150 to 6600.

If, in fact, many of these features are correlated, it may have interesting impacts on the experiment.  It may be necessary to take measures to reduce the number of correlated features.

I'm specifically intersted in doing a demonstration of a multi-class classifier.  Therefore, I will use the "make" attribute (attribute 3) as the response variable.

Fore model features, initially, I will consider using all features other than the monetary-related features (1. symboling; 2. normalized-losses; 26. price)

## TODO
- Consider custom feature encoding
- Transform data into supervised learning structure: reponse vector + design matrix
- Make partitions for Train, Test, Validation (note, there are only 205 rows in whole data set)

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

In [2]:
DEFAULT_FIGSIZE = (14, 14)

# Load Data

In [3]:
full_data = fetch_openml(data_id=9)

In [4]:
# - Combine target data and rest of 'data'
data_values = np.concatenate(
    (np.reshape(full_data['target'], (len(full_data['target']), 1)), full_data['data']), 
    axis=1)
# - Concatenate column names too
all_cols = ['symboling']
all_cols.extend(full_data['feature_names'])

In [5]:
full_df = pd.DataFrame(data_values, columns=all_cols).infer_objects()

In [6]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    object 
 1   normalized-losses  164 non-null    float64
 2   make               205 non-null    float64
 3   fuel-type          205 non-null    float64
 4   aspiration         205 non-null    float64
 5   num-of-doors       203 non-null    float64
 6   body-style         205 non-null    float64
 7   drive-wheels       205 non-null    float64
 8   engine-location    205 non-null    float64
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    float64
 14  engine-type        205 non-null    float64
 15  num-of-cylinders   205 non-null    float64
 16  engine-size        205 non

# Modeling Data Analysis and Curation

- Due to the goals of the experiment, let's simplify our lives by removing all rows from classes that are severaly undersampled

In [7]:
response_col = 'make'
# - Easier to list the features to be ignored than to list those to be included
ignore_feature_cols = [response_col, 'symboling', 'normalized-losses', 'price']
feature_cols = list(set(full_df.columns) - set(ignore_feature_cols))
for col in feature_cols:
    print(col)

engine-location
num-of-doors
body-style
peak-rpm
aspiration
fuel-type
width
horsepower
bore
highway-mpg
length
fuel-system
engine-type
curb-weight
height
city-mpg
drive-wheels
stroke
num-of-cylinders
wheel-base
compression-ratio
engine-size


## Remove "make" values that have too few examples

In [8]:
min_examples_count = 9

In [9]:
response_value_counts = full_df[response_col].value_counts()
response_value_counts

19.0    32
12.0    18
8.0     17
5.0     13
11.0    13
18.0    12
20.0    12
13.0    11
21.0    11
4.0      9
9.0      8
2.0      8
14.0     7
1.0      7
17.0     6
15.0     5
6.0      4
3.0      3
0.0      3
7.0      3
16.0     2
10.0     1
Name: make, dtype: int64

In [10]:
valid_response_values = response_value_counts[response_value_counts >= min_examples_count].keys().tolist()
valid_response_values

[19.0, 12.0, 8.0, 5.0, 11.0, 18.0, 20.0, 13.0, 21.0, 4.0]

In [22]:
modeling_df = full_df[full_df[response_col].isin(valid_response_values)]
print(modeling_df.shape)

(148, 26)


## Remove rows that have NaN values for any feature cols

In [23]:
modeling_df = modeling_df[~modeling_df[feature_cols].isna().any(axis=1)]
print(modeling_df.shape)

(142, 26)


In [24]:
modeling_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 142 entries, 21 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          142 non-null    object 
 1   normalized-losses  129 non-null    float64
 2   make               142 non-null    float64
 3   fuel-type          142 non-null    float64
 4   aspiration         142 non-null    float64
 5   num-of-doors       142 non-null    float64
 6   body-style         142 non-null    float64
 7   drive-wheels       142 non-null    float64
 8   engine-location    142 non-null    float64
 9   wheel-base         142 non-null    float64
 10  length             142 non-null    float64
 11  width              142 non-null    float64
 12  height             142 non-null    float64
 13  curb-weight        142 non-null    float64
 14  engine-type        142 non-null    float64
 15  num-of-cylinders   142 non-null    float64
 16  engine-size        142 no

# Create Train and Test Data for Modeling

In [30]:
X_train, X_test, y_train, y_test = train_test_split(
    modeling_df[feature_cols], 
    modeling_df[response_col], 
    test_size=0.33, 
    stratify=modeling_df[response_col], 
    random_state=100)

In [26]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 34 to 62
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   engine-location    95 non-null     float64
 1   num-of-doors       95 non-null     float64
 2   body-style         95 non-null     float64
 3   peak-rpm           95 non-null     float64
 4   aspiration         95 non-null     float64
 5   fuel-type          95 non-null     float64
 6   width              95 non-null     float64
 7   horsepower         95 non-null     float64
 8   bore               95 non-null     float64
 9   highway-mpg        95 non-null     float64
 10  length             95 non-null     float64
 11  fuel-system        95 non-null     float64
 12  engine-type        95 non-null     float64
 13  curb-weight        95 non-null     float64
 14  height             95 non-null     float64
 15  city-mpg           95 non-null     float64
 16  drive-wheels       95 non-n

In [27]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47 entries, 95 to 65
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   engine-location    47 non-null     float64
 1   num-of-doors       47 non-null     float64
 2   body-style         47 non-null     float64
 3   peak-rpm           47 non-null     float64
 4   aspiration         47 non-null     float64
 5   fuel-type          47 non-null     float64
 6   width              47 non-null     float64
 7   horsepower         47 non-null     float64
 8   bore               47 non-null     float64
 9   highway-mpg        47 non-null     float64
 10  length             47 non-null     float64
 11  fuel-system        47 non-null     float64
 12  engine-type        47 non-null     float64
 13  curb-weight        47 non-null     float64
 14  height             47 non-null     float64
 15  city-mpg           47 non-null     float64
 16  drive-wheels       47 non-n

In [28]:
y_train.value_counts()

19.0    22
12.0    12
11.0     9
5.0      9
20.0     8
8.0      8
18.0     8
21.0     7
13.0     7
4.0      5
Name: make, dtype: int64

In [29]:
y_test.value_counts()

19.0    10
12.0     6
20.0     4
5.0      4
8.0      4
11.0     4
18.0     4
13.0     4
21.0     4
4.0      3
Name: make, dtype: int64

#### Observations
- This is such a small data set that after filtering out very low occurence classes and doing a train-test split, there are very few samples left for the test set.  This will make it quite difficult to interpret the results of the experiment.  
 - Specifically, I'm hoping to interpret the difference in Performance-Recall metrics among a set of models (Random Forest models with different features being ignored)