This exercise involves the auto data set available as Auto.csv from the MY591 website, or directly from https://www.statlearning.com/s/Auto.csv. This data includes characteristics on a number of different types of cars. It includes the following variables:


```
mpg = miles per gallon
cylinders = number of cylinders
displacement = engine displacement
horsepower
weight = weight of the car in kgs
acceleration = time in seconds for the car to go from 0-60mph
year = year of manufacture
origin = country of manufacture
name = name of car
```

We will build a decision tree to predict if a car is luxury or not (indicated by its mpg)

# 1. Load the Carseat data

In [24]:
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
AutoData = pd.read_csv('/content/drive/MyDrive/ISCH 370 Labs/Auto.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 2. Check if there is any null value in the dataset. If so, clean the data by removing the tuples with null values.

In [25]:
AutoData.isnull().sum()

Unnamed: 0,0
mpg,0
cylinders,0
displacement,0
horsepower,0
weight,0
acceleration,0
year,0
origin,0
name,0


## 3. Feature Engineering

3.1 Check the datatype for each column, identify those that do not have numeric type

In [26]:
AutoData.dtypes

Unnamed: 0,0
mpg,float64
cylinders,int64
displacement,float64
horsepower,object
weight,int64
acceleration,float64
year,int64
origin,int64
name,object


3.2 Looks like hosepower has object type. As it is one of the features that can be relevant to mpg, it should be converted to numeric type, such as int64. So convert its type to int64.

In [29]:
AutoData['horsepower'] = AutoData['horsepower'].astype('int64')
AutoData['horsepower']

Unnamed: 0,horsepower
0,130
1,165
2,150
3,150
4,140
...,...
392,86
393,52
394,84
395,79


3.3: You got an error from the step 3.2, like: ValueError: invalid literal for int() with base 10: '?' this is because some records have "?" value for the attribute. To fix this, use this sample code:



```
df=df.loc[df['horsepower'].str.isnumeric()==True]
```

By doing this, you remove those records having non-numeric values for horsepower.

After this, run the code in 3.2 again to convert the data type of horsepower



In [28]:
AutoData=AutoData.loc[AutoData['horsepower'].str.isnumeric()==True]

# 4. Generate Classifiers for classification.

In this example, we want to predict if a car is luxury or economy (i.e., indicated by its mpg), given the other features.

4.1 Generate the targeted variable: use df.qcut to convert mpg to cateogical values with the labels: Luxury, Economy, and put these values under a new column "car_type". Print the dataframe to show the change.

In [30]:
AutoData['car_type'] = pd.qcut(AutoData['mpg'], q=2, labels=['Economy','Luxury'])
print(AutoData)

      mpg  cylinders  displacement  horsepower  weight  acceleration  year  \
0    18.0          8         307.0         130    3504          12.0    70   
1    15.0          8         350.0         165    3693          11.5    70   
2    18.0          8         318.0         150    3436          11.0    70   
3    16.0          8         304.0         150    3433          12.0    70   
4    17.0          8         302.0         140    3449          10.5    70   
..    ...        ...           ...         ...     ...           ...   ...   
392  27.0          4         140.0          86    2790          15.6    82   
393  44.0          4          97.0          52    2130          24.6    82   
394  32.0          4         135.0          84    2295          11.6    82   
395  28.0          4         120.0          79    2625          18.6    82   
396  31.0          4         119.0          82    2720          19.4    82   

     origin                       name car_type  
0         1  

4.2 Prepare the features and target variables. For features, use all the columns in the dataframe except for car_type, mpg, and name. Car_type and mpg are targeted variables so they should be included in the features. name has no predictive power as it is used as an identification.

In [31]:
target = AutoData['car_type']
features = AutoData.drop(['mpg','car_type','name'], axis=1)
print(target)
print(features)

0      Economy
1      Economy
2      Economy
3      Economy
4      Economy
        ...   
392     Luxury
393     Luxury
394     Luxury
395     Luxury
396     Luxury
Name: car_type, Length: 392, dtype: category
Categories (2, object): ['Economy' < 'Luxury']
     cylinders  displacement  horsepower  weight  acceleration  year  origin
0            8         307.0         130    3504          12.0    70       1
1            8         350.0         165    3693          11.5    70       1
2            8         318.0         150    3436          11.0    70       1
3            8         304.0         150    3433          12.0    70       1
4            8         302.0         140    3449          10.5    70       1
..         ...           ...         ...     ...           ...   ...     ...
392          4         140.0          86    2790          15.6    82       1
393          4          97.0          52    2130          24.6    82       2
394          4         135.0          84    2295  

4.3 Splitting data, dividing the dataset into a trainning set and a test set using train_test_split(). Three parameters need to be specified: features, target, and test_set size.

In [32]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(features,target,test_size=0.3,random_state=1)

4.4 Build a decision tree model with criterion="entropy", max_depth=3, use the decision tree to predict the car_type in the test set.

In [33]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=3,criterion='entropy')
clf = clf.fit(X_train,Y_train)
Y_pred = clf.predict(X_test)

4.5 Compute the accuracy, precision, recall and f1 score of the decision tree. Set luxury as the positive label.

The result looks like this: (the values you get should be close to but not necessily the same as shown, since the splitting is random)


```
Accuracy: 0.8220338983050848
Precision is:  0.8235294117647058
Recall is:  0.8615384615384616
f1 score is:  0.8421052631578948
```



In [34]:
from sklearn import metrics
print('Accuracy:',metrics.accuracy_score(Y_test,Y_pred))
print('Precision:',metrics.precision_score(Y_test,Y_pred,pos_label='Luxury'))
print('Recall:',metrics.recall_score(Y_test,Y_pred,pos_label='Luxury'))
print('F1 Score:',metrics.f1_score(Y_test,Y_pred,pos_label='Luxury'))

Accuracy: 0.923728813559322
Precision: 0.8793103448275862
Recall: 0.9622641509433962
F1 Score: 0.918918918918919


4.6 Print out the confusion matrix

The result looks like this: (the values you get should be close to but not necessily the same as shown, since the splitting is random)



```
        Luxury	Economy
Luxury	41	12
Economy	9	56
```



In [35]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test,Y_pred)
cm_df = pd.DataFrame(cm,index=['Luxury','Economy'],columns=['Luxury','Economy'])
cm_df

Unnamed: 0,Luxury,Economy
Luxury,58,7
Economy,2,51


4.7 Use the splitting result from Step 4.3, build a KNN classifier, use the default values for n_neighbors and weights.

In [36]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf = clf.fit(X_train,Y_train)
Y_pred = clf.predict(X_test)

4.8 Use the setting of the Step 4.5 to print the acccuracy, precision, recall, and f1 score.

The result looks like this: (the values you get should be close to but not necessily the same as shown, since the splitting is random)



```
Accuracy: 0.864406779661017
Precision is:  0.9016393442622951
Recall is:  0.8461538461538461
f1 score is:  0.873015873015873
```



In [37]:
print('Accuracy:',metrics.accuracy_score(Y_test,Y_pred))
print('Precision:',metrics.precision_score(Y_test,Y_pred,pos_label='Luxury'))
print('Recall:',metrics.recall_score(Y_test,Y_pred,pos_label='Luxury'))
print('F1 Score:',metrics.f1_score(Y_test,Y_pred,pos_label='Luxury'))

Accuracy: 0.864406779661017
Precision: 0.8245614035087719
Recall: 0.8867924528301887
F1 Score: 0.8545454545454545


4.9 Scale the data and then classify. Use standard scaler to scale the feature data, then reuse the code in the steps 4.3, 4.7, and 4.8 to generate the new KNN classifier and evaluate its performance.

The result looks like this: (the values you get should be close to but not necessily the same as shown, since the splitting is random)



```
Accuracy: 0.9322033898305084
Precision is:  0.9508196721311475
Recall is:  0.9206349206349206
f1 score is:  0.9354838709677418
```



In [38]:
from sklearn.preprocessing import StandardScaler
sscaler = StandardScaler()
numerics = ['int16','int32','int64','float16','float32','float64']
new_features = features.select_dtypes(include = numerics)
ss_features = sscaler.fit_transform(new_features)
new_features = pd.DataFrame(ss_features,columns = new_features.columns)

from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(new_features,target,test_size=0.3,random_state=1)

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf = clf.fit(X_train,Y_train)
Y_pred = clf.predict(X_test)

print('Accuracy:',metrics.accuracy_score(Y_test,Y_pred))
print('Precision:',metrics.precision_score(Y_test,Y_pred,pos_label='Luxury'))
print('Recall:',metrics.recall_score(Y_test,Y_pred,pos_label='Luxury'))
print('F1 Score:',metrics.f1_score(Y_test,Y_pred,pos_label='Luxury'))


Accuracy: 0.9491525423728814
Precision: 0.9433962264150944
Recall: 0.9433962264150944
F1 Score: 0.9433962264150944


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=cf886d58-b5c5-494f-83a3-52efa87a4945' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>