# Week 8 Lab
In this lab, we will compare the performance of the DecisionTreeClassifier with the KNeighborsClassifier when applied to the Node welcome survey.

In [1]:
# Import pandas, numpy, DecisionTreeClassifier, KNN, train_test_split, and accuracy_score
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Read in the csv from the url:

In [2]:
# Read in the csv from the url
url = 'https://raw.githubusercontent.com/ishaandey/node/master/week-8/lab/survey-results.csv'
df = pd.read_csv(url)

Take a look at the data to identify the features and the target. We are trying to identify which students are completing a Data Science related major (Computer Science or Statistics). The DS_Major column tells whether or not the student is pursuing a Data Sciene Major.

In [3]:
# Take a look at the first 5 rows
df.head()

Unnamed: 0,Timestamp,Index,Use_responses,First_name,Last_name,Nickname,College,Year,Major,DS_Major,New_to_CS,Python,R,SQL,SAS_STATA,Other_languages,Statistics,DS_Topics,Hobbies
0,2/14/2021 10:16,1,Sure!,Susannah,White,Susie,UVA College of Arts and Sciences,2nd Year,Pre Commerce,No,Nope,Kinda know it,Kinda know it,Never touched it,Never touched it,Kinda know it,Kinda know it,I’m interested in data wrangling and using dat...,"Dance, baking"
1,2/14/2021 10:25,2,"Yes, but anonymized",PII,PII,,UVA Darden,Graduate school,MBA,No,It's been a while,Kinda know it,Never touched it,Kinda know it,Never touched it,Love it,Kinda know it,"Learning the new languages, formalizing my lim...",Outdoors activities
2,2/14/2021 11:01,3,Sure!,Jackson,Wolf,,N/a,2nd Year,"CS, Statistics",Yes,Nope,Love it,Never touched it,Never touched it,Never touched it,Love it,Love it,,"Video games, tv, sports, anime"
3,2/14/2021 11:22,4,"Yes, but anonymized",PII,PII,,UVA College of Arts and Sciences,4th Year,"History, Chinese Language & Literature",No,It's been a while,Kinda know it,Never touched it,Never touched it,Never touched it,Never touched it,Never touched it,,"Ultimate Frisbee, Biking (Road and Mountain), ..."
4,2/14/2021 12:07,5,Sure!,Paul,Andrews,,UVA College of Arts and Sciences,3rd Year,Mathematics,No,It's been a while,Kinda know it,Never touched it,Kinda know it,Never touched it,Never touched it,Kinda know it,I just want to get more experience with data s...,"I'm a dancer, I enjoy trying different cuisine..."


Map the 'Yes' and 'No' in the DS_Major column to be 1 for 'Yes' and 0 for 'No'.

In [4]:
# Convert DS_Major to be 1 and 0
df['DS_Major'] = df['DS_Major'].map({'Yes':1, 'No':0})

Drop the columns that will not be useful for this prediction.

In [5]:
# Drop unimportant columns from the DataFrame
unimportant_columns = ['Timestamp', 'Index', 'Use_responses', 'First_name', 'Last_name', 'Nickname', 'Hobbies', 'DS_Topics']
cleaned = df.drop(columns=unimportant_columns)
cleaned.head()

Unnamed: 0,College,Year,Major,DS_Major,New_to_CS,Python,R,SQL,SAS_STATA,Other_languages,Statistics
0,UVA College of Arts and Sciences,2nd Year,Pre Commerce,0,Nope,Kinda know it,Kinda know it,Never touched it,Never touched it,Kinda know it,Kinda know it
1,UVA Darden,Graduate school,MBA,0,It's been a while,Kinda know it,Never touched it,Kinda know it,Never touched it,Love it,Kinda know it
2,N/a,2nd Year,"CS, Statistics",1,Nope,Love it,Never touched it,Never touched it,Never touched it,Love it,Love it
3,UVA College of Arts and Sciences,4th Year,"History, Chinese Language & Literature",0,It's been a while,Kinda know it,Never touched it,Never touched it,Never touched it,Never touched it,Never touched it
4,UVA College of Arts and Sciences,3rd Year,Mathematics,0,It's been a while,Kinda know it,Never touched it,Kinda know it,Never touched it,Never touched it,Kinda know it


One-hot encode the categorical variables using get_dummies from pandas.

In [6]:
# One-hot encode categorical variables
one_hot = pd.get_dummies(cleaned)
one_hot.head()

Unnamed: 0,DS_Major,College_N/a,College_UVA,College_UVA College of Arts and Sciences,College_UVA Darden,Year_1st Year,Year_2nd Year,Year_3rd Year,Year_4th Year,Year_Graduate school,...,SQL_Never touched it,SAS_STATA_Kinda know it,SAS_STATA_Love it,SAS_STATA_Never touched it,Other_languages_Kinda know it,Other_languages_Love it,Other_languages_Never touched it,Statistics_Kinda know it,Statistics_Love it,Statistics_Never touched it
0,0,0,0,1,0,0,1,0,0,0,...,1,0,0,1,1,0,0,1,0,0
1,0,0,0,0,1,0,0,0,0,1,...,0,0,0,1,0,1,0,1,0,0
2,1,1,0,0,0,0,1,0,0,0,...,1,0,0,1,0,1,0,0,1,0
3,0,0,0,1,0,0,0,0,1,0,...,1,0,0,1,0,0,1,0,0,1
4,0,0,0,1,0,0,0,1,0,0,...,0,0,0,1,0,0,1,1,0,0


Now, separate the data into X and y. Remember that the target is whether or not the student is pursuing a DS related major.

In [7]:
# Separate into X and y
X = one_hot.drop('DS_Major', axis=1)
y = one_hot['DS_Major']

Use train_test_split to split the data into training and testing. Use a test_size of 0.2, and set random_state to 42.

In [8]:
# Train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We will start by fitting a decision tree classifier to the training data.

In [9]:
# Fit a DecisionTreeClassifier
clf_dt = DecisionTreeClassifier()
clf_dt.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Use the classifier to predict on the testing data and compare the predictions to the actual labels.

In [10]:
predicted = clf_dt.predict(X_test)
actual = np.array(y_test)

print('Predicted: ',predicted)
print('Actual:    ',actual)

Predicted:  [1 0 1]
Actual:     [1 0 0]


Next, get the accuracy score for the DecisionTreeClassifier.

In [11]:
acc = accuracy_score(predicted, actual)
acc

0.6666666666666666

1. What are the most important features for the DecisionTreeClassifier? Hint: Use feature_importances_

Bonus: Create a DataFrame where each row has a feature and its importance and subset for features with importances greater than 0.

In [12]:
importances = pd.DataFrame({'Importance':clf_dt.feature_importances_}, index=X.columns)
importances[importances.Importance > 0]

Unnamed: 0,Importance
New_to_CS_Nope,0.5
Statistics_Love it,0.5


Now, we will train a KNeighborsClassifier and compare its performance to the DecisionTreeClassifier. Fit a KNN to the training data.

In [13]:
# Fit a KNN to the training data
clf_knn = KNeighborsClassifier(n_neighbors=5)
clf_knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

Predict on the testing data, and compare the results to the actual labels.

In [14]:
# Get predictions
predicted = clf_knn.predict(X_test)
actual = np.array(y_test)

print('Predicted: ',predicted)
print('Actual:    ',actual)

Predicted:  [1 1 1]
Actual:     [1 0 0]


In [15]:
# Get the accuracy score
acc = accuracy_score(predicted, actual)
acc

0.3333333333333333

2. Which classifier performed better -- KNN or DT?

3. What is the accuracy of a KNN on the training data when n_neighbors is set to 1? Why?

In [16]:
# Fit a KNN with n_neighbors=1 to the training data
clf_knn = KNeighborsClassifier(n_neighbors=1)
clf_knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')

In [17]:
# Get the predicted values
predicted = clf_knn.predict(X_train)
actual = np.array(y_train)

print('Predicted: ',predicted)
print('Actual:    ',actual)

Predicted:  [0 1 1 1 0 1 0 1 1 1 0 1]
Actual:     [0 1 1 1 0 1 0 1 1 1 0 1]


In [18]:
# Get the accuracy score
acc = accuracy_score(predicted, actual)
acc

1.0