## HW 2

Dennis Wang

MA 707 - Machine Learning

Find a dataset (from sklearn, seaborn, https://datasetsearch.research.google.com/, or anywhere else), choose features and a categorical target, impute missing entries if needed, split 80/20 into training and test, then fit the following models on your training data and report their accuracies on the test data: random forest, Gaussian Naive Bayes, SVM. Try this once without rescaling your data, once when normalizing it, and once when standardizing it (remember when you rescale to only do it on the feature matrix X not the target vector y, and also to fit the rescale on your training data then use it to transform the test data).

NOTES: 

(1) You may need to convert numerical features to categorical or categorical features to numerical depending on the method---you decide what should be done in this regard and do it. 

(2) For methods that involve hyperparameters, just pick and try a few values by hand and use whatever gives the best test accuracy (you don't have to do a thorough grid search, though you can if you want).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
titanic = sns.load_dataset('titanic')

### Deal with Missing Values

In [3]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]

    if pd.isnull(Age):
        if Pclass == 1:
            return round(titanic[titanic['pclass'] == 1]['age'].dropna().mean()) # the average for 1st class
        elif Pclass == 2:
            return round(titanic[titanic['pclass'] == 2]['age'].dropna().mean()) # the average for 2nd class
        else:
            return round(titanic[titanic['pclass'] == 3]['age'].dropna().mean()) # the average for 3rd class
    else:
        return Age

In [4]:
titanic['age'] = titanic[['age', 'pclass']].apply(impute_age, axis = 1)
titanic['age'].isna().sum()

0

In [5]:
titanic = titanic.dropna()

### Vectorization of Dummy Variabes

In [6]:
titanic = pd.get_dummies(titanic, columns = ['sex', 'embarked'], drop_first = True)

In [7]:
titanic.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,class,who,adult_male,deck,embark_town,alive,alone,sex_male,embarked_Q,embarked_S
1,1,1,38.0,1,0,71.2833,First,woman,False,C,Cherbourg,yes,False,0,0,0
3,1,1,35.0,1,0,53.1,First,woman,False,C,Southampton,yes,False,0,0,1
6,0,1,54.0,0,0,51.8625,First,man,True,E,Southampton,no,True,1,0,1
10,1,3,4.0,1,1,16.7,Third,child,False,G,Southampton,yes,False,0,0,1
11,1,1,58.0,0,0,26.55,First,woman,False,C,Southampton,yes,True,0,0,1


### Split 80/20 into training and test

In [8]:
X = titanic[['pclass','age','sibsp','parch','fare', 'sex_male', 'embarked_Q', 'embarked_S']]
y = titanic['survived']

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

### Fitting Models - No rescaling

#### Random Forest

In [11]:
from sklearn.ensemble import RandomForestClassifier

In [12]:
rfc = RandomForestClassifier(n_estimators = 200)
rfc.fit(X_train, y_train)
rfc_predictions = rfc.predict(X_test)

#### Gaussian Naive Bayes

In [13]:
from sklearn.naive_bayes import GaussianNB

In [14]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)
gnb_predictions = gnb.predict(X_test)

#### SVM

In [15]:
from sklearn import svm
from sklearn.model_selection import GridSearchCV

In [16]:
param_grid = {'C': [10, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001]}
grid = GridSearchCV(svm.SVC(), param_grid = param_grid, return_train_score = True)
grid.fit(X_train, y_train)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [10, 1, 0.1, 0.01, 0.001, 0.0001, 1e-05]},
             return_train_score=True)

In [17]:
print(f"Best mean cross-validation score: {grid.best_score_}")
print(f"Best parameters: {grid.best_params_}")
print(f"Test-set score: {grid.score(X_test, y_test):.3f}")

Best mean cross-validation score: 0.66875
Best parameters: {'C': 10}
Test-set score: 0.732


In [18]:
svm_model = svm.SVC(C = grid.best_params_.get('C'))
svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_test)

####  Model Evaluation - No Rescaling

In [19]:
from sklearn.metrics import confusion_matrix, classification_report

In [20]:
print('Random Forest')
print(confusion_matrix(rfc_predictions, y_test))
print('\n')
print(classification_report(rfc_predictions, y_test))

Random Forest
[[ 5  7]
 [ 2 27]]


              precision    recall  f1-score   support

           0       0.71      0.42      0.53        12
           1       0.79      0.93      0.86        29

    accuracy                           0.78        41
   macro avg       0.75      0.67      0.69        41
weighted avg       0.77      0.78      0.76        41



In [21]:
print('Gaussian Naive Bayes')
print(confusion_matrix(gnb_predictions, y_test))
print('\n')
print(classification_report(gnb_predictions, y_test))

Gaussian Naive Bayes
[[ 6  9]
 [ 1 25]]


              precision    recall  f1-score   support

           0       0.86      0.40      0.55        15
           1       0.74      0.96      0.83        26

    accuracy                           0.76        41
   macro avg       0.80      0.68      0.69        41
weighted avg       0.78      0.76      0.73        41



In [22]:
print('SVM')
print(confusion_matrix(svm_predictions, y_test))
print('\n')
print(classification_report(svm_predictions, y_test))

SVM
[[ 0  4]
 [ 7 30]]


              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       0.88      0.81      0.85        37

    accuracy                           0.73        41
   macro avg       0.44      0.41      0.42        41
weighted avg       0.80      0.73      0.76        41



### Fitting models - Normalized

In [23]:
from sklearn.preprocessing import Normalizer

In [24]:
normalizer = Normalizer()
normalizer.fit(X_train) # fit to data, not the target class
norm_features = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)

X_tr_norm = pd.DataFrame(norm_features, columns = X_train.columns)
X_tr_norm.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_male,embarked_Q,embarked_S
0,0.009973,0.558513,0.0,0.009973,0.829376,0.0,0.0,0.0
1,0.006346,0.368053,0.0,0.0,0.929783,0.0,0.0,0.0
2,0.089013,0.741773,0.029671,0.029671,0.663392,0.0,0.0,0.0
3,0.008504,0.425177,0.008504,0.0,0.90499,0.008504,0.0,0.0
4,0.014178,0.666352,0.014178,0.014178,0.745098,0.0,0.0,0.014178


#### Random Forest

In [25]:
rfc.fit(X_tr_norm, y_train)
rfc_norm_predictions = rfc.predict(X_test_norm)

#### Gaussian Naive Bayes

In [26]:
gnb.fit(X_tr_norm, y_train)
gnb_norm_predictions = gnb.predict(X_test_norm)

#### SVM

In [27]:
grid.fit(X_tr_norm, y_train)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [10, 1, 0.1, 0.01, 0.001, 0.0001, 1e-05]},
             return_train_score=True)

In [28]:
print(f"Best mean cross-validation score: {grid.best_score_}")
print(f"Best parameters: {grid.best_params_}")
print(f"Test-set score: {grid.score(X_test, y_test):.3f}")

Best mean cross-validation score: 0.6625
Best parameters: {'C': 10}
Test-set score: 0.171


In [29]:
svm_model = svm.SVC(C = grid.best_params_.get('C'))
svm_model.fit(X_tr_norm, y_train)
svm_norm_predictions = svm_model.predict(X_test_norm)

#### Model Evaluation - Normalization

In [30]:
print('Random Forest - Normalized')
print(confusion_matrix(rfc_norm_predictions, y_test))
print('\n')
print(classification_report(rfc_norm_predictions, y_test))

Random Forest - Normalized
[[ 4  6]
 [ 3 28]]


              precision    recall  f1-score   support

           0       0.57      0.40      0.47        10
           1       0.82      0.90      0.86        31

    accuracy                           0.78        41
   macro avg       0.70      0.65      0.67        41
weighted avg       0.76      0.78      0.77        41



In [31]:
print('Gaussian Naive Bayes - Normalized')
print(confusion_matrix(gnb_norm_predictions, y_test))
print('\n')
print(classification_report(gnb_norm_predictions, y_test))

Gaussian Naive Bayes - Normalized
[[ 1  3]
 [ 6 31]]


              precision    recall  f1-score   support

           0       0.14      0.25      0.18         4
           1       0.91      0.84      0.87        37

    accuracy                           0.78        41
   macro avg       0.53      0.54      0.53        41
weighted avg       0.84      0.78      0.81        41



In [32]:
print('SVM - Normalized')
print(confusion_matrix(svm_norm_predictions, y_test))
print('\n')
print(classification_report(svm_norm_predictions, y_test))

SVM - Normalized
[[ 0  1]
 [ 7 33]]


              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.97      0.82      0.89        40

    accuracy                           0.80        41
   macro avg       0.49      0.41      0.45        41
weighted avg       0.95      0.80      0.87        41



### Fitting Models - Scaled

In [33]:
from sklearn.preprocessing import StandardScaler

In [34]:
scaler = StandardScaler()
scaler.fit(X_train) # fit to data, not the target class
scaled_features = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_tr_scaled = pd.DataFrame(scaled_features, columns = X_train.columns)
X_tr_scaled.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_male,embarked_Q,embarked_S
0,-0.412751,1.303612,-0.660979,0.797861,0.147497,-1.119608,-0.138233,-1.308382
1,-0.412751,1.433991,-0.660979,-0.589723,0.977495,-1.119608,-0.138233,-1.308382
2,3.063047,-0.717252,1.101632,0.797861,-0.648935,-1.119608,-0.138233,-1.308382
3,-0.412751,0.912477,1.101632,-0.589723,0.452272,0.89317,-0.138233,-1.308382
4,-0.412751,0.71691,1.101632,0.797861,-0.253393,-1.119608,-0.138233,0.764303


#### Random Forest

In [35]:
rfc.fit(X_tr_scaled, y_train)
rfc_scaled_predictions = rfc.predict(X_test_scaled)

#### Gaussian Naive Bayes

In [36]:
gnb.fit(X_tr_scaled, y_train)
gnb_scaled_predictions = gnb.predict(X_test_scaled)

#### SVM

In [37]:
grid.fit(X_tr_scaled, y_train)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [10, 1, 0.1, 0.01, 0.001, 0.0001, 1e-05]},
             return_train_score=True)

In [38]:
print(f"Best mean cross-validation score: {grid.best_score_}")
print(f"Best parameters: {grid.best_params_}")
print(f"Test-set score: {grid.score(X_test, y_test):.3f}")

Best mean cross-validation score: 0.725
Best parameters: {'C': 1}
Test-set score: 0.829


In [39]:
svm_model = svm.SVC(C = grid.best_params_.get('C'))
svm_model.fit(X_tr_scaled, y_train)
svm_scaled_predictions = svm_model.predict(X_test_scaled)

#### Model Evaluation - Scaled

In [40]:
print('Random Forest - Scaled')
print(confusion_matrix(rfc_scaled_predictions, y_test))
print('\n')
print(classification_report(rfc_scaled_predictions, y_test))

Random Forest - Scaled
[[ 5  8]
 [ 2 26]]


              precision    recall  f1-score   support

           0       0.71      0.38      0.50        13
           1       0.76      0.93      0.84        28

    accuracy                           0.76        41
   macro avg       0.74      0.66      0.67        41
weighted avg       0.75      0.76      0.73        41



In [41]:
print('Gaussian Naive Bayes - Scaled')
print(confusion_matrix(gnb_scaled_predictions, y_test))
print('\n')
print(classification_report(gnb_scaled_predictions, y_test))

Gaussian Naive Bayes - Scaled
[[ 6  9]
 [ 1 25]]


              precision    recall  f1-score   support

           0       0.86      0.40      0.55        15
           1       0.74      0.96      0.83        26

    accuracy                           0.76        41
   macro avg       0.80      0.68      0.69        41
weighted avg       0.78      0.76      0.73        41



In [42]:
print('SVM - Scaled')
print(confusion_matrix(svm_scaled_predictions, y_test))
print('\n')
print(classification_report(svm_scaled_predictions, y_test))

SVM - Scaled
[[ 6  7]
 [ 1 27]]


              precision    recall  f1-score   support

           0       0.86      0.46      0.60        13
           1       0.79      0.96      0.87        28

    accuracy                           0.80        41
   macro avg       0.83      0.71      0.74        41
weighted avg       0.81      0.80      0.79        41



### Conceptual Questions

#### From Sep 15 lecture:

>Describe in words the "backwards elimination" process for feature selection

When selecting which features to include in your model, backwards eliminations involves starting with a model with all the features included, and then eliminating the feature that leads to the the model with the best accuracy score. Keep eliminating features and creating new models until you no longer get better accuracy scores.

>Explain why some categorical variables need to be vectorized (one-hot encoding) whereas others can be label encoded (just assigned a different number for each value)

If there is an implicit order to the categorical variable, e.g. the darkness level of a crab shell sorted as Dark, Medium Dark, Medium, etc., then it can be assigned a different number for each value and the number would correspond to that implicit order. However, if there is no implicit order to the categorical variable, such as the state that a package is shipped to, then there is no way to judge whether one state is greater than another (e.g. NJ > MA), and thus the variable will need to be vectorized.



#### k-NN:

>How does k-NN label new data points based on the training data points (both for classification and for regression)?

>What effect does the hyperparmater k have?

The hyperparameter k is how many of the closest k points to our test data point will be considered in our judgement. Smaller k-values make the model more complex and sensitive to the local region of our test value, but increase the risk of overfitting. Therefore, if we think the local structure of our data points are important, then a smaller K is better. Bigger K-values are more representative of our overall dataset but less sensitive to local regions of the data, which may also be bad.

In classification, we simply take the majority label of our k neighbors and select that as the prediction for our test data point. In regression, we take the mean of all the k neighbors and choose that as our prediction. 


#### Naive Bayes:

>How does Bayes formula help with classification if we've already estimated the probability distribution for each class?

Say we we have two classes `Men` and `Women`, and we are trying to predict for a certain `Disease`: P(Disease | Men) and P(Disease | Women). We cannot simply say that because of the probability of `Disease` is higher for `Men` than it is for `Women`, that our label for certain is going to be `Men`. This is because we're assuming that the dataset is balanced between `Men` and `Women` labels when it could be heavily biased towards one or the other. Therefore, having a probability of `Disease` that is higher for `Men` than it is for `Women` doesn't mean much if `Men` make up 1% of the dataset and `Women` 99%. 

A much more useful metric if instead the flip of our probability: P(Men | Disease) and P(Women | Disease). This would help us incorporate the fact that the dataset may be imbalanced and properly turns these class distributions into a classifier.

>Explain how the class distributions are estimated in Gaussian Naive Bayes

In Gaussian Naive Bayes, all of our variables/features are numerical and we assume the distribution of each class to be Gaussian with no covariance. We then take the mean and variance of each variable in each label(where each Gaussian is centered), thereby granting us the entire class distribution.

>Very roughly, what are Gaussian Mixture Model (GMM) and Kernel Density Estimator (KDE)?

If we're assuming each class to be Gaussian but also have some covariance (e.g. you think the predictors might be related), then we use the Gaussian Mixture Model. GMM allows us to fit the data in different ways by mixing different Gaussian models and see which results in the best scores.

Kernel Density Estimator (KDE) works by creating the distribution from the data. We not only assume that each class as Gaussian, but replace each point with a Gaussian distribution. Essentially we build a Gaussian distribution around each data point, and when these Gaussians overlap, we can add up all the points of the Gaussians and it'll build the distribution.

>Describe Naive Bayes classification when all predictors are discrete (this is the frequency-based version) including what the "Naive" part of it means)

In a Naive Bayes Classification, the predictors are discrete. You can view your target conditional probability as the fraction of your class training points that have those particular values for the predictors. For example, if you're looking for the probability of `Cat`, `Dog`, and `Mouse` keywords appearing in an email to differentiate spam vs. not spam, P( (1,1,1) | not spam), you would go through all the emails in your dataset that have exactly one mention of each keyword, count those emails, and divide by the total number of emails marked as not spam. 

The problem is that there's a good chance you may not have a vector that exactly matches yours, e.g. a vector that has exactly 2 mentions of `Cat`, 5 mentions of `Dog`, and 7 mentions of `Mouse` may be not appear. In practice we often don't have any training points with these exact predictor values.

As such, we will pretend that all of our predictors are conditionally independent. This is where the "Naive" part comes from: It's not true but it's close enough to being true where it's still useful. From our example, we're pretending that our words are not related. We're claiming that the number of times the word `Cat` appears in an email has nothing to do with the appearances of `Dog` or `Mouse` in that same email, which is very likely not true in real, but we're doing so regardless.

This works because in real life, where we're assuming dependence between our variables (or keywor apperances in an email), seeing one word will push us towards a single label, and another keyword may also push us towards that label. However, because these keywords are conditionally dependent, this "push effect" is not as strong as if they were conditionally independently pushing towards that label. 

#### Decision trees and random forest:

>Explain the basic idea of a decision tree and what kind of decision boundary it produces

A decision tree works by continuously splitting your along the axis of each prredictor. Each split represents a decision point in the tree, and creates a decision boundary.

>How are random forests related to decision trees?

Random forests are an ensemble method which means that it trains a bunch of different decision trees based on random sampling (with replacement) of our training data. From the results of each tree, we average out each prediction if its regression, or take the majority label if we're doing classification. 

#### SVM:

>Explain the basic idea and how the cost parameter C is used and what it does (no need to explain anything about kernels or radial basis functions)

The goal of SVM is to choose the best linear decision boundary to separate our data points. This works by choosing the line that maximizes the distance from the closest data points (known as support points or vectors). The cost parameter C is a parameter of the SVM function. It determines the cost of data points that enter deep into our margin or cross our line. IF C is large, thyen we don't have many points in our margin or crossing the line at all as we can't afford to let a lot of points cross the margin, but the expense of this is that our margin then becomes very thin as a result and may lead to a model that can't generalize to new data very well. A low C can afford to let a lot of points cross the margin and as a result results in a larger margin. The problem with this is that you may not properly capture the trends/shape of the data with your model.