# MA 707 - HW 1

### Dennis Wang

## Part A

Load one of the built-in data sets from seaborn or sklearn that includes at least one categorical variable, choose a collection of feature variables (including vectorization of the categorical variable, and vectorize any other categorical feature variables you use as well) and choose a target variable (categorical or numerical, either is fine), then format your data as a supervised learning problem for these features/target and print these data frames/series. Choose a supervised learning method (you don't have to know how it works, we'll get to them next week) then do 5-fold cross validation and report the average accuracy (if your target is categorical) or some measure of error (if your target is numerical) of your method across the five trials, and also the standard deviation of the accuracies/errors.


In [1]:
import pandas as pd
import seaborn as sns
import numpy as np

In [2]:
titanic = sns.load_dataset('titanic')

In [3]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB


### Dealing with missing values

In [5]:
titanic.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

### Impute Age (based on pclass) and drop NA's elsewhere

In [6]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]

    if pd.isnull(Age):
        if Pclass == 1:
            return round(titanic[titanic['pclass'] == 1]['age'].dropna().mean()) # the average for 1st class
        elif Pclass == 2:
            return round(titanic[titanic['pclass'] == 2]['age'].dropna().mean()) # the average for 2nd class
        else:
            return round(titanic[titanic['pclass'] == 3]['age'].dropna().mean()) # the average for 3rd class
    else:
        return Age

In [7]:
titanic['age'] = titanic[['age', 'pclass']].apply(impute_age, axis = 1)
titanic['age'].isna().sum()

0

In [8]:
titanic = titanic.dropna()

### Vectorization of categorical variables

In [9]:
titanic = pd.get_dummies(titanic, columns = ['sex', 'embarked'], drop_first = True)

In [10]:
titanic.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,class,who,adult_male,deck,embark_town,alive,alone,sex_male,embarked_Q,embarked_S
1,1,1,38.0,1,0,71.2833,First,woman,False,C,Cherbourg,yes,False,0,0,0
3,1,1,35.0,1,0,53.1,First,woman,False,C,Southampton,yes,False,0,0,1
6,0,1,54.0,0,0,51.8625,First,man,True,E,Southampton,no,True,1,0,1
10,1,3,4.0,1,1,16.7,Third,child,False,G,Southampton,yes,False,0,0,1
11,1,1,58.0,0,0,26.55,First,woman,False,C,Southampton,yes,True,0,0,1


### Logistic Regression with 5-fold cross validation

In [11]:
X = titanic[['pclass','age','sibsp','parch','fare', 'sex_male', 'embarked_Q', 'embarked_S']]
y = titanic['survived']

In [12]:
from sklearn.linear_model import LogisticRegression

In [13]:
logmodel = LogisticRegression(max_iter = 275)

In [14]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(logmodel, X, y, cv=5)
print(scores)

[0.80487805 0.8        0.75       0.675      0.775     ]


In [15]:
import numpy as np

In [16]:
print((np.mean(scores), np.std(scores)))

(0.7609756097560976, 0.047242523620913565)


## Part B
Now split your data into three parts: 80% training, 10% validation, 10% testing. Pick 3 different supervised learning methods (you may have to Google or browse sklearn documentation to find the names of ones you can use) and train all three on your training data (default hyperparameter values is fine, and don't worry if you don't understand the method yet) then compute their accuracy/error scores on the validation data. Pick the method that got the best score. Then combine your training data with your validation and retrain this one method on that 90% data then compute its score on the remaining 10% test data.

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
X_trainval, X_test, y_trainval, y_test = train_test_split(X,y, test_size = 0.1, random_state = 42)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size = 0.1, random_state = 42)

### Logistic Regression, KNN, and SVM

#### Logistic Regression

In [19]:
log_model = LogisticRegression(max_iter = 300)
log_model.fit(X_train, y_train)
log_predictions = log_model.predict(X_val)

#### KNN

In [20]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [21]:
param_grid = {'n_neighbors': np.arange(1,40)}
grid = GridSearchCV(KNeighborsClassifier(), param_grid = param_grid, return_train_score = True)
grid.fit(X_train,y_train)

GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39])},
             return_train_score=True)

In [22]:
print(f"Best mean cross-validation score: {grid.best_score_}")
print(f"Best parameters: {grid.best_params_}")
print(f"Test-set score: {grid.score(X_test, y_test):.3f}")

Best mean cross-validation score: 0.6606060606060606
Best parameters: {'n_neighbors': 34}
Test-set score: 0.810


In [23]:
knn_model = KNeighborsClassifier(n_neighbors = grid.best_params_.get('n_neighbors'))
knn_model.fit(X_train, y_train)
knn_predictions = knn_model.predict(X_val)

#### SVM

In [24]:
from sklearn import svm

In [25]:
svm_model = svm.SVC()
svm_model.fit(X_train, y_train)
svm_predictions = svm_model.predict(X_val)

### Model Evaluation

In [26]:
from sklearn.metrics import classification_report, confusion_matrix

In [27]:
print('Logistic Regression')
print(confusion_matrix(log_predictions, y_val))
print('\n')
print(classification_report(log_predictions, y_val))

Logistic Regression
[[ 3  0]
 [ 1 14]]


              precision    recall  f1-score   support

           0       0.75      1.00      0.86         3
           1       1.00      0.93      0.97        15

    accuracy                           0.94        18
   macro avg       0.88      0.97      0.91        18
weighted avg       0.96      0.94      0.95        18



In [28]:
print('KNN')
print(confusion_matrix(knn_predictions, y_val))
print('\n')
print(classification_report(knn_predictions, y_val))

KNN
[[ 1  0]
 [ 3 14]]


              precision    recall  f1-score   support

           0       0.25      1.00      0.40         1
           1       1.00      0.82      0.90        17

    accuracy                           0.83        18
   macro avg       0.62      0.91      0.65        18
weighted avg       0.96      0.83      0.88        18



In [29]:
print('SVM')
print(confusion_matrix(svm_predictions, y_val))
print('\n')
print(classification_report(svm_predictions, y_val))

SVM
[[ 0  0]
 [ 4 14]]


              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.78      0.88        18

    accuracy                           0.78        18
   macro avg       0.50      0.39      0.44        18
weighted avg       1.00      0.78      0.88        18



  _warn_prf(average, modifier, msg_start, len(result))


Question: I'm guessing this warning occurs because I have 0 true positives? Not sure. 

### Combine training data with validation and retrain our KNN model on that 90% data then compute its score on the remaining 10% test data.

In [30]:
log_test_model = LogisticRegression(max_iter = 200)
log_test_model.fit(X_trainval, y_trainval)
log_test_predictions = log_model.predict(X_test)

In [31]:
print(confusion_matrix(log_test_predictions, y_test))
print('\n')
print(classification_report(log_test_predictions, y_test))

[[ 1  0]
 [ 2 18]]


              precision    recall  f1-score   support

           0       0.33      1.00      0.50         1
           1       1.00      0.90      0.95        20

    accuracy                           0.90        21
   macro avg       0.67      0.95      0.72        21
weighted avg       0.97      0.90      0.93        21



### Conceptual Questions

#### (a) When choosing hyperparameters (or when choosing from among different supervised learning methods) why do we split our data into three pieces (training, validation, testing) instead of the usual two (training, testing)?

If we're interested in fine tuning our data, we need a validation set to test the results of modified parameters in our models that were trained on the training set. However, since we fine tuned our model on the validation set, we can't effectively test our model's performance on that same test without risking issues of overfitting. Therefore, another hold out test, the test set, is used to provide an unbiased estimate of our model's performance.

#### (b) Why in the graph of accuracy as a function of model complexity does the training accuracy keep increasing but the testing accuracy goes up then back down?

Because we're training our model on the training data, the training accuracy will tend to increase as the model becomes more complex because the model is "learning" or becoming more flexible to the training data, i.e. overfits the data. However testing accuracy will start to decrease because the complex model is starting to represent the shape of the training data too closely and has issues predicting/generalizing to new data.

#### (c) Why in the graph of accuracy as a function of training set size do the training and testing accuracies get closer together as the training set size increases? Will they always "converge" (meet each other) as the training size goes to infinity?

As you get more data points, it becomes more difficult to fit a model that captures all those data points with strong accuracy scores, and hence the training data accuracy decreases and training set size increases. With more data points however, you can build a more rigid model that better captures the trends/generalizes outside the training data, hence stronger testing/validation accuracy. There is no guarantee that the accuracies will ever converge even as training size goes to infinity.

#### (d) Explain what "stratified" means in cross-validation.

"Stratified" means that each of the folds produced from our KFold cross validation gets their own mini-KFold cross validation. Instead of each fold getting a segmented area dedicated to the testing set, the testing set data is instead taken from multiple parts of the fold proportional to the distribution of the target variable. This is because our data is often sorted according to the target variable so taking a single segment from the fold would likely produce testing data that does not accurately represent the distribution of our target variable.

#### (e) In the above Python problem 2(b), was the accuracy/error score you got at the end on the test data better or worse than the first time you trained and computed it on the validation data? Why do you think that was case?

I think it was worse because the model was fitted better to the training data and wasn't as effective in generalizing to new data. 