# Neural Networks #


In [3]:
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_extraction import DictVectorizer

from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, ParameterGrid

import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [4]:
!wget -nc http://quadro.ist.berkeley.edu:1331/lab_4_training.csv
!wget -nc http://quadro.ist.berkeley.edu:1331/lab_4_test.csv

df_train = pd.read_csv('./lab_4_training.csv')
df_test = pd.read_csv('./lab_4_test.csv')
df_train.head()

--2024-10-02 03:42:25--  http://quadro.ist.berkeley.edu:1331/lab_4_training.csv
Resolving quadro.ist.berkeley.edu (quadro.ist.berkeley.edu)... 169.229.194.98
Connecting to quadro.ist.berkeley.edu (quadro.ist.berkeley.edu)|169.229.194.98|:1331... connected.
HTTP request sent, awaiting response... 200 OK
Length: 79177 (77K) [text/csv]
Saving to: ‘lab_4_training.csv’


2024-10-02 03:42:25 (773 KB/s) - ‘lab_4_training.csv’ saved [79177/79177]

--2024-10-02 03:42:25--  http://quadro.ist.berkeley.edu:1331/lab_4_test.csv
Resolving quadro.ist.berkeley.edu (quadro.ist.berkeley.edu)... 169.229.194.98
Connecting to quadro.ist.berkeley.edu (quadro.ist.berkeley.edu)|169.229.194.98|:1331... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26519 (26K) [text/csv]
Saving to: ‘lab_4_test.csv’


2024-10-02 03:42:25 (529 KB/s) - ‘lab_4_test.csv’ saved [26519/26519]



Unnamed: 0.1,Unnamed: 0,gender,age,year,eyecolor,height,miles,brothers,sisters,computertime,exercise,exercisehours,musiccds,playgames,watchtv
0,577,male,20,third,hazel,72.0,180.0,0,0,5.0,No,0.0,100.0,10.0,10.0
1,677,male,19,second,hazel,72.0,120.0,1,1,16.0,Yes,9.0,70.0,3.0,5.0
2,1738,male,20,second,brown,63.0,55.0,1,2,15.0,Yes,4.5,15.0,4.0,13.0
3,1355,male,20,third,green,78.0,200.0,0,0,10.0,Yes,9.0,20.0,10.0,10.0
4,891,female,19,second,green,67.0,280.0,2,0,4.0,Yes,2.0,164.0,0.0,2.0


In [5]:
df_test.head()

Unnamed: 0.1,Unnamed: 0,gender,age,year,eyecolor,height,miles,brothers,sisters,computertime,exercise,exercisehours,musiccds,playgames,watchtv
0,1303,male,20,second,green,73.0,210.0,0,1,10.0,Yes,5.0,50.0,1.0,15.0
1,36,male,20,third,other,71.0,90.0,1,0,15.0,Yes,4.0,10.0,0.0,1.0
2,489,male,22,fourth,hazel,75.0,200.0,0,1,1.0,Yes,2.0,150.0,1.0,10.0
3,1415,male,19,second,brown,72.0,35.0,2,2,20.0,Yes,5.0,100.0,0.0,7.0
4,616,male,22,fourth,hazel,71.0,15.0,2,1,10.0,Yes,7.0,10.0,0.0,5.0


***
Calculate a baseline accuracy measure using the majority class, assuming a target variable of `gender`. The majority class is the most common value of the target variable in a particular dataset.

In [20]:
majority_class_train = df_train['gender'].mode()[0]
print(majority_class_train)
accuracy_train = (df_train['gender'] == majority_class_train).mean()
print(accuracy_train)

female
0.5427852348993288


   
Choose a NN implementation (eg: scikit-learn) and specify which you choose. Be sure the implementation allows you to modify the number of hidden layers and hidden nodes per layer.  

NOTE: When possible, specify the logsig (`sigmoid`/`logistic`) function as the transfer function (another word for activation function) and use Levenberg-Marquardt backpropagation (`lbfgs`). It is possible to specify logistic in Sklearn MLPclassifier (Neural net).  

Neural network trained with a single 10 node hidden layer. Using only the `height` feature of the dataset to predict the `gender`. Gender is one-hot encoded. Predicting the class (`gender`) using the `height` feature from the training set. 

In [22]:
X_train = df_train[['height']].values
Y_train = df_train['gender'].values
le = LabelEncoder()
y_train = le.fit_transform(Y_train)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
mlp = MLPClassifier(hidden_layer_sizes=(10,), activation='logistic',
                    solver='lbfgs',max_iter=100,random_state=42)
mlp.fit(X_train_scaled, y_train)
y_train_prediction= mlp.predict(X_train_scaled)
train_acc = accuracy_score(y_train, y_train_prediction)
print(train_acc)

0.8439597315436241


Taking the trained model from above and use it to predict the test set.

In [23]:
X_test = df_test[['height']].values
Y_test = df_test['gender'].values
y_test = le.transform(Y_test)
X_test_scaled = scaler.transform(X_test)
y_test_prediction = mlp.predict(X_test_scaled)
test_acc = accuracy_score(y_test, y_test_prediction)
print(test_acc)

0.8542713567839196


###ANSWER: 0.8542713567839196

Taking the log of the `height` feature in both training and testing sets.

In [24]:
X_train_log = np.log(df_train[['height']].values)
X_test_log = np.log(df_test[['height']].values)
mlp.fit(X_train_log, y_train)
y_log_train_prediction = mlp.predict(X_train_log)
y_log_test_prediction = mlp.predict(X_test_log)
log_train_acc = accuracy_score(y_train, y_log_train_prediction)
log_test_acc = accuracy_score(y_test, y_log_test_prediction)
print(log_train_acc)
print(log_test_acc)

0.8439597315436241
0.8542713567839196


Using only these binary variable transformed features.

In [25]:
categoricals = ['year', 'eyecolor', 'exercise']
onehot_encoder = OneHotEncoder()
X_train_cat = onehot_encoder.fit_transform(df_train[categoricals])
X_test_cat = onehot_encoder.transform(df_test[categoricals])
mlp_2 = MLPClassifier(hidden_layer_sizes=(10,),
                      activation='logistic', solver='lbfgs',
                      max_iter=50,random_state=42)
mlp_2.fit(X_train_cat, y_train)
y_test_pred = mlp_2.predict(X_test_cat)
acc = accuracy_score(y_test, y_test_pred)
print(acc)

0.5527638190954773


Accuracy on the test set using the original `height` values (no pre-processing) and `eyecolor` as a one-hot.

In [26]:
X_train_height = df_train[['height']]
X_test_height = df_test[['height']]
onehot_encoder = OneHotEncoder(sparse_output=False)
X_train_eyecolor = pd.DataFrame(onehot_encoder.fit_transform
 (df_train[['eyecolor']]),
                                columns=onehot_encoder.get_feature_names_out
                                 (['eyecolor']))
X_test_eyecolor = pd.DataFrame(onehot_encoder.transform(df_test[['eyecolor']]),
                                columns=onehot_encoder.get_feature_names_out
                                 (['eyecolor']))
X_train_set = pd.concat([X_train_height, X_train_eyecolor], axis = 1)
X_test_set = pd.concat([X_test_height, X_test_eyecolor], axis = 1)
le = LabelEncoder()
y_train = le.fit_transform(df_train['gender'].values)
y_test = le.transform(df_test['gender'].values)

mlp_height = MLPClassifier(hidden_layer_sizes = 10, activation = 'logistic',
                           solver = 'lbfgs', max_iter=100, random_state =42)
mlp_height.fit(X_train_set, y_train)
y_test_prediction_height = mlp_height.predict(X_test_set)
height_acc = accuracy_score(y_test, y_test_prediction_height)
print(height_acc)

0.8592964824120602


Accuracy on the test set using the log of `height` values (applied to both training and testing sets) and `eyecolor` as a one-hot.

In [27]:
X_train_height_log = np.log(df_train[['height']])
X_test_height_log = np.log(df_test[['height']])
onehot_encoder = OneHotEncoder(sparse_output=False)
X_train_eyecolor_log = pd.DataFrame(onehot_encoder.fit_transform
 (df_train[['eyecolor']]),
                                columns=onehot_encoder.get_feature_names_out
                                 (['eyecolor']))
X_test_eyecolor_log = pd.DataFrame(onehot_encoder.transform
 (df_test[['eyecolor']]),
                                columns=onehot_encoder.get_feature_names_out
                                 (['eyecolor']))
X_train_set_log = pd.concat([X_train_height_log, X_train_eyecolor_log],
                            axis = 1)
X_test_set_log = pd.concat([X_test_height_log, X_test_eyecolor_log], axis = 1)
le = LabelEncoder()
y_train = le.fit_transform(df_train['gender'].values)
y_test = le.transform(df_test['gender'].values)

mlp_height_log = MLPClassifier(hidden_layer_sizes = (10,),
                               activation = 'logistic',solver = 'lbfgs',
                               max_iter=100, random_state =42)
mlp_height_log.fit(X_train_set_log, y_train)
y_test_prediction_height_log = mlp_height_log.predict(X_test_set_log)
height_acc = accuracy_score(y_test, y_test_prediction_height_log)
print(height_acc)

0.8693467336683417


Accuracy on the test set using the Z-score of `height` values and `eyecolor` as a one-hot.

In [28]:
height_mean = df_train['height'].mean()
height_std = df_train['height'].std()
X_train_height_z = (df_train[['height']] - height_mean) / height_std
X_test_height_z = (df_test[['height']] - height_mean) / height_std
onehot_encoder = OneHotEncoder(sparse_output=False)
X_train_eyecolor_z = pd.DataFrame(onehot_encoder.fit_transform
 (df_train[['eyecolor']]),
                                columns=onehot_encoder.get_feature_names_out
                                 (['eyecolor']))
X_test_eyecolor_z = pd.DataFrame(onehot_encoder.transform
 (df_test[['eyecolor']]),
                                columns=onehot_encoder.get_feature_names_out
                                 (['eyecolor']))
X_train_set_z = pd.concat([X_train_height_z, X_train_eyecolor_z], axis = 1)
X_test_set_z = pd.concat([X_test_height_z, X_test_eyecolor_z], axis = 1)
le = LabelEncoder()
y_train = le.fit_transform(df_train['gender'].values)
y_test = le.transform(df_test['gender'].values)

mlp_height_z = MLPClassifier(hidden_layer_sizes = 10, activation = 'logistic',
                           solver = 'lbfgs', max_iter=100, random_state =42)
mlp_height_z.fit(X_train_set_z, y_train)
y_test_prediction_height_z = mlp_height_z.predict(X_test_set_z)
height_acc = accuracy_score(y_test, y_test_prediction_height_z)
print(height_acc)

0.8693467336683417


Repeat for`watchtv` & `eyecolor`features

Accuracy on the test set using the original `watchtv` values (no pre-processing) and `eyecolor` as a one-hot.

In [29]:
X_train_tv = df_train[['watchtv']]
X_test_tv = df_test[['watchtv']]
X_train_set_tv = pd.concat([X_train_tv, X_train_eyecolor], axis = 1)
X_test_set_tv = pd.concat([X_test_tv, X_test_eyecolor], axis = 1)
mlp_watchtv = MLPClassifier(hidden_layer_sizes = (10,), activation = 'logistic',
                           solver = 'lbfgs', max_iter=100, random_state =42)
mlp_watchtv.fit(X_train_set_tv, y_train)
y_test_prediction_tv = mlp_watchtv.predict(X_test_set_tv)
tv_acc = accuracy_score(y_test, y_test_prediction_tv)
print(tv_acc)

0.5376884422110553


Accuracy on the test set using the log of `watchtv` values (applied to both training and testing sets) and `eyecolor` as a one-hot.

In [30]:
df_train_log_tv = df_train[df_train['watchtv'] > 0].copy()
df_test_log_tv = df_test[df_test['watchtv'] > 0].copy()


X_train_tv_log = np.log(df_train_log_tv[['watchtv']])
X_test_tv_log = np.log(df_test_log_tv[['watchtv']])

onehot_encoder = OneHotEncoder(sparse_output=False, handle_unknown ='ignore')
X_train_eyecolor_log = pd.DataFrame(onehot_encoder.fit_transform
 (df_train_log_tv[['eyecolor']]),
                                    columns=onehot_encoder.get_feature_names_out
                                     (['eyecolor']))
X_test_eyecolor_log = pd.DataFrame(onehot_encoder.transform
 (df_test_log_tv[['eyecolor']]),
                                    columns=onehot_encoder.get_feature_names_out
                                     (['eyecolor']))


X_train_logtv = pd.concat([X_train_tv_log.reset_index(drop=True),
                           X_train_eyecolor_log.reset_index(drop=True)],
                          axis = 1)
X_test_logtv = pd.concat([X_test_tv_log.reset_index(drop=True),
                          X_test_eyecolor_log.reset_index(drop=True)],
                         axis = 1)

le = LabelEncoder()
y_train_logtv = le.fit_transform(df_train_log_tv['gender'].values)
y_test_logtv = le.transform(df_test_log_tv['gender'].values)

mlp_logTV = MLPClassifier(hidden_layer_sizes=(10,),
                          activation='logistic', solver='lbfgs',
                          max_iter=100, random_state=42)
mlp_logTV.fit(X_train_logtv, y_train_logtv)
y_pred_logtv = mlp_logTV.predict(X_test_logtv)

tv_log_acc = accuracy_score(y_pred_logtv, y_test_logtv)
print(tv_log_acc)


0.5128205128205128


Accuracy on the test set using the Z-score of `watchtv` values and `eyecolor` as a one-hot.

In [31]:
scaler = StandardScaler()
X_train_tv_z = scaler.fit_transform(df_train_log_tv[['watchtv']])
X_test_tv_z = scaler.transform(df_test_log_tv[['watchtv']])

X_train_zscore = pd.concat([pd.DataFrame(X_train_tv_z, columns=['watchtv']),
                            X_train_eyecolor_log.reset_index(drop=True)],
                           axis = 1)
X_test_zscore = pd.concat([pd.DataFrame(X_test_tv_z, columns=['watchtv']),
                           X_test_eyecolor_log.reset_index(drop=True)],
                          axis = 1)

mlp_zscore= MLPClassifier(hidden_layer_sizes=(10,), activation='logistic',
                          solver='lbfgs',max_iter=100, random_state=42)
mlp_zscore.fit(X_train_zscore, y_train_logtv)

y_pred_z = mlp_zscore.predict(X_test_zscore)


zscore_acc_2 = accuracy_score(y_pred_z, y_test_logtv)
print(zscore_acc_2)


0.49743589743589745


Combining the features(`year`, `eyecolor`, `exercise`, `height`, `watchtv`)

NN accuracy on the test set using the single 10 node hidden layer.

In [33]:
df_train_6a = df_train[df_train['watchtv'] > 0].copy()
df_test_6a = df_test[df_test['watchtv'] > 0].copy()

categoricals = ['year', 'eyecolor', 'exercise']
onehot_encoder = OneHotEncoder(sparse_output=False)
X_train_cat = pd.DataFrame(onehot_encoder.fit_transform
 (df_train_6a[categoricals]),
                           columns=onehot_encoder.get_feature_names_out
                            (categoricals))
X_test_cat = pd.DataFrame(onehot_encoder.transform
 (df_test_6a[categoricals]),
                          columns=onehot_encoder.get_feature_names_out
                           (categoricals))

height_scaler = StandardScaler()
X_train_6a = pd.DataFrame(height_scaler.fit_transform(df_train_6a[['height']]),
                          columns=['height_z'])
X_test_6a = pd.DataFrame(height_scaler.transform(df_test_6a[['height']]),
                         columns=['height_z'])

X_train_tv_log = np.log(df_train_6a[['watchtv']])
X_test_tv_log = np.log(df_test_6a[['watchtv']])

X_train_comb = pd.concat([X_train_cat.reset_index(drop=True),
                          X_train_6a.reset_index(drop=True),
                          X_train_tv_log.reset_index(drop=True)], axis =1)

X_test_comb = pd.concat([X_test_cat.reset_index(drop=True),
                         X_test_6a.reset_index(drop=True),
                         X_test_tv_log.reset_index(drop=True)], axis =1)

le = LabelEncoder()
y_train_combined = le.fit_transform(df_train_6a['gender'].values)
y_test_combined = le.fit_transform(df_test_6a['gender'].values)

mlp_comb = MLPClassifier(hidden_layer_sizes=(10,), activation='logistic',
                         solver='lbfgs', max_iter=100, random_state=42)
mlp_comb.fit(X_train_comb, y_train_combined)

y_comb_pred = mlp_comb.predict(X_test_comb)
combined_acc = accuracy_score(y_comb_pred, y_test_combined)

print(combined_acc)

0.8179487179487179


***Improving test set prediction accuracy***

In [34]:
mlp_comb_bonus = MLPClassifier(hidden_layer_sizes=(50,50), activation='identity', solver='lbfgs', alpha = .0001, max_iter=50, random_state=42)
mlp_comb_bonus.fit(X_train_comb, y_train_combined)

y_comb_pred_bonus = mlp_comb_bonus.predict(X_test_comb)
combined_acc_bonus = accuracy_score(y_comb_pred_bonus, y_test_combined)

print(combined_acc_bonus)
print(combined_acc_bonus - combined_acc)

0.8512820512820513
0.033333333333333326


I was able to increase the accuracy by 3.3%. I increased the hidden layer sizes from (10,) to (50,50). I changed the activation function for the hidden layer from logistic to identity. I also added an alpha term which represents the strength of L2 regularization term. I decreased the max iterations from 100 to 50. I played around with model to achieve this while making educated choices on what to change. By increasing the number of hidden layers, I allow for the representation of more complex patterns but this includes a higher risk of overfitting. To combat this, I added a L2 regularization term.