<p style="font-family: Arial; font-size:3em;color:black;"> Lab Exercise 7</p>

We will use the titanic dataset from Kaggle (https://www.kaggle.com/). This is a well-known dataset and we will use it for classification- if the passenger survived or passed away.

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [33]:
training_DF = pd.read_csv('titanic_dataset_GBC.csv')

In [34]:
training_DF['Age'].fillna(training_DF['Age'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  training_DF['Age'].fillna(training_DF['Age'].mean(), inplace=True)


In [35]:
training_DF.drop('Cabin',axis=1,inplace=True)

In [36]:
training_DF.dropna(inplace=True)

In [37]:
sex = pd.get_dummies(training_DF['Sex'])
embark = pd.get_dummies(training_DF['Embarked'])

In [38]:
training_DF.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

In [39]:
training_DF = pd.concat([training_DF,sex,embark],axis=1)

In [40]:
training_DF.drop(['female','C'],axis=1,inplace=True)

In [41]:
from sklearn.model_selection import train_test_split

In [42]:
X_train, X_test, y_train, y_test = train_test_split(training_DF.drop('Survived',axis=1), 
                                                    training_DF['Survived'], test_size=0.30, 
                                                    random_state=101)

In [43]:
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)

In [44]:
# let's explore SGDClassifier parameters

'''
SGDClassifier(
    loss='hinge',
    penalty='l2',
    alpha=0.0001,
    l1_ratio=0.15,
    fit_intercept=True,
    max_iter=1000,
    tol=0.001,
    shuffle=True,
    verbose=0,
    epsilon=0.1,
    n_jobs=None,
    random_state=None,
    learning_rate='optimal',
    eta0=0.0,
    power_t=0.5,
    early_stopping=False,
    validation_fraction=0.1,
    n_iter_no_change=5,
    class_weight=None,
    warm_start=False,
    average=False,
)
'''

# Let's form SGD models with variation in paameters loss and alpha
    # loss: 'hinge', 'log', and 'modified_huber'
    # alpha: 0.0001 and 0.001
    # explain and dicuss your findings

"\nSGDClassifier(\n    loss='hinge',\n    penalty='l2',\n    alpha=0.0001,\n    l1_ratio=0.15,\n    fit_intercept=True,\n    max_iter=1000,\n    tol=0.001,\n    shuffle=True,\n    verbose=0,\n    epsilon=0.1,\n    n_jobs=None,\n    random_state=None,\n    learning_rate='optimal',\n    eta0=0.0,\n    power_t=0.5,\n    early_stopping=False,\n    validation_fraction=0.1,\n    n_iter_no_change=5,\n    class_weight=None,\n    warm_start=False,\n    average=False,\n)\n"

In [45]:
# Create different SGD models with varying parameters
sgd_models = {
    'hinge_0.0001': SGDClassifier(loss='hinge', alpha=0.0001, max_iter=1000, random_state=42),
    'hinge_0.001': SGDClassifier(loss='hinge', alpha=0.001, max_iter=1000, random_state=42),
    'log_loss_0.0001': SGDClassifier(loss='log_loss', alpha=0.0001, max_iter=1000, random_state=42),
    'log_loss_0.001': SGDClassifier(loss='log_loss', alpha=0.001, max_iter=1000, random_state=42),
    'modified_huber_0.0001': SGDClassifier(loss='modified_huber', alpha=0.0001, max_iter=1000, random_state=42),
    'modified_huber_0.001': SGDClassifier(loss='modified_huber', alpha=0.001, max_iter=1000, random_state=42)
}

# Train and evaluate each model
X_test = scaler.transform(X_test)  # Scale the test data
results = {}

for name, model in sgd_models.items():
    model.fit(X_train, y_train)
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    results[name] = {'train_score': train_score, 'test_score': test_score}

# Display results
for name, scores in results.items():
    print(f"\nModel: {name}")
    print(f"Training accuracy: {scores['train_score']:.4f}")
    print(f"Testing accuracy: {scores['test_score']:.4f}")


Model: hinge_0.0001
Training accuracy: 0.7797
Testing accuracy: 0.8052

Model: hinge_0.001
Training accuracy: 0.7910
Testing accuracy: 0.8015

Model: log_loss_0.0001
Training accuracy: 0.7894
Testing accuracy: 0.7865

Model: log_loss_0.001
Training accuracy: 0.8071
Testing accuracy: 0.8165

Model: modified_huber_0.0001
Training accuracy: 0.7267
Testing accuracy: 0.7079

Model: modified_huber_0.001
Training accuracy: 0.7926
Testing accuracy: 0.7790


# Analysis of findings

Findings from the parameter exploration:

1. Loss Functions:
   - Hinge loss: Implements SVM-like classification
   - Log loss: Implements logistic regression
   - Modified Huber: Combines log loss with squared loss for outliers

2. Alpha (regularization strength):
   - 0.0001: Lower regularization, allows more complex models
   - 0.001: Stronger regularization, promotes simpler models

Key Observations:
1. Loss Function Impact:
   - Modified Huber often provides better stability
   - Log loss good for probabilistic predictions
   - Hinge loss good for maximum margin classification

2. Alpha Impact:
   - Lower alpha (0.0001) typically gives better accuracy but risks overfitting
   - Higher alpha (0.001) provides more regularization but might underfit

Best Practices:
- Use modified_huber for robust performance
- Start with lower alpha and increase if overfitting
- Consider cross-validation for parameter tuning