#Model Selection and Training

In [57]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Load cleaned training and testing data sets
train = pd.read_csv('train_clean.csv')
test = pd.read_csv('test_clean.csv')

Since the test dataset does not include the target variable, I split the training dataset using train_test_split to create a validation set. This allows me to evaluate the models' performance and assess their generalizability before testing them on the actual test dataset.



In [58]:
from sklearn.model_selection import train_test_split

# Split the training data into X (features) and y (target variable)
X = train.drop(columns=['Survived'])
y = train['Survived']

# Split the data into training and validation sets (80% train, 20% validation)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

I began by selecting a logistic regression model for the Titanic dataset because it is simple, easy to interpret, and efficient, especially for smaller datasets or when the relationships between features and the target are mostly linear. In this case, features like age, class, and gender have a clear connection to survival, making logistic regression a good fit. The training process is straightforward, involving optimizing model parameters using methods like gradient descent, which is computationally less demanding than more complex models. Additionally, logistic regression has fewer hyperparameters to tune—mainly regularization strength and the optimization method—making it quick to train and easy to adjust. While more complex models might perform better, logistic regression is a solid choice for its efficiency and interpretability.

In [59]:
#Create the model
log_model = LogisticRegression(max_iter=1000)

#Train the model
log_model.fit(X_train, y_train)

To evaluate the performance of my logistic regression model on the Titanic dataset, I first calculated the validation accuracy, which was 83%. This indicates that the model correctly predicted the survival outcomes for 83% of the passengers in the validation dataset. I then examined the confusion matrix, which showed that the model correctly identified 87 passengers as non-survivors (true negatives) and 61 passengers as survivors (true positives). However, it also misclassified 17 non-survivors as survivors (false positives) and 13 survivors as non-survivors (false negatives). To gain a more detailed understanding of the model's performance, I reviewed the classification report, which provided precision, recall, and F1-score metrics for each class. The F1-score is the harmonic mean of precision and recall and offers a balanced measure of both metrics. For non-survivors, the F1-score was 0.85, indicating strong precision and recall. For survivors, the F1-score was 0.80, reflecting a good balance between precision and recall, though there is still room for improvement. The overall accuracy, macro average, and weighted average all hovered around 83%, suggesting that the model performs well across both classes. In summary, the logistic regression model shows strong performance, with room for minor improvements, particularly in reducing false positives and false negatives for survivors.

















In [60]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Predict the target variable (Survived) on the validation set
val_predictions = log_model.predict(X_val)

print("Validation Accuracy:", accuracy_score(y_val, val_predictions))
print("\nConfusion Matrix:\n", confusion_matrix(y_val, val_predictions))
print("\nClassification Report:\n", classification_report(y_val, val_predictions))

Validation Accuracy: 0.8314606741573034

Confusion Matrix:
 [[87 17]
 [13 61]]

Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.84      0.85       104
           1       0.78      0.82      0.80        74

    accuracy                           0.83       178
   macro avg       0.83      0.83      0.83       178
weighted avg       0.83      0.83      0.83       178



I selected random forest for the Titanic dataset because it is a powerful and flexible model that can capture complex, non-linear relationships between features and the target variable. Unlike logistic regression, which assumes linear relationships, random forest can handle interactions between features more effectively, which is useful when the data may have more complicated patterns. It is also less prone to overfitting compared to other models, as it averages the results of multiple decision trees, improving generalization. The training process involves building multiple decision trees using random subsets of the data and features, then combining their predictions to make a final decision. Key hyperparameters to tune in a random forest include the number of trees (n_estimators), the maximum depth of each tree (max_depth), and the minimum number of samples required to split a node (min_samples_split). While training can be more computationally intensive compared to simpler models like logistic regression, random forest's ability to handle a variety of data types and its robustness make it an excellent choice for this dataset, particularly when accuracy is a priority over training time.

In [61]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

To evaluate the performance of my random forest model on the Titanic dataset, I first calculated the validation accuracy, which was 83%. This means that the model correctly predicted the survival outcomes for 83% of the passengers in the validation dataset. I then examined the confusion matrix, which showed that the model correctly identified 92 passengers as non-survivors (true negatives) and 56 passengers as survivors (true positives). However, it also misclassified 12 non-survivors as survivors (false positives) and 18 survivors as non-survivors (false negatives). To gain a deeper understanding of the model's performance, I reviewed the classification report, which provided precision, recall, and F1-score metrics for each class. For non-survivors, the F1-score was 0.86, indicating strong precision and recall. For survivors, the F1-score was 0.79, suggesting that while precision is good, the model could be improved in terms of recall, as it misses some survivors. The overall accuracy, macro average, and weighted average all hovered around 83%, showing that the model performs fairly well in predicting both classes. In summary, the random forest model performs effectively, with slight room for improvement, particularly in increasing recall for survivors.

In [62]:
# Predict on the validation set
val_predictions = rf_model.predict(X_val)

# Evaluate the model's performance on the validation set
print("Validation Accuracy:", accuracy_score(y_val, val_predictions))
print("\nConfusion Matrix:\n", confusion_matrix(y_val, val_predictions))
print("\nClassification Report:\n", classification_report(y_val, val_predictions))

Validation Accuracy: 0.8314606741573034

Confusion Matrix:
 [[92 12]
 [18 56]]

Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.88      0.86       104
           1       0.82      0.76      0.79        74

    accuracy                           0.83       178
   macro avg       0.83      0.82      0.82       178
weighted avg       0.83      0.83      0.83       178



When comparing the evaluation metrics of the logistic regression and random forest models on the Titanic dataset, both models achieved a validation accuracy of 83%, indicating similar overall performance. However, the random forest model demonstrated a slight edge in several areas. In terms of the confusion matrix, the random forest model correctly predicted more non-survivors (92 vs. 87) but misclassified more survivors as non-survivors (18 vs. 13). The logistic regression model, on the other hand, misclassified fewer survivors as non-survivors but had slightly more false positives for non-survivors. Looking at the classification report, the random forest model outperformed logistic regression for non-survivors, achieving higher recall (0.88 vs. 0.84) and a higher F1-score (0.86 vs. 0.84). For survivors, however, logistic regression had a higher recall (0.82 vs. 0.76), making it more effective at correctly identifying survivors. The random forest model did have a higher overall F1-score for both classes, suggesting a better balance between precision and recall. In summary, while both models performed well, the random forest model slightly outperformed logistic regression, especially in predicting non-survivors and achieving a better overall balance in its predictions. Future improvements could involve enhancing feature engineering by creating new features like family size or age categories, and optimizing hyperparameters using grid search or randomized search. Additionally, ensembling both models through techniques like stacking or boosting could lead to better performance, while addressing class imbalance through methods like SMOTE or adjusting class weights could improve prediction accuracy. Exploring model interpretability with feature importance and SHAP values, as well as refining data preprocessing techniques, could further strengthen the models’ results.

We made predictions for the Titanic test dataset using the two models, logistic regression and random forest, and then saved these predictions into a CSV file for later use.

In [63]:
from google.colab import files

# Make predictions with both models
log_reg_predictions = log_model.predict(test)
rf_predictions = rf_model.predict(test)

import pandas as pd

# Create a DataFrame with the predictions
predictions_df = pd.DataFrame({
    'PassengerId': test['PassengerId'],  # Assuming 'PassengerId' is in the test data
    'Logistic_Regression_Prediction': log_reg_predictions,
    'Random_Forest_Prediction': rf_predictions
})

# Save the predictions to a CSV file
predictions_df.to_csv('model_predictions.csv', index=False)

# Download the file to your local machine
files.download('model_predictions.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>