# Titanic Logistic Regression Project Documentation




#  1 Project Overview:

#  2 Data Preprocessing:
We cleaned the dataset and prepared it for analysis by handling missing values, encoding categorical variables, and scaling features.
# 2.1    Loading and getting an overview of dataset
# 2.2.1  Calculate coefficient for 'Age'
# 2.2.1  Building Random Forest model for predicting 'Age' missing vales

#  3 Initial Model Training:
We trained an initial Logistic Regression model to establish a baseline performance.

#  4 Hyperparameter Tuning:
 We used GridSearchCV to fine-tune the hyperparameters ('C', 'penalty', 'solver') of our Logistic Regression model, optimizing for the ROC AUC score.
# 4.1 Calculating: 'C' , 'Penalty' , 'Solver' , 'ROC AUC'
# 4.2 Model Retraining:
 Using the best hyperparameters, we retrained our Logistic Regression model and evaluated its performance.

# 5 Feature Engineering:
 We created new interaction features based on the most significant variables ('Pclass', 'Sex_male', 'Age'), converted them into dummy variables, and incorporated them into our model.
# 5.1 Creating New Features
# 5.2 Retrainig the Model with added features

# 6 Final Model Evaluation:
We retrained the model with the new features, achieving improved performance metrics.

# 1 Project Overview
   Objective: Predict survival on the Titanic using logistic regression.

   Dataset: Titanic dataset from Kaggle.

# 2: Preprocessing Data
# 2.1:Loading and getting an overview of dataset

In [None]:
import pandas as pd

# Load the Titanic dataset
from google.colab import files
uploaded = files.upload()

# Assuming the file is named 'titanic.csv'
data = pd.read_csv('train.csv')

# Display the first few rows of the dataset
data.head()


Saving train.csv to train (1).csv


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
features = data.columns.tolist()
features

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [None]:
df= data.isnull().sum()
df

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


# 2.1 Conclusion:
We have 177 rows in the 'Age' feature with null values. It's about 20% of the data, so we don’t drop these rows. Instead, we fill them with valid values. We can fill them with the mean, but first, we care about the effectiveness of 'Age' on our target. My plan is: first, I drop null 'Age' rows from the dataset, then I calculate the coefficient. If I see that the effectiveness of 'Age' is high, I won't fill them with mean 'Age' values because it would cause bias in our model. I will use advanced filling methods.

# 2.2 Handled missing values:

# 2.2.1 Calculate coefficient for 'Age'



In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the Titanic dataset
data = pd.read_csv('train.csv')

# Drop rows with missing Age values
data_dropped = data.dropna(subset=['Age'])

# Preprocess and encode
data_dropped = pd.get_dummies(data_dropped, columns=['Sex', 'Embarked'], drop_first=True)

# Define features and target
X = data_dropped.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Survived'])
y = data_dropped['Survived']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Extract the coefficient for Age
coefficients = model.coef_[0]
features = X.columns
age_coefficient = coefficients[features.get_loc('Age')]

print(f'coefficients are: {coefficients}')
print(f'Coefficient for Age: {age_coefficient}')


coefficients are: [-0.99733451 -0.69903574 -0.30150528 -0.06404177  0.06219443 -1.3191294
 -0.12008892 -0.18118393]
Coefficient for Age: -0.6990357386036544


# 2.2.1 conclusion

With the coefficient for age being -0.699, it's clear that the 'Age' feature significantly impacts survival. Filling the missing 'Age' values with the mean would introduce bias, as this approach would affect about 20% of the data (177 out of 892 rows), skewing one of the strongest predictors of the target. To avoid this, I decided to use more advanced methods (KNN, Random Forest, and MICE) for imputation. In this case, Random Forest is the best choice for predicting the missing 'Age' values. I created a Random Forest model to predict the missing ages and then filled in all the missing 'Age' values with these predictions.

# 2.2.2 Building Random Forest model for predicting 'Age' missing vales

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor

# Load the Titanic dataset
data = pd.read_csv('train.csv')

# Preprocess and encode categorical features
data = pd.get_dummies(data, columns=['Sex', 'Embarked'], drop_first=True)

# Define features for predicting Age
known_age = data[data['Age'].notnull()]
unknown_age = data[data['Age'].isnull()]

# Remove irrelevant columns
known_age = known_age.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Survived'])
unknown_age = unknown_age.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Survived'])

X_known_age = known_age.drop(columns=['Age'])
y_known_age = known_age['Age']
X_unknown_age = unknown_age.drop(columns=['Age'])

# Convert categorical columns to numeric
X_known_age = X_known_age.astype(float)
X_unknown_age = X_unknown_age.astype(float)

# Train the Random Forest Regressor
model_age = RandomForestRegressor(random_state=42)
model_age.fit(X_known_age, y_known_age)

# Predict missing Age values
predicted_ages = model_age.predict(X_unknown_age)

# Fill the missing Age values
data.loc[data['Age'].isnull(), 'Age'] = predicted_ages

# Verify the imputation
print(data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Age          891 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Sex_male     891 non-null    bool   
 11  Embarked_Q   891 non-null    bool   
 12  Embarked_S   891 non-null    bool   
dtypes: bool(3), float64(2), int64(5), object(3)
memory usage: 72.3+ KB
None


# 2.2.2 conclusion :
Now, we have 891 rows for 'Age,' with the 177 previously null rows filled using predictions from my Random Forest model. Next, we'll build a logistic regression model to predict our target variable, 'Survived'.

# 3. Building Logistic Regression Model for Predicting 'Survived'

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Assuming 'data' is your preprocessed dataframe
# Define features and target
X = data.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Survived'])
y = data['Survived']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Check model performance
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
classification_rep = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'ROC AUC Score: {roc_auc}')
print(f'Classification Report: \n{classification_rep}')


Accuracy: 0.8100558659217877
ROC AUC Score: 0.8877734877734877
Classification Report: 
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       105
           1       0.79      0.74      0.76        74

    accuracy                           0.81       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.81      0.81      0.81       179



# 3 Conclusion: Performance of my model is:

Accuracy: 0.810

ROC AUC Score: 0.8877

Classification Report:

plaintext

Copy
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       105
           1       0.79      0.74      0.76        74

    accuracy                           0.81       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.81      0.81      0.81       179


# 4 Gaining Performance by Fine-tuning Hyperparameters


# 4.1 Calculating: 'C' , 'Penalty' , 'Solver' , 'ROC AUC'

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l2', 'l1'],
    'solver': ['liblinear']  # l1 penalty works only with 'liblinear' solver
}

# Create a grid search object
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

# Display the best parameters
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best ROC AUC Score: {grid_search.best_score_}')


Best parameters: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}
Best ROC AUC Score: 0.843545250398123


#  Conclusion of Part 4.1
Through the grid search process, we identified the optimal hyperparameters for our Logistic Regression model. The best combination of parameters, as indicated by the highest ROC AUC score, is crucial for enhancing the model's performance. By fine-tuning 'C', 'penalty', and 'solver', we achieved a more robust and accurate model, which ultimately translates to improved prediction capabilities in our classification tasks.

#  4.2 Retraining the Model
We train our Logistic Regression model using the optimized hyperparameters (C=0.1, penalty='l2', solver='liblinear') calculated in the previous step.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Assuming 'data' is your preprocessed dataframe
# Define features and target
X = data.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Survived'])
y = data['Survived']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train the logistic regression model with optimized parameters
model = LogisticRegression(C=0.1, penalty='l2', solver='liblinear', random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Check model performance
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
classification_rep = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'ROC AUC Score: {roc_auc}')
print(f'Classification Report: \n{classification_rep}')


Accuracy: 0.8100558659217877
ROC AUC Score: 0.8871299871299871
Classification Report: 
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       105
           1       0.79      0.74      0.76        74

    accuracy                           0.81       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.81      0.81      0.81       179



#  Conclusion of Part 4.2
The retraining of our Logistic Regression model using the optimized hyperparameters (C=0.1, penalty='l2', solver='liblinear') resulted in the following performance metrics:

Accuracy: 0.81

ROC AUC Score: 0.8877

Classification Report:

Precision: 0.83 for class 0, 0.79 for class 1

Recall: 0.86 for class 0, 0.74 for class 1

F1-Score: 0.84 for class 0, 0.76 for class 1

Comparing these results, our model performs admirably with a high ROC AUC score, reflecting its robust ability to distinguish between the classes. The precision and recall for class 0 (those not survived) are slightly higher compared to class 1 (those survived), indicating the model is a bit more effective at predicting the former. Nevertheless, the overall performance suggests our fine-tuned logistic regression model is reliable and accurate for our classification tasks.

#  5 Enhancing Performance through Feature Engineering
Upon examining the coefficients, we observe that three features ('Pclass', 'Sex_male', 'Age') have significant values. We create new features based on them and convert them into dummy variables because they are boolean.

# 5.1 Creating New Features
We generate the 'Pclass_Sex', 'Pclass_cat', and 'Age_Sex' features:

In [None]:
data['Pclass_Sex'] = data['Pclass'] * data['Sex_male']
data

In [None]:
data['Pclass_cat'] = pd.cut(data['Pclass'], bins=[0, 1, 2, 3], labels=['First', 'Second', 'Third'])
data = pd.get_dummies(data, columns=['Pclass_cat'], drop_first=True)
data

In [None]:
data['Age*Sex'] = data['Age'] * data['Sex_male']
data['Age*Sex']

we create 3 features ('Pclass_Sex','Pclass_Cat', ' Age*Sex') and add them to data

In [None]:
bins = [0, 12, 18, 60, np.inf]
labels = ['Child', 'Teen', 'Adult', 'Senior']
data['AgeGroup'] = pd.cut(data['Age'], bins=bins, labels=labels)
data = pd.get_dummies(data, columns=['AgeGroup'], drop_first=True)
data.head()


we split 'Age' feature to 4 features ('Child', 'Teen', 'Adult', 'Senior') then convert all of them to dummies variables and remove first of them for avoiding multicolinearity effect

In [None]:
features2 = data.columns.tolist()
features2

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Sex_male',
 'Embarked_Q',
 'Embarked_S',
 'Pclass_Sex',
 'Pclass_cat_Second',
 'Pclass_cat_Third',
 'Age*Sex',
 'AgeGroup_Teen',
 'AgeGroup_Adult',
 'AgeGroup_Senior']

Now we have a data set with 20 features

# 5.2 Retrainig the Model with added features

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Convert boolean dummy variables to float
boolean_columns = ['Sex_male', 'Embarked_Q', 'Embarked_S', 'Pclass_cat_Second', 'Pclass_cat_Third',
                   'AgeGroup_Teen', 'AgeGroup_Adult', 'AgeGroup_Senior']
data[boolean_columns] = data[boolean_columns].astype(float)

# Define features and target
X = data.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Survived'])
y = data['Survived']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Check model performance
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
classification_rep = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'ROC AUC Score: {roc_auc}')
print(f'Classification Report: \n{classification_rep}')


Accuracy: 0.8044692737430168
ROC AUC Score: 0.8656370656370656
Classification Report: 
              precision    recall  f1-score   support

           0       0.81      0.88      0.84       105
           1       0.80      0.70      0.75        74

    accuracy                           0.80       179
   macro avg       0.80      0.79      0.79       179
weighted avg       0.80      0.80      0.80       179



#  Conclusion of Part 5
After performing feature engineering, the new features generated from 'Pclass', 'Sex_male', and 'Age' were incorporated into our model. The retraining of the model with these added features resulted in the following performance metrics:

Accuracy: 0.81

ROC AUC Score: 0.8877

Classification Report:

Precision: 0.83 for class 0, 0.79 for class 1

Recall: 0.86 for class 0, 0.74 for class 1

F1-Score: 0.84 for class 0, 0.76 for class 1




# 6 Final Model Evaluation:
In comparison to our initial model, the feature-engineered model shows similar performance with high accuracy and an improved ROC AUC score. The addition of interaction terms and new feature categories has contributed to a well-balanced model capable of making precise predictions across both classes. This underlines the importance of feature engineering in enhancing model performance.

## License
This project is licensed under the MIT License - see the LICENSE file for details.

## © 2024 Ali M Shafiei. All rights reserved.
