<a href="https://colab.research.google.com/github/bamideleadedeji/Cookie-Review-Transformation/blob/master/feature_engineering_titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧪 Feature Engineering Example in Python (Titanic Dataset)
This notebook demonstrates common feature engineering techniques using the Titanic dataset.

In [11]:
# Step 1: Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler

In [12]:
# Step 2: Load dataset (Titanic dataset as example)
import pandas as pd
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 🧹 Step 3: Handle Missing Values

In [13]:
# Fill missing 'Age' with median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing 'Embarked' with mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Drop 'Cabin' due to too many missing values
df.drop(columns=['Cabin'], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


### 🛠️ Step 4: Create New Features

In [14]:
# Create a new feature 'FamilySize'
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# Create a new feature 'IsAlone'
df['IsAlone'] = 1
df.loc[df['FamilySize'] > 1, 'IsAlone'] = 0

# Extract title from name
df['Title'] = df['Name'].apply(lambda name: name.split(',')[1].split('.')[0].strip())

### 🔣 Step 5: Encode Categorical Features

In [15]:
# Encode 'Sex'
le_sex = LabelEncoder()
df['Sex'] = le_sex.fit_transform(df['Sex'])

# Encode 'Embarked'
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Simplify and encode 'Title'
df['Title'] = df['Title'].replace(['Lady', 'Countess', 'Capt', 'Col',
                                   'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'],
                                  'Rare')
df['Title'] = df['Title'].replace('Mlle', 'Miss')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')

df['Title'] = LabelEncoder().fit_transform(df['Title'])

### ⚖️ Step 6: Scale Numerical Features

In [16]:
scaler = StandardScaler()
num_features = ['Age', 'Fare', 'FamilySize']
df[num_features] = scaler.fit_transform(df[num_features])

### 📦 Final Step: Drop Unused Columns & Preview

In [17]:
df.drop(columns=['Name', 'Ticket', 'PassengerId'], inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,FamilySize,IsAlone,Title,Embarked_Q,Embarked_S
0,0,3,1,-0.565736,1,0,-0.502445,0.05916,0,2,False,True
1,1,1,0,0.663861,1,0,0.786845,0.05916,0,3,False,False
2,1,3,0,-0.258337,0,0,-0.488854,-0.560975,1,1,False,True
3,1,1,0,0.433312,1,0,0.42073,0.05916,0,3,False,True
4,0,3,1,0.433312,0,0,-0.486337,-0.560975,1,2,False,True


In [19]:
from google.colab import sheets
sheet = sheets.InteractiveSheet(df=df)

MessageError: Error: credential propagation was unsuccessful

In [20]:
from google.colab import auth
auth.authenticate_user()

MessageError: Error: credential propagation was unsuccessful

# Task
Save the dataframe `df` to a CSV file and provide instructions on how to upload it to Google Sheets.

## Save dataframe to csv

### Subtask:
Generate code to save the current state of the `df` DataFrame to a CSV file in the Colab environment.


**Reasoning**:
Save the dataframe `df` to a CSV file named `titanic_processed.csv` without including the index.



In [21]:
df.to_csv('titanic_processed.csv', index=False)

## Provide upload instructions

### Subtask:
Explain how the user can download the generated CSV file and upload it to Google Sheets.


## Summary:

### Data Analysis Key Findings

* The DataFrame `df` was successfully saved to a CSV file named `titanic_processed.csv` in the Colab environment, excluding the index.

### Insights or Next Steps

* Download the generated `titanic_processed.csv` file from the Colab environment.
* Upload the downloaded `titanic_processed.csv` file to Google Sheets using the instructions provided.


# Task
Split the data and train a few classification models.

## Split the data

### Subtask:
Split the `df` DataFrame into features (X) and the target variable (y), and then split these into training and testing sets.


**Reasoning**:
Split the dataframe into features and target, and then split those into training and testing sets.



In [22]:
from sklearn.model_selection import train_test_split

X = df.drop('Survived', axis=1)
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Import models

### Subtask:
Import the necessary libraries for the classification models you want to train (e.g., Logistic Regression, Decision Tree, Random Forest).


**Reasoning**:
Import the required classification model classes from scikit-learn.



In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

## Train models

### Subtask:
Train each of the selected classification models on the training data.


**Reasoning**:
Instantiate and train the Logistic Regression, Decision Tree, and Random Forest models using the training data.



In [24]:
# Instantiate and train Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

# Instantiate and train Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Instantiate and train Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

## Evaluate models

### Subtask:
Evaluate the performance of each trained model on the testing data using appropriate metrics (e.g., accuracy, precision, recall, F1-score).


**Reasoning**:
Evaluate the performance of each trained model on the testing data using appropriate metrics.



In [25]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the testing data
lr_pred = lr_model.predict(X_test)
dt_pred = dt_model.predict(X_test)
rf_pred = rf_model.predict(X_test)

# Calculate and print metrics for Logistic Regression
print("Logistic Regression Performance:")
print(f"Accuracy: {accuracy_score(y_test, lr_pred):.4f}")
print(f"Precision: {precision_score(y_test, lr_pred):.4f}")
print(f"Recall: {recall_score(y_test, lr_pred):.4f}")
print(f"F1-score: {f1_score(y_test, lr_pred):.4f}\n")

# Calculate and print metrics for Decision Tree
print("Decision Tree Performance:")
print(f"Accuracy: {accuracy_score(y_test, dt_pred):.4f}")
print(f"Precision: {precision_score(y_test, dt_pred):.4f}")
print(f"Recall: {recall_score(y_test, dt_pred):.4f}")
print(f"F1-score: {f1_score(y_test, dt_pred):.4f}\n")

# Calculate and print metrics for Random Forest
print("Random Forest Performance:")
print(f"Accuracy: {accuracy_score(y_test, rf_pred):.4f}")
print(f"Precision: {precision_score(y_test, rf_pred):.4f}")
print(f"Recall: {recall_score(y_test, rf_pred):.4f}")
print(f"F1-score: {f1_score(y_test, rf_pred):.4f}")

Logistic Regression Performance:
Accuracy: 0.8045
Precision: 0.7826
Recall: 0.7297
F1-score: 0.7552

Decision Tree Performance:
Accuracy: 0.7877
Precision: 0.7368
Recall: 0.7568
F1-score: 0.7467

Random Forest Performance:
Accuracy: 0.8324
Precision: 0.8056
Recall: 0.7838
F1-score: 0.7945


## Summary:

### Data Analysis Key Findings
*   The data was split into training and testing sets with a test size of 20%.
*   Three classification models (Logistic Regression, Decision Tree, and Random Forest) were trained on the training data.
*   The Random Forest model achieved the highest performance on the testing data with an Accuracy of 0.8324, Precision of 0.8056, Recall of 0.7838, and F1-score of 0.7945.
*   Logistic Regression also performed well, with an Accuracy of 0.8045, Precision of 0.7826, Recall of 0.7297, and F1-score of 0.7552.
*   The Decision Tree model showed slightly lower performance compared to the other two models, with an Accuracy of 0.7877, Precision of 0.7368, Recall of 0.7568, and F1-score of 0.7467.

### Insights or Next Steps
*   The Random Forest model appears to be the most promising for predicting survival based on the evaluated metrics.
*   Further analysis could involve hyperparameter tuning for the Random Forest model to potentially improve its performance, or exploring other advanced classification algorithms.
