# Welcome to My Titanic Data Analysis Notebook!

Hey there, amazing folks! 🚢📊

I hope you're doing great because I'm about to dive headfirst into a data analysis adventure with you. And here's the thing – it's my very first time working on a data analysis problem. Yep, you read that right! So, I'm a bit like a kid in a candy store, buzzing with excitement and ready to explore the fascinating world of data.

I've heard the Titanic dataset is a fantastic place to start, and I'm genuinely thrilled to take you along on this journey with me. But, just so you know, I'm still learning the ropes here. So, be nice to me, okay? 😅

In this notebook, we're going to:

1. **Explore the Data**: We'll roll up our sleeves and get to know our Titanic dataset – who's on board, what they're like, and maybe even hear some incredible stories along the way.

2. **Preprocess the Data**: We'll clean things up, handle missing values, and make sure our data is all set for our analysis.

3. **Choose Our Weapon**: Since we're embarking on this adventure together, I've decided to try out some cool machine learning models like XGBoost and HistGradientBoostingClassifier. We'll explore their strengths and see which one helps us crack the Titanic challenge.

4. **Make Predictions**: With our chosen model, we'll predict who survived the Titanic journey and who didn't. It's a bit like being a detective trying to solve a century-old mystery.

5. **Share Our Findings**: Finally, we'll wrap it all up and share our discoveries, because that's what this adventure is all about – learning and sharing the excitement of data analysis!

So, whether you're an experienced data wizard or someone just as new to this as I am, I'm glad you're here. Let's make some data magic together! 🎩✨

Now, without further ado, let's set sail on this Titanic adventure! 🌊🚢


In [5]:
# first, import the necessary libraries
import pandas as pd

from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score, recall_score, precision_score, accuracy_score

import warnings
warnings.filterwarnings("ignore")

In [6]:
# read the Source Data
train_data = pd.read_csv('/content/train.csv')
test_data = pd.read_csv('/content/test.csv')

# Checking the first few rows for an initial glimpse.

In [7]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Creating a list of features we'll use for analysis and modeling.

In [8]:
# create a list with features
list_of_features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

# This function is for performing data cleaning operations.

In [11]:
def clean_data(data):
    # Perform data cleaning operations here
    cleaned_data = ...

    return cleaned_data

# here we apply the function to data

In [12]:
# applying a function to data
train_data = clean_data(train_data)
test_data = clean_data(test_data)

# Displaying the column names of the train_data DataFrame.

In [18]:
train_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

 # Function to transform data by filling missing values with medians and removing missing values in 'Embarked'.

In [19]:
def clean_data(data):
    """Function to transform data: padding with median and removing missing values."""

    columns = ['Age', 'SibSp', 'Parch', 'Fare']
    for col in columns:
        data[col].fillna(data[col].median(), inplace=True)

    data.Embarked.dropna(inplace=True)
    return data

# Checking the first few rows of the train_data DataFrame after data transformation.

In [20]:
# check transform data
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Displaying column names.

In [21]:
train_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

# Converting 'Sex' values to numerical format: 'male' to 0, 'female' to 1.

In [25]:
train_data['Sex'] = train_data['Sex'].replace('male', 0)
train_data['Sex'] = train_data['Sex'].replace('female', 1)
test_data['Sex'] = test_data['Sex'].replace('male', 0)
test_data['Sex'] = test_data['Sex'].replace('female', 1)

# Converting 'Embarked' values to numerical format in train_data and test_data:
# 'S' to 0, 'C' to 1, 'Q' to 2, 'U' to 3.

In [26]:
train_data['Embarked'] = train_data['Embarked'].replace('S', 0)
train_data['Embarked'] = train_data['Embarked'].replace('C', 1)
train_data['Embarked'] = train_data['Embarked'].replace('Q', 2)
train_data['Embarked'] = train_data['Embarked'].replace('U', 3)
test_data['Embarked'] = test_data['Embarked'].replace('S', 0)
test_data['Embarked'] = test_data['Embarked'].replace('C', 1)
test_data['Embarked'] = test_data['Embarked'].replace('Q', 2)
test_data['Embarked'] = test_data['Embarked'].replace('U', 3)

# Creating feature matrix X and target vector y from train_data.

In [27]:
X = train_data[list_of_features]
y = train_data['Survived']

#HistGradientBoostingClassifier :

So, check this out – we're gonna use the HistGradientBoostingClassifier for the Titanic dataset, and here's why it's pretty cool:

- It can handle categorical stuff without us having to make it all fancy with one-hot encoding. Like, it gets us, you know?
- You remember those weird 'Fare' and 'Age' numbers that kinda stick out? Well, this model doesn't freak out about 'em. It's pretty chill with outliers.
- It's all about teamwork. This model combines a bunch of smaller models to make better predictions. Teamwork makes the dream work!
- You know how life can get kinda complicated? Well, this model can handle that complexity in our data. No sweat.
- Big datasets? No problemo! It's built to handle those big boys like a pro.
- When it sees new data it's never met before, it doesn't get all confused. It's pretty good at making educated guesses.
- Oh, and it's smart enough to pick out which features actually matter. No need for us to play detective.

So, we're firing up this bad boy, training it, and getting ready to make some predictions. Let's do this!


In [None]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier


# Define the list of features
list_of_features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']

In [None]:
# Extract features (X) and target variable (y)
X = train_data[list_of_features]
y = train_data['Survived']

# Extract test features (X_test) from your test_data
X_test = test_data[list_of_features]  # Make sure test_data contains the relevant features

In [None]:
# Split the training data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, stratify=y, test_size=0.15, random_state=1)

In [None]:
# Create and fit the HistGradientBoostingClassifier model
hist_gradient_boosting_model = HistGradientBoostingClassifier()
hist_gradient_boosting_model.fit(X_train, y_train)

# Evaluate the HistGradientBoostingClassifier model on the validation set
hist_gradient_boosting_valid_accuracy = hist_gradient_boosting_model.score(X_valid, y_valid)
print("HistGradientBoostingClassifier Validation Accuracy:", round(hist_gradient_boosting_valid_accuracy, 3))

# Make predictions on the test set
hist_gradient_boosting_pred = hist_gradient_boosting_model.predict(X_test)
# The rest of your code for evaluating and using predictions remains the same


HistGradientBoostingClassifier Validation Accuracy: 0.799


#XGBoost:

Hey there! If you're like me, you're trying to figure out which machine learning model to use for the Titanic dataset. Let's break it down in simple terms:

1. Learning Speed and Efficiency:
   - XGBoost: It's like the speedster of the group. XGBoost learns really quickly, thanks to some fancy tricks under the hood. Think of it as a speed demon in the world of machine learning.
   - HistGradientBoostingClassifier: This one's no slouch either. It uses something called histograms to learn efficiently, especially when you're dealing with lots of data. Perfect for handling the Titanic dataset's size.

2. Handling Large Datasets:
   - XGBoost: If you've got tons of Titanic passenger data, XGBoost won't break a sweat. It's built to handle large datasets like a champ.
   - HistGradientBoostingClassifier: It's got "Hist" in its name for a reason. This model's a pro at dealing with big datasets too, thanks to its histogram-based magic.

3. Choosing Your Model:
   So, which one to pick? It depends on your Titanic mission:
   - If you want speed and efficiency, consider XGBoost. It's like the Usain Bolt of machine learning.
   - If you've got a Titanic-sized dataset and need efficiency, both models are solid choices.

Remember, I'm no expert, but I hope this helps you decide which model to try first. Happy coding!


In [33]:

# Define the list of features
list_of_features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']

# Extract features (X) and target variable (y)
X = train_data[list_of_features]
y = train_data['Survived']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.15, random_state=1)


In [34]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((757, 6), (134, 6), (757,), (134,))

In [35]:
from sklearn.model_selection import train_test_split

# Assuming you have already defined X_train and y_train
# Split the training data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, stratify=y_train, test_size=0.15, random_state=1)

# Now you can use X_train, y_train, X_valid, and y_valid in your CatBoost model fitting
model = CatBoostClassifier(verbose=False)
model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)])

# Rest of your code for model evaluation and prediction



<catboost.core.CatBoostClassifier at 0x7ebb9939dc60>

In [36]:
y_pred = model.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, y_pred), 3))
print("F1 Score:", round(f1_score(y_test, y_pred), 3))
print("Recall:", round(recall_score(y_test, y_pred), 3))
print("Precision:", round(precision_score(y_test, y_pred), 3))

Accuracy: 0.806

F1 Score: 0.723

Recall: 0.667

Precision: 0.791


In [37]:
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train, y_train, eval_set=[(X_valid, y_valid)])

[0]	validation_0-logloss:0.58681

[1]	validation_0-logloss:0.54045

[2]	validation_0-logloss:0.50987

[3]	validation_0-logloss:0.49295

[4]	validation_0-logloss:0.48760

[5]	validation_0-logloss:0.48591

[6]	validation_0-logloss:0.47612

[7]	validation_0-logloss:0.47584

[8]	validation_0-logloss:0.47436

[9]	validation_0-logloss:0.47702

[10]	validation_0-logloss:0.47853

[11]	validation_0-logloss:0.48057

[12]	validation_0-logloss:0.47166

[13]	validation_0-logloss:0.47456

[14]	validation_0-logloss:0.47184

[15]	validation_0-logloss:0.47475

[16]	validation_0-logloss:0.47627

[17]	validation_0-logloss:0.47465

[18]	validation_0-logloss:0.47396

[19]	validation_0-logloss:0.47609

[20]	validation_0-logloss:0.47697

[21]	validation_0-logloss:0.47599

[22]	validation_0-logloss:0.47634

[23]	validation_0-logloss:0.47402

[24]	validation_0-logloss:0.48046

[25]	validation_0-logloss:0.47911

[26]	validation_0-logloss:0.47839

[27]	validation_0-logloss:0.47826

[28]	validation_0-logloss:0.47

In [38]:
round(xgb.score(X_valid, y_valid), 3)
pred_xgb = xgb.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, pred_xgb), 3))
print("F1 Score:", round(f1_score(y_test, pred_xgb), 3))
print("Recall:", round(recall_score(y_test, pred_xgb), 3))
print("Precision:", round(precision_score(y_test, pred_xgb), 3))

Accuracy: 0.821

F1 Score: 0.765

Recall: 0.765

Precision: 0.765


In [39]:
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': xgb.predict(test_data[list_of_features])})
output.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,1
4,896,0
