# Introduction

#### Overview of the Titanic dataset

The Titanic dataset is a famous dataset in machine learning and data analysis, containing data on the passengers aboard the Titanic and their survival status. It consists of a training dataset with 891 observations and a test dataset with 418 observations. The goal is to predict the survival of passengers in the test dataset based on the variables in the training dataset.

# Import libraries
To begin, let's import the necessary libraries that we'll be using throughout this notebook:

In [None]:
# Data Manipulation Libraries
import pandas as pd
import numpy as np

# Data Visualization Libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Machine Learning Libraries
from sklearn.preprocessing import  StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score ,precision_score, recall_score, f1_score
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import  classification_report, confusion_matrix

# Machine Learning Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC

# Loading the dataset

In this section, we will load the Titanic dataset into the notebook. The dataset is stored in two CSV files, one for the training data and one for the test data. We will use the pandas library to load the CSV files into dataframes that we can manipulate and explore. Once loaded, we can begin to explore the data and prepare it for machine learning modeling

In [None]:
# reading the train data
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")

# reading the test data
test = pd.read_csv("/kaggle/input/titanic/test.csv")

# Understanding the Variables

Before we can begin analyzing the Titanic dataset, it's important to understand what each variable represents. The dataset contains the following variables:

PassengerId: A unique identifier for each passenger.

Survived: Whether or not the passenger survived (0 = No, 1 = Yes).

Pclass: The passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class).

Name: The name of the passenger.

Sex: The gender of the passenger.

Age: The age of the passenger in years. Fractional values are included for infants.

SibSp: The number of siblings/spouses aboard the Titanic.

Parch: The number of parents/children aboard the Titanic.

Ticket: The ticket number for the passenger.

Fare: The fare paid by the passenger.

Cabin: The cabin number for the passenger (if available).

Embarked: The port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

# Data Exploration and Preparation

In [None]:
# we use method head() to show the first 5 rows
train_df.head()


In [None]:
test.head()


In [None]:
print (test.shape)
print (train_df.shape)

as we see above we have 11 attributes and 418 records in the test data set, but we have
12 attributes and 891 records in the train data set.


In [None]:
train_df.describe()

The above df.describe() command presents statistical properties in vertical form.


In [None]:
# find if we have a duplicated rows in train data

train_df.duplicated().sum()

In [None]:
# # find if we have a null values in train data

train_df.isna().sum()

In [None]:
# find if we have a duplicated rows in test data

test.duplicated().sum()

In [None]:
# find if we have a null value in test data
test.isna().sum()

In [None]:
# The number of men who survived
train_df[train_df['Sex']=='male']['Survived'].sum()

In [None]:
# The number of women who survived
train_df[train_df['Sex']=='female']['Survived'].sum()

The code bellow uses the Seaborn library to plot graphs that show the number of passengers who survived and who did not survive the disaster for each of the columns 'Sex', 'Embarked', 'Pclass', 'SibSp', and 'Parch'.

A loop is used to repeat the process for each specified column in the list ['Sex', 'Embarked', 'Pclass', 'SibSp', 'Parch'].

The 'countplot' function from Seaborn is used to draw the graph. This function displays the count of cases in each category for the different values of the specified column (such as gender or ticket class), along with the distinctive color for each category (survived and not survived). In other words, the count of passengers who survived and who did not survive is displayed for each different value of the specified column

In [None]:

for column_name in ['Sex','Embarked','Pclass', 'SibSp', 'Parch']:
    print(column_name)
    sns.countplot(data=train_df, x=column_name, hue='Survived')
    plt.show()
    print("")

In [None]:
sns.histplot(train_df['Age'])

The 'histplot' function from Seaborn is used to draw the histogram. This function displays the frequency distribution of the 'Age' column, i.e., the number of passengers in each age group.

The histogram can be used to identify patterns and trends in the age distribution of the passengers in the 'train_df' dataset, such as the most common age group or the presence of outliers

In [None]:
# calculate the mean age of male passengers in the 'train_df' dataset.
mean_male = train_df[train_df['Sex']=='male']['Age'].mean()
mean_male

In [None]:
# calculate the mean age of بثmale passengers in the 'train_df' dataset.
mean_female = train_df[train_df['Sex']=='female']['Age'].mean()
mean_female

As we have seen, there is a difference between the mean of mal and the mean of fmale

in the code bellow 
the 'fillna' function is used to fill the missing values in the 'Age' column of male and female passengers with the value of 'mean_male' and 'mean_female' respectively.

This is done by first selecting the rows in the 'train_df' dataset where 'Sex' is 'male' using the condition 'train_df['Sex']=='male''. Then, for these selected rows, the missing values in the 'Age' column are filled with 'mean_male'.

In [None]:
train_df.loc[train_df['Sex']=='male', 'Age'] = train_df[train_df['Sex']=='male']['Age'].fillna(value=mean_male)

In [None]:
train_df.loc[train_df['Sex']=='female', 'Age'] = train_df[train_df['Sex']=='female']['Age'].fillna(value=mean_female)

In [None]:
train_df.isna().sum()

as we have seen, there are no null values in the column age

In [None]:
# Dropping some unimportant features
train_df.drop(['PassengerId','SibSp','Parch','Ticket','Cabin','Name'], axis=1, inplace= True)

In [None]:
train_df.head()

In [None]:
train_df.dropna(inplace=True)

The 'dropna' function is used to remove any rows in the 'train_df' dataset that contain missing values (i.e., NaN values). The 'inplace=True' parameter is used to modify the 'train_df' dataset directly, rather than returning a new modified datasetز

In [None]:
train_df.isna().sum()

as we have seen above there are no null values 

In [None]:
train_df.replace({'female':0,'male':1},inplace=True)

The 'replace' function is used to replace the values 'female' and 'male' in the 'train_df' dataset with 0 and 1, respectively. This is done by passing a dictionary with the keys 'female' and 'male' and their corresponding values 0 and 1 to the 'replace' function. The 'inplace=True' parameter is used to modify the 'train_df' dataset directly, rather than returning a new modified dataset.



In [None]:
train_df = pd.get_dummies(train_df,columns=['Embarked'],prefix='Embarked')

The 'get_dummies' function from pandas is used to create dummy variables for the 'Embarked' column in the 'train_df' dataset. This is done by passing the 'train_df' dataset and the 'Embarked' column to the 'columns' parameter of the 'get_dummies' function. The 'prefix' parameter is used to add the prefix 'Embarked_' to the column names of the resulting dummy variables.

The resulting 'train_df' dataset now contains new columns 'Embarked_C', 'Embarked_Q', and 'Embarked_S', with binary values (0 or 1) indicating whether a passenger embarked from the corresponding port.

In [None]:
train_df.head()

### split train data

In [None]:
# split data into x and y
x = train_df.drop('Survived',axis =1)
y = train_df['Survived']
x

This code standardizes the values of the 'Age' and 'Fare' columns in the 'x' dataframe using the StandardScaler from the scikit-learn library.

In [None]:
# Select numerical columns
num_cols = ['Age', 'Fare']

# Create scaler object
scaler = StandardScaler()

# Fit scaler on selected columns
scaler.fit(x[num_cols])

# Transform selected columns with scaler
x[num_cols] = scaler.transform(x[num_cols])

In [None]:
# split train data into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=10)

# modeling

In [None]:
# Initialize the models
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'XGBClassifier':XGBClassifier(),
    'GradientBoostingClassifier':GradientBoostingClassifier(),
    'AdaBoostClassifier':AdaBoostClassifier()
    
}

# Train and evaluate each model using cross-validation
for name, model in models.items():
    scores = cross_val_score(model, x_train, y_train, cv=5, scoring='accuracy')
    print(f"{name} accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
    
    # Fit the model to the full training set and make predictions on the test set
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    
    # Evaluate the model on the test set
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    print(f"Accuracy: {acc:.3f}")
    print(f"Precision: {prec:.3f}")
    print(f"Recall: {rec:.3f}")
    print(f"F1-score: {f1:.3f}")
    print()

After comparing the accuracy of several models, we found that the GradientBoostingClassifier model had the highest accuracy.

So I will use it to make a prediction to the test set.

In [None]:
gbc= GradientBoostingClassifier()
scores = cross_val_score(gbc, x_train, y_train, cv=5, scoring='accuracy')
print(f"{gbc} accuracy: {scores}")
    
# Fit the model to the full training set and make predictions on the test set
gbc.fit(x_train, y_train)
y_pred = gbc.predict(x_test)
# Evaluate the model on the test set
acc = accuracy_score(y_test, y_pred)
print (acc)

# Test data preparation

In [None]:
# show the first 5 columns 
test.head()

In [None]:
# fill the null value that in the age column
mean_male_test = test[test['Sex']=='male']['Age'].mean()
mean_female_test = test[test['Sex']=='female']['Age'].mean()
test.loc[test['Sex']=='male', 'Age'] = test[test['Sex']=='male']['Age'].fillna(value=mean_male_test)
test.loc[test['Sex']=='female', 'Age'] = test[test['Sex']=='female']['Age'].fillna(value=mean_female_test)

# fill the null values that in the Fare column
test['Fare'].fillna(test['Fare'].median(), inplace = True)

# Dropping some unimportant features
test.drop(['SibSp','Parch','Ticket','Cabin','Name'], axis=1, inplace= True)




In [None]:
test.isna().sum()

as we have seen above, there are no null values

In [None]:
# convert string values to numeric by replace the female and male by 0 and 1
test.replace({'female':0,'male':1},inplace=True)

In [None]:
# using one hot encoder to convertr str values 
test = pd.get_dummies(test,columns=['Embarked'],prefix='Embarked')

In [None]:
# Fit scaler on numeric columns
scaler.fit(test[num_cols])

# Transform numeric columns with scaler
test[num_cols] = scaler.transform(test[num_cols])

In [None]:
test.head()

In [None]:
#Store the PassengerId column in a separate variable

PassengerId = test['PassengerId']

# drop PassengerId column from the test set
test.drop('PassengerId',axis=1,inplace=True)

# Predictions and Submission

In [None]:
# #Generate predictions for the test data using GradientBoostingClassifier 
test_pred = gbc.predict(test)


Create a submission file with the PassengerId column and predicted survival outcomes


In [None]:
submission = pd.DataFrame({'PassengerId': PassengerId, 'Survived': test_pred})

In [None]:
submission.to_csv('submission.csv', index=False)