<a href="https://www.kaggle.com/code/alirizaercan/titanic-data-science-explained-and-details?scriptVersionId=167048015" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction
***We will focus Titanic dataset in this code. It is very important for all user. We can say it is basic dataset for every user. I will try to review this code from a beginner view. If you are ready, let's go!***


**We will have some parts in this code. These parts include every part of data science life cycle. If you want to be a succesfull data scientist/analyst/engineer, you shouldn't pass these life cycle steps!**

<font color = 'darkblue'>
Content:

1. [Problem Definition and Project Planning](#1)
2. [Import Libraries](#2)   
3. [Load and Check Data](#3)
4. [Understand Dataset](#4)
5. [Variable Description](#5)
6. [Exploratory Data Analysis](#6)   
    * [Univariate Variable Analysis(EDA)](#7)
7. [Basic Data Analysis](#8)
8. [Data Cleaning](#9)
    * [Outlier Detection](#10)
    * [Missing Values](#11)
9. [Feature Engineering](#12)
10. [Modeling](#13)
11. [Prediction and Submission](#14)

<a id = "1"></a><br>
# Problem Definition and Project Planning

***Problem Definition:***

The problem at hand is to build a predictive model that can answer the question: "What sorts of people were more likely to survive the sinking of the Titanic?" The sinking of the Titanic is a historical event where, due to a collision with an iceberg during its maiden voyage on April 15, 1912, a significant number of passengers lost their lives. The challenge is to analyze passenger data, including attributes such as name, age, gender, socio-economic class, etc., and predict which passengers were more likely to survive based on these characteristics.*

**Project Planning:**

* ***Understanding the Objective:***

We clearly talked about problem in the problem definition. We will focus to find the most optimum **submission.csv** file.

* ***Data Explanation:***

    ***Data Split:***

    ***The dataset is divided into two groups:***

    ***Training set (train.csv)***: Used to build machine learning models, with the ground truth provided for each passenger.

    ***Test set (test.csv)***: Used to evaluate model performance on unseen data, with no ground truth provided.

    ***Target Variable:***

    *Survival:* Binary variable indicating whether a passenger survived (1) or not (0).

    ***Key Features:***

    *Pclass:* Ticket class representing socio-economic status (1st = Upper, 2nd = Middle, 3rd = Lower).
    *Sex:* Gender of the passenger.
    *Age:* Age of the passenger in years.
    *SibSp:* # of siblings/spouses aboard the Titanic.
    Parch: # of parents/children aboard the Titanic.
    *Ticket:* Ticket number.
    *Fare:* Passenger fare.
    *Cabin:* Cabin number.
    *Embarked:* Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

    ***Variable Notes:***

    *Pclass:* A proxy for socio-economic status (SES).
    *Age:* Fractional if less than 1, or in the form of xx.5 if estimated.
    *SibSp and Parch:* Define family relations, specifying siblings, spouses, parents, and children.
    *Some children traveled only with a nanny, resulting in Parch=0 for them.*
   
* ***Import Libraries:***
    *We will import necessary libraries.*
* ***Load and Check Data:***
    *We will load and check data. We can say, we will read csv files.*
* ***Understand  Dataset:***
    *We will understand the dataset from dataframe.*
* ***Variable Description:***
    *We will review variables according to their situation: categorical or numerical.*
* ***Exploratory Data Analysis(EDA):***
    *We will understand the dataset deeper with Exploraty Data Analysis. We will focus univariate relationship. We will visualize the data. If you visualize your data you can understand your data easily.*
* ***Basic Data Analysis:***
    *We will do basic data analysis so we can understand dataset from deep side.*
* ***Data Cleaning:***
    *We will do data cleaning. This is the most important step for data science. This is %80 of the work in the projects. You cannot overlook this step. We will remove duplicates, missing values, reformat data types, check outlier and remove them, validate, if these steps are necessary.* 
* ***Feature Engineering:***
    *We will focus feature engineering for the best solution in the feature engineering. You can show your creative side in Feature Engineering.*
* ***Modelling:***
    *We will choose the best machine learning model for our data. We should choose a good model for our data. If we find the best model for our dataset, we will get a good score.*
* ***Submission File:***
    *We will create our submission file.*
    
**Now, we can start our code part according to planning steps. You will try to understand whole steps. If you understand whole steps, you can be very good data scientist. Let's start!**

<a id = "2"></a><br>
# Import Libraries
We will need some libraries in this project, we need to import necessary libraries. We didn't choose our model so we will talk about model later. We can add our machine learning model libraries later. We can add 'matplotlib', 'seaborn', 'matplotlib.pyplot', 'Counter', 'warning' libraries right now. I can explain their roles in data science like that: 

**NumPy:**
Provides efficient numerical computation capabilities for arrays and matrices.

**Pandas:**
Offers high-performance, easy-to-use data structures and data analysis tools for labeled data.

**Matplotlib:**
Creates various static, animated, and interactive visualizations for data exploration and communication.

**Seaborn:**
Builds upon Matplotlib to create high-level statistical graphics with a focus on aesthetics and ease of use.

**from collections import Counter:**
Creates a dictionary-like object (Counter) that counts the occurrences of elements in an iterable (like a list or string).

**warnings:**
Controls how Python handles warning messages.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")

import seaborn as sns

from collections import Counter

import warnings
warnings.filterwarnings("ignore")



import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a id = "3"></a><br>
# Load and Check Data
We will load and check data in this step.

In [None]:
train_df = pd.read_csv('/kaggle/input/titanic/train.csv')
test_df = pd.read_csv('/kaggle/input/titanic/test.csv')
test_PassengerId = test_df["PassengerId"]

We need to assign train_df and test_df in here. It will need in the Feature Engineering!

In [None]:
train_df

<a id = "4"></a><br>
# Understand Dataset
We can understand dataset with some codes and we can check the dataset.

We can see the columns of train dataset:

In [None]:
train_df.columns

We will see the first 10 index and last 10 index in the below codes :

In [None]:
train_df.head(10)

In [None]:
train_df.tail(10)

We can see statistical details about data in the below code: 

In [None]:
train_df.describe()

<a id = "5"></a><br>
# Variable Description
In this step we should understand our dataset variables. If we don't understand our variables, we can't do our job with a good result!

*  **PassengerId:** unique id number to each passenger
*  **Survived:** passenger survive(1) or died(0)
*  **Pclass:** passenger class
*  **Name:** name
*  **Sex:** gender of passenger 
*  **Age:** age of passenger 
*  **SibSp:** number of siblings/spouses
*  **Parch:** number of parents/children 
*  **Ticket:** ticket number 
*  **Fare:** amount of money spent on ticket
*  **Cabin:** cabin category
*  **Embarked:** port where passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)

We can see the detailed info about dataset variables. For example we can see data types with .info() method:

In [None]:
train_df.info()


* int64(5): PassengerId, Survived, Pclass, SibSp, Pclass

* float64(2): Age, Fare

* object(5): Name, Sex, Ticket, Cabin, Embarked


<a id = "6"></a><br>
# Exploratory Data Analysis (EDA) 
We can understand data deeper in Exploratory Data Analysis (EDA). In this step, we will do exploratory data analysis. We will focus to univariate variable analysis. We will do some visualization according to our data. We will seperate data to categorical and numerical variables. Firstly we should look our categorical and numerical variables:


**Categorical Variables:** Survived, Pclass, Name, Sex, SibSp, Parch, Ticket, Cabin, Embarked

**Numerical Variables:** PassengerId, Age, Fare

<a id = "7"></a><br>
## Univariate Variable Analysis

Firstly we should define univariate variable analysis:
Univariate analysis is a fundamental statistical technique used to explore and understand the distribution of a single variable within a dataset. It focuses on summarizing the data, identifying patterns, and describing the characteristics of that single variable.

We separated the variables at the top. We can start with categorical variables in the below code:


### Categorical Variable

We will use bar_plot graph for analysis, we will visualize categorical variables. In this categorical variables, we will seperate some of them. We need to seperate because some of them nearly unique so we assigned them like categorical_2.

In [None]:
def bar_plot(variable):
    """
        input: variable ex: "Sex"
        output: bar plot & value count
    """
    # get feature
    var = train_df[variable]
    # count number of categorical variable(value/sample)
    varValue = var.value_counts()
    
    # visualize
    plt.figure(figsize = (12,4))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable,varValue))

In [None]:
categorical_1 = ["Survived","Pclass","Sex","SibSp", "Parch","Embarked"]
for c in categorical_1:
    bar_plot(c)

In [None]:
categorical_2 = ["Name","Ticket","Cabin"]
for c in categorical_2:
    print("{} \n".format(train_df[c].value_counts()))

### Numerical Variable

We will visualize numerical variable in this step. 

In [None]:
def plot_hist(var):
    plt.figure(figsize = (12,4))
    plt.hist(train_df[var], bins = 50)
    plt.xlabel(var)
    plt.ylabel("Frequency")
    plt.title("{} Distribution with Hist".format(var))
    plt.show()

In [None]:
numeric = ["Fare", "Age","PassengerId"]
for n in numeric:
    plot_hist(n)

<a id = "8"></a><br>
# Basic Data Analysis

We will do basic data analysis. Basic data analysis serves as the foundation for understanding and extracting valuable insights from raw data. 

* Pclass - Survived
* Sex - Survived
* SibSp - Survived
* Parch - Survived

**Pclass - Survived**

In [None]:
train_df[["Pclass","Survived"]].groupby(["Pclass"], as_index = False).mean().sort_values(by="Survived",ascending = False)

**Sex - Survived**

In [None]:
train_df[["Sex","Survived"]].groupby(["Sex"], as_index = False).mean().sort_values(by="Survived", ascending = False)

**SibSp - Survived**

In [None]:
train_df[["SibSp","Survived"]].groupby(["SibSp"], as_index = False).mean().sort_values(by="SibSp", ascending = False)

**Parch - Survived**

In [None]:
train_df[["Parch","Survived"]].groupby(["Parch"], as_index = False).mean().sort_values(by="Survived",ascending = False)

<a id = "9"></a><br>
# Data Cleaning

In the data cleaning, we have the most important step for data science lifecycle. In a lot of project, this step is %80 of the work. We will give importance because of that data cleaning step. We will apply that steps: 

* **Outlier Detection**

    We will focus IQR test for outlier detection.
    
* **Missing Values**

    We will find and fill missing values.
    
In the end of this part we will visualize some values. We will see correlation matrix.

<a id = "10"></a><br>
## Outlier Detection

We will do outlier detection. We have some outlier values in the dataset. If you pass this step, you can't have good score in your model. You shouldn't pass! 

You can find a lot of method for outlier detection like IQR, z-score etc. . We will use IQR test in this dataset.

In [None]:
def detect_outliers(df,features):
    outlier_indices = []
    
    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c],25)
        # 3rd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # detect outlier and their indeces
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        # store indeces
        outlier_indices.extend(outlier_list_col)
    
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

In [None]:
train_df.loc[detect_outliers(train_df,["Age","SibSp","Parch","Fare"])]

We found some outlier values. We need to drop from dataset. You can see code in the below: 

In [None]:
train_df = train_df.drop(detect_outliers(train_df,["Age","SibSp","Parch","Fare"]),axis = 0).reset_index(drop = True)

We can pass another step. We detected outlier values and dropped them!

<a id = "11"></a><br>
## Missing Values

We need to check dataset that we have missing values or not! We will check. If we have, we will find them. After that we will find missing values. Some machine learning models can't fit your dataset so you need to handle with missing values! Let's find and fill them!

### Find Missing Values

In [None]:
train_df_len = len(train_df)
train_df = pd.concat([train_df,test_df],axis = 0).reset_index(drop = True)

We can see some missing value columns.

In [None]:
train_df.columns[train_df.isnull().any()]

We can see the sum of missing values by column.

In [None]:
train_df.isnull().sum()

## Fill Missing Values

We found some missing values. We need to fill them. Let's fill!

In [None]:
train_df[train_df["Embarked"].isnull()]

We can see the missing data rows according to Embarked column in the top. Now, I want to show visualization of "Embarked" column with boxplot.

In [None]:
train_df.boxplot(column="Fare",by = "Embarked")
plt.show()

We can fill according to boxplot table.

In [None]:
train_df["Embarked"] = train_df["Embarked"].fillna("C")
train_df[train_df["Embarked"].isnull()]

We will fill "Age" column! Let's fill!

In [None]:
train_df[train_df["Age"].isnull()]

We need to see relationship of "Age" feature with other features. We will create correlation matrix in here:

In [None]:
sns.heatmap(train_df[["Age","SibSp","Parch","Pclass"]].corr(), annot = True)
plt.show()

We will fill according to correlation relationship. Considering the correlation with these features, we can follow a method like the one below to fill in the missing values!

In [None]:
index_nan_age = list(train_df["Age"][train_df["Age"].isnull()].index)
for i in index_nan_age:
    age_pred = train_df["Age"][((train_df["SibSp"] == train_df.iloc[i]["SibSp"]) &(train_df["Parch"] == train_df.iloc[i]["Parch"])& (train_df["Pclass"] == train_df.iloc[i]["Pclass"]))].median()
    age_med = train_df["Age"].median()
    if not np.isnan(age_pred):
        train_df["Age"].iloc[i] = age_pred
    else:
        train_df["Age"].iloc[i] = age_med

In [None]:
train_df[train_df["Age"].isnull()]

We filled all missing values for 'Age' column. We can pass Feature Engineering step!

<a id = "12"></a><br>
# Feature Engineering

We will focus feature engineering for the best solution in the feature engineering. You can show your creative side in Feature Engineering. Let's show creative side in here!

## Title & Is Married

**Title** is created by extracting the prefix before **Name** feature. According to graph below, there are many titles that are occuring very few times. Some of those titles doesn't seem correct and they need to be replaced. **Miss, Mrs, Ms, Mlle, Lady, Mme, the Countess, Dona** titles are replaced with **Miss/Mrs/Ms** because all of them are female. Values like **Mlle, Mme and Dona** are actually the name of the passengers, but they are classified as titles because **Name** feature is split by comma. **Dr, Col, Major, Jonkheer, Capt, Sir, Don and Rev** titles are replaced with **Dr/Military/Noble/Clergy** because those passengers have similar characteristics. **Master** is a *unique* title. It is given to male passengers below age 26. They have the highest survival rate among all males.

*Is_Married* is a binary feature based on the *Mrs* title. *Mrs* title has the highest survival rate among other female titles. This title needs to be a feature because all female titles are grouped with each other.

In [None]:
train_df['Title'] = train_df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
train_df['Is_Married'] = 0
train_df['Is_Married'].loc[train_df['Title'] == 'Mrs'] = 1

In [None]:
train_df.head()

In [None]:
fig, axs = plt.subplots(nrows=2, figsize=(20, 20))
sns.barplot(x=train_df['Title'].value_counts().index, y=train_df['Title'].value_counts().values, ax=axs[0])

axs[0].tick_params(axis='x', labelsize=10)
axs[1].tick_params(axis='x', labelsize=15)

for i in range(2):    
    axs[i].tick_params(axis='y', labelsize=15)

axs[0].set_title('Title Feature Value Counts', size=20, y=1.05)

train_df['Title'] = train_df['Title'].replace(['Miss', 'Mrs','Ms', 'Mlle', 'Lady', 'Mme', 'the Countess', 'Dona'], 'Miss/Mrs/Ms')
train_df['Title'] = train_df['Title'].replace(['Dr', 'Col', 'Major', 'Jonkheer', 'Capt', 'Sir', 'Don', 'Rev'], 'Dr/Military/Noble/Clergy')

sns.barplot(x=train_df['Title'].value_counts().index, y=train_df['Title'].value_counts().values, ax=axs[1])
axs[1].set_title('Title Feature Value Counts After Grouping', size=20, y=1.05)

plt.show()

## Family Size
**Family Size** is created by adding **SibSp, Parch and 1**. *SibSp* is the count of siblings and spouse, and *Parch* is the count of parents and children. Those columns are added in order to find the total size of families. Adding *1* at the end, is the current passenger.

In [None]:
train_df['Family_Size'] = train_df['SibSp'] + train_df['Parch'] + 1
family_map = {1: 'Alone', 2: 'Small', 3: 'Small', 4: 'Small', 5: 'Medium', 6: 'Medium', 7: 'Large', 8: 'Large', 11: 'Large'}
train_df['Family_Size_Grouped'] = train_df['Family_Size'].map(family_map)
train_df.head()

In [None]:
g = sns.catplot(x="Family_Size_Grouped", y="Survived", data=train_df, kind="bar")
g.set_ylabels("Survival")
plt.show()

In [None]:
sns.countplot(x = "Family_Size_Grouped", data = train_df)
plt.show()

In [None]:
g = sns.catplot(x = "Family_Size", y = "Survived", data = train_df, kind = "bar")
g.set_ylabels("Survival")
plt.show()

As you can see in the graph, small families can live with a big possibility.

## Ticket
There are too many unique **Ticket** values to analyze, so grouping them up by their frequencies makes things easier.

How is this feature different than **Family_Size**? Many passengers travelled along with groups. Those groups consist of friends, nannies, maids and etc. They weren't counted as family, but they used the same ticket.

Why not grouping tickets by their prefixes? If prefixes in **Ticket** feature has any meaning, then they are already captured in **Pclass** or **Embarked** features because that could be the only logical information which can be derived from the **Ticket** feature.

In [None]:
train_df['Ticket_Frequency'] = train_df.groupby('Ticket')['Ticket'].transform('count')

In [None]:
train_df.head()

## Feature Transformation
We will do feature transformation for these features. Some models don't fit for categorical values. This is the best thing we can do! Let's do!
### Label Encoding Non-Numerical Features
**Embarked, Sex, Title and Family_Size_Grouped** are object type, and **Age** and **Fare** features are category type. They are converted to numerical type with **LabelEncoder**. **LabelEncoder** basically labels the classes from 0 to n. This process is necessary for models to learn from those features.

Firstly, we need to import some libraries for feature transformation! Let's implement!

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler

In [None]:
non_numeric_features = ['Embarked', 'Sex', 'Title', 'Family_Size_Grouped','Age', 'Fare']

label_encoder = LabelEncoder()

for column in non_numeric_features:
    train_df[column] = label_encoder.fit_transform(train_df[column])

### One-Hot Encoding the Categorical Features
The categorical features (**Pclass, Sex, Deck, Embarked, Title**) are converted to one-hot encoded features with **OneHotEncoder**. **Age** and **Fare** features are not converted because they are ordinal unlike the previous ones.

In [None]:
cat_features = ['Pclass', 'Sex', 'Embarked', 'Title', 'Family_Size_Grouped']
one_hot_encoder = OneHotEncoder()
encoded_features = one_hot_encoder.fit_transform(train_df[cat_features]).toarray()

column_names = []
for i, column in enumerate(cat_features):
    unique_labels = train_df[column].unique()
    
    names = [f"{column}_{label}" for label in unique_labels]
    column_names.extend(names)

one_hot_encoded_df = pd.DataFrame(encoded_features, columns=column_names)

train_df = pd.concat([train_df, one_hot_encoded_df], axis=1)

In [None]:
train_df.head(20)

## Drop Passenger ID and Cabin
We need to drop 'PassengerId' and 'Column' columns. We don't need that columns. They are unnecessary for models! 

In [None]:
train_df.drop(labels = ["PassengerId", "Cabin", "Name", "Ticket"], axis = 1, inplace = True)

In [None]:
train_df.head()

In [None]:
train_df.columns

We need to do last thing before model!

<a id = "13"></a><br>
# Modeling
We will put model for our dataset. We completed all processes for our dataset! Firstly we will seperate train-test split.

We need to import some libraries for machine learning models!

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
import xgboost as xgb
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

## Train - Test Split

In [None]:
test = train_df[train_df_len:]
test.drop("Survived", axis=1, inplace=True)

In [None]:
test

In [None]:
train = train_df[:train_df_len]
X_train = train.drop("Survived", axis = 1)
y_train = train["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size = 0.33, random_state = 42)
print("X_train",len(X_train))
print("X_test",len(X_test))
print("y_train",len(y_train))
print("y_test",len(y_test))
print("test",len(test_df))

In [None]:
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'XGBoost': xgb.XGBClassifier()
}

for model_name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = model.predict(X_test)
    
    # Evaluate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    print(f'{model_name}: Accuracy = {accuracy:.4f}')

We can choose Logistic Regression. Logistic Regression gives the best accuracy_score in the titanic dataset!

We trained Logistic Regression for titanic dataset in the below code!

In [None]:
logistic_regression = LogisticRegression()
logistic_regression = logistic_regression.fit(X_train, y_train)
print(accuracy_score(logistic_regression.predict(X_test),y_test))

<a id = "14"></a><br>
# Prediction and Submit
We will predict and submit our .csv file. We can finish this notebook in here!

In [None]:
test_survived = pd.Series(logistic_regression.predict(test), name = "Survived").astype(int)
results = pd.concat([test_PassengerId, test_survived],axis = 1)
results.to_csv("submission.csv",header=True, index = False)

I hope this notebook can be helpful for your data scientist journey!

> Respects,
> Ali Riza Ercan

                                                                   