# **Predicting Passenger Survivor**
****

**Table content**
* **About Data**
* **Import Libraries**
* **load Datasets**
* **Explorer Datasets**
* **Data Split**
* **Data Cleaning**
* **Data Visualization**
* **Choose Model**
* **Fit Models**
* **Evaluate Model**
* **Fine The Model**

## **About Data**
RMS Titanic was a British passenger liner, operated by the White Star Line, which sank in the North Atlantic Ocean on 15 April 1912 after striking an iceberg. for more info click <a href='https://en.wikipedia.org/wiki/Titanic' style='text-decoration:none'>here</a>

<pre style="font-family: 'Brush Script MT', cursive, serif;">
<h3 style='font-size: 12'>Defination Of Feature Columns</h3>
<b>Survived:</b> Passager Survived 
0 = No 
1 = Yes
<b>Embarked:</b> Port of Embarkation 
C = Cherbourg
Q = Queenstown
S = Southampton
<b>Pclass:</b> ticket class
A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
<b>Sex:</b> passenger sex
<b>Age:</b> Age in years
Age is fractional if less than 1. 
If the age is estimated, is it in the form of xx.5
<b>Sibsp:</b> of siblings / spouses aboard the Titanic
The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
<b>Parch:</b> of parents / children aboard the Titanic.
The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
<b>Ticket:</b> ticket number
<b>Fare:</b> fare paid for a ticket
<b>cabin:</b> Cabin number
</pre>

## **Import Libraries**

In [None]:
!pip install numpy pandas matplotlib seaborn sklearn xgboost lightgbm

In [1]:
import os, warnings
from typing import List, Tuple
# Data manipulation tools
import numpy as np
import pandas as pd
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Data split into train and validation
from sklearn.model_selection import (
    train_test_split, 
    cross_val_score, 
    cross_val_predict, 
    GridSearchCV, 
    GroupKFold, 
    RandomizedSearchCV)

# Data Imputation tool
from sklearn.impute import KNNImputer, SimpleImputer, MissingIndicator

# Machine learning model
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier, 
    BaggingClassifier, 
    ExtraTreesClassifier, 
    GradientBoostingClassifier, 
    RandomForestClassifier,
    StackingClassifier,
    VotingClassifier,
    )
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC, SVC
from sklearn.tree import DecisionTreeClassifier

# Model evalution tools 
from sklearn.metrics import (
    accuracy_score, 
    classification_report, 
    confusion_matrix, f1_score, 
    plot_confusion_matrix, 
    plot_roc_curve,
    auc,
    r2_score,
    recall_score,

    )

ModuleNotFoundError: No module named 'numpy'

## **Setting Notebook**

In [None]:
plt.rc("figure", autolayout=True, figsize=(5, 6))
plt.rc("axes", titlesize=12, titleweight=30, labelsize=10, labelweight=20)


## **Load Datasets**

In [None]:
def load_data(train_path: str, test_path: str, index_col: str=None) -> Tuple[pd.DataFrame]:
    """
    Loads data path into pandas dataframe and return a 
    list of dataframe
    Args:
        train_path (str): train data path to load in the pandas dataframe
        test_path (str): test data path to load in the pandas dataframe
        index_col (str, optional): column name to use as index column. Defaults to None.

    Returns:
        Tuple[pd.DataFrame]: Two dataframe in the tuple
    """
    # Instantiate the train dataframe
    train_df = pd.read_csv(train_path, index_col=index_col)

    # Instantiate the test dataframe
    test_df = pd.read_csv(test_path, index_col=index_col)

    return train_df, test_df

In [None]:
try:
    # trying to use this local path
    local_train_path = "../datasets/train.csv"
    local_test_path = "../datasets/test.csv"
    # Initialize the list of dataset to: datasets
    datasets = load_data(local_train_path, local_test_path, "PassengerId")

    # Assign train and test dataframe
    train_df, test_df = datasets
    
except:
    pass

# **Explorer Datasets**

### **Train And Test Dataframe**

#### **Train**

In [None]:
# Check out the first five row in train dataframe
train_df.head()


#### **Test**

In [None]:
# And Five from test dataframe
test_df.head()

### **Data Information**

#### **Train**

In [None]:
# View the data information
train_df.info()

#### **Test**

In [None]:
# View the data information
test_df.info()

### **Data Description**

#### **Train**

In [None]:
# Data summary analysis
train_df.describe(include="all").T

#### **Test**

In [None]:
# Data summary analysis
test_df.describe(include="all").T

## **Split Data**

*Seperate predictive feature from the target variable*

In [None]:
X = train_df.drop("Survived", axis="columns")
y = train_df["Survived"]

In [None]:
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

*Split dataset into train and validation dataset*

In [None]:
train_X, valid_X, train_y, valid_y = train_test_split(X, y, train_size=.80)

In [None]:
print(f"train_X shape: {train_X.shape}")
print(f"train_y shape: {train_y.shape}")
print(f"valid_X shape: {valid_X.shape}")
print(f"valid_y shape: {valid_y.shape}")

## **Data Cleaning**

### **Checking For Missing Value(NaN)**

In [None]:
def check_nan(data: pd.DataFrame) -> pd.DataFrame:
    """
    Check missing values or NaN in 
    the data
    Args:
        data (pd.DataFrame): To check if there is missing values

    Returns:
        pd.DataFrame: with column, count
        and percentage of missing values
    """
    # Get the sum of missing values in each column
    total_nan = data.isna().sum()

    # Percent of missing values in each column
    percent = (total_nan / data.shape[0]).round(3)

    # Construct a dataframe of missing values
    check_missing_df = pd.DataFrame({
        "Columns": data.columns,
        "TotalNaN": total_nan,
        "Percent": percent
    }).reset_index(drop=True)

    return check_missing_df


def plot_missing_values(data: pd.DataFrame, width: int = 8, height: int = 5) -> None:
    """
    Plot Missing values in data
    Args:
        data (pd.DataFrame): To plot columns with missing values
        width (int, optional): figure width. Defaults to 8.
        height (int, optional): figure height. Defaults to 5.
    """

    # Check for missing values in the data
    check_missing_df = check_nan(data=data)

    # Initialize the figure and axes
    fig, ax =plt.subplots(1, 1, figsize=(width, height))

    # Plot a bar plot
    barh = ax.barh(y=check_missing_df.Columns, width=check_missing_df.TotalNaN, edgecolor="black", color="red")
    # Add bar labels
    ax.bar_label(barh, padding=1.5)
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_visible(False)

    # Remove x, y Ticks
    ax.xaxis.set_ticks_position('none')
    ax.yaxis.set_ticks_position('none')

    # Add padding between axes and labels
    ax.xaxis.set_tick_params(pad = 5)
    ax.yaxis.set_tick_params(pad = 10)

    # Add x, y gridlines
    ax.grid(visible = True, color ='grey',
        linestyle ='-.', linewidth = 0.5,
        alpha = 0.2)

    # Show top values
    ax.invert_yaxis()
    
    # Add Plot Title
    ax.set_title('Chart Shows Missing Value(NaN) In Data',
             loc ='left', size=11)
    
    ax.set_xlabel("TotalNaN", size=9)
    ax.set_ylabel("Columns" , size=9)


#### **Train**

In [None]:
# Check missing values in train data
check_nan(train_df)

In [None]:
# Plot missing values in the data
plot_missing_values(train_df)

#### **Test**

In [None]:
# Check missing values in train data
check_nan(test_df)

In [None]:
# Plot missing values in the test data
plot_missing_values(test_df)