# Building Boosting algorithms in Python, step by step, using Titanic Data

### Project of Adaboost CatBoost XG Boost Titanic Survival ML

### Run this command in Anaconda prompt which will be in windows start menu after search you will get prompt

jupyter nbconvert --clear-output --inplace "Project 2 ML Adaboost CatBoost XG Boost Titanic Survival.ipynb"

### inside the jupyter need to run this

!jupyter nbconvert --clear-output --inplace "Project 2 ML Adaboost CatBoost XG Boost Titanic Survival.ipynb"

### inside the python terminal need to run this

python -m nbconvert --clear-output --inplace "Project 2 ML Adaboost CatBoost XG Boost Titanic Survival.ipynb"


# PROBLEM STATEMENT

The sinking of the Titanic on April 15th, 1912 is one of the most tragic tragedies in history. The Titanic sank, during her maiden voyage, after colliding with an iceberg, killing 1502 out of 2224 passengers. The numbers of survivors were low due to the lack of lifeboats for all passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others, such as women, children, and upper-class. This case study analyzes what sorts of people were likely to survive this tragedy. The dataset includes the following: 

- Pclass:	Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Sex:    Sex	
- Age:    Age in years	
- Sibsp:	# of siblings / spouses aboard the Titanic	
- Parch:	# of parents / children aboard the Titanic	
- Ticket:	Ticket number	
- Fare:	Passenger fare	
- Cabin:	Cabin number	
- Embarked:	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton


- Target class: Survived: Survival	(0 = No, 1 = Yes)


## IMPORT LIBRARIES
We are importing all the required libraries

## Need to run this

!pip install pandas numpy matplotlib seaborn

In [None]:
#!pip install pandas numpy matplotlib seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## IMPORT DATASET

In [None]:
# read the data using pandas dataframe
train = pd.read_csv('Project 2 ML titanic_train Algo Ada Cat XG Boost.csv')

## EDA part

In [None]:
# Show the data head!
train.head()

#Dependent varaible is survived and rest of the features

The data has been properly imported. 
#### Observation :  We can't predict anything with the PasserngerId, Name, and Ticket column, hence we will drop it. Here, we understand Survived is our dependent variable. The place where the person has boarded the ship (Embarked column) shouldn't be predicting their chance of survival, hence it will also be dropped.

In [None]:
#Dependent varaible is survived and rest of the features
train.drop(["PassengerId","Name","Ticket","Embarked"],axis=1,inplace=True)
train.head(2)

# EXPLORE/VISUALIZE DATASET

In [None]:
# EXPLORE/VISUALIZE DATASET# Let's count the number of survivors and non-survivors
train['Survived'].value_counts()

#### Observation :  Here 1 means survived and 0 means died. Pretty much balanced data, since number of 1 and 0 are close.

In [None]:
#Number of people travelling by different class
plt.figure(figsize=[10,5])
sns.countplot(x = 'Pclass', data = train)
plt.show()
#count plot is for each sub-category, that's why we are using this count plot

#### Observation :  Maximum people were travelling by 3rd class

In [None]:
plt.figure(figsize=[10,5])
#Observation :  Here 1 means survived and 0 means died. 
#Pretty much balanced data, since number of 1 and 0 are close.
sns.countplot(x = 'Pclass', hue = 'Survived', data=train)
plt.show()

#### Observation :  More people travelling by 1st class survived; it's almost equal for 2nd class (marginally more people die though), and most of the people travelling by 3rd class died. This chart shows that the class had some impact whether a person would survive or not.

## Now the same analysis will be made on other independent variables

In [None]:
plt.figure(figsize=[10,5])
sns.countplot(x = 'SibSp', hue = 'Survived', data=train)
plt.show()
#so we are doing comparison independent with dependent variable that is called bi-variate analysis

#### Observation :  Bar Chart to indicate the number of people survived based on their siblings status. If you have 1 siblings (SibSp = 1), you have a higher chance of survival compared to being alone (SibSp = 0)

In [None]:
plt.figure(figsize=[10,5])
sns.countplot(x = 'Parch', hue = 'Survived', data=train)
plt.show()

#### Observation :  Bar Chart to indicate the number of people survived based on their Parch status (how many parents onboard). If you have 1, 2, or 3 family members (Parch = 1,2), you have a higher chance of survival compared to being alone (Parch = 0)

In [None]:
plt.figure(figsize=[10,5])
sns.countplot(x = 'Sex', hue = 'Survived', data=train)
plt.show()

#### Observation :   Bar Chart to indicate the number of people survived based on their sex. If you are a female, you have a higher chance of survival compared to other ports!

#### Female and children were given first preference for safety, hence it makes sense that gender will help to predict the chances of survival.

In [None]:
plt.figure(figsize=(20,5))
sns.countplot(x = 'Age', hue = 'Survived', data=train)
plt.show()

#in this Age column countplot is not sufficient so we need to put histogram over this
#so that we will see clear picture or analysis for this.
#Histogram is always plotted with bins, but in this case we are doing countplot which 
#is very bad idea, because of overlapping.

#### Observation :   Bar Chart to indicate the number of people survived based on their age. If you are a baby, you have a higher chance of survival

#### Female and children were given first preference for safety, hence it makes sense that age will help to predict the chances of survival.

In [None]:
# Age Histogram 
train['Age'].hist(bins = 40) 
#in bins give any random number(as I give hypothetical 40 number of bins)
plt.show()
#In histogram age column will give you intervals like age 5 to 10 to 15 to 20 to 30 to 40 like this.
#Histogram is always plotted with bins

#so when we use continous value or numerical value then we have to use histogram not a count plot
#count plot and bar plot is for categorical not for continous or numerical

#as we already checked x axis for age that people survived
#now what is showing in y axis= (frequnecy)

#### Observation :   The histogram shows that majorly people were around the age of 20 to 30

In [None]:
plt.figure(figsize=(80,40))
sns.countplot(x = 'Fare', hue = 'Survived', data=train)
plt.show()

#### Observation :   # Bar Chart to indicate the number of people survived based on their fare. If you pay a higher fare, you have a higher chance of survival

In [None]:
# Fare Histogram 
train['Fare'].hist(bins = 40)
plt.show()

#### Observation :   # Mostly people had paid low value fare. Only a handful of people had paid high fare. 

#### and done with data explonation part (EDA)

# PREPARE THE DATA FOR TRAINING/ DATA CLEANING 

In [None]:
# number of missing values by variables
train.isnull().sum()

#### Observation: There are missing values for Age and Cabin

In [None]:
# percentage of missing values by variables
train.isnull().mean()*100

#### Observation: Missing values are shown in percentage. 77% of total values in Cabin is missing, and 19.8% values of Age is missing.

In [None]:
# Let's visualize which variables in the dataset are missing, only with x axis
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap="Blues")
plt.show()

#### Observation: Since a very high percentage of values in Cabin are missing, this variable is not going to help us in the model. We will drop it from the dataset

In [None]:
# Dropping the Cabin column
train.drop('Cabin',axis=1,inplace=True)

In [None]:
train.head()

In [None]:
# Let's view the missing values in the data one more time! only with x axis
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap="Blues")
plt.show()

#### Observation: There are 19.8% of missing values in Age, we can't entirely drop the column, nor we can keep the missing values. We will replace them with the average. However, we can't replace all the missing values with the average of Age. It would be misleading. 

#### The mean of total Age could would be:

In [None]:
#Mean of total Age
train.Age.mean()

#### Observation: If we check the average age for different sex (male and female), we can see they are different values.

In [None]:
# Let's get the average age for male (~29) and female (~25)
plt.figure(figsize=(15, 10))
sns.boxplot(x='Sex', y='Age',data=train)
plt.show()

#### Hence, we should be replacing the missing Age values for female with average age of female and  replace the missing Age values for male with average age of male.

In [None]:
#Shows the missing values of Age for male
train.loc[(train["Age"].isnull()) & (train["Sex"] == "male"),"Age"].head()

In [None]:
#Shows the average age for male
train.loc[train["Sex"] == "male","Age"].mean()

In [None]:
#Shows the missing values of Age for female
train.loc[(train["Age"].isnull()) & (train["Sex"] == "female"),"Age"].head()

In [None]:
#Shows the average age for female
train.loc[train["Sex"] == "female","Age"].mean()

In [None]:
#Replace missing age for male and female with average age of male and female respectively
train.loc[(train["Age"].isnull()) & (train["Sex"] == "male"),"Age"] = train.loc[train["Sex"] == "male","Age"].mean()
#or we can use this as well
#train.loc[(train["Age"].isnull()) & (train["Sex"] == "male"),"Age"] = 30.72

train.loc[(train["Age"].isnull()) & (train["Sex"] == "female"),"Age"] = train.loc[train["Sex"] == "female","Age"].mean()

#or we can use this as well
#train.loc[(train["Age"].isnull()) & (train["Sex"] == "female"),"Age"] = 27.91

In [None]:
#Check again for missing values 
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap="Blues")
plt.show()
# now there are no missing values

#### Now there are no missing values in the dataset. We have completed the data cleaning.

## Create Dummy variables
### We need to create the dummy variables for all the categorical variables.

In [None]:
#and 1 will your base dummy, in your two category male and female so female is your base dummy
#sex- male is 0 and sex- female is 1
train.head()

#### Observation :  We can see that here we have only one categorical variable, i.e. Sex. We will create the dummy variable as shown below:

In [None]:
train = pd.get_dummies(data=train, columns=['Sex'],drop_first=True)
#and dummy creation is similer to encoding
#because machine only understand numerical values

In [None]:
train.head()

##  Data split
### We will now split the data into dependent (y) and independent variable (X)

In [None]:
#Let's drop the target coloumn before we do train test split
X = train.drop('Survived',axis=1)
y = train['Survived']

#### Now we will split the data into training (80% of the data) and rest 20% - named test, will be kept aside for later use. 

## Need to run this
!pip install scikit-learn xgboost catboost lightgbm

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

##  Adaptive Boosting
 Adaptive Boosting, or most commonly known AdaBoost. It is sequentially growing decision trees as weak learners and punishing the incorrectly predicted samples by assigning a larger weight to them after each round of prediction. This way, the algorithm is learning from previous mistakes. The final prediction is the weighted majority vote (or weighted median in case of regression problems). After training a classifier at any level, ada-boost assigns weight to each training item. Misclassified item is assigned higher weight so that it appears in the training subset of next classifier with higher probability. After each classifier is trained, a weight is assigned to the classifier as well based on accuracy. More accurate classifier is assigned higher weight so that it will have more impact in final outcome.

In [None]:
# AdaBoost
from sklearn.ensemble import AdaBoostClassifier
classifier_ada = AdaBoostClassifier(random_state = 0)
classifier_ada.fit(X_train, y_train)

#  MODEL TESTING
## Once the model is executed, we will predict the test data with our model.

In [None]:
y_predict_test = classifier_ada.predict(X_test)

## Now we will check the confusion matrix and accuracy of the model

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict_test))

#### Overall accuracy is 82% and precision for 0 and 1 are 87% and 73% respectively. 

##  Cat Boosting
CatBoost is an algorithm for gradient boosting on decision trees. It is used for search, recommendation systems, personal assistants, self-driving cars, weather prediction, and many other tasks at Yandex and in other companies, including CERN, Cloudflare, Careem taxi. It is open-source and can be used by anyone.
Catboost is a boosted decision tree machine learning algorithm. It works in the same way as other gradient boosted algorithms such as XGBoost but provides support out of the box for categorical variables, has a higher level of accuracy without tuning parameters and also offers GPU support to speed up training.

## Need to run this

!pip3 install catboost

In [None]:
from catboost import CatBoostClassifier
#clf=catbclsf(hyper-parameter is iteration)
#catboost algo is iteration process it keeps on executive till it finds the best
#eval_metric is for accuracy and verbose is for output 
#that means give me verbose every 500 iteration
clf = CatBoostClassifier(iterations=10000, eval_metric = 'Accuracy', verbose = 500)
clf.fit(X_train, y_train, eval_set = (X_test, y_test))

In [None]:
y_pred = clf.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

#### Overall accuracy is 87% and precision for 0 and 1 are 87% and 85% respectively. 
#### CatBoost has given us a significantly better result than AdaBoost

##  XGBoost


In [None]:
import xgboost as xgb
from xgboost import XGBClassifier

In [None]:
classifier =XGBClassifier()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, y_pred))

#### Overall accuracy is 85% and precision for 0 and 1 are 79% and 77% respectively. 
#### CatBoost seems to have given us the best result

### Recall of 1 - it shows what % of total 1 available in the data could be identified by the model.
### Precision of 1 - it shows what % of total 1 predicted by the model actually has been correctly identified.

In [None]:
#majority of case XG boost gives highest rating but in this case catBoost gives highest rating.