### The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

**In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).**

#### Data Dictionary

* **survival**>>>>>>>>>>>>>> Survival	0 = No, 1 = Yes,

* **pclass**>>>>>>>>>>>>>>>	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd

* **sex**>>>>>>>>>>>>>>>>>>>>>>>>>	Sex	

* **Age**>>>>>>>>>>>>>>>>>	Age in years	
* **sibsp** >>>>>>>>>>>>>>>	# of siblings / spouses aboard the Titanic	
* **parch**>>>>>>>>>>>>>>>>>>>>	# of parents / children aboard the Titanic	
* **ticket**>>>>>>>>>>>>>>>>>>>>>	Ticket number	
* **fare**>>>>>>>>>>>>>>>>>>>>>>	Passenger fare	
* **cabin**>>>>>>>>>>>>>>>>>>>>>>>	Cabin number	
* **embarked**>>>>>>>>>>>>>>>>>>>	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton
* **pclass**:>>>>>>>>>>>>>>>>>>>>>>> A proxy for socio-economic status (SES)
* **1st = Upper**
* **2nd = Middle**
* **3rd = Lower**

* **age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

* **sibsp**: The dataset defines family relations in this way...
* **Sibling** = brother, sister, stepbrother, stepsister
* **Spouse** = husband, wife (mistresses and fiancés were ignored)

* **parch**: The dataset defines family relations in this way...
* **Parent** = mother, father
* **Child** = daughter, son, stepdaughter, stepson
* Some children travelled only with a nanny, therefore parch=0 for them.

* **Here the Target Variable is Survival** (0=Not Survived,1=Survived)

###### Evaluation Metric:Your score is the percentage of passengers you correctly predict. This is known as accuracy.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',100)
import scipy.stats as stats
import statsmodels.api  as sma
from statsmodels.api import OLS
import statsmodels.formula.api as sfa
from sklearn.metrics import accuracy_score,roc_auc_score,roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,BaggingClassifier
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.model_selection import train_test_split,GridSearchCV,StratifiedKFold,cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier,BaggingClassifier,GradientBoostingClassifier,StackingClassifier,VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm  import LGBMClassifier


* **Load the Train and Test Data**

In [2]:
train=pd.read_csv(r"C:\Users\saxen\ML PROJECTS\TITANIC\train.csv")
test=pd.read_csv(r"C:\Users\saxen\ML PROJECTS\TITANIC\test.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\saxen\\ML PROJECTS\\TITANIC\\train.csv'

In [None]:
train.shape,test.shape

In [None]:
train.head()

In [None]:
test.head()

###### Combine the Train and Test Dataset for further Analysis

In [None]:
combine=pd.concat([train,test],ignore_index=True)

In [None]:
combine.tail()

In [None]:
combine.shape

In [None]:
combine.info()

In [None]:
combine.isnull().sum()/combine.shape[0]
# More than 77% data for cabin is Missing

In [None]:
# Number of unique category in  each Features
combine.nunique()

In [None]:
combine.describe()

In [None]:
cat_cols=combine.select_dtypes(include=object).columns
num_cols=combine.select_dtypes(include=np.number).columns

In [None]:
cat_cols
# Most of records in Name columns contains unique labels so i will not consider this columns for further visulaization
# Also cabin,Ticket most of the records are unique

In [None]:
num_cols
# Survived,Pclass,Sibsp,Parch are the categorical features

In [None]:
cat_cols=['Sex','Embarked','Survived','Pclass','SibSp','Parch']
num_cols=['Age','Fare']

# Univariate Analysis


#### Numerical Columns

In [None]:
plt.figure(figsize=(20,10),dpi=500)
nrows=1
ncols=2
rep=1
for i in num_cols:
    plt.subplot(nrows,ncols,rep)
    sns.distplot(combine[i])
    rep=rep+1
    plt.title(i,fontdict={'fontsize':10})
plt.tight_layout()
plt.show()

#### Observation From the Above Distribution Plot
* Age we can see the distribution of age is looks like normal distributions and most of the passenger who are travelling are
  lies in the age range of 20 to 60.
* There are some Passenger who are more than 80 age also
* The distributions of Fare columns is looklikes highly right/positively skewed,it means that there are some passengers 
  who paid high amount,at Zero Fare amount the peak is high.
* So as i guess/my hpothesis ,imagine that the fare amount for which the passenger_id is zero he/she might be a staff/workers of titanic

In [None]:
plt.figure(figsize=(20,10),dpi=500)
nrows=1
ncols=2
rep=1
for i in num_cols:
    plt.subplot(nrows,ncols,rep)
    sns.boxplot(combine[i])
    plt.title('Skewness of %s id %.2f'%(i,combine[i].skew()))
    rep=rep+1
plt.tight_layout()
plt.show()

### Categorical Columns

In [None]:
# combine.Sex.value_counts().plot(kind="bar")

In [None]:
plt.figure(figsize=(20,10),dpi=500)
nrows=2
ncols=3
rep=1
for i in cat_cols:
    plt.subplot(nrows,ncols,rep)
    sns.countplot(x=combine[i])
    rep=rep+1
    
plt.tight_layout()
plt.show()

#### Obseravtion from the Above Plot
* Most of the Passengers who are Travelling are Male,Here i assume/think that Most of the passenger who not survived are Male
* Most of the passengers are started there journey from southhampton/ most of the Passengers belongs to southampton
* The number of passengers who died is more than the number of passengers who survived
* Most of the passenger who are traveling are belongs to pclass3,so my hopthesis  say that most of the passenger who died are belongs to pclass 3.
* Most of the passengers who are travellings are alone or i can say with one siblings or spouse and maximum sibings or spouse 
  a traveller travel with is 8 as seen from the above plot.
* A maximum of 9 parents/children traveled along with one of the traveler.Most of the passengers/travellers are travelling alone.




## Bivariate Analysis

### Numeric vs Categoric(Tgt)

In [None]:
plt.figure(figsize=(30,10),dpi=500)
nrows=1
ncols=2
rep=1
for i in num_cols:
    plt.subplot(nrows,ncols,rep)
    sns.boxplot(x=combine.Survived,y=combine[i])
    rep=rep+1
plt.tight_layout()
plt.show()

#### Observation From the above plot:
* Passenger who are younger are high chance of survival as compared to older.
* The passenger who paid the high fare amount are high chance of survival.

### Categoric vs Categoric(Tgt)

In [None]:
#Pclass vs Survived
pd.crosstab(combine.Pclass,combine.Survived)

In [None]:
pd.crosstab(combine.Pclass,combine.Survived).plot(kind='bar')
plt.title('Pclass Vs Survived')
plt.show()
#class 3 passenger are the most who did not survived as compare to others class passenger and class 1 passenger are
#higher survival rates...

In [None]:
sns.boxplot(x =combine.Survived,y=combine.Pclass)
plt.show()

In [None]:
pd.crosstab(combine.Sex,combine.Survived).plot(kind='bar')
plt.title('Gender vs Survived')
plt.show()

# Feamle passenger are Survived the most as compare to Male

In [None]:
# Sibsp vs Survived
pd.crosstab(combine.SibSp,combine.Survived).plot(kind='bar')
plt.title('Sibsp vs Survived')
plt.show()
#Singles and couples are survived the most

In [None]:
# Parch vs Survived
pd.crosstab(combine.Parch,combine.Survived).plot(kind='bar')
plt.title('Parch vs Survived')
plt.show()
# Solo travellers and two family members are able to survived most.....

In [None]:
# Embarked vs Survived
pd.crosstab(combine.Embarked,combine.Survived).plot(kind='bar')
plt.title('Embarked Vs Survived')
plt.show()
#chebourgs have high chances of survival...

In [None]:
combine.groupby('Embarked')['Survived'].value_counts(normalize=True)
# people from chebourgs are survived more in terms of percentage approx 55%

In [None]:
combine.groupby(['Embarked','Pclass'])['Survived'].value_counts(normalize=True)

## Missing Values

In [None]:
combine.isnull().sum()

In [None]:
(combine.isnull().sum()/combine.shape[0])*100
# Approx 21% data of Age are Missing
# Approx 78% data of Cambin are Mising
# The missing value which show in survived actually it is from the test dataset and as we know in test their is no target variable
# so we will drop this columns later from the test dataset

In [None]:
combine.Age.describe()

In [None]:
combine.groupby(['Pclass','Sex'])['Age'].describe()

* **Lets split the Name of the passengers,as each name of the passengers contains salutation so,I can extract the salutaion and see the distribution of age accordingly**

In [None]:
combine.head(1)

In [None]:
combine.Name[0].split(", ")[1].split(". ")[0]

In [None]:
# Extracting Salutation from each Name
title=[]
for i in combine.Name:
    title.append(i.split(', ')[1].split('.')[0])
    

In [None]:
combine['Title']=pd.Series(title)

In [None]:
combine.head(1)

In [None]:
# Lest see the distribution of Salutation in the dataset
combine.Title.value_counts(normalize=True)

In [None]:
combine.Title.value_counts(normalize=True).plot(kind='bar')
plt.title('Distribution of Type of Salutaion in Dataset')
plt.show()
# So from the below plot i can say that most of the person who are travelling are Mr(Mister)>age18 and followes by Miss,Mrs,Master

In [None]:
# Now lets check the Age distribution with Respect to Title
combine.groupby('Title')['Age'].describe()
# This distribution of Age Make More sense

In [None]:
combine.Title.unique()
# There are total 18 unique title

In [None]:
title_ignore=['Don', 'Rev', 'Dr', 'Mme',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
       'Jonkheer', 'Dona']
len(title_ignore)

* **Binnig the title columns from 18 categories to 6 categories as must of the person belongs to the title (Mr,Mrs,Miss,Master,Ms) and rest of the title grouping with others category**

In [None]:
def ignore(x):
    if x in title_ignore:
        return ('Others')
    else:
        return (x)

In [None]:
combine['Titles']=combine.Title.apply(ignore)

In [None]:
combine.head()

In [None]:
combine.groupby('Titles')['Age'].describe()

In [None]:
pd.crosstab(combine.Titles,combine.Survived).plot(kind='bar')
plt.show()
# Mr(Mister) Person are Died/Not Survived the Most as compares to others category

In [None]:
# Dealing with Missing value of Age as per the Titles Column
combine['Age']=combine.groupby('Titles')['Age'].apply(lambda x:x.fillna(x.median()))

In [None]:
combine.head(1)

In [None]:
combine.groupby('Titles')['Age'].describe()

In [None]:
combine.isnull().sum()

In [None]:
# Dealing with the Missing values in Fare columns

In [None]:
combine.loc[combine.Fare.isnull(),['Fare']]=combine.loc[(combine.Titles=='Mr')&(combine.Pclass==3)&(combine.Embarked=='S')]['Fare'].median()

In [None]:
combine.loc[(combine.Titles=='Mr')&(combine.Pclass==3)&(combine.Embarked=='S')]['Fare'].median()

In [None]:
combine.isnull().sum()

In [None]:
# Dealing with Missin Value of Cabin

In [None]:
combine.Cabin.value_counts()

In [None]:
combine.Cabin.unique()

In [None]:
cabin_avbl=['C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
       'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
       'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
       'C148', 'B45', 'B36', 'A21', 'D34', 'A9', 'C31', 'B61', 'C53',
       'D43', 'C130', 'C132', 'C55 C57', 'C116', 'F', 'A29', 'C6', 'C28',
       'C51', 'C97', 'D22', 'B10', 'E45', 'E52', 'A11', 'B11', 'C80',
       'C89', 'F E46', 'B26', 'F E57', 'A18', 'E60', 'E39 E41',
       'B52 B54 B56', 'C39', 'B24', 'D40', 'D38', 'C105']

In [None]:
len(cabin_avbl)
#Total 187 cabin are availabels

In [None]:
def available(x):
    if x in cabin_avbl:
        return ('Cabin Available')
    else:
        return('Cabin Not_Available')

In [None]:
# AS 77% of data in cabin columns are missing so we extract a new columns from cabin that is Cabin Avalability
combine['Cabin_Avalability']=combine.Cabin.apply(available)

In [None]:
combine.head()

In [None]:
pd.crosstab(combine.Cabin_Avalability,combine.Survived).plot(kind='bar')
plt.show()
#Inference: Cabin_Available passengers are survived more as compare to Cabin not available Passengers

In [None]:
combine.Cabin_Avalability.value_counts()

In [None]:
combine.head(1)

* **In column Cabin approx 77% data are missing,so as per theory it say that if there will be any columns contains missing
value more than 60% we simply drops but i am not dropping here i extract a intelligence from this and create a new columns
Cabin_Avalability and Check the relation with Target it make more sense**

In [None]:
combine.isnull().sum()

In [None]:
new_data=combine.drop(['Name','PassengerId','Ticket','Cabin','Title'],axis=1)
# Drop the Unncessary columns 

In [None]:
new_data.head()

In [None]:
# Family
# Combine all the attributes like,SibSp,Parch
new_data['Family']=new_data.SibSp+new_data.Parch+1

In [None]:
new_data.head(1)

In [None]:
pd.crosstab(new_data.Family,new_data.Survived).plot(kind='bar')
plt.title('Family Vs Survived')
plt.show()
#People who are travelling alone and copules are high chance of survival

In [None]:
# Binnig Family
new_data
def fam(x):
    if x>=5:
        return ('Large_Family')
    elif (x>=3):
        return ('Small_Family')
    elif (x==2):
        return ('Couples')
    else:
        return ('Singles')

In [None]:
new_data['Family_Cat']=new_data.Family.apply(fam)

In [None]:
new_data.head(1)

In [None]:
pd.crosstab(new_data.Family_Cat,new_data.Survived).plot(kind='bar')
plt.title('Family_Cat Vs Survived')
plt.show()
# Couples and Small_Family passengers are high chance of survival

In [None]:
new_data.head(1)

In [None]:
# Fare Per Person
new_data['Fare_Per_Head']=new_data.Fare/new_data.Family

In [None]:
new_data.head(1)

In [None]:
new_data[new_data.Fare==0]

In [None]:
new_data.isnull().sum()

In [None]:
new_data.loc[(new_data.Pclass==3)&(new_data.Titles=='Mr')&(new_data.Cabin_Avalability=='Cabin Not_Available')&
       (new_data.Family_Cat=='Singles'),'Fare'].median()

In [None]:
new_data.loc[new_data.Fare.isnull(),'Fare']=new_data.loc[(new_data.Pclass==3)&(new_data.Titles=='Mr')&(new_data.Cabin_Avalability=='Cabin Not_Available')&
       (new_data.Family_Cat=='Singles'),'Fare'].median()

In [None]:
new_data.isnull().sum()

In [None]:
new_data[new_data.Embarked.isnull()]

In [None]:
new_data.loc[
            (new_data.Sex=='female')&(new_data.Family_Cat=='Singles')&(new_data.Pclass==1),'Embarked'].mode()[0]

In [None]:
new_data.loc[new_data.Embarked.isnull(),'Embarked']='C'

In [None]:
new_data.Survived.value_counts()

In [None]:
new_data['Magic_1']=new_data.groupby(['Sex','Embarked','Titles','Cabin_Avalability'])['Pclass'].transform('count')

In [None]:
new_data['Magic_2']=new_data.groupby(['Pclass','Embarked','Titles','Cabin_Avalability','Family_Cat'])['Fare'].transform('median')

In [None]:
new_data.head(1)

In [None]:
new_data.corr()
# Magic_2,Fare_per_head,Fare,Pclass,Magic_1 are having good correlation with Target

In [None]:
sns.boxplot(x=new_data.Survived,y=new_data.Magic_1)
plt.show()

In [None]:
sns.boxplot(x=new_data.Survived,y=new_data.Magic_2)
plt.show()

In [None]:
new_data.head()

#### Lets perform Statistical Test with Target for all the features

In [None]:
new_data.nunique()

In [None]:
new_data.select_dtypes(include=np.number).columns

In [None]:
new_data.select_dtypes(include=object).columns

In [None]:
num_cols=['Age','Fare','Fare_Per_Head', 'Magic_1', 'Magic_2','Family']
cat_cols=['Sex', 'Embarked', 'Title', 'Titles', 'Cabin_Avalability',
       'Family_Cat','Pclass','Sibsp',]

#### Numeric vs Tgt

In [None]:
# Ho: Feature is not significant
# Ha: Feature is significant
for i in num_cols:
    sample1=new_data.groupby(['Survived'])[i].apply(list)[0]
    sample2=new_data.groupby(['Survived'])[i].apply(list)[1]
    ttest,pvalue=stats.ttest_ind(sample1,sample2)
    print(i,'---------',(pvalue))
# Except Family all features are significant with the Target

##### Categoric Vs Tgt

In [None]:
for i in cat_cols:
    table=pd.crosstab(new_data.Sex,new_data.Survived)
    teststats,pvalue,dof,expected=stats.chi2_contingency(table)
    print(i,'-------------',pvalue)
# All the Features are Statistically Significant with Target

In [None]:
# Drop the Insignificant columns from the Above Analysis
new_data.drop(columns=['Family'],inplace=True)

In [None]:
#Spilt train,test data set
train.shape,test.shape

In [None]:
newtrain=new_data.loc[0:train.shape[0]-1,:]

In [None]:
newtrain.head()

In [None]:
new_test=new_data.loc[train.shape[0]:,:]

In [None]:
new_test.head()

In [None]:
newtrain.shape,new_test.shape

In [None]:
new_test.drop(columns='Survived',inplace=True)

In [None]:
new_test.head()

In [None]:
newtrain.shape,new_test.shape

## Scaling

In [None]:
newtrain.head()

In [None]:
# Convert the Tgt into Int
newtrain['Survived']=newtrain.Survived.astype('int')

In [None]:
newtrain.head(2)

In [None]:
# Scale the Age,Fare and Fare_per_Head,Magic_1,Magic_2
cols=['Age','Fare','Fare_Per_Head','Magic_1','Magic_2']
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
for i in cols:
    newtrain.loc[:,cols]=sc.fit_transform(newtrain.loc[:,cols])
    new_test.loc[:,cols]=sc.transform(new_test.loc[:,cols])

In [None]:
sns.distplot(combine.Fare)

In [None]:
sns.distplot(sc.fit_transform(pd.DataFrame(new_test['Fare'])))

In [None]:
combine.describe()

In [None]:
newtrain.describe()

In [None]:
new_test.describe()

In [None]:
dummytrain=pd.get_dummies(newtrain,drop_first=True)
dummytest=pd.get_dummies(new_test,drop_first=True)

In [None]:
dummytrain.shape,dummytest.shape

In [None]:
dummytrain.head()

In [None]:
dummytest.head()

* **Data is ready for Modelling***

### Lets Build a Base Model

In [None]:
X=dummytrain.drop(columns=['Survived'])
y=dummytrain.Survived

In [None]:
# 70:30 train test spilt
xtrain,xtest,ytrain,ytest=train_test_split(X,y,random_state=12,stratify=y,test_size=0.30)

In [None]:
xtrain.shape,ytrain.shape,xtest.shape,ytest.shape

In [None]:
xtrain_c=sma.add_constant(xtrain)
xtest_c=sma.add_constant(xtest)

In [None]:
base_model=OLS(ytrain,xtrain_c).fit()

In [None]:
print(base_model.summary())

In [None]:
y_pred_logit=base_model.predict(xtest_c)

In [None]:
y_pred_logit=pd.Series(np.where(y_pred_logit>0.50,1,0))

In [None]:
y_pred_logit.value_counts()

In [None]:
# Lets check the accuracy of Base Model
print('Accuracy:',accuracy_score(ytest,y_pred_logit))

In [None]:
# Lets check the classification report
from sklearn.metrics import classification_report,f1_score,confusion_matrix
print('Classification_Report:\n',classification_report(ytest,y_pred_logit))
print('F1_score:',f1_score(ytest,y_pred_logit))

###### Obseravtion from Base Model: The accuracy is comes out to be  approx 82% and the F1score is approx 77%

#### Lets Apply diffrent Machine Learning Model

In [None]:
def base_models():
    models=dict()
    models['Logistic Regression']=LogisticRegression()
    models['Decission Tree']=DecisionTreeClassifier()
    models['Random Forest']=RandomForestClassifier()
    models['Naive Bayes']=GaussianNB()
    models['KNN']=KNeighborsClassifier()
    models['Ada Boost']=AdaBoostClassifier()
    models['Xgboost']=XGBClassifier()
    models['Catboost']=CatBoostClassifier()
    models['Light gbm']=LGBMClassifier()
    models['GBM']=GradientBoostingClassifier()
    return models

In [None]:
def evaluation_score(model):
    Cv=StratifiedKFold(n_splits=5,shuffle=True,random_state=12)
    score=cross_val_score(estimator=model,X=X,y=y,scoring='accuracy',cv=Cv,error_score='raise',n_jobs=-1)
    return score

In [None]:
models=base_models()
result,names=list(),list()
for name,model in models.items():
    finalscore=evaluation_score(model)
    result.append(finalscore)
    names.append(name)
    print('Model:',names[-1],'Mean_Score:',np.mean(result),'Variance:',np.std(result))

In [None]:
plt.boxplot(result,labels=names,showmeans=True)
plt.xticks(rotation=90)
plt.axhline(y=0.82)
plt.ylabel('accuracy score')
plt.title('Model Performance')
plt.show()

In [None]:
# Lets combine top 4 models,Logistic Regression,Xgboost,catboost,Light gbm

In [None]:
base=[('log_reg',LogisticRegression()),('xgb',XGBClassifier()),('lgbm',LGBMClassifier())]
final=[('catboost',CatBoostClassifier())]

In [None]:
Cv=StratifiedKFold(n_splits=5,shuffle=True,random_state=42)

In [None]:
stack=StackingClassifier(estimators=base,final_estimator=CatBoostClassifier(),cv=Cv) 

In [None]:
model_stack=stack.fit(X,y)

In [None]:
y_pred_stack=model_stack.predict(dummytest)

In [None]:
y_pred_stack

In [None]:
pd.Series(y_pred_stack).value_counts()

In [None]:
# Load submission File
submission=pd.read_csv(r"C:\Users\saxen\ML PROJECTS\TITANIC\gender_submission.csv")
submission.head()

In [None]:
submission['Survived']=y_pred_stack

In [None]:
submission.head()

In [None]:
submission.to_csv('Titanic_stacking_final.csv',index=False) #0.76315,Rank 12287

#### Cat Boost

In [None]:
cat=CatBoostClassifier()

In [None]:
y_pred_cat=cat.fit(X,y).predict(dummytest)

In [None]:
pd.Series(y_pred_cat).value_counts()

In [None]:
submission=pd.read_csv(r"C:\Users\saxen\ML PROJECTS\TITANIC\gender_submission.csv")
submission.head()

In [None]:
submission['Survived']=y_pred_cat

In [None]:
submission.to_csv('Titanic_cat_final.csv',index=False)  #0.77033

#### Logistic Regression

In [None]:
log=LogisticRegression()

In [None]:
y_pred_log=log.fit(X,y).predict(dummytest)

In [None]:
submission=pd.read_csv(r"C:\Users\saxen\ML PROJECTS\TITANIC\gender_submission.csv")
submission.head()

In [None]:
submission['Survived']=y_pred_log

In [None]:
submission.to_csv('Titanic_log_final.csv',index=False)  #0.77272

#### Gradient Boosting

In [None]:
gbm=GradientBoostingClassifier()

In [None]:
y_pred_gbm=gbm.fit(X,y).predict(dummytest)

In [None]:
submission=pd.read_csv(r"C:\Users\saxen\ML PROJECTS\TITANIC\gender_submission.csv")
submission.head()

In [None]:
submission['Survived']=y_pred_gbm

In [None]:
submission.to_csv('Titanic_gbm_final.csv',index=False)  #0.77272

##### As i can see in Logistics Regression Model i got the Highest Accuracy,so lets tuned

In [None]:
Cv=StratifiedKFold(n_splits=5,shuffle=True,random_state=42)

In [None]:
log_reg = LogisticRegression(max_iter=1000)
param_grid = {'penalty': ['l1', 'l2'],
              'C': [0.001, 0.01, 0.1, 1, 10, 100],
              'solver': ['liblinear', 'saga']}
grid_search = GridSearchCV(log_reg, param_grid, cv=Cv, n_jobs=-1)

grid_search.fit(X, y)

print("Best hyperparameters: ", grid_search.best_params_)
print("Accuracy score: ", grid_search.best_score_)


In [None]:
log_reg=LogisticRegression(C=10,penalty='l1',solver='liblinear')

In [None]:
y_pred_tun_log=log_reg.fit(X,y).predict(dummytest)

In [None]:
submission=pd.read_csv(r"C:\Users\saxen\ML PROJECTS\TITANIC\gender_submission.csv")

In [None]:
submission['Survived']=y_pred_tun_log

In [None]:
submission.to_csv('Titanic_logtun_final.csv',index=False)  #0.77511

### Lets Tuned the cat Boost Model

In [None]:
cat=CatBoostClassifier()

In [None]:
Cv=StratifiedKFold(n_splits=5,shuffle=True,random_state=42)

In [None]:
param_grid = {'learning_rate': [0.01, 0.05, 0.1],
              'depth': [3, 5, 7],
              'iterations': [100, 200, 300]}
grid_search = GridSearchCV(cat, param_grid, cv=Cv, n_jobs=-1)

grid_search.fit(X, y)
print("Best hyperparameters: ", grid_search.best_params_)
print("Accuracy score: ", grid_search.best_score_)


In [None]:
#Best hyperparameters:  {'depth': 3, 'iterations': 200, 'learning_rate': 0.1}
#Accuracy score:  0.8495951289937856
cat=CatBoostClassifier(depth=3,iterations=200,learning_rate=0.1)

In [None]:
y_pred_tun_cat=cat.fit(X,y).predict(dummytest)

In [None]:
y_pred_tun_cat

In [None]:
submission=pd.read_csv(r"C:\Users\saxen\ML PROJECTS\TITANIC\gender_submission.csv")

In [None]:
submission['Survived']=y_pred_tun_cat

In [None]:
submission.to_csv('Titanic_final.csv',index=False)  #Score: 0.75358

* **After fitting different Machine learning Model,I find out that the Tuned Logistic Regression Model give me the Highest Accuracy as comapres to others,so i will consider this will be my final model**


  `**END**`

In [None]:
cd