- Let's do Exploratory data analysis on the [Titanic Dataset](https://www.kaggle.com/c/titanic/data) !
- We'll use pandas , seaborn and matplotlib libraries of Python. 
- One way to bring the dataset here is to download from kaggle and upload here. Another way is to directly download in colab using an API token. [(reference)](https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/)

In [None]:
## Kaggle data to colab: https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/
''' directly downloading from Kaggle using API. Remember to create kaggle.json file from you kaggle account. '''
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle competitions download -c titanic # this API token was collected from the dataset website

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

- 11 features, need to predict whether the person 'Survived' or not.

In [None]:
# Let's load the train dataframe...
train = pd.read_csv('/content/train.csv')
train.head() # returns the first five rows

In [None]:
train.head(10) # we can also define how many rows to show in head func.

In [None]:
train.shape

Can use *info* nad *describe* functions for detailed statistics/information of the data

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
train.isnull()

In [None]:
# number of Null values per column
train.isnull().sum()

In [None]:
## Let's visualize the null values (need to get rid of them!)
sns.heatmap(train.isnull())

In [None]:
# this heatmap function offers some additional features (cmap)
sns.heatmap(train.isnull(), yticklabels=False, cmap='viridis' )

In [None]:
# can get rid of the color-bar if needed
sns.heatmap(train.isnull(), yticklabels=False, cbar=False, cmap='viridis' )

- To conclude, we see that, for most of the dataframes, the Cabin info isn't present. Also for the Age column, a good amount data is absent
- Let's remove the Cabin column and fill the missing values of the Age column

In [None]:
# Drop Cabin column
train.drop('Cabin',axis=1, inplace = True) 
train.head()
# axis = 1 represents columns. Learn more: https://www.w3resource.com/pandas/dataframe/dataframe-drop.php
# The 'inplace' parameter modifies the actual memory instead of returning a copy. Learn more: https://www.ritchieng.com/pandas-inplace-parameter/

In [None]:
sns.heatmap(train.isnull(), yticklabels=False, cbar=False, cmap='viridis' ) # status after dropping the cabin column

In [None]:
sns.countplot(x='Survived', data=train)

In [None]:
# we can also add some styles to the graph...
sns.set_style('whitegrid')  # reference: https://seaborn.pydata.org/generated/seaborn.set_style.html
sns.countplot(x='Survived', data=train)

In [None]:
# let's observe the relation of survived variable other variable
sns.factorplot(x='Survived', col='Pclass', kind='count', data= train) # column names are case sensitive
# females were given priority while saving pessengers. Thus more female survived than male passengers

In [None]:
# there are many other styles.. explore!
sns.set_style('darkgrid')
sns.countplot(x='Sex',hue='Survived', data=train)

In [None]:
sns.set_style('whitegrid') 
sns.countplot(x='Survived',hue='Pclass',data=train) 
# Passengers of class-3 has died the most :( )
# Learn more about counterplots: https://seaborn.pydata.org/generated/seaborn.countplot.html

In [None]:
## survival vs dead per Pclass
sns.set_style('whitegrid') 
sns.countplot(x='Pclass', hue='Survived', data=train)
plt.title('Survived vs Dead according to passenger class')
plt.show()

In [None]:
# these type of operations can be useful.
train.groupby(['Sex']).Survived.sum()

In [None]:
train.groupby(['Sex', 'Survived'])['Survived'].count()

In [None]:
## check out the crosstab function!
pd.crosstab([train.Sex,train.Survived],train.Pclass,margins=True).style.background_gradient(cmap='summer_r')

In [None]:
# let's change the view by a little bit? which one looks better?  well it matters on the context!
pd.crosstab([train.Sex,train.Pclass],train.Survived,margins=True).style.background_gradient(cmap='summer_r')

In [None]:
sns.factorplot('Pclass', 'Survived', data= train)
plt.show()

# passengers of class-3 survived the least! (Money.....)

In [None]:
# how does it vary with gender??
sns.factorplot('Pclass', 'Survived', hue='Sex', data= train)
plt.show()

- From the FactorPlot and CrossTab categorical variables can easily be visualized. 
- Looking at the two plots, it is clear that women survival rate in Class 1 is about 95-96%, as only 3 out of 94 women died. So, it is now more clear that irrespective of Class, women are given first priority during Rescue. 
- Because survival rate for men in even Class 1 is also very low. From this conclusion, PClass is also a important feature.

In [None]:
sns.violinplot(x='Sex', y='Age', hue='Survived', data=train)

In [None]:
# the split variable is handy!
sns.violinplot(x='Sex', y='Age', hue='Survived', data=train, split=True)

In [None]:
# let's observe SibSp column (SibSp = Sibling or spouse)
sns.countplot(x='SibSp', data= train)

# seems like most of the passengers travelled alone... typical western culture

In [None]:
sns.countplot(x='SibSp', hue = 'Survived', data= train)

In [None]:
sns.distplot(train['Age'].dropna(), kde=False, color='green') #kde= False. helps to read the count

In [None]:
sns.distplot(train['Age'].dropna(), kde=True, color='green')

In [None]:
# the distplot function is similar to the histogram function on matplotlib...
train['Age'].hist(bins=30,color = 'darkred', alpha=0.5)

In [None]:
train['Fare'].hist(bins=40, color='green')

In [None]:
## It is observed that there is a relation with Pclass and Age column.
sns.boxplot(x='Pclass',y='Age', data=train, palette='winter')
# we can find the average age for each Pclass. So in case of missing data, it can be replace with the avg.

In [None]:
train.head()

In [None]:
# Dealing with categorical values. Need to convert them into numbers
Embarked = pd.get_dummies(train['Embarked'], drop_first=True) # 0 0 combination of Q-S column means C. Thus one feature is reduced.
Embarked

In [None]:
gender = pd.get_dummies(train['Sex'], drop_first=True) # if male=0 it means female. Avoided one column by drop_first= True
gender

In [None]:
# Name, Ticket is not relevant to predint Survival. 
# Also, as we created dummies for gender and Embarked, we have to 'drop' the original column from the dataframe and concat them.
train.drop(['Sex','Embarked','Name','Ticket'],axis=1, inplace=True) 
train.head()

In [None]:
train=pd.concat([train,gen  der,Embarked],axis=1)
train.head()
# After dealing with missing values, replacing categorical values, Now the dataset is ready to pass into a model for prediction

In [None]:
sns.heatmap(train.isnull(), cmap='viridis')

In [None]:
# lets fix the missing values of 'Age' feature. 
# policy, we'll replace it by the avg values of the corresponding pclass
# credit: https://github.com/mohantyaditya/Exploratory-Data-Analysis-Titanic/blob/master/TitanicEda.ipynb
def fix_missing_age(cols):
    age = cols[0]
    pclass = cols[1]

    if pd.isnull(age):
        if pclass==1:
            return 38
        elif pclass==2:
            return 29
        else:
            return 24
    else:
        return age

In [None]:
train['Age'] = train[['Age', 'Pclass']].apply(fix_missing_age, axis=1)
sns.heatmap(train.isnull(), cmap='viridis')

In [None]:
# let's check the current status. (remember in the beginning there were lot's of missing values!)
train.info()

## Building a Logistic Regression Model


In [None]:
train.head()

In [None]:
# Need to drop the survived column since it's the class Label.
train.drop('Survived', axis=1).head()  ## change is not happening in original place

In [None]:
train['Survived'].head() ## change is not happening in original place

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train.drop('Survived', axis=1), 
                                                    train['Survived'],
                                                    test_size=0.3,
                                                    random_state=101)

In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logmodel = LogisticRegression()
logmodel.fit(x_train, y_train)

In [None]:
predictions = logmodel.predict(x_test)

In [None]:
predictions

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)
print(accuracy)

In [None]:
conf_mat = confusion_matrix(y_test, predictions)
conf_mat

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

- Not so bad! You might want to explore other feature engineering and the other titanic_text.csv file, some suggestions for feature engineering:

- This was all for today. Although we filled the missing values of Age column, there are opportunities to do lot more 'feature engineering'  to find inner meanings.

- There are lots of Notebooks in Kaggle website where you can find even deeper 'data analysis' on this dataset. Please explore!
- Few of them are: [notebook by MANAV SEHGAL](https://www.kaggle.com/startupsci/titanic-data-science-solutions), [notebook by ASHWINI SWAIN](https://www.kaggle.com/ash316/eda-to-prediction-dietanic/notebook). (Find similar resources in the [here](https://www.kaggle.com/c/titanic/code), go to the 'code' tab, sort based on 'most votes'/'hotness'/there are other options...)

## ***Some suggestions***:
- Try grabbing the Title (Dr.,Mr.,Mrs,etc..) from the name as a feature
- Maybe the Cabin letter could be a feature
- Is there any info you can get from the ticket?

**Resources**:
- [EDA on Titanic Dataset - Medium](https://medium.datadriveninvestor.com/step-by-step-exploratory-data-analysis-of-titanic-dataset-2d0fb09b0e86)
-[EDA on Titanic Dataset - Jamil Moughal](https://www.kaggle.com/mjamilmoughal/eda-of-titanic-dataset-with-python-analysis)
-[EDA + Logistic Regression on Titanic](https://github.com/krishnaik06/EDA1/blob/master/EDA.ipynb)