<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 2.2.1 

# Data

> The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

> One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this lab, we'll explore this dataset to find insight.

[Titanic Dataset](https://www.kaggle.com/c/titanic/data)

# Data Dictionary

| Variable |                                 Definition | Key                                            |
|----------|-------------------------------------------:|------------------------------------------------|
| Survival | Survival                                   | 0 = No, 1 = Yes                                |
| Pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| Sex      | Sex                                        |                                                |
| Age      | Age in years                               |                                                |
| SibSp    | # of siblings / spouses aboard the Titanic |                                                |
| Parch    | # of parents / children aboard the Titanic |                                                |
| Ticket   | Ticket number                              |                                                |
| Fare     | Passenger fare                             |                                                |
| Cabin    | Cabin number                               |                                                |
| Embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |

# Loading Modules

In [1]:
# Load necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

# Loading Dataset

In [2]:
# Read Titanic Dataset
titanic_csv = '../../../DATA/titanic.csv'
titanic = pd.read_csv(titanic_csv)

FileNotFoundError: [Errno 2] No such file or directory: '../../../DATA/titanic.csv'

# Explore Dataset

## Head

In [None]:
# Check Head
titanic.head()

## Tail

In [None]:
# Check Tail
titanic.tail()

## Shape

In [None]:
# Check rows, columns
titanic.shape

## Check Types of Data

In [None]:
# Check DataTypes
titanic.info()

## Check Null Values

In [None]:
titanic.isnull().sum()

In [None]:
titanic_long = pd.melt(titanic, id_vars='PassengerId')
titanic_long.head()

In [None]:
pd.pivot(titanic_long, index='PassengerId', columns='variable').droplevel(level=0, axis=1)

## Fill Null Values

Is there any null values in any columns? 

- Identify those columns
- Fill those null values using your own logic
    - State your logic behind every steps

### Age

In [None]:
titanic[titanic['Age'].isna()]

So, There are 177 rows have missing `Age` values. We can use median values of `Male` & `Female` to fill those values.

In [None]:
# Check Age
# Check Median Age of Male, Female
titanic['Age'].fillna(titanic.groupby(by=['Sex'])['Age'].transform("median"), inplace=True) 

### Cabin

In [None]:
titanic[titanic['Cabin'].isna()]

In [None]:
titanic['Cabin'].value_counts()

Variation of cabin name is not helping. Remove numbers from `Cabin`.

In [None]:
# Consider only the  first character as cabin number
titanic['Cabin'] = titanic['Cabin'].apply(lambda x: x[:1] if type(x) is str else x)

In [None]:
x[:1] if type(x) is str
 else
 x 

In [None]:
# Check Cabin
titanic['Cabin'].value_counts()

In [None]:
titanic.groupby(by=['Pclass', 'Cabin']).agg({'Cabin': 'count'}).unstack().plot(kind='bar', figsize=(10,8));

It's clear from the plot that Cabin `A`, `B`, `C` & `T` is only available in Pclass `1`.

In [None]:
cabin_map = {
    'A': 1
    , 'B': 2
    , 'C': 3
    , 'D': 4
    , 'E': 5
    , 'F': 6
    , 'G': 7
    , 'T': 8
}
titanic['Cabin'] = titanic['Cabin'].map(cabin_map)

In [None]:
# Fill Cabin with Mean values
titanic['Cabin'].fillna(titanic.groupby(by=['Pclass'])['Cabin'].transform("mean"), inplace=True) 

In [None]:
# Remove Decimal Numbers
titanic['Cabin'] = np.round(titanic['Cabin'], decimals=0)

In [None]:
# Check Cabin
titanic['Cabin'].value_counts()

### Embarked

In [None]:
titanic[titanic['Embarked'].isna()]

In [None]:
titanic['Embarked'].value_counts(normalize=True)

As 72% Passenger embarked from `S`. We can fill 2 rows of null values with `S`.

In [None]:
titanic['Embarked'] = titanic['Embarked'].apply(lambda x: x if type(x) is str else 'S')

# Describe

In [None]:
titanic.describe(include='all').T

# Relationship between Features and Survival

Find relationship between categorical features and survived.

**Describe your findings.**

In [None]:
def bar_charts(df, feature):
    '''
    Inputs:
    df: Dataset
    feature: Name of Feature to Check With Survived
    '''
    _agg = {
        'PassengerId': 'count'
    }
    _groupby = ['Survived', feature]

    df_feature = df.groupby(by=_groupby).agg(_agg)
    # Find the percentage of people survived
    # df_feature = df_feature.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))
    
    ax = df_feature.unstack().plot(kind='bar', figsize=(15,6))
    plt.legend(list(df_feature.index.levels[1].unique()))
    plt.xlabel('Survived')
    plt.xticks(np.arange(2), ('No', 'Yes'))
    plt.show();

## Pclass

In [None]:
bar_charts(titanic, 'Pclass')

- Chance of survival is high, if passenger is in Pclass 1

## Sex

In [None]:
bar_charts(titanic, 'Sex')

- 80% Male Passenger Died

## Parch

Parch = Number of parents of children travelling with each passenger.

In [None]:
bar_charts(titanic, 'Parch')

- Chance of survival is low if passenger is traveling alone.

## SibSp

In [None]:
bar_charts(titanic, 'SibSp')

## Embarked

In [None]:
bar_charts(titanic, 'Embarked')

# Feature Engineering

Create some new features from existing feature.

## Fare Class

In [None]:
def create_fare_class(x):
    if x > 30:
        fare_class = 1
    elif x > 20 and x <= 30:
        fare_class = 2
    elif x > 10 and x <= 20:
        fare_class = 3
    else:
        fare_class = 4
    return fare_class

In [None]:
titanic['FareClass'] = titanic['Fare'].apply(create_fare_class)

In [None]:
bar_charts(titanic, 'FareClass')

- Those who have paid more will likely to survive. 

## Age Class

In [None]:
titanic['Age'].value_counts()

In [None]:
def create_age_class(x):
    if x > 60:
        age_class = 5
    elif x > 35 and x <= 60:
        age_class = 4
    elif x > 25 and x <= 35:
        age_class = 3
    elif x > 16 and x <= 25:
        age_class = 2
    else:
        age_class = 1
    return age_class

In [None]:
titanic['AgeClass'] = titanic['Age'].apply(create_age_class)

In [None]:
bar_charts(titanic, 'AgeClass')

# Statistical Overview

In [None]:
from scipy import stats

## Correlation

In [None]:
titanic.corr()

# [BONUS] Hypothesis Testing
---
The usual process of null hypothesis testing consists of four steps.

1. Formulate the null hypothesis H_0 (commonly, that the observations are the result of pure chance) and the alternative hypothesis H_a (commonly, that the observations show a real effect combined with a component of chance variation).

2. Identify a test statistic that can be used to assess the truth of the null hypothesis.

3. Compute the p-value, which is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis were true. The smaller the p-value, the stronger the evidence against the null hypothesis.

4. Compare the p-value to an acceptable significance value alpha (sometimes called an alpha value). If p<=alpha, that the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid.

### Define Hypothesis

> Formulate the null hypothesis H_0 (commonly, that the observations are the result of pure chance) and the alternative hypothesis H_a (commonly, that the observations show a real effect combined with a component of chance variation).

    Null Hypothesis (H0): There is no difference in the survival rate between the young and old passengers.

    Alternative Hypothesis (HA): There is a difference in the survival rate between the young and old passengers.

### Collect Data

Next step is to collect data for each population group. 

Collect two sets of data, one with the passenger greater than 35 years of age and another one with the passenger younger than 35. The sample size should ideally be the same but it can be different. Lets say that the sample sizes is 100.

In [None]:
titanic_young = titanic[titanic['Age'] <= 35].sample(100, random_state=42)
titanic_old = titanic[titanic['Age'] > 35].sample(100, random_state=42)

In [None]:
titanic_young['Survived'].value_counts()

In [None]:
titanic_old['Survived'].value_counts()

In [None]:
N = 100
a = titanic_young['Survived']
b = titanic_old['Survived']

### Set alpha (Let alpha = 0.05)

> Identify a test statistic that can be used to assess the truth of the null hypothesis.

In [None]:
alpha = 0.05

### Calculate point estimate

In [None]:
a = titanic_young['Survived']
b = titanic_old['Survived']

In [None]:
## Calculate the variance to get the standard deviation
var_a = a.var(ddof = 1)
var_b = b.var(ddof = 1)

## Calculate the Standard Deviation
s = np.sqrt((var_a + var_b)/2)

### Calculate test statistic

In [None]:
t = (a.mean() - b.mean()) / (s * np.sqrt(2 / N))  # t-statistic
t

### Find the p-value

> Compute the P-value, which is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis were true. The smaller the P-value, the stronger the evidence against the null hypothesis.

In [None]:
## Compare with the critical t-value
## Degrees of freedom
df = 2*N - 2

#p-value after comparison with the t 
if (t > 0):
    p = 1 - stats.t.cdf(t, df = df) 
else:
    p = stats.t.cdf(t, df = df)

In [None]:
print("t = " + str(t))
print("p = " + str(2*p))

### Interpret results

> Compare the p-value to an acceptable significance value  alpha (sometimes called an alpha value). If p<=alpha, that the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid.

In [None]:
def print_sig(p_value, alpha):
    if p_value < alpha:
        print("We reject our null hypothesis.")
    elif p_value > alpha:
        print("We fail to reject our null hypothesis.")
    else:
        print("Our test is inconclusive.")

In [None]:
print_sig(p, alpha)

In [None]:
## Cross Checking with the internal scipy function
t2, p2 = stats.ttest_ind(a,b)
print("t = " + str(t2))
print("p = " + str(p2))

print_sig(p2, alpha)

### Another example with random distributions

In [None]:
## Import the packages
import numpy as np
from scipy import stats


## Define 2 random distributions
#Sample Size
N = 100
#Gaussian distributed data with mean = 2 and var = 1
a = np.random.randn(N) + 2
#Gaussian distributed data with with mean = 0 and var = 1
b = np.random.randn(N)

## Calculate the Standard Deviation
#Calculate the variance to get the standard deviation

#For unbiased max likelihood estimate we have to divide the var by N-1, and therefore the parameter ddof = 1
var_a = a.var(ddof=1)
var_b = b.var(ddof=1)

#std deviation
s = np.sqrt((var_a + var_b)/2)
s

## Calculate the t-statistics
t = (a.mean() - b.mean())/(s*np.sqrt(2/N))

## Compare with the critical t-value
#Degrees of freedom
df = 2*N - 2

#p-value after comparison with the t 
if (t > 0):
    p = 1 - stats.t.cdf(t, df = df) # p-value after comparison with the t 
else:
    p = stats.t.cdf(t, df = df)

print("t = " + str(t))
print("p = " + str(2*p))
### You can see that after comparing the t statistic with the critical t value (computed internally) we get a good p value of 0.0005 and thus we reject the null hypothesis and thus it proves that the mean of the two distributions are different and statistically significant.


## Cross Checking with the internal scipy function
t2, p2 = stats.ttest_ind(a,b)
print("t = " + str(t2))
print("p = " + str(p2))



---



---



> > > > > > > > > © 2022 Institute of Data


---



---



