# Final Homework

> **Course:** Data Mining

> **Author:** Enes Kemal Ergin

> **Date:** 05/01/2017

Using Titanic Dataset from Kaggle: [link](https://www.kaggle.com/c/titanic/data)

About Dataset:

    VARIABLE DESCRIPTIONS:
    survival        Survival
                    (0 = No; 1 = Yes)
    pclass          Passenger Class
                    (1 = 1st; 2 = 2nd; 3 = 3rd)
    name            Name
    sex             Sex
    age             Age
    sibsp           Number of Siblings/Spouses Aboard
    parch           Number of Parents/Children Aboard
    ticket          Ticket Number
    fare            Passenger Fare
    cabin           Cabin
    embarked        Port of Embarkation
                    (C = Cherbourg; Q = Queenstown; S = Southampton)

    SPECIAL NOTES:
    Pclass is a proxy for socio-economic status (SES)
     1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

    Age is in Years; Fractional if Age less than One (1)
     If the Age is Estimated, it is in the form xx.5

    With respect to the family relation variables (i.e. sibsp and parch)
    some relations were ignored.  The following are the definitions used
    for sibsp and parch.

    Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
    Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
    Parent:   Mother or Father of Passenger Aboard Titanic
    Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

    Other family relatives excluded from this study include cousins,
    nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
    only with a nanny, therefore parch=0 for them.  As well, some
    travelled with very close friends or neighbors in a village, however,
    the definitions do not support such relations.
    

Step 0 : Data Preparation
---

Reading and cleaning the data if necessary

In [10]:
# Import the pandas library 
import pandas as pd

In [11]:
# Read csv file from the path and store it in df
df = pd.read_csv('./eneskemal_HW.csv')
# Show the first 5 row of the data
df.head() 
# Show the last 5 row of the data
# df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
# Check if missing values
df.count(0) 

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

In [13]:
# Applying axis as 1 to remove the columns with the following labels
df = df.drop(['Ticket','Cabin','Name'], axis=1)
# Remove missing values
df = df.dropna()

# Now our data is cleaned and ready!

Step 1: Data Information
---

Generate the information about your dataset: number of columns and rows, names and data types of the columns, memory usage of the dataset. 

> *Hint: Pandas data frame info() function.*

In [15]:
# Show the general information about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    712 non-null int64
Survived       712 non-null int64
Pclass         712 non-null int64
Sex            712 non-null object
Age            712 non-null float64
SibSp          712 non-null int64
Parch          712 non-null int64
Fare           712 non-null float64
Embarked       712 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 55.6+ KB


Step 2 : Descriptive Statistics
---

Generate descriptive statistics of all columns (input and output) of your dataset. Descriptive statistics for numerical columns include: count, mean, std, min, 25 percentile (Q1), 50 percentile (Q2, median), 75 percentile (Q3), max values of the columns. For categorical columns, determine distinct values and their frequency in each categorical column. 

> *Hint: Pandas, data frame describe() function.*

In [19]:
# Descriptive information of the numerical columns
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,712.0,712.0,712.0,712.0,712.0,712.0,712.0
mean,448.589888,0.404494,2.240169,29.642093,0.514045,0.432584,34.567251
std,258.683191,0.491139,0.836854,14.492933,0.930692,0.854181,52.938648
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,222.75,0.0,1.0,20.0,0.0,0.0,8.05
50%,445.0,0.0,2.0,28.0,0.0,0.0,15.64585
75%,677.25,1.0,3.0,38.0,1.0,1.0,33.0
max,891.0,1.0,3.0,80.0,5.0,6.0,512.3292


In [20]:
# Categorical descriptive info for Sex column
df['Sex'].describe()

count      712
unique       2
top       male
freq       453
Name: Sex, dtype: object

In [27]:
print(df['Embarked'].describe())
print("Embarked values available: ", df['Embarked'].unique())

count     712
unique      3
top         S
freq      554
Name: Embarked, dtype: object
Embarked values available:  ['S' 'C' 'Q']


Step 3 : Analysis of the Output Column
---

If the output column is numerical then calculate the IQR (inter quartile range, Q3-Q1) and Range (difference between max and min value). If your output column is categorical then determine if the column is nominal or ordinal, why?. Is there a class imbalance problem? (check if there is big difference between the number of distinct values in your categorical output column)


Step 4 : Box Plots
---

Generate box plots of all numerical columns and generate pie plots for all categorical columns. 

> *Hint: Pandas, Matplotlib, Seaborn, Bokeh libraries*

Step 5 : Distribution of Columns
---

Generate plots for probability density function (pdf) or histogram of all numerical input and output columns. 

> *Hint: Pandas, Matplotlib, Seaborn, Bokeh libraries*


Step 6 : Pairwise Plot
---

Generate pairwise scatter plot of all numerical input and output columns. *Hint: Seaborn pairwise plot function*


Step 7 :  Cross-Correlation of Input Columns
---

Generate the cross-correlation matrix for input columns. Use pearson correlation coefficient. 

> *Hint: Pandas Seaborn corr() function*


Step 8 : Identify Correlated Columns
---

Those input columns with pearson coefficient greater than or equal to 0.8 *Hint: Pandas, Seaborn corr() function*

Step 9 : Cross-Correlation Heatmap
---

Generate heatmap plot for cross-correlation matrix of input columns. 

> *Hint: Pandas, Seaborn heatmap() function*

Step 10 : Output versus Input Plot
---

Select one of the numerical input columns in your dataset, and generate scatter plot of output column versus the input column. If the output column is categorical then generate the box plot of the input column for each distinct value of the output column. Let’s say if your output has three distinct categorical values, plot one box plot of the input column for each value (three) in the output column. 

> *Hint: check examples in Pandas, Matplotlib, plot(), scatter(), groupby() getgroup() functions*
