#  Titanic Proyect

* Importing Libraries

In [1]:
import numpy as np # linear algebra                                                                                                           
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)                                                                        
import matplotlib.pyplot as plt # this is used for the plot the graph                                                                         
import seaborn as sns # used for plot interactive graph.                                                                                      
from pandas_profiling import ProfileReport

%matplotlib inline  

* Data Access

In [2]:
df = pd.read_csv("../train.csv", sep=",")

In [3]:
df.describe(include="all")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Emanuel, Miss. Virginia Ethel",male,,,,347082.0,,C23 C25 C27,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


We can drop missing values. In this case, if there is a missing value in Survived column (our target) is a problem: 
df.dropna(subset=["Survived"], axis=0)


# Data Cleaning or Data Wrangling

* Identify and handle missing values.
* Data formating: standardize the values into the same format, or unit, or convention.
* Data Normalization: to bring all data into a similar range for more useful comparison (centering, scaling).
* Data binning: creates bigger categories from a set of numerical values (useful for comparison between groups of data).
* Handle categorical variables: how to convert categorical values into numeric variables to make statistical modeling easier.

## Identify and handle missing values


Could be representated as: "?", "N/A", o or just a blanck cell

1. Check if the person or group that collected the data can go back and find what the actual value should be
2. Drop the data: you could either drop the whole variable or just the single data entry with the missing value. If you don't have a lot of observations with missing data, usually dropping the particular entry is the best.
3. Replacing data is better since no data is wasted. Less accuarate since we need to replace missing data with a guess of what the data should be
    a. Replace missing values by the average value of the entire variable.
    b. We replace with the most frequent value.
    c. Somethimes data gathered knows something additional about the missing data. 
4. Leave the missing data as missing data

### SHAPE

Number of rows and columns

In [7]:
df.shape

(891, 12)

### DROP
If there is a missing value in our target variable we must drop it. In this case: Survived column.
For the method dropna, there are 3 arguments: column, axis, inplace:
* For the axis argument, axis = 0: drops row, axis =1: drops column
* For the inplace argument, inplace = True, drop data in data file

In [9]:
df.dropna(subset=["Survived"], axis=0) 
# reset index, in case we drop rows
#df.reset_index(drop=True, inplace=True)
df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### MISSING DATA
You can verify if a value is a Missing data calling the boolean function isnull(): 
True stands for a missing value, and False stands for a not a missing value


In [10]:
missing_data = df.isnull() # True and False dataset

#Count missing values in each column

for column in missing_data.columns.values.tolist():   # missing_data.columns.values.tolist() is a list of column's names
    print(column)
    print (missing_data[column].value_counts())
    print("") 

PassengerId
False    891
Name: PassengerId, dtype: int64

Survived
False    891
Name: Survived, dtype: int64

Pclass
False    891
Name: Pclass, dtype: int64

Name
False    891
Name: Name, dtype: int64

Sex
False    891
Name: Sex, dtype: int64

Age
False    714
True     177
Name: Age, dtype: int64

SibSp
False    891
Name: SibSp, dtype: int64

Parch
False    891
Name: Parch, dtype: int64

Ticket
False    891
Name: Ticket, dtype: int64

Fare
False    891
Name: Fare, dtype: int64

Cabin
True     687
False    204
Name: Cabin, dtype: int64

Embarked
False    889
True       2
Name: Embarked, dtype: int64



### REPLACE

We can use the method replace to replace missing data with a new value: <br>
df.replace(missing_value, new_value) 


In [11]:
# Age: Replace missing value by age average.
age_avg = df["Age"].astype("float").mean(axis=0)
print("age average: ", age_avg)
df["Age"].replace(np.nan, age_avg, inplace=True)

# Embarked: Replace missing values by mode
embarked_mode = df["Embarked"].value_counts().idxmax()
print(embarked_mode)
df["Embarked"].replace(np.nan, embarked_mode, inplace=True)

#df.head(3)
df.info()

age average:  29.69911764705882
S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


## Data Formating

Bringing data into a common standard of expression that allows users to make meaningful comparisons.

* We can set the proper type of a variable using the method astype() <br>
    Convert data types: <br>
    df[["Survived"]] = df[["Survived"]].astype("int") 
* Change variable definition: <br>
    df["col"] = df["col"]/100 <br>
    df.rename(columns={"col":"col/100"}, inplace=True)

## Data normalization

For example, we are studying the correlation between two variables x and y. The range for 
the variable x is [0,100] and the range for variable y is [10000,2000000]. If we are going to make a linear regression, y variable is going to be most important due to their high values, it is not necesarily true. So, data normalization refers to normalize both variables to a similar range in order to be judge equally. <br>

Three data normalization techniques are: <br>
* scaling df["col"] = df["col"]/df["col"].max() # divided in the maximum. Range [0,1]
* centering df["col"] = (df["col"]-df["col"].min())/(df["col"].max()-df["col"].min()) # Range [0,1]
* z-mode df["col"] = (df["col"]-df["col"].mean())/df["col"].std() # Range [-3,3]

## Binning

Convert numerical values into categorical. For example Age: Child, Teen, Adult and Old. <br>

In [12]:
bins = [0,12,18,65,80]
group_ages = 'Child','Teen','Adult','Senior'
df["Cat_age"] = pd.cut(df["Age"], bins, labels=group_ages, include_lowest = True)

## Converting categorical values into numeric variables 

Most statistical models cannot take in objects or strings as input and for model 
training only take the numbers as inputs

Solutions:
1. One-hot encoding Technique:
Add dummy variables. For example: Embarked: S,Q and C/
    dummy variables: Emb_S could be 0 or 1, depending if the passanger boarded in S
                     Emb_Q could be 0 or 1, depending if the passanger boarded in Q
                     Emb_C could be 0 or 1, depending if the passanger boarded in C
    pd.getdummies(df.("Embarked"))

In [13]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Cat_age'],
      dtype='object')

In [14]:
# Dummy variables
Emb_1 = pd.get_dummies(df["Embarked"])
Emb_1.head()

Unnamed: 0,C,Q,S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [15]:
# merge data frame "df" and "dummy_variable" 
df = pd.concat([df, Emb_1], axis=1)
# drop original column "Embarked" from "df"
df.drop("Embarked", axis = 1, inplace=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Cat_age,C,Q,S
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,Adult,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,Adult,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,Adult,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,Adult,0,0,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,Adult,0,0,1
