Write a Python script to 
1. Read-in the titanic data (titanic_traning.csv). The data is available in the homework folder. You may use read_csv().
2. Identify the severity of the missing value problem and data inconsistency problem. Specifically, generate summary of missing values, and inconsistent values for each of the features.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("titanic_traning.csv")

## Create the empty data frame in order to store the information of missing and inconsistent values.

In [3]:
features = df.columns #  Extract the features of data frame 
# Create an all zero data frame to sotre the information of missing and inconsistent values
df_m_i = pd.DataFrame(0,
                      columns = ["Missing Values (MV)", "% of MV (MV/n)", "Inconsistency Values (IV)", "% of IV (IV/n)"],
                      index = features[1:])

## Check the number and percentage of missing value in each feature (column).

In [4]:
for i in range(1, len(features)):
    n_of_missing_value = sum(df[features[i]].isna())
    df_m_i.loc[features[i], "Missing Values (MV)"] = n_of_missing_value
    df_m_i.loc[features[i], "% of MV (MV/n)"] = "{:.2%}".format(n_of_missing_value / df.shape[0])
    
df_m_i

Unnamed: 0,Missing Values (MV),% of MV (MV/n),Inconsistency Values (IV),% of IV (IV/n)
pclass,0,0.00%,0,0
sex,0,0.00%,0,0
age,188,20.52%,0,0
sibsp,0,0.00%,0,0
parch,0,0.00%,0,0
fare,9,0.98%,0,0
embarked,0,0.00%,0,0
survived,0,0.00%,0,0


## Fill the missing values with mean for numerical data and mode for categorical data.

In [5]:
# Fill the missing values in "age" by its mean since it is a numerical feature
df["age"].fillna(value = np.round(df[~df["age"].isna()]["age"].mean()), inplace = True)
# Fill the missing values in "fare" by its mean since it is a numerical feature
df["fare"].fillna(value = np.round(df[~df["fare"].isna()]["fare"].mean(), 1), inplace = True)
df.isna().sum() # Check if there is missing value again.

ID          0
pclass      0
sex         0
age         0
sibsp       0
parch       0
fare        0
embarked    0
survived    0
dtype: int64

## Examine the inconsistent value for each feature by using value_counts()

In [6]:
for i in range(1, len(features)):
    print(df[features[i]].value_counts())
    print()

3    495
1    216
2    205
Name: pclass, dtype: int64

male      581
female    331
Male        2
Female      2
Name: sex, dtype: int64

29.0    209
24.0     34
22.0     31
21.0     30
30.0     28
       ... 
55.5      1
24.5      1
36.5      1
74.0      1
60.5      1
Name: age, Length: 94, dtype: int64

0    617
1    229
2     26
3     16
4     15
8      8
5      5
Name: sibsp, dtype: int64

0    705
1    113
2     81
3      7
5      5
9      2
4      2
6      1
Name: parch, dtype: int64

7.9     77
7.8     73
13.0    50
8.1     41
26.0    36
        ..
79.7     1
25.6     1
77.3     1
93.5     1
4.0      1
Name: fare, Length: 198, dtype: int64

S             649
C             175
Q              89
Queenstown      3
Name: embarked, dtype: int64

0    563
1    353
Name: survived, dtype: int64



### It can be observed that there are inconsistent values in "sex" and "embarked" columns. Fix the inconsistent values to make the data consistent.

In [7]:
df_m_i.loc["sex", "Inconsistency Values (IV)"] = sum((df["sex"] == "Male") | (df["sex"] == "Female"))
df_m_i.loc["sex", "% of IV (IV/n)"] = "{:.2%}".format(sum((df["sex"] == "Male") | (df["sex"] == "Female")) / df.shape[0])
df.loc[df["sex"] == "Male", "sex"] = "male"
df.loc[df["sex"] == "Female", "sex"] = "female"

In [8]:
df_m_i.loc["embarked", "Inconsistency Values (IV)"] = sum(df["embarked"] == "Queenstown")
df_m_i.loc["embarked", "% of IV (IV/n)"] = "{:.2%}".format(sum(df["embarked"] == "Queenstown") / df.shape[0])
df.loc[df["embarked"] == "Queenstown", "embarked"] = "Q"

### Examine the data again

In [9]:
for i in range(1, len(features)):
    print(df[features[i]].value_counts())
    print()

3    495
1    216
2    205
Name: pclass, dtype: int64

male      583
female    333
Name: sex, dtype: int64

29.0    209
24.0     34
22.0     31
21.0     30
30.0     28
       ... 
55.5      1
24.5      1
36.5      1
74.0      1
60.5      1
Name: age, Length: 94, dtype: int64

0    617
1    229
2     26
3     16
4     15
8      8
5      5
Name: sibsp, dtype: int64

0    705
1    113
2     81
3      7
5      5
9      2
4      2
6      1
Name: parch, dtype: int64

7.9     77
7.8     73
13.0    50
8.1     41
26.0    36
        ..
79.7     1
25.6     1
77.3     1
93.5     1
4.0      1
Name: fare, Length: 198, dtype: int64

S    649
C    175
Q     92
Name: embarked, dtype: int64

0    563
1    353
Name: survived, dtype: int64



### Print out the statistical data for missing and inconsistency values.

In [10]:
df_m_i

Unnamed: 0,Missing Values (MV),% of MV (MV/n),Inconsistency Values (IV),% of IV (IV/n)
pclass,0,0.00%,0,0
sex,0,0.00%,4,0.44%
age,188,20.52%,0,0
sibsp,0,0.00%,0,0
parch,0,0.00%,0,0
fare,9,0.98%,0,0
embarked,0,0.00%,3,0.33%
survived,0,0.00%,0,0


### Store the cleaned data to a .csv file.

In [11]:
df.to_csv("cleaned_data.csv", index = False)