# Libraries

In [None]:
import pandas as pd
from deepchecks.tabular import Dataset
from deepchecks.tabular.suites import data_integrity

# Challenge: "Titanic ML Competition"

## Challenge:

    Use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

## Question:
    what sorts of people were more likely to survive?
    
## Source:
https://www.kaggle.com/c/titanic/data

# Data Summary

In [None]:
#Load data
raw_train_data = pd.read_csv("../data/raw/train.csv")
raw_test_data = pd.read_csv("../data/raw/test.csv")

In [None]:
#Review shape
print("*"*10)
print(f'train shape: {raw_train_data.shape}')
print(f'test shape: {raw_test_data.shape}')
print("*"*10)

#Review columns
print(f'train columns: {raw_train_data.columns}')
print(f'test columns: {raw_test_data.columns}')
print("*"*10)

#Review data structure info
print(f'train info: {raw_train_data.info()}')
print(f'test info: {raw_test_data.info()}')
print("*"*10)

#Review summary
print(f'train summary:\n {raw_train_data.describe()}')
print(f'test summary:\n {raw_test_data.describe()}')
print("*"*10)

In [None]:
#Review some rows
raw_train_data.head(5)

By the description in this link (https://www.kaggle.com/c/titanic/data) there are some categorical features (pclass, sex, survival, embarked, cabin), some numerical features (fare, parch, sibsp. Age), and some identification features (PassengerId, Name, ticket).

Also there are some missing values. In the EDA we will review this data characteristic.

In [None]:
cat_features = ['Cabin', 'Embarked', 'Sex', 'Pclass']
num_features = ['SibSp', 'Parch', 'Fare', 'Age']
id_features = ['PassengerId', 'Name', 'Ticket']
target = 'Survived'

# Data Integrity

In [None]:
#Create a dataset without the identification features (by now)
train_data = raw_train_data.drop(id_features, axis=1)
test_data = raw_test_data.drop(id_features, axis=1)

#Create the Dataset according to deepchecks format
train_deepchecks = Dataset(
    train_data, 
    cat_features = cat_features,
    label = target
)

In [None]:
#Run the data integrity test
integrity_suite = data_integrity()
integrity_results = integrity_suite.run(train_deepchecks)
integrity_results.save_as_html("../docs/data_integrity_report_train.html")

The data integrity evaluation shows that:
- We need to review the importance of the "Cabin" feature because of this feature has a high correlation with other features. 

- There are some rows (4.83% of the total data) with the same information but different survival predictions. This phenomenon could be appear due to missing values in some columns, it is important to make a "missing values" analysis. 

- There are data duplicates, but this behavior in the data is expected due to we dropped the identification columns and many people can share the same characteristics.