In [1]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import zipfile
#from tensorflow.python.lib.io import file_io
import tensorflow_data_validation as tfdv
#from tensorflow_data_validation.utils import slicing_util
from tensorflow_metadata.proto.v0.statistics_pb2 import DatasetFeatureStatisticsList, DatasetFeatureStatistics

tf.get_logger().setLevel('ERROR')

2022-04-27 00:50:35.066323: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib64:
2022-04-27 00:50:35.066349: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## Reading in data

In [2]:
zip_file = zipfile.ZipFile('../data/raw_data/titanic.zip')

dfs = \
{
text_file.filename: pd.read_csv(zip_file.open(text_file.filename))
for text_file in zip_file.infolist()
    if text_file.filename.endswith('.csv')
}

df_gender = dfs['gender_submission.csv']
df_test = dfs['test.csv']
df_train = dfs['train.csv']

In [3]:
df_gender = dfs['gender_submission.csv']
df_test = dfs['test.csv']
df_train = dfs['train.csv']

The "gender" submission is an example submission that assumes only women survive. We won't need that file.

## Data Validation

Before we touch anything, it is a decent idea to get a train-test split of the training set.

In [4]:
df_train, df_eval = train_test_split(df_train, test_size=0.1, random_state=72)

We'll begin by generating statistics for training data.
Before calculating any stats, it is a decent idea to remove irrelevant features such as **PassengerID** and **Name**.

In [5]:
features_to_remove = ['PassengerId', 'Name']
approved_cols = [col for col in df_train.columns if (col not in features_to_remove)]
stats_options = tfdv.StatsOptions(feature_allowlist=approved_cols)

Generating stats for each of the datasets:

In [6]:
train_stats = tfdv.generate_statistics_from_dataframe(df_train, stats_options=stats_options)
eval_stats = tfdv.generate_statistics_from_dataframe(df_eval, stats_options=stats_options)

Visualizing the data:

In [22]:
df_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
239,240,0,2,"Hunt, Mr. George Henry",male,33.0,0,0,SCO/W 1585,12.2750,,S
297,298,0,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.5500,C22 C26,S
119,120,0,3,"Andersson, Miss. Ellis Anna Maria",female,2.0,4,2,347082,31.2750,,S
618,619,1,2,"Becker, Miss. Marion Louise",female,4.0,2,1,230136,39.0000,F4,S
721,722,0,3,"Jensen, Mr. Svend Lauritz",male,17.0,1,0,350048,7.0542,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
869,870,1,3,"Johnson, Master. Harold Theodor",male,4.0,1,1,347742,11.1333,,S
74,75,1,3,"Bing, Mr. Lee",male,32.0,0,0,1601,56.4958,,S
46,47,0,3,"Lennon, Mr. Denis",male,,1,0,370371,15.5000,,Q
787,788,0,3,"Rice, Master. George Hugh",male,8.0,4,1,382652,29.1250,,Q


In [7]:
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
                          lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')

A few observations from those highlighted by the visualizations:
- About 1/5th of the ages are missing
- Most did not have a sibling aboard
- Most did not have a parent or sibling aboard
- About 3/4ths of the examples do not have a cabin number associated

Moreover, the distributions of the training data and the eval data seem mostly the same.

### Some steps we may wish to take later on:
- Perhaps some bucketizing of ages could work nicely
- A OneHot feature of whether cabin is missing could be sufficient since so many Cabin values are missing
- Perhaps some knowledge on what cabin's were more "preferred" could assist us in predicting survival

### Next, let's infer the schema of the training data.

In [8]:
schema = tfdv.infer_schema(train_stats)

In [9]:
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Survived',INT,required,,-
'Pclass',INT,required,,-
'Sex',STRING,required,,'Sex'
'Age',FLOAT,optional,single,-
'SibSp',INT,required,,-
'Parch',INT,required,,-
'Ticket',BYTES,required,,-
'Fare',FLOAT,required,,-
'Cabin',BYTES,optional,single,-
'Embarked',STRING,optional,single,'Embarked'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Sex',"'female', 'male'"
'Embarked',"'C', 'Q', 'S'"


Now we should compare the schema of the eval data and the training data.

In [10]:
anomalies = tfdv.validate_statistics(eval_stats, schema)

tfdv.display_anomalies(anomalies)

Hooray! Evaluation set is similar enough to the training data such that there are no anomalies.

Next, lets check the test data.

In [11]:
test_options = tfdv.StatsOptions(schema=schema, 
                                 infer_type_from_schema=True, 
                                 feature_allowlist=approved_cols)

In [12]:
test_stats = tfdv.generate_statistics_from_dataframe(df_test, stats_options=test_options)

In [13]:
tfdv.visualize_statistics(lhs_statistics=test_stats, rhs_statistics=train_stats,
                          lhs_name='TEST_DATASET', rhs_name='TRAIN_DATASET')

And checking for anomalies in the test set:

In [14]:
anomalies = tfdv.validate_statistics(test_stats, schema)

tfdv.display_anomalies(anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'Survived',Column dropped,Column is completely missing
'Fare',Multiple errors,"The feature has a shape, but it's not always present (if the feature is nested, then it should always be present at each nested level) or its value lengths vary. The feature was present in fewer examples than expected: minimum fraction = 1.000000, actual = 0.997608"


It appears that contrary to the training data, Fare is missing in some of the test observations. Of course, the test data doesn't have 'survived' as a feature. We'll need to handle these two issues.

First, dealing with Fare, we want to remove 'required' from the presence of Fare. To do this, we'll set the min_fraction to 0.0 as well as the value_count min and max to 1.

In [16]:
fare = tfdv.get_feature(schema, 'Fare')
fare.presence.min_fraction = 0.0
fare.value_count.min = 1
fare.value_count.max = 1

In [17]:
anomalies = tfdv.validate_statistics(test_stats, schema)

tfdv.display_anomalies(anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'Survived',Column dropped,Column is completely missing


Now, dealing with the label column:

In [18]:
schema.default_environment.append('TRAINING')
schema.default_environment.append('TESTING')

In [19]:
tfdv.get_feature(schema, 'Survived').not_in_environment.append('TESTING')

Finding anomalies now that we've specified that the label is not in the 'TESTING' environment:

In [20]:
anomalies = tfdv.validate_statistics(test_stats, schema, environment='TESTING')

And displaying the anomalies:

In [21]:
tfdv.display_anomalies(anomalies)

Hooray! No anomalies.

## Exploring Features

- Look at Cabin numbers for example