<a href="https://colab.research.google.com/github/betoval/learning-tensorflow/blob/master/titanic-estimator-2.5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split

In [172]:
#we use pandas to load the csv files
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
#showing some of the data
train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


We note that we have the word "NaN" in the 'Age' feature, which means that we are missing those values.

In [173]:
total = train.isnull().sum().sort_values(ascending=False)
total.head()

Cabin       687
Age         177
Embarked      2
Fare          0
Ticket        0
dtype: int64

In [174]:
train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

Here, we note that we are missing 177 values of Age, 687 of Cabin, and 2 of Embarked. As a first attemp we will replace the missing values of Age with the value of the "mean age" and the Embarked values with S, which is the most common.

In [175]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


We can see that the data has 11 features + Survived (the feature we are interested in).

Below, we can examine the statistics of the data using `pd.DataFrame.describe`

In [176]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


It is evident that we are missing 177 'Age' values. We need to take care of that.

Of course, not all features are useful. In fact, we don't need the following: PassengerID, ticket, Name because they don't tell us anything about the survival rate. It is important to note that the Cabin feature could be useful, however, we will drop it because of its missing values.

In [177]:
#show train columns
train.columns.values

array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)

In [178]:
#show test columns
test.columns.values

array(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)

In [179]:
#obtain the mean age to replace the missing values
train.mean(axis=0)

PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

In [180]:
#features
train_x = train.drop(['PassengerId', 'Name', 'Ticket', 'Cabin','Survived'],
                     axis=1)
#label
train_y = train['Survived']
#test dataset, doesn't include the label
test_x = test.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'],axis=1)
#train_y only includes 0 and 1 (dead or alive)
print(train_y)

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64


In [0]:
#fill the Age feature with the mean, which is 30
train_x['Age'] = train_x['Age'].fillna(30)

In [0]:
#fill Embarked with S, the most common value
train_x["Embarked"].fillna("S", inplace = True) 

In [183]:
#train_x is the "complete" train data
train_x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    891 non-null    int64  
 1   Sex       891 non-null    object 
 2   Age       891 non-null    float64
 3   SibSp     891 non-null    int64  
 4   Parch     891 non-null    int64  
 5   Fare      891 non-null    float64
 6   Embarked  891 non-null    object 
dtypes: float64(2), int64(3), object(2)
memory usage: 48.9+ KB


In [184]:
#indeed, the test_x dataset doesn't include Survived
test_x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    418 non-null    int64  
 1   Sex       418 non-null    object 
 2   Age       332 non-null    float64
 3   SibSp     418 non-null    int64  
 4   Parch     418 non-null    int64  
 5   Fare      417 non-null    float64
 6   Embarked  418 non-null    object 
dtypes: float64(2), int64(3), object(2)
memory usage: 23.0+ KB


In [185]:
test_x.mean(axis=0)

Pclass     2.265550
Age       30.272590
SibSp      0.447368
Parch      0.392344
Fare      35.627188
dtype: float64

In [0]:
test_x['Age'] = test_x['Age'].fillna(30)
test_x['Fare'] = test_x['Fare'].fillna(36)

In [187]:
test_x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    418 non-null    int64  
 1   Sex       418 non-null    object 
 2   Age       418 non-null    float64
 3   SibSp     418 non-null    int64  
 4   Parch     418 non-null    int64  
 5   Fare      418 non-null    float64
 6   Embarked  418 non-null    object 
dtypes: float64(2), int64(3), object(2)
memory usage: 23.0+ KB


features = dataframe that includes all the necessary data to determinate if a passenger will survive or not.

labels = dataframe containing the information we want to predict. In this example, "Survived" is the label.

We need to split the train dataset into three parts:

1. Training

2. Validation

3. Test

In this example, the Test dataset is already provided. We just need to split 
the train dataset into: Train and Validation (sometimes called crossvalidation)

In [0]:
#sampling 80% for train data and 20% for validation
train_x, val_x, train_y, val_y = train_test_split(train_x, train_y, 
                                                  test_size=0.2, 
                                                  random_state=1)

#create input function
train_input_fn = tf.compat.v1.estimator.inputs.pandas_input_fn(x=train_x,
                                                               y=train_y,
                                                               num_epochs=None,
                                                               batch_size=100,
                                                               shuffle=True)

#validation function
val_input_fn = tf.compat.v1.estimator.inputs.pandas_input_fn(x=val_x,
                                                             y=val_y,
                                                              num_epochs=1,
                                                              batch_size=
                                                             len(val_x),
                                                              shuffle=False)

                                                      

**Feature Columns**

In this example, we have Categorical columns and numerical columns

In [0]:
#the layman implementation of feature columns
Sex = tf.feature_column.categorical_column_with_vocabulary_list("Sex", ["male", "female"])
Embarked = tf.feature_column.categorical_column_with_vocabulary_list("Embarked", ["S", "C", "Q"])
Age = tf.feature_column.numeric_column("Age")
Fare = tf.feature_column.numeric_column("Fare")
Parch = tf.feature_column.numeric_column("Parch")
Pclass = tf.feature_column.numeric_column("Pclass")
SibSp = tf.feature_column.numeric_column("SibSp")

ft_columns = [tf.feature_column.indicator_column(Sex), tf.feature_column.indicator_column(Embarked), Age, Fare, Parch, Pclass, SibSp]



```
feature_columns = []
num_cols = ['Age', 'Fare', 'Parch', 'Pclass', 'SibSp']
for num_name in num_cols:
  feature_columns.append(tf.feature_column.numeric_column(num_name))

categorical_cols = ['Sex', 'Embarked']
for ft_name in categorical_cols:
  vocabulary = train_x[ft_name].unique()
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(ft_name, vocabulary))
```



**Instantiate the Estimator**

We are going to use a Linear Classifier. This will train the model to classify the data into two possible cases: survival and not survival.

hidden_units = number of hidden nodes per layer. Ex. [30,10] means that we have two layers, the first one with 30 nodes and the second one with 10 nodes.

n_classes= number of label classes

In [190]:
model = tf.estimator.DNNClassifier(feature_columns=ft_columns,
                                   hidden_units=[35,25,30], 
                                   n_classes=2, 
                                   optimizer='Adam')

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpdhm56b6k', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


**Train the model**

In [191]:
model.train(input_fn=train_input_fn, steps=500)

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpdhm56b6k/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 4.0775847, step = 0
INFO:tensorflow:global_step/sec: 276.207
INFO:tensorflow:loss = 0.593145, step = 100 (0.365 sec)
INFO:tensorflow:global_step/sec: 329.127
INFO:tensorflow:loss = 0.63201535, step = 200 (0.302 sec)

<tensorflow_estimator.python.estimator.canned.dnn.DNNClassifierV2 at 0x7f5744d5e7b8>

**Evaluate the model**

In [192]:
accuracy_score = model.evaluate(input_fn=val_input_fn)["accuracy"]
print("\nTest accuracy: {0:f}%\n".format(accuracy_score*100))

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-04-29T03:08:38Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpdhm56b6k/model.ckpt-500
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 0.64712s
INFO:tensorflow:Finished evaluation at 2020-04-29-03:08:38
INFO:tensorflow:Saving dict for global step 500: accuracy = 0.7821229, accuracy_baseline = 0.59217876, auc = 0.8378134, auc_precision_recall = 0.8029717, average_loss = 0.4829482, global_step = 500, label/mean = 0.40782124, loss = 0.4829482, precision = 0.7741935, predic

**TEST DATA (VALIDATION)**

In [0]:
test_input_fn = tf.compat.v1.estimator.inputs.pandas_input_fn(x=test_x,
                                                              num_epochs=1,
                                                              batch_size=len(test_x),
                                                              shuffle=False)

In [194]:
LABEL =['YOU DIED', 'YOU LIVED']
predictions = model.predict(input_fn=test_input_fn)
for i, predict in enumerate(predictions):
    label_ = predict['class_ids'][0]
    probs = predict['probabilities'][label_]
    print(f'Prediction \t{LABEL[label_]} ({100*probs} %) ')
    if i==10: break

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpdhm56b6k/model.ckpt-500
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Prediction 	YOU DIED (89.34611082077026 %) 
Prediction 	YOU DIED (66.63805842399597 %) 
Prediction 	YOU DIED (88.42379450798035 %) 
Prediction 	YOU DIED (90.118807554245 %) 
Prediction 	YOU DIED (56.92726373672485 %) 
Prediction 	YOU DIED (81.64664506912231 %) 
Prediction 	YOU LIVED (70.5467700958252 %) 
Prediction 	YOU DIED (84.70030426979065 %) 
Prediction 	YOU LIVED (69.19453740119934 %) 
Prediction 	YOU DIED (84.987723827362