<a href="https://colab.research.google.com/github/chunxu8894/tensorflow-tutorials/blob/master/Load_CSV_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import functools

import numpy as np
import tensorflow as tf

In [2]:
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("test.csv", TEST_DATA_URL)

Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv


## 2. Load data

- load this using pandas and pass the Numpy arrays to Tensorflow.

- if you need to scale up to a large set of files, or need a loader that integrates with Tensorflow and tf.data the use the tf.data.experimental.make_csv_dataset function:

the only column you need to identify explicity is the one with the value that the model is intended to predict


In [3]:
LABEL_COLUMN = 'survived'
LABELS = [0,1]


In [4]:
def get_dataset(file_path, **kwargs):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=5,
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True,
      **kwargs
  )
  return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

In [5]:
def show_batch(dataset):
  for batch, label in dataset.take(1):
    for key, value in batch.items():
      print("{:20s}: {}".format(key, value.numpy()))

In [6]:
show_batch(raw_train_data)

sex                 : [b'male' b'male' b'male' b'male' b'male']
age                 : [28. 17. 28. 45. 28.]
n_siblings_spouses  : [0 1 0 0 0]
parch               : [0 1 0 0 0]
fare                : [ 0.      7.2292  7.7292 26.55    0.    ]
class               : [b'Second' b'Third' b'Third' b'First' b'Second']
deck                : [b'unknown' b'unknown' b'unknown' b'B' b'unknown']
embark_town         : [b'Southampton' b'Cherbourg' b'Queenstown' b'Southampton' b'Southampton']
alone               : [b'y' b'n' b'y' b'y' b'y']


In [7]:
CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']

temp_dataset = get_dataset(train_file_path, column_names = CSV_COLUMNS)
show_batch(temp_dataset)

sex                 : [b'male' b'female' b'male' b'female' b'male']
age                 : [ 4.  6. 64. 35. 18.]
n_siblings_spouses  : [1 0 1 1 0]
parch               : [1 1 4 1 0]
fare                : [ 11.1333  33.     263.      20.25     8.3   ]
class               : [b'Third' b'Second' b'First' b'Third' b'Third']
deck                : [b'unknown' b'unknown' b'C' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton'
 b'Southampton']
alone               : [b'n' b'n' b'n' b'n' b'y']


## 3. Data preprocessing

A CSV file can contain a variety of data types. Typically you want to convert from those mixed types to a fixed length vector before feeding the data into your model

Tensorflow has a built-in system for describing common input conversions: __tf.feature_column__,

You can preprocess your data using any tool you like (like __ntlk__ or __sklearn__) and just pass the processed output to Tensorflow

The primary advantage of doing the preprocessing inside your model is that when you export the model it include the preprocessing. This way you can pass the raw data directly to your model


## 4. Continuous data

if your data is already in an appropriate numeric format, you can pack the data into a vector before passing it off to the model:


In [8]:
SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare']
DEFAULTS = [0,0.0,0.0,0.0,0.0]

temp_dataset = get_dataset(train_file_path, select_columns = SELECT_COLUMNS, column_defaults = DEFAULTS)
show_batch(temp_dataset)

age                 : [16. 34. 28. 22. 20.]
n_siblings_spouses  : [0. 1. 0. 0. 1.]
parch               : [0. 1. 0. 0. 1.]
fare                : [ 9.5    14.4    56.4958  7.25   15.7417]


In [9]:
example_batch, labels_batch = next(iter(temp_dataset))

Here's a simple function that will pack together all the columns:

In [10]:
def pack(features, label):
  return tf.stack(list(features.values()), axis=-1), label

In [11]:
packed_dataset = temp_dataset.map(pack)

for features, labels in packed_dataset.take(1):
  print(features.numpy())
  print()
  print(labels.numpy())

[[28.      0.      0.      6.95  ]
 [20.      0.      0.      8.05  ]
 [19.      0.      0.     10.1708]
 [28.      0.      0.     26.55  ]
 [45.      0.      0.      8.05  ]]

[0 0 0 1 1]


if you have mixed data types you want to separate out these simple-numeric fields. The __tf.feature_column__ api can handle them, but this incurs some overhear and should be avoided unless really necessary. Switch back to the  mixed dataset:
  

In [12]:
show_batch(raw_train_data)

sex                 : [b'male' b'female' b'male' b'female' b'male']
age                 : [20.  4. 28. 31. 28.]
n_siblings_spouses  : [0 0 0 0 2]
parch               : [0 2 0 0 0]
fare                : [ 7.8542 22.025   0.      7.8542 21.6792]
class               : [b'Third' b'Third' b'Second' b'Third' b'Third']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton' b'Cherbourg']
alone               : [b'y' b'n' b'y' b'y' b'n']


In [13]:
example_batch, labels_batch = next(iter(temp_dataset))

In [14]:
class PackNumericFeatures(object):
  def __init__(self, names):
    self.names = names

  def __call__(self, features, labels):
    numeric_features = [features.pop(name) for name in self.names]
    numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_features]
    numeric_features = tf.stack(numeric_features, axis=-1)
    features['numeric'] = numeric_features
    return features, labels

In [15]:
NUMERIC_FEATURES = ['age', 'n_siblings_spouses','parch', 'fare']

packed_train_data = raw_train_data.map(PackNumericFeatures(NUMERIC_FEATURES))

packed_test_data = raw_test_data.map(PackNumericFeatures(NUMERIC_FEATURES))

In [16]:
show_batch(packed_train_data)

sex                 : [b'male' b'female' b'male' b'female' b'female']
class               : [b'Third' b'First' b'Second' b'Third' b'First']
deck                : [b'unknown' b'D' b'unknown' b'unknown' b'B']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Queenstown' b'Southampton']
alone               : [b'y' b'n' b'y' b'y' b'n']
numeric             : [[ 28.       0.       0.       7.8958]
 [ 51.       1.       0.      77.9583]
 [ 36.       0.       0.      10.5   ]
 [ 28.       0.       0.       7.75  ]
 [ 15.       0.       1.     211.3375]]


In [17]:
example_batch, labels_batch = next(iter(packed_train_data))

## 5. Data Normalization
Continous data should always be normalized


In [18]:
import pandas as pd

desc = pd.read_csv(train_file_path)[NUMERIC_FEATURES].describe()

In [19]:
desc

Unnamed: 0,age,n_siblings_spouses,parch,fare
count,627.0,627.0,627.0,627.0
mean,29.631308,0.545455,0.379585,34.385399
std,12.511818,1.15109,0.792999,54.59773
min,0.75,0.0,0.0,0.0
25%,23.0,0.0,0.0,7.8958
50%,28.0,0.0,0.0,15.0458
75%,35.0,1.0,0.0,31.3875
max,80.0,8.0,5.0,512.3292


In [20]:
MEAN = np.array(desc.T['mean'])
STD = np.array(desc.T['std'])

In [21]:
def normalize_numeric_data(data, mean, std):
  return (data-mean)/std

Now create a numeric column. The __tf.feature_columns.numeric_column__ API accepts a normalize_fn argument, which will be run on each batch

Bind the MEAN and STD to the normalizer fn using __functions.partial__

In [24]:
normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)
numeric_column = tf.feature_column.numeric_column('numeric', normalizer_fn=normalizer, shape=[len(NUMERIC_FEATURES)])
numeric_columns = [numeric_column]
numeric_column

NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function normalize_numeric_data at 0x7f4bb90207b8>, mean=array([29.63130781,  0.54545455,  0.37958533, 34.38539856]), std=array([12.51181763,  1.1510896 ,  0.79299921, 54.5977305 ])))

In [25]:
example_batch['numeric']

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[ 28.    ,   1.    ,   0.    ,  82.1708],
       [ 28.    ,   1.    ,   0.    , 133.65  ],
       [ 58.    ,   0.    ,   0.    ,  29.7   ],
       [ 46.    ,   0.    ,   0.    ,  79.2   ],
       [ 21.    ,   0.    ,   0.    ,  77.9583]], dtype=float32)>

In [26]:
numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()

array([[-0.13038135,  0.39488277, -0.4786705 ,  0.8752269 ],
       [-0.13038135,  0.39488277, -0.4786705 ,  1.8181086 ],
       [ 2.2673519 , -0.47385937, -0.4786705 , -0.08581672],
       [ 1.3082585 , -0.47385937, -0.4786705 ,  0.8208143 ],
       [-0.6898524 , -0.47385937, -0.4786705 ,  0.7980716 ]],
      dtype=float32)

## 6. Categorical data
Some of the columns in the CSV data are categorical columns. That is the content should be one of a limited set of options

Use the __tf.feature_column__ API to create a collection with a __tf.feature_column.indicator_column__ for each categorical column

In [27]:
CATEGORIES= {
    'sex': ['male','female'],
    'class': ['First','Second','Third'],
    'deck': ['A','B','C','D','E','F','G','H','I','J'],
    'embark_town': ['Cherbourg', 'Southhampton', 'Queentown'],
    'alone': ['y','n']
}

In [29]:
categorical_columns = []
for feature, vocab in CATEGORIES.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))

In [30]:
categorical_columns

[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queentown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]

In [32]:
categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
print(categorical_layer(example_batch).numpy()[0])

[0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0.]


This will be become part of data processing input later when you build the model

## 7. Combined preprocessing layer

Add the two feature column collections and pass them to a __tf.keras.layers.DenseFeatures__ to create an input layer that will extract and preprocess both input types:


In [33]:
preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numeric_columns)

In [36]:
print(preprocessing_layer(example_batch).numpy()[0])

[ 0.          1.          1.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          1.          0.          0.
 -0.13038135  0.39488277 -0.4786705   0.8752269   1.          0.        ]


## 8. Build the model

Build a __tf.keras.Sequential__, starting with the __preprocessing_layer__

In [37]:
model = tf.keras.Sequential(
    [
     preprocessing_layer,
     tf.keras.layers.Dense(128, activation='relu'),
     tf.keras.layers.Dense(128, activation='relu'),
     tf.keras.layers.Dense(1),
    ]
)

model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy']
)

In [38]:
train_data = packed_train_data.shuffle(500)
test_data = packed_test_data


In [39]:
model.fit(train_data, epochs=40)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<tensorflow.python.keras.callbacks.History at 0x7f4bb83870f0>

In [41]:
test_loss, test_accuracy = model.evaluate(test_data)
print(f'\n\n Test Loss {test_loss}, Test Accuracy {test_accuracy}')



 Test Loss 0.5485135912895203, Test Accuracy 0.810606062412262


In [42]:
predictions = model.predict(test_data)

#show some results
for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):
  prediction = tf.sigmoid(prediction).numpy()
  print('Predicted survival: {:.2%}'.format(prediction[0]), " | Actual outcome: ", ("Survived " if bool(survived) else "DIED"))


Predicted survival: 53.31%  | Actual outcome:  Survived 
Predicted survival: 6.88%  | Actual outcome:  DIED
Predicted survival: 1.38%  | Actual outcome:  Survived 
Predicted survival: 80.48%  | Actual outcome:  DIED
Predicted survival: 77.38%  | Actual outcome:  Survived 
