##### Copyright 2019 The TensorFlow Authors.



In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Load CSV with tf.data

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/beta/tutorials/load_data/text"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/load_data/csv.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/r2/tutorials/load_data/csv.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/r2/tutorials/load_data/csv.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

This tutorial provides an example of how to load CSV data from a file into a `tf.data.Dataset`.

The data used in this tutorial are taken from the Titanic passenger list. We'll try to predict the likelihood a passenger survived based on characteristics like age, gender, ticket class, and whether the person was traveling alone.

## Setup

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

In [2]:
print(tf.__version__)

2.0.0-beta1


In [8]:
TRAIN_DATA_URL = "file:///home/crab179/PycharmProjects/machine-learning-project/data/UAI_Data/train_July.csv"
TEST_DATA_URL = "file:///home/crab179/PycharmProjects/machine-learning-project/data/UAI_Data/test_Aug.csv"

train_file_path = tf.keras.utils.get_file("train_July.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("test_Aug.csv", TEST_DATA_URL)

Downloading data from file:///home/crab179/PycharmProjects/machine-learning-project/data/UAI_Data/train_July.csv
Downloading data from file:///home/crab179/PycharmProjects/machine-learning-project/data/UAI_Data/test_Aug.csv


In [9]:
# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

## Load data

So we know what we're doing, lets look at the top of the CSV file we're working with.

In [13]:
!head {train_file_path}

id,driver_id,member_id,create_date,create_hour,status,estimate_money,estimate_distance,estimate_term,start_geo_id,end_geo_id
583411b46a31bcc5d12d4402c928a146,3e69e17a6e5a726fe44d71896bee4f32,6b4d6e4992191fe96b9f27921520d551,2017-07-01,00,2,140.00,20099.00,18.00,6d7827e8dcfa09497954a31e6f7e6ee6,85e49ded1fa70a7bfa01ab0212a6e538
396b6e317f915352d3a19f61d2657c46,034f5860624827a65191a9be919fbb3d,c7c93facfd1b10d4e75ff14f479484e2,2017-07-01,00,2,78.00,9000.00,18.00,27d75f17e61587172fe7a6827bbaa198,f5dc996f6aa097f7a84a9bcfe58ed55c
c0badb35d04b00b06c54a285abde6e1b,d41d8cd98f00b204e9800998ecf8427e,8325f79b82f697dcce557b4a08f2ae5d,2017-07-01,00,1,86.23,10323.00,20.00,f92dfcc31699ad56d967a57673b8fc65,8c269e40d177f46840aff30baeb25e29
9c67ee57c2217c3b2211a66b120d77b2,e4c4e24edd254bb81fc6e3fe7a1a5dd4,bee163f2587d01a9fd9070be4c1e24fc,2017-07-01,00,1,81.88,14197.00,27.00,92e1e8020813ef939183e345626b442a,f80c4ceeb36264b42e34d6c4c2cb9b4c
fbd6734ac4938fab06546db06de9b3a9,d41d8cd98f00b204e9800998

As you can see, the columns in the CSV are labeled. We need the list later on, so let's read it out of the file.

In [17]:
# CSV columns in the input file.
with open(train_file_path, 'r') as f:
    names_row = f.readline()


CSV_COLUMNS = names_row.rstrip('\n').split(',')
print(CSV_COLUMNS)

['id', 'driver_id', 'member_id', 'create_date', 'create_hour', 'status', 'estimate_money', 'estimate_distance', 'estimate_term', 'start_geo_id', 'end_geo_id']


In [16]:
# CSV columns in the input file.
with open(test_file_path, 'r') as f:
    names_row = f.readline()


CSV_COLUMNS = names_row.rstrip('\n').split(',')
print(CSV_COLUMNS)

['id', 'driver_id', 'member_id', 'create_date', 'create_hour', 'status', 'estimate_money', 'estimate_distance', 'estimate_term', 'start_geo_id', 'end_geo_id']


 The dataset constructor will pick these labels up automatically.

If the file you are working with does not contain the column names in the first line, pass them in a list of strings to  the `column_names` argument in the `make_csv_dataset` function.

```python

CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']

dataset = tf.data.experimental.make_csv_dataset(
     ...,
     column_names=CSV_COLUMNS,
     ...)
  
```


This example is going to use all the available columns. If you need to omit some columns from the dataset, create a list of just the columns you plan to use, and pass it into the (optional) `select_columns` argument of the constructor.


```python

drop_columns = ['fare', 'embark_town']
columns_to_use = [col for col in CSV_COLUMNS if col not in drop_columns]

dataset = tf.data.experimental.make_csv_dataset(
  ...,
  select_columns = columns_to_use, 
  ...)

```

We also have to identify which column will serve as the labels for each example, and what those labels are.

In [19]:
LABELS = [0, 1]
LABEL_COLUMN = 'survived'

FEATURE_COLUMNS = [column for column in CSV_COLUMNS if column != LABEL_COLUMN]

Now that these constructor argument values are in place,  read the CSV data from the file and create a dataset. 

(For the full documentation, see `tf.data.experimental.make_csv_dataset`)


In [9]:
def get_dataset(file_path):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=12, # Artificially small to make examples easier to show.
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)
  return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

W0628 04:41:16.166034 140680645007104 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/data/experimental/ops/readers.py:498: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.


Each item in the dataset is a batch, represented as a tuple of (*many examples*, *many labels*). The data from the examples is organized in column-based tensors (rather than row-based tensors), each with as many elements as the batch size (12 in this case).

It might help to see this yourself.

In [10]:
examples, labels = next(iter(raw_train_data)) # Just the first batch.
print("EXAMPLES: \n", examples, "\n")
print("LABELS: \n", labels)

EXAMPLES: 
 OrderedDict([('sex', <tf.Tensor: id=170, shape=(12,), dtype=string, numpy=
array([b'female', b'female', b'male', b'male', b'female', b'male',
       b'female', b'male', b'male', b'female', b'female', b'male'],
      dtype=object)>), ('age', <tf.Tensor: id=162, shape=(12,), dtype=float32, numpy=
array([24., 22., 30., 28., 42., 28., 17., 32., 28., 24., 28.,  4.],
      dtype=float32)>), ('n_siblings_spouses', <tf.Tensor: id=168, shape=(12,), dtype=int32, numpy=array([0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 4], dtype=int32)>), ('parch', <tf.Tensor: id=169, shape=(12,), dtype=int32, numpy=array([2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2], dtype=int32)>), ('fare', <tf.Tensor: id=167, shape=(12,), dtype=float32, numpy=
array([14.5  ,  7.75 , 16.1  , 82.171, 26.   ,  7.896, 12.   ,  7.896,
        8.458, 13.   , 16.1  , 31.275], dtype=float32)>), ('class', <tf.Tensor: id=164, shape=(12,), dtype=string, numpy=
array([b'Second', b'Third', b'Third', b'First', b'Second', b'Third',
       b'Second',

## Data preprocessing

### Categorical data

Some of the columns in the CSV data are categorical columns. That is, the content should be one of a limited set of options.

In the CSV, these options are represented as text. This text needs to be converted to numbers before the model can be trained. To facilitate that, we need to create a list of categorical columns, along with a list of the options available in each column.

In [11]:
CATEGORIES = {
    'sex': ['male', 'female'],
    'class' : ['First', 'Second', 'Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
    'alone' : ['y', 'n']
}


Write a function that takes a tensor of categorical values, matches it to a list of value names, and then performs a one-hot encoding.

In [12]:
def process_categorical_data(data, categories):
  """Returns a one-hot encoded tensor representing categorical values."""
  
  # Remove leading ' '.
  data = tf.strings.regex_replace(data, '^ ', '')
  # Remove trailing '.'.
  data = tf.strings.regex_replace(data, r'\.$', '')
  
  # ONE HOT ENCODE
  # Reshape data from 1d (a list) to a 2d (a list of one-element lists)
  data = tf.reshape(data, [-1, 1])
  # For each element, create a new list of boolean values the length of categories,
  # where the truth value is element == category label
  data = tf.equal(categories, data)
  # Cast booleans to floats.
  data = tf.cast(data, tf.float32)
  
  # The entire encoding can fit on one line:
  # data = tf.cast(tf.equal(categories, tf.reshape(data, [-1, 1])), tf.float32)
  return data

To help you visualize this, we'll take a single category-column tensor from the first batch, preprocess it, and show the before and after state.

In [13]:
class_tensor = examples['class']
class_tensor

<tf.Tensor: id=164, shape=(12,), dtype=string, numpy=
array([b'Second', b'Third', b'Third', b'First', b'Second', b'Third',
       b'Second', b'Third', b'Third', b'Second', b'Third', b'Third'],
      dtype=object)>

In [14]:
class_categories = CATEGORIES['class']
class_categories

['First', 'Second', 'Third']

In [15]:
processed_class = process_categorical_data(class_tensor, class_categories)
processed_class

<tf.Tensor: id=189, shape=(12, 3), dtype=float32, numpy=
array([[0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.]], dtype=float32)>

Notice the relationship between the lengths of the two inputs and the shape of the output.

In [16]:
print("Size of batch: ", len(class_tensor.numpy()))
print("Number of category labels: ", len(class_categories))
print("Shape of one-hot encoded tensor: ", processed_class.shape)

Size of batch:  12
Number of category labels:  3
Shape of one-hot encoded tensor:  (12, 3)


### Continuous data

Continuous data needs to be normalized, so that the values fall between 0 and 1. To do that, write a function that multiplies each value by 1 over twice the mean of the column values.

The function should also reshape the data into a two dimensional tensor.


In [17]:
def process_continuous_data(data, mean):
  # Normalize data
  data = tf.cast(data, tf.float32) * 1/(2*mean)
  return tf.reshape(data, [-1, 1])

To do this calculation, you need the column means. You would obviously need to compute these in real life, but for this example we'll just provide them.

In [18]:
MEANS = {
    'age' : 29.631308,
    'n_siblings_spouses' : 0.545455,
    'parch' : 0.379585,
    'fare' : 34.385399
}

Again, to see what this function is actually doing, we'll take a single tensor of continuous data and show it before and after processing.

In [19]:
age_tensor = examples['age']
age_tensor

<tf.Tensor: id=162, shape=(12,), dtype=float32, numpy=
array([24., 22., 30., 28., 42., 28., 17., 32., 28., 24., 28.,  4.],
      dtype=float32)>

In [20]:
process_continuous_data(age_tensor, MEANS['age'])

<tf.Tensor: id=198, shape=(12, 1), dtype=float32, numpy=
array([[0.405],
       [0.371],
       [0.506],
       [0.472],
       [0.709],
       [0.472],
       [0.287],
       [0.54 ],
       [0.472],
       [0.405],
       [0.472],
       [0.067]], dtype=float32)>

### Preprocess the data

Now assemble these preprocessing tasks into a single function that can be mapped to each batch in the dataset. 



In [21]:
def preprocess(features, labels):
  
  # Process categorial features.
  for feature in CATEGORIES.keys():
    features[feature] = process_categorical_data(features[feature],
                                                 CATEGORIES[feature])

  # Process continuous features.
  for feature in MEANS.keys():
    features[feature] = process_continuous_data(features[feature],
                                                MEANS[feature])
  
  # Assemble features into a single tensor.
  features = tf.concat([features[column] for column in FEATURE_COLUMNS], 1)
  
  return features, labels



Now apply that function with `tf.Dataset.map`, and shuffle the dataset to avoid overfitting.

In [22]:
train_data = raw_train_data.map(preprocess).shuffle(500)
test_data = raw_test_data.map(preprocess)

And let's see what a single example looks like.

In [23]:
examples, labels = next(iter(train_data))

examples, labels

(<tf.Tensor: id=365, shape=(12, 24), dtype=float32, numpy=
 array([[1.   , 0.   , 0.472, 0.   , 0.   , 0.115, 0.   , 0.   , 1.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 1.   , 0.   , 0.   , 1.   , 0.   ],
        [0.   , 1.   , 0.472, 0.917, 0.   , 1.195, 1.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 1.   , 0.   , 0.   , 0.   , 1.   ],
        [1.   , 0.   , 0.321, 0.   , 0.   , 0.153, 0.   , 1.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 1.   , 0.   ],
        [0.   , 1.   , 0.607, 0.   , 2.634, 1.032, 1.   , 0.   , 0.   ,
         0.   , 1.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 1.   ],
        [1.   , 0.   , 0.591, 0.   , 0.   , 0.386, 1.   , 0.   , 0.   ,
         0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
         0.  

The examples are in a  two dimensional arrays of 12 items each (the batch size). Each item represents a single row in the original CSV file. The labels are a 1d tensor of 12 values.

## Build the model

This example uses the [Keras Functional API](https://www.tensorflow.org/beta/guide/keras/functional) wrapped in a `get_model` constructor to build up a simple model. 

In [24]:
def get_model(input_dim, hidden_units=[100]):
  """Create a Keras model with layers.

  Args:
    input_dim: (int) The shape of an item in a batch. 
    labels_dim: (int) The shape of a label.
    hidden_units: [int] the layer sizes of the DNN (input layer first)
    learning_rate: (float) the learning rate for the optimizer.

  Returns:
    A Keras model.
  """

  inputs = tf.keras.Input(shape=(input_dim,))
  x = inputs

  for units in hidden_units:
    x = tf.keras.layers.Dense(units, activation='relu')(x)
  outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)

  model = tf.keras.Model(inputs, outputs)
 
  return model

The `get_model` constructor needs to know the input shape of your data (not including the batch size).

In [25]:
input_shape, output_shape = train_data.output_shapes

input_dimension = input_shape.dims[1] # [0] is the batch size

## Train, evaluate, and predict

Now the model can be instantiated and trained.

In [26]:
model = get_model(input_dimension)
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])

model.fit(train_data, epochs=20)

Epoch 1/20


W0628 04:41:17.067102 140680645007104 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7ff2682225c0>

Once the model is trained, we can check its accuracy on the `test_data` set.

In [27]:
test_loss, test_accuracy = model.evaluate(test_data)

print('\n\nTest Loss {}, Test Accuracy {}'.format(test_loss, test_accuracy))

     22/Unknown - 0s 7ms/step - loss: 0.4435 - accuracy: 0.7879

Test Loss 0.4434814710508693, Test Accuracy 0.7878788113594055


Use `tf.keras.Model.predict` to infer labels on a batch or a dataset of batches.

In [28]:
predictions = model.predict(test_data)

# Show some results
for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):
  print("Predicted survival: {:.2%}".format(prediction[0]),
        " | Actual outcome: ",
        ("SURVIVED" if bool(survived) else "DIED"))



Predicted survival: 89.33%  | Actual outcome:  SURVIVED
Predicted survival: 11.27%  | Actual outcome:  DIED
Predicted survival: 41.55%  | Actual outcome:  SURVIVED
Predicted survival: 80.50%  | Actual outcome:  DIED
Predicted survival: 24.32%  | Actual outcome:  DIED
Predicted survival: 98.60%  | Actual outcome:  SURVIVED
Predicted survival: 8.59%  | Actual outcome:  DIED
Predicted survival: 44.37%  | Actual outcome:  DIED
Predicted survival: 93.19%  | Actual outcome:  SURVIVED
Predicted survival: 11.09%  | Actual outcome:  DIED
