In [120]:
import functools
import numpy as np
import pandas as pd
import os
import tensorflow as tf
print("TensorFlow version: ",tf.version.VERSION)

TensorFlow version:  2.2.0


In [18]:
# Define data path
cwd = os.getcwd()
train_file_path = os.path.join(cwd, "train.csv")
test_file_path = os.path.join(cwd, "test.csv")

In [None]:
# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

Inspect CSV file

In [51]:
with open(test_file_path) as f:
    for _ in range(3): # first 10 lines
        print(f.readline())

PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked

892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q

893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S



Specify label column:

In [52]:
LABEL_COLUMN = 'Survived'
LABELS = [0, 1]

Read the CSV data from the file and create a dataset. 

In [53]:
def get_dataset(file_path, **kwargs):
    "Loads a CSV file into a dataset"
    dataset = tf.data.experimental.make_csv_dataset(
        file_pattern=file_path,
        batch_size=5,
        label_name=LABEL_COLUMN,
        na_value="?",
        num_epochs=1,
        **kwargs
        )

    return dataset

In [54]:
raw_train_data = get_dataset(train_file_path)

In [65]:
def show_batch(dataset):
    # take one batch of 5 and print it
    for batch, label in dataset.take(1):
        for key, value in batch.items():
            print("{:20s}: {}".format(key,value.numpy()))
        print("Label:", label)

Each item in the dataset is a batch, represented as a tuple of (*many examples*, *many labels*). The data from the examples is organized in column-based tensors (rather than row-based tensors), each with as many elements as the batch size (5 in this case).

In [66]:
show_batch(raw_train_data)

PassengerId         : [143 544 853 741 349]
Pclass              : [3 2 3 1 3]
Name                : [b'Hakkarainen, Mrs. Pekka Pietari (Elin Matilda Dolck)'
 b'Beane, Mr. Edward' b'Boulos, Miss. Nourelain'
 b'Hawksford, Mr. Walter James' b'Coutts, Master. William Loch "William"']
Sex                 : [b'female' b'male' b'female' b'male' b'male']
Age                 : [24. 32.  9.  0.  3.]
SibSp               : [1 1 1 0 1]
Parch               : [0 0 1 0 1]
Ticket              : [b'STON/O2. 3101279' b'2908' b'2678' b'16988' b'C.A. 37671']
Fare                : [15.85   26.     15.2458 30.     15.9   ]
Cabin               : [b'' b'' b'' b'D45' b'']
Embarked            : [b'S' b'S' b'C' b'S' b'S']
Label: tf.Tensor([1 1 0 1 1], shape=(5,), dtype=int32)


The columns in the CSV are named. The dataset constructor will pick these names up automatically. If the file you are working with does not contain the column names in the first line, pass them in a list of strings to  the `column_names` argument in the `make_csv_dataset` function.

This example is going to use all the available columns. If you need to omit some columns from the dataset, create a list of just the columns you plan to use, and pass it into the (optional) `select_columns` argument of the constructor.

In [69]:
SELECT_COLUMNS = ['Survived', 'Age', 'SibSp', 'Pclass', 'Cabin', 'Sex']

temp_dataset = get_dataset(train_file_path, select_columns=SELECT_COLUMNS)

show_batch(temp_dataset)

Pclass              : [3 1 3 3 1]
Sex                 : [b'male' b'male' b'female' b'female' b'male']
Age                 : [ 1. 26. 29.  2. 40.]
SibSp               : [5 0 1 3 0]
Cabin               : [b'' b'C148' b'G6' b'' b'']
Label: tf.Tensor([0 1 0 0 0], shape=(5,), dtype=int32)


## Data preprocessing

A CSV file can contain a variety of data types. Typically you want to convert from those mixed types to a fixed length vector before feeding the data into your model.

TensorFlow has a built-in system for describing common input conversions: `tf.feature_column`, see [this tutorial](../keras/feature_columns) for details.


You can preprocess your data using any tool you like (like [nltk](https://www.nltk.org/) or [sklearn](https://scikit-learn.org/stable/)), and just pass the processed output to TensorFlow. 


The primary advantage of doing the preprocessing inside your model is that when you export the model it includes the preprocessing. This way you can pass the raw data directly to your model.

### Continuous data

If your data is already in an appropriate numeric format, you can pack the data into a vector before passing it off to the model:

In [73]:
SELECT_COLUMNS = ['Survived', 'Age', 'SibSp', 'Pclass', 'Fare']
DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
temp_dataset = get_dataset(train_file_path, 
                           select_columns=SELECT_COLUMNS,
                           column_defaults = DEFAULTS)

show_batch(temp_dataset)

Pclass              : [3. 3. 2. 3. 3.]
Age                 : [29. 11. 19.  0. 44.]
SibSp               : [1. 0. 0. 0. 0.]
Fare                : [10.4625 18.7875 10.5     7.2292  8.05  ]
Label: tf.Tensor([0 0 1 1 0], shape=(5,), dtype=int32)


In [98]:
example_batch, labels_batch = next(iter(temp_dataset)) 

In [101]:
example_batch

OrderedDict([('Pclass',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([1., 3., 1., 1., 1.], dtype=float32)>),
             ('Age',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([40., 17., 54., 38., 34.], dtype=float32)>),
             ('SibSp',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([0., 0., 0., 0., 0.], dtype=float32)>),
             ('Fare',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([153.4625,   8.6625,  51.8625, 227.525 ,  26.55  ], dtype=float32)>)])

Here's a simple function that will pack together all the columns:

In [99]:
def pack(features, label):
    return tf.stack(list(example_batch.values()), axis=-1), label

In [100]:
# Map pack function to dataset
packed_dataset = temp_dataset.map(pack)

In [102]:
for features, labels in packed_dataset.take(1):
  print(features.numpy())
  print()
  print(labels.numpy())

[[  1.      40.       0.     153.4625]
 [  3.      17.       0.       8.6625]
 [  1.      54.       0.      51.8625]
 [  1.      38.       0.     227.525 ]
 [  1.      34.       0.      26.55  ]]

[1 1 0 0 0]


If you have mixed datatypes you may want to separate out these simple-numeric fields. The `tf.feature_column` api can handle them, but this incurs some overhead and should be avoided unless really necessary. Switch back to the mixed dataset:

In [103]:
show_batch(raw_train_data)

PassengerId         : [137 521 289 517 727]
Pclass              : [1 1 2 2 2]
Name                : [b'Newsom, Miss. Helen Monypeny' b'Perreault, Miss. Anne'
 b'Hosono, Mr. Masabumi' b'Lemore, Mrs. (Amelia Milley)'
 b'Renouf, Mrs. Peter Henry (Lillian Jefferys)']
Sex                 : [b'female' b'female' b'male' b'female' b'female']
Age                 : [19. 30. 42. 34. 30.]
SibSp               : [0 0 0 0 3]
Parch               : [2 0 0 0 0]
Ticket              : [b'11752' b'12749' b'237798' b'C.A. 34260' b'31027']
Fare                : [26.2833 93.5    13.     10.5    21.    ]
Cabin               : [b'D47' b'B73' b'' b'F33' b'']
Embarked            : [b'S' b'S' b'S' b'S' b'S']
Label: tf.Tensor([1 1 1 1 1], shape=(5,), dtype=int32)


So define a more general preprocessor that selects a list of numeric features and packs them into a single column:

In [115]:
class PackNumericFeatures():
    
    def __init__(self, column_names):
        self.names = column_names
        
    def __call__(self, features, labels):
        numeric_features = [features.pop(name) for name in self.names]
        numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_features]
        numeric_features = tf.stack(numeric_features, axis=-1)
        features['numeric'] = numeric_features

        return features, labels

    

In [116]:
NUMERIC_FEATURES = ['Pclass','Age','SibSp', 'Parch', 'Fare']

packed_train_data = raw_train_data.map(PackNumericFeatures(NUMERIC_FEATURES))

In [117]:
show_batch(packed_train_data)

PassengerId         : [712 676 209 343  44]
Name                : [b'Klaber, Mr. Herman' b'Edvardsson, Mr. Gustaf Hjalmar'
 b'Carr, Miss. Helen "Ellen"' b'Collander, Mr. Erik Gustaf'
 b'Laroche, Miss. Simonne Marie Anne Andree']
Sex                 : [b'male' b'male' b'female' b'male' b'female']
Ticket              : [b'113028' b'349912' b'367231' b'248740' b'SC/Paris 2123']
Cabin               : [b'C124' b'' b'' b'' b'']
Embarked            : [b'S' b'S' b'Q' b'S' b'C']
numeric             : [[ 1.      0.      0.      0.     26.55  ]
 [ 3.     18.      0.      0.      7.775 ]
 [ 3.     16.      0.      0.      7.75  ]
 [ 2.     28.      0.      0.     13.    ]
 [ 2.      3.      1.      2.     41.5792]]
Label: tf.Tensor([0 0 1 0 1], shape=(5,), dtype=int32)


In [134]:
example_batch, labels_batch = next(iter(packed_train_data)) 

#### Data Normalization

Continuous data should always be normalized.

In [121]:
desc = pd.read_csv(train_file_path)[NUMERIC_FEATURES].describe()
desc

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
count,891.0,714.0,891.0,891.0,891.0
mean,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.42,0.0,0.0,0.0
25%,2.0,20.125,0.0,0.0,7.9104
50%,3.0,28.0,0.0,0.0,14.4542
75%,3.0,38.0,1.0,0.0,31.0
max,3.0,80.0,8.0,6.0,512.3292


In [122]:
MEAN = np.array(desc.T['mean'])
STD = np.array(desc.T['std'])

In [124]:
print(MEAN, STD)

[ 2.30864198 29.69911765  0.52300786  0.38159371 32.20420797] [ 0.83607124 14.52649733  1.10274343  0.80605722 49.6934286 ]


In [123]:
def normalize_numeric_data(data, mean, std):
  # Center the data
  return (data-mean)/std

Now create a numeric column. The `tf.feature_columns.numeric_column` API accepts a `normalizer_fn` argument, which will be run on each batch.

Bind the `MEAN` and `STD` to the normalizer fn using [`functools.partial`](https://docs.python.org/3/library/functools.html#functools.partial).

In [129]:
# We need to use partial here to pass the two variables to the function
normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)
numeric_column = tf.feature_column.numeric_column('numeric', normalizer_fn=normalizer, shape=[len(NUMERIC_FEATURES)])
numeric_columns = [numeric_column]
numeric_column

When you train the model, include this feature column to select and center this block of numeric data:

In [132]:
example_batch['numeric']

<tf.Tensor: shape=(5, 5), dtype=float32, numpy=
array([[  1.    ,   0.    ,   0.    ,   0.    , 221.7792],
       [  3.    ,   0.    ,   0.    ,   0.    ,   7.3125],
       [  3.    ,  26.    ,   1.    ,   0.    ,   7.8542],
       [  3.    ,   0.    ,   1.    ,   0.    ,  16.1   ],
       [  3.    ,  36.    ,   1.    ,   1.    ,  24.15  ]], dtype=float32)>

In [133]:
numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()

array([[-1.5652277 , -2.044479  , -0.47427887, -0.47340772,  3.8148906 ],
       [ 0.8269129 , -2.044479  , -0.47427887, -0.47340772, -0.50090545],
       [ 0.8269129 , -0.25464624,  0.43255043, -0.47340772, -0.4900046 ],
       [ 0.8269129 , -2.044479  ,  0.43255043, -0.47340772, -0.32407117],
       [ 0.8269129 ,  0.433751  ,  0.43255043,  0.767199  , -0.16207795]],
      dtype=float32)

### Categorical data

Some of the columns in the CSV data are categorical columns. That is, the content should be one of a limited set of options.

Use the `tf.feature_column` API to create a collection with a `tf.feature_column.indicator_column` for each categorical column.


In [135]:
CATEGORIES = {
    'Sex': ['male', 'female'],
    'Embarked' : ['C', 'S', 'Q'],
}


In [136]:
categorical_columns = []
for feature, vocab in CATEGORIES.items():
    cat_col = tf.feature_column.categorical_column_with_vocabulary_list(key=feature, vocabulary_list=vocab)
    categorical_columns.append(tf.feature_column.indicator_column(cat_col))

In [137]:
categorical_columns

[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='Sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='Embarked', vocabulary_list=('C', 'S', 'Q'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]

In [139]:
categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
categorical_layer(example_batch).numpy()[0]

array([0., 1., 0., 0., 1.], dtype=float32)

### Combined preprocessing layer

Add the two feature column collections and pass them to a `tf.keras.layers.DenseFeatures` to create an input layer that will extract and preprocess both input types:

In [141]:
preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numeric_columns)
print(preprocessing_layer(example_batch).numpy()[0])

[ 0.          1.          0.          0.          1.          0.8269129
 -0.66768456 -0.47427887 -0.47340772 -0.47373885]


### Next Step

A next step would be to build a build a `tf.keras.Sequential`, starting with the `preprocessing_layer`.