In [165]:
# Import dependencies 
import tensorflow as tf
import pandas as pd
import numpy as np
import seaborn as sns
import apache_beam as beam

# Print version for easy debugging 
print "Library versions \n Tensorflow version: {} \n DataFlow version: {}".format(tf.__version__, beam.__version__)

Library versions 
 Tensorflow version: 1.11.0 
 DataFlow version: 2.10.0


<h2>Setup environment variables</h2>

Store local paths and filenames in environment variables for easy use. 

In [167]:
import os

# Store the root directory of the project
CWD = os.getcwd() # path to this notebook on the local filesystem
ROOT,_ = os.path.split(CWD) # on level up, the root directory of the project 

# Save path to raw in envron variables  
DATA_DIR = os.path.join(ROOT,'raw_data/')
DATA_FILE_NAME = 'true_car_listings.csv'
STATS_FILE_NAME = 'stats.tfrecord'

DATA_PATH = os.path.join(DATA_DIR,DATA_FILE_NAME)
STATS_PATH = os.path.join(DATA_DIR,STATS_FILE_NAME) # path to store statistics 

PROJECTID = None 
STAGING_BUCKET = None
REGION = 'europe-west1'

# Check on the root oflder
print "Root project folder is: {}".format(ROOT)

Root project folder is: /Users/evanderknaap/Documents/Projects/tfvalidate


## Split train & test data in an Apache Beam pipeline 
We use a random number generator to split the data into a train and test-set. We don't care at this point if the files cannot be reused. 

In [192]:
import random
from apache_beam.options.pipeline_options import GoogleCloudOptions

class PipeOptions(GoogleCloudOptions):

  @classmethod
  def _add_argparse_args(cls, parser):
    parser.add_argument('--split_prob',
                        help='probability',
                        default=0.8)

def train_eval_fn(data_row, num_partitions):
    """Partitions data in train and evaluate based on a split prob"""
    """
        Args: 
            data_row: string of input data
            num_partitions: number of splits in data, 2 in this case
        Out: tuple the PCollections of the train and test data
    """        
    # Sample a number between 0,1 and one
    sample = random.uniform(0, 1)

    # Check if the number is smaller then defined treshold
    if sample <= options.split_prob:
        return 0 # for train
    else:
        return 1 # for evaluate

# execute the graph 
options = PipeOptions()

with beam.Pipeline(options = options) as p:
    raw_data = p |'ReadCSV' >> beam.io.ReadFromText(DATA_PATH, skip_header_lines=True)
    partitioned_data = raw_data | 'Split in train and test' >> beam.Partition(train_eval_fn,2)
    
    train_data = partitioned_data[0]
    test_data = partitioned_data[1]
    
    _ =  train_data |'Write train data' >> beam.io.WriteToText(os.path.join(DATA_DIR,'train_data.csv'))
    _ =  test_data  |'Write test data' >> beam.io.WriteToText(os.path.join(DATA_DIR,'test_data.csv'))

usage: ipykernel_launcher.py [-h] --split_prob SPLIT_PROB
                             [--dataflow_endpoint DATAFLOW_ENDPOINT]
                             [--project PROJECT] [--job_name JOB_NAME]
                             [--staging_location STAGING_LOCATION]
                             [--temp_location TEMP_LOCATION] [--region REGION]
                             [--service_account_email SERVICE_ACCOUNT_EMAIL]
                             [--no_auth NO_AUTH]
                             [--template_location TEMPLATE_LOCATION]
                             [--label LABELS] [--update]
ipykernel_launcher.py: error: argument --split_prob is required


SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


<h2>Compute statistics on data on local machine</h2>

Load the tfdv dependencies, this may take some time 

In [156]:
# Import tensorflow
import tensorflow_data_validation as tfdv

print "TFDV version: {}".format(tfdv.version.__version__)

TFDV version: 0.11.0


Next, we point TFDV to the location of our raw data, and compute the statistics on our local machine using an apache beam pipeline. They will be stored as a protobuffer .tfrecord file in the the folder located in the statistics path. We'll get a warning if there is an existing .tfrecord file, which you can choose to overwrite.

In [154]:
# Compute statistics frmo the raw data and store stats as a tfrecord
tfdv.generate_statistics_from_csv(data_location=DATA_PATH, output_path=STATS_PATH)



datasets {
  num_examples: 852122
  features {
    name: "City"
    type: STRING
    string_stats {
      common_stats {
        num_non_missing: 852122
        min_num_values: 1
        max_num_values: 1
        avg_num_values: 1.0
        num_values_histogram {
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 85212.2
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 85212.2
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 85212.2
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 85212.2
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 85212.2
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 85212.2
          }
          

We notice a few interesting things already
- There is a huge spread in **mileage**. Avearge mileage is $52.5K$ where the std dev is $42.0K$. 
- Outliers in **mileage** of $2.86M$ skew the picture. We might need to exclude those to get a better fell for the distribution
- The spread in **price** is quite large. The std dev is half if the mean prive of $21.5K$. Most expensive car is $500K$, making it hard to judge the distribution. 
- Apparantly most cars that are sold second-hand are actually when they are just 1 **year** old: $2017$.
- **Vin** numbers show a unit linear distribution, indicating they are unique to each car. The slight increase slop inidicates there are some duplicates. 
- There are $58$ different **makes** where the Ford is most popular. About $85%$ of listings belong to about $25$ cars. They are quite concentrated. The tail is made up by more exotic cars like Porches.
- **Models** are also quite concentrated, $80%$ of the $1000$ listing are concentrated in the first 400 models. 

In [168]:
# Load the statistics from file, so they won't have be re-run everytime 
stats_proto = tfdv.load_statistics(STATS_PATH)

# Visualize using facets
tfdv.visualize_statistics(stats_proto)

##  Infer a schema

When loading the data, we need to define a schema to convert the data into Tensors. We can use TFDV to infer a first  schema automatically. This schema is then used, to check if new data fits the schema. 

In [172]:
schema = tfdv.infer_schema(statistics=stats_proto)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'City',BYTES,required,,-
'Mileage',INT,required,,-
'Make',STRING,required,,'Make'
'Vin',BYTES,required,,-
'State',STRING,required,,'State'
'Year',INT,required,,-
'Model',BYTES,required,,-
'Price',INT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Make',"'AM', 'Acura', 'Alfa', 'Aston', 'Audi', 'BMW', 'Bentley', 'Buick', 'Cadillac', 'Chevrolet', 'Chrysler', 'Dodge', 'FIAT', 'Ferrari', 'Fisker', 'Ford', 'Freightliner', 'GMC', 'Genesis', 'Geo', 'HUMMER', 'Honda', 'Hyundai', 'INFINITI', 'Isuzu', 'Jaguar', 'Jeep', 'Kia', 'Lamborghini', 'Land', 'Lexus', 'Lincoln', 'Lotus', 'MINI', 'Maserati', 'Maybach', 'Mazda', 'McLaren', 'Mercedes-Benz', 'Mercury', 'Mitsubishi', 'Nissan', 'Oldsmobile', 'Plymouth', 'Pontiac', 'Porsche', 'Ram', 'Rolls-Royce', 'Saab', 'Saturn', 'Scion', 'Subaru', 'Suzuki', 'Tesla', 'Toyota', 'Volkswagen', 'Volvo', 'smart'"
'State',"' AK', ' AL', ' AR', ' AZ', ' Az', ' CA', ' CO', ' CT', ' Ca', ' DC', ' DE', ' FL', ' Fl', ' GA', ' Ga', ' HI', ' IA', ' ID', ' IL', ' IN', ' KS', ' KY', ' LA', ' MA', ' MD', ' ME', ' MI', ' MN', ' MO', ' MS', ' MT', ' Md', ' NC', ' ND', ' NE', ' NH', ' NJ', ' NM', ' NV', ' NY', ' OH', ' OK', ' OR', ' Oh', ' PA', ' RI', ' SC', ' SD', ' TN', ' TX', ' UT', ' VA', ' VT', ' Va', ' WA', ' WI', ' WV', ' WY', ' ga'"


In [None]:
# We make some changes to the schema. We want to treat prices as float. 