Please ensure you have run all previous notebooks in sequence before running this.

# Data Ingestion

In this notebook we are going to cover basic data engineering.  We are going to bring a small, sample data set (never start with BIG DATA) into our cluster and "play" with the data.

In [3]:
import os
import urllib

# Download data-scientist-model.csv from GitHub. This file has 4875 rows.
basedataurl = "https://raw.githubusercontent.com/davew-msft/MLOps-E2E/master/data"
datafile = "data-scientist-model.csv"
datafile_dbfs = os.path.join("/dbfs", datafile)

if os.path.isfile(datafile_dbfs):
    print("found {} at {}".format(datafile, datafile_dbfs))
else:
    print("downloading {} to {}".format(datafile, datafile_dbfs))
    urllib.request.urlretrieve(os.path.join(basedataurl, datafile), datafile_dbfs)

In [4]:
# Create a Spark dataframe out of the csv file.
data_all = sqlContext.read.format('csv').options(header='true', inferSchema='true', ignoreLeadingWhiteSpace='true', ignoreTrailingWhiteSpace='true').load(datafile)
print("({}, {})".format(data_all.count(), len(data_all.columns)))
data_all.printSchema()

In [5]:
data_all.head(5)

In [6]:
display(data_all.limit(5))

MaritalStatus,Gender,YearlyIncome,TotalChildren,NumberChildrenAtHome,Education,HouseOwnerFlag,NumberCarsOwned,CommuteDistance,Region,Age,BikeBuyer
1,1,40000,0,0,1,0,2,10,1,39,0
1,1,30000,3,0,1,0,2,2,2,66,0
1,1,30000,3,0,1,0,2,2,2,65,0
1,2,80000,2,0,1,1,2,10,2,61,0
1,2,80000,2,0,1,1,2,10,2,61,0


# What are we doing? 

BikeBuyer is a binary column, is this consumer ultimately a bike buyer?  This is the *label*.   

The other columns are *possibly* feature columns that are predictive of the label.  

Let's see if we can build a predictive model against this data set.  

Generally we need to split a data set into training and validation datasets.  75/25 is considered a good split.

# Data Preparation

In [9]:
# Choose feature columns and the label column.
label = "BikeBuyer"
xvals_all = set(data_all.columns) - {label}

#dbutils.widgets.remove("xvars_multiselect")
dbutils.widgets.removeAll()

dbutils.widgets.multiselect('xvars_multiselect', 'MaritalStatus', xvals_all)
xvars_multiselect = dbutils.widgets.get("xvars_multiselect")
xvars = xvars_multiselect.split(",")

print("label = {}".format(label))
print("features = {}".format(xvars))

data = data_all.select([*xvars, label])

# Split data into train and test.
train, test = data.randomSplit([0.75, 0.25], seed=123)

print("train ({}, {})".format(train.count(), len(train.columns)))
print("test ({}, {})".format(test.count(), len(test.columns)))

#Data Persistence

In [11]:
# Write the train and test data sets to intermediate storage
train_data_path = "BikeBuyerTrain"
test_data_path = "BikeBuyerTest"

train_data_path_dbfs = os.path.join("/dbfs", "BikeBuyerTrain")
test_data_path_dbfs = os.path.join("/dbfs", "BikeBuyerTest")

train.write.mode('overwrite').parquet(train_data_path)
test.write.mode('overwrite').parquet(test_data_path)
print("train and test datasets saved to {} and {}".format(train_data_path_dbfs, test_data_path_dbfs))

In [12]:
dbutils.notebook.exit("success")

success