<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preliminary-Data-Analysis" data-toc-modified-id="Preliminary-Data-Analysis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preliminary Data Analysis</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#The-Dataset" data-toc-modified-id="The-Dataset-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>The Dataset</a></span></li><li><span><a href="#Missing-Data" data-toc-modified-id="Missing-Data-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Missing Data</a></span></li><li><span><a href="#Dropping-Features" data-toc-modified-id="Dropping-Features-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Dropping Features</a></span></li><li><span><a href="#Filling-In-Data" data-toc-modified-id="Filling-In-Data-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Filling In Data</a></span></li><li><span><a href="#Merging-Data" data-toc-modified-id="Merging-Data-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Merging Data</a></span></li><li><span><a href="#Categorical-Data" data-toc-modified-id="Categorical-Data-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Categorical Data</a></span></li><li><span><a href="#Bar-Plot" data-toc-modified-id="Bar-Plot-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Bar Plot</a></span></li><li><span><a href="#Interpreting-Plots" data-toc-modified-id="Interpreting-Plots-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Interpreting Plots</a></span></li></ul></li><li><span><a href="#Data-Processing" data-toc-modified-id="Data-Processing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Processing</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Splitting-Datasets" data-toc-modified-id="Splitting-Datasets-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Splitting Datasets</a></span></li><li><span><a href="#Integer-Features" data-toc-modified-id="Integer-Features-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Integer Features</a></span></li></ul></li></ul></div>

# Preliminary Data Analysis


## Introduction

In this case study, you will be completing a real industry machine learning project. The project involves making sales predictions (sales forecasting) for various stores of a large retail corporation, based on a variety of different features. Along the way, you will also learn techniques for efficient data processing and model training.

A. Industry ML projects

Machine learning projects in industry are comprised of five main parts: data analysis, data processing, creating a model, training the model, and interpreting results. Data analysis is used to confirm that machine learning can actually find useful trends in the dataset to make the requisite predictions. Oftentimes, people will forget to perform data analysis and end up confused why their machine learning model doesn't work well, when in reality it is because the dataset is not conducive to making predictions.

After data analysis and data processing, we create the appropriate machine learning model given the application and dataset. In most cases, a regular multi-layer perceptron (MLP) will suffice, but sometimes we use a convolutional neural network (CNN) for images/videos or a recurrent neural network (RNN) for text-based data. Creating the model is the easiest part of an industry ML project, since you just need to reproduce the code for one of these well-known models..

Finally, after training the model on the dataset, we use it on test data to confirm that the model works. The results obtained from using the model on the test data will be interpreted and reported to the project's supervisors. This is where we can confirm to the supervisors that the model indeed works for the given application, and can be deployed to production.

B. Sales forecasting

One of the most common uses of machine learning in industry is to predict retail sales for consumer-based corporations. Our case study will have you mimic the work of an ML engineer at a large retail corporation, where you'll be creating a model to predict the weekly sales of stores, based on a large training dataset.

In this section of the case study, you'll be performing data analysis to get a feel for the dataset and also confirm that there are enough trends in the dataset to predict weekly sales. You'll also perform a bit of data processing to organize the raw dataset, although the majority of the data processing is done in the Data Pipeline Creation section.


## The Dataset

Learn about the retail dataset used for the project.

Chapter Goals:

- Learn about the retail dataset used in this case study
- Read the separate data files that comprise the dataset

A. Starting off the project

Let's say you're a machine learning engineer at a large retail corporation, and your supervisor just gave you this dataset and said,

"I want a system that can make future sales predictions for these 45 stores. We want to know whether they'll make enough money to justify keeping them. Here's a dataset containing the past sales of these stores."

This is a pretty short and vague description of the project, which is normally the type of description you'd get from a manager or supervisor. Luckily, we can learn more about the project by looking through the dataset.

Industry data usually comes in CSV files, XLSX spreadsheets, JSON data files, or can be accessed from a database using SQL. In this case, our dataset comes in three CSV files: weekly_sales.csv, features.csv, and stores.csv. We'll use the pandas library's pd.read_csv function to read each CSV file into a DataFrame.

For more on pandas and data processing, check out the Machine Learning for Software Engineers course on Educative.

B. Understanding the dataset

Your supervisor gave you some basic details about the dataset. The weekly_sales.csv file contains rows detailing the sales (in dollars) for the departments of each store in a given week. We use this file to train a machine learning model to make future weekly sales predictions for each store's department.


In [1]:
"""
import pandas as pd

train_df = pd.read_csv('weekly_sales.csv')

print(train_df)

"""

"\nimport pandas as pd\n\ntrain_df = pd.read_csv('weekly_sales.csv')\n\nprint(train_df)\n\n"

After taking a look at the data using pandas, you confirm that the weekly_sales.csv file does indeed match your supervisor's description. There's also an additional column called 'Holiday', which is True if the row's week has a holiday, otherwise it's False.

The features.csv file contains potentially useful features, with values given on a weekly basis for each store. These features include a given week's national unemployment rate and the temperature of the region that the store is located in. The stores.csv file contains information about each of the 45 stores, specifically the type of store and the size of the store.

We'll take a deeper look at the features and stores CSV files in later chapters.


In [2]:
"""
def read_dataframes():
    train_df = pd.read_csv('weekly_sales.csv')
    features_df = pd.read_csv('features.csv')
    stores_df = pd.read_csv('stores.csv')
    return train_df, features_df, stores_df
    
"""

"\ndef read_dataframes():\n    train_df = pd.read_csv('weekly_sales.csv')\n    features_df = pd.read_csv('features.csv')\n    stores_df = pd.read_csv('stores.csv')\n    return train_df, features_df, stores_df\n    \n"

## Missing Data

Learn how to merge feature datasets find missing data in features.

Chapter Goals:
- Merge the two DataFrames containing the feature data
- Learn how to identify features with missing data

A. Merging the features

Both the features_df and stores_df DataFrames contain feature data, 
i.e. data related to the stores or weeks that correspond to the rows in train_df. 
Remember each feature is a column of the DataFrame, 
and that each row in a given column is an entry for that feature. 
Let's take a look at the specific features contained in each DataFrame.


In [None]:
"""
general_features = features_df.columns

print(general_features)
print('General Features: {}\n'.format(general_features.tolist()))

store_features = stores_df.columns
print('Store Features: {}'.format(store_features.tolist()))

"""

The code above shows the features (i.e. the columns) of the features_df and stores_df DataFrames. The tolist function converts the features from an Index to a list.

You'll notice that both these DataFrames share the 'Store' feature, which is just the ID of the store for a given row in the DataFrame. Since both these DataFrames contain useful data, we can make things easier for ourselves by merging the two DataFrames into one.

We do this by using the merge function and merging the DataFrames based on the 'Store' feature.


In [3]:
"""
merged_features = features_df.merge(stores_df, on='Store')

"""

"\nmerged_features = features_df.merge(stores_df, on='Store')\n\n"

The code above merges the two DataFrames. The new DataFrame (merged_features) contains all the features from features_df and stores_df. It has a total of 8190 rows.

B. Finding missing data

Using the newly merged DataFrame, we can figure out which features contain missing data. 
This is a crucial step in the data analysis, since we need to perform data processing on any features that have missing row values (i.e. only some of the rows will have entries for that feature).

The pandas library will represent missing values with an NA in the DataFrame. 
We use the pd.isna function combined with the any function to check which columns contain missing values.


In [4]:
"""
na_values = pd.isna(merged_features) # Boolean DataFrame
na_features = na_values.any() # Boolean Series
print(na_features)

"""

'\nna_values = pd.isna(merged_features) # Boolean DataFrame\nna_features = na_values.any() # Boolean Series\nprint(na_features)\n\n'

In [5]:
"""
Store           False
Date            False
Temperature     False
Fuel_Price      False
MarkDown1        True
...
"""

'\nStore           False\nDate            False\nTemperature     False\nFuel_Price      False\nMarkDown1        True\n...\n'

The 'CPI' and 'Unemployment' features contain missing values, along with each 'Markdown' feature. We'll discuss how to handle these missing values in the next chapter.


## Dropping Features

Drop features from the dataset that have too many missing data values.

Chapter Goals:
- Figure out exactly how many missing values are in each feature
- Drop the features that contain too many missing values

A. Counting the missing values

In the previous chapter, we figured out that each of the 'MarkDown' features, along with the 'CPI' and 'Unemployment' features contained missing values. We now want to figure out how many missing values each of these features has, i.e. how many rows of the combined feature DataFrame don't contain a value for the particular feature.

This can be done by counting the number of True values for each feature's column in the boolean DataFrame.


In [6]:
"""
print(len(na_values))
print(sum(na_values['MarkDown1']))
print(sum(na_values['CPI']))
"""

"\nprint(len(na_values))\nprint(sum(na_values['MarkDown1']))\nprint(sum(na_values['CPI']))\n"

Since each feature's column contains True (equivalent to 1) or False (equivalent to 0), we just take the column's sum to count the number of True, i.e. missing values.


B. Dropping unusable features

The number of missing values in the 'MarkDown' features are 4158, 5269, 4577, 4726, and 4140 respectively. Since each of the 'MarkDown' feature values is missing in over half DataFrame's rows, we'll consider these features unusable and therefore drop them from the dataset.


In [7]:
"""
markdowns = [
    'MarkDown1',
    'MarkDown2',
    'MarkDown3',
    'MarkDown4',
    'MarkDown5'
]
merged_features = merged_features.drop(columns=markdowns)
print(merged_features.columns.tolist())

"""

"\nmarkdowns = [\n    'MarkDown1',\n    'MarkDown2',\n    'MarkDown3',\n    'MarkDown4',\n    'MarkDown5'\n]\nmerged_features = merged_features.drop(columns=markdowns)\nprint(merged_features.columns.tolist())\n\n"

Both the 'CPI' and 'Unemployment' features contain only 585 missing values. This is significantly less than the total number of rows in the DataFrame (8190), so we can still use these features. We'll discuss how to deal with the missing values in the next chapter.


## Filling In Data

Fill in missing data for features that only have a few missing values.

Chapter Goals:

- Find the rows that contain missing values for 'CPI' and 'Unemployment'
- Fill in the missing values using previous row values

A. Finding the missing values

We previously noted that both the 'CPI' and 'Unemployment' features contain 585 missing values. We'll find the row indexes containing these missing values by first converting the feature columns in the na_values boolean DataFrame to integers, i.e. 0 and 1.

We then use the nonzero function to find the locations of the 1's, which correspond to the True values.


In [8]:
"""
import numpy as np  # NumPy library

na_cpi_int = na_values['CPI'].astype(int)
na_indexes_cpi = na_cpi_int.to_numpy().nonzero()[0]
na_une_int = na_values['Unemployment'].astype(int)
na_indexes_une = na_une_int.to_numpy().nonzero()[0]

print(np.array_equal(na_indexes_cpi, na_indexes_une))

"""

"\nimport numpy as np  # NumPy library\n\nna_cpi_int = na_values['CPI'].astype(int)\nna_indexes_cpi = na_cpi_int.to_numpy().nonzero()[0]\nna_une_int = na_values['Unemployment'].astype(int)\nna_indexes_une = na_une_int.to_numpy().nonzero()[0]\n\nprint(np.array_equal(na_indexes_cpi, na_indexes_une))\n\n"

The row indexes are located in the na_indexes_cpi and na_indexes_une NumPy arrays, which you can see contain the exact same row indexes (sorted in ascending order). Now let's take a closer look at the exact rows that contain the missing values.


In [9]:
"""
na_indexes = na_indexes_cpi
na_rows = merged_features.iloc[na_indexes]
print(na_rows['Date'].unique())  # missing value weeks

print(merged_features['Date'].unique()[-13:])  # final 13 weeks

print(na_rows.groupby('Store').count()['Date'].unique())

"""

"\nna_indexes = na_indexes_cpi\nna_rows = merged_features.iloc[na_indexes]\nprint(na_rows['Date'].unique())  # missing value weeks\n\nprint(merged_features['Date'].unique()[-13:])  # final 13 weeks\n\nprint(na_rows.groupby('Store').count()['Date'].unique())\n\n"

In the code above, we found that the missing values for 'CPI' and 'Unemployment' correspond to only 13 weeks, which are in fact the final 13 weeks of the entire dataset. Furthermore, the final line of code shows that each store contains 13 weeks with missing values.

Since there are 45 stores, and 13 weeks missing per store, that gives us the total of 585 rows with missing values.


B. Filling in the values

The 'CPI' and 'Unemployment' features correspond to the national consumer price index and unemployment rate. These values have minimal change on a week-to-week basis, so we can fill in the missing values using the final 'CPI' and 'Unemployment' values for the week of '2018-04-26' (the final week without a missing value).


In [11]:
"""
print(na_indexes[0])  # first missing value row index
print()

first_missing_row = merged_features.iloc[169]
print(first_missing_row[['Date','CPI','Unemployment']])
print()

final_val_row = merged_features.iloc[168]
print(final_val_row[['Date','CPI','Unemployment']])
print()

cpi_final_val = merged_features.at[168, 'CPI']
une_final_val = merged_features.at[168, 'Unemployment']
merged_features.at[169, 'CPI'] = cpi_final_val
merged_features.at[169, 'Unemployment'] = une_final_val

new_row = merged_features.iloc[169]
print(new_row[['Date','CPI','Unemployment']])
print()

"""

"\nprint(na_indexes[0])  # first missing value row index\nprint()\n\nfirst_missing_row = merged_features.iloc[169]\nprint(first_missing_row[['Date','CPI','Unemployment']])\nprint()\n\nfinal_val_row = merged_features.iloc[168]\nprint(final_val_row[['Date','CPI','Unemployment']])\nprint()\n\ncpi_final_val = merged_features.at[168, 'CPI']\nune_final_val = merged_features.at[168, 'Unemployment']\nmerged_features.at[169, 'CPI'] = cpi_final_val\nmerged_features.at[169, 'Unemployment'] = une_final_val\n\nnew_row = merged_features.iloc[169]\nprint(new_row[['Date','CPI','Unemployment']])\nprint()\n\n"

Since the row indexes in na_indexes are sorted in ascending order, we can fill in the missing values using a for loop through na_indexes (more details in the coding exercise at the end of this chapter).


C. Imputation variants

Filling in missing data with substituted values is known as imputation. The method we used for our dataset replaced missing values with values from closely related observations. However, there are many other forms of imputation, such as replacing the feature's missing data with the feature's mean, median, or mode.

There are also more advanced imputation methods such as k-Nearest Neighbors (filling in missing values based on similarity scores from the kNN algorithm) and MICE (applying multiple chained imputations, assuming the missing values are randomly distributed across observations).

In most industry cases these advanced methods are not required, since the data is either perfectly cleaned or the missing values are scarce. Nevertheless, the advanced methods could be useful when dealing with open source datasets, since these tend to be more incomplete.


In [12]:
"""
def impute_data(merged_features, na_indexes_cpi, na_indexes_une):
    for i in na_indexes_cpi:
        merged_features.at[i, 'CPI'] = merged_features.at[i - 1, 'CPI']
    for i in na_indexes_une:
        merged_features.at[i, 'Unemployment'] = merged_features.at[i - 1, 'Unemployment']

"""

"\ndef impute_data(merged_features, na_indexes_cpi, na_indexes_une):\n    for i in na_indexes_cpi:\n        merged_features.at[i, 'CPI'] = merged_features.at[i - 1, 'CPI']\n    for i in na_indexes_une:\n        merged_features.at[i, 'Unemployment'] = merged_features.at[i - 1, 'Unemployment']\n\n"

## Merging Data

Merge the main training dataset with its corresponding feature data.

Chapter Goals:
- Create the final dataset by merging the training and features DataFrames

A. The final dataset

For the same organizational reasons we had in merging the features and stores DataFrames, we'll now merge the training and combined features DataFrames.

Remember that the stores DataFrame contains potentially useful features listed weekly by store, and the stores DataFrame contains the type and size of each store.


In [13]:
"""
train_df = pd.read_csv('weekly_sales.csv')
print(train_df.columns.tolist())

# Merged and imputed stores + features
print(merged_features.columns.tolist())
"""

"\ntrain_df = pd.read_csv('weekly_sales.csv')\nprint(train_df.columns.tolist())\n\n# Merged and imputed stores + features\nprint(merged_features.columns.tolist())\n"

The code above shows that the two DataFrames share the features 'Store', 'Date', and 'IsHoliday'. Therefore, we merge the DataFrames on these three features.

While the 'Date' feature is useful in the sense that it allows us to identify important values for a given week, like unemployment rate or CPI, it's not used directly in training a machine learning model. Therefore, we drop it from the final dataset.


In [14]:
"""
features = ['Store', 'Date', 'IsHoliday']
final_dataset = train_df.merge(merged_features, on=features)
final_dataset = final_dataset.drop(columns=['Date'])

print(final_dataset.columns.tolist())
"""

"\nfeatures = ['Store', 'Date', 'IsHoliday']\nfinal_dataset = train_df.merge(merged_features, on=features)\nfinal_dataset = final_dataset.drop(columns=['Date'])\n\nprint(final_dataset.columns.tolist())\n"

## Categorical Data

Learn about categorical data and how it is used in a dataset.

Chapter Goals:
- Analyze the categorical features in the dataset

A. The dataset format

When using the dataset to train a machine learning model, each feature needs to be an integer, float, or string type. The float data is the numeric data, i.e. the data that can be quantified and analyzed using operations like mean or standard deviation. The string data is categorical, meaning that each string represents some unique category for the feature. Integer data can be either numeric (e.g. kilometer distance) or categorical (e.g. year of birth).

In the final dataset we're using, the categorical features are the 'Store', 'Type', and 'Dept' features. Since we already know there are 45 stores, labeled from 1 to 45, we just need to investigate the 'Type' and 'Dept' features.


In [16]:
"""
print(final_dataset['Type'].unique())
print(final_dataset['Dept'].unique())
"""

"\nprint(final_dataset['Type'].unique())\nprint(final_dataset['Dept'].unique())\n"

There are only three categories of stores shown in the 'Type' feature: 'A', 'B', and 'C'. There are 81 store departments, and each is a positive integer less than 100. We'll discuss how to process and use categorical and numeric features in a machine learning model later in this course.


In [17]:
"""
import pandas as pd

final_dataset['IsHoliday'] = final_dataset['IsHoliday'].astype(int)

"""

"\nimport pandas as pd\n\nfinal_dataset['IsHoliday'] = final_dataset['IsHoliday'].astype(int)\n\n"

##Scatter Plot

Learn how to create scatter plots comparing numerical features to sales.

Chapter Goals:
- Create a scatter plot to visualize the relationships between dataset features and weekly sales

A. Plotting data

The most important part of interpreting data is being able to visualize trends and correlations between dataset features. We do this through data plots, which allows us to easily discover interesting patterns in the dataset and decide whether the dataset is feasible for training a machine learning model.

We create the data plots using the pyplot API of Matplotlib. One of the most common plots for data analysis is the 2-D scatter plot. It's used for plotting the relationship between a numeric dependent feature (Y-axis) and a numeric independent feature (X-axis). Since we're trying to predict weekly sales for stores, we'll make plots with the 'Weekly_Sales' feature as the dependent feature.


In [18]:
"""
import matplotlib.pyplot as plt

plot_df = final_dataset[['Weekly_Sales', 'Temperature']]
rounded_temp = plot_df['Temperature'].round(0)  # nearest integer
plot_df = plot_df.groupby(rounded_temp).mean()
plot_df.plot.scatter(x='Temperature', y='Weekly_Sales')
plt.show()

"""

"\nimport matplotlib.pyplot as plt\n\nplot_df = final_dataset[['Weekly_Sales', 'Temperature']]\nrounded_temp = plot_df['Temperature'].round(0)  # nearest integer\nplot_df = plot_df.groupby(rounded_temp).mean()\nplot_df.plot.scatter(x='Temperature', y='Weekly_Sales')\nplt.show()\n\n"

In the code, we rounded the temperature to the nearest integer and took the average weekly sales at each integer temperature value (degrees Fahrenheit).

We'll do more detailed analysis of the feature scatter plots in later chapters, but you can tell from the above scatter plot that sales are pretty consistent around $160,000 weekly, although they tend to drop at the high and low ends of the temperature spectrum.

For organizational purposes, and to present a more professional plot to the project supervisor, we should label and title the scatter plots appropriately.


In [19]:
"""
import matplotlib.pyplot as plt

plot_df = final_dataset[['Weekly_Sales', 'Temperature']]
rounded_temp = plot_df['Temperature'].round(0)  # nearest integer
plot_df = plot_df.groupby(rounded_temp).mean()
plot_df.plot.scatter(x='Temperature', y='Weekly_Sales')
plt.title('Temperature vs. Weekly Sales')
plt.xlabel('Temperature (Fahrenheit)')
plt.ylabel('Avg Weekly Sales (Dollars)')
plt.show()

"""

"\nimport matplotlib.pyplot as plt\n\nplot_df = final_dataset[['Weekly_Sales', 'Temperature']]\nrounded_temp = plot_df['Temperature'].round(0)  # nearest integer\nplot_df = plot_df.groupby(rounded_temp).mean()\nplot_df.plot.scatter(x='Temperature', y='Weekly_Sales')\nplt.title('Temperature vs. Weekly Sales')\nplt.xlabel('Temperature (Fahrenheit)')\nplt.ylabel('Avg Weekly Sales (Dollars)')\nplt.show()\n\n"

## Bar Plot

Learn how to create bar plots comparing categorical features to sales.

Chapter Goals:
- Create a bar plot to visualize the correlation between dataset features and weekly sales

A. Categorical feature plots

For categorical features, it doesn't make sense to create a scatter plot. Instead, when we're using a categorical feature as the independent variable (with a numeric feature as the dependent variable), we'll make a bar plot.

When using weekly sales as the dependent variable, we'll create bar plots that show the average weekly sale amount for each category in a categorical feature.


In [20]:
"""
plot_df = final_dataset[['Weekly_Sales', 'Type']]
plot_df = plot_df.groupby('Type').mean()
plot_df.plot.bar()
plt.title('Store Type vs. Weekly Sales')
plt.xlabel('Type')
plt.ylabel('Avg Weekly Sales (Dollars)')
plt.show()

"""

"\nplot_df = final_dataset[['Weekly_Sales', 'Type']]\nplot_df = plot_df.groupby('Type').mean()\nplot_df.plot.bar()\nplt.title('Store Type vs. Weekly Sales')\nplt.xlabel('Type')\nplt.ylabel('Avg Weekly Sales (Dollars)')\nplt.show()\n\n"

The above bar plot shows that stores of type A have significantly higher average weekly sales than stores of type B or C.


## Interpreting Plots

Interpret scatter and bar plots in the context of the project.

Chapter Goals:
- Analyze plots of the dataset's features with respect to weekly sales and usage in a machine learning model

A. Analyzing within context

After creating the dataset plots, we should analyze them to determine whether they're useful in the context of our problem. For us, this means deciding whether there's enough correlation between the dataset features and weekly sales to train a machine learning model.

The main thing we're looking for in the plots is non-uniform distributions. A uniform distribution means that the weekly sales are identical regardless of the data feature's value. A non-uniform distribution, such as a normal distribution or a multi-modal distribution, shows that the dataset feature can potentially be used by a machine learning model to predict the sales.

It can also be beneficial to plot multiple features at once against the weekly sales. Sometimes, when plotting multiple features, we can find trends that would not have been found from the single feature's plot.



# Data Processing


## Introduction

In this section of the course you will be performing data processing on the final dataset from the Preliminary Data Analysis section. Specifically, you'll be building the input pipeline for training and evaluating the machine learning model.

A. Additional data processing

In the Preliminary Data Analysis section, we performed data analysis on the retail dataset and concluded that there is a strong enough correlation from the dataset's features to predict weekly store sales.

The files in that chapter were small enough to use relatively simple techniques. If the files were larger and more resource intensive, we would have used the techniques laid out in the Efficient Data Processing Techniques section.

After informing the project supervisor that the prediction task is viable, we begin working on the machine learning model.

However, before writing any actual machine learning code, we know that we need to continue processing the data to create an efficient input pipeline. The input pipeline represents how the data will be passed into the model for each step of training or evaluation. Since training the model requires thousands of steps, it is important that the input pipeline is as efficient as possible.

The final dataset we created was stored in a pandas DataFrame. Since the DataFrame is not the most efficient data storage for the input pipeline, we'll need to perform additional processing to create a more efficient solution.


## Splitting Datasets

Split the overall project's dataset into training and evaluation sets.

Chapter Goals:
- Learn about training and evaluation sets
- Split the project's final dataset into training and evaluation sets

A. Training and evaluation

There are two main components in creating a machine learning model: training and evaluation. Training is the foundation of machine learning, but evaluation is just as important. Model evaluation gives us a concrete idea of just how good the model is after training, and it allows us to compare the performances for different configurations of the model.

B. Set proportions

The question now becomes how much of the data we use for training and how much we use for evaluation. We should use a lot more data for training (the training set) compared to the data for evaluation (the evaluation set). The exact amount is up to the machine learning engineer to decide.

Using more data in training would potentially improve the model's performance, but it would limit us in how accurate our evaluation is due to the limited evaluation set size. On the other hand, having a larger evaluation set would give us more confidence in our evaluation process' accuracy, but it might limit the amount and diversity of the data in training.

In our case study, we choose a 90-10 split, meaning that the training set comprises 90% of the final dataset while the evaluation set comprises 10%. Since the overall dataset is pretty large, a 10% evaluation set still gives us a good representation of the overall dataset. Therefore, we can afford to put 90% of the dataset in training to maximize the model's performance.

C. Removing systematic trends

Before splitting the final dataset into training and evaluation sets, we need to randomly shuffle it. This is because the dataset is currently sorted by date and store, which is a systematic trend that we need to remove.

If we don't remove this trend, each training step will have data that is too similar to adjacent training steps, since they will likely be for adjacent weeks of the same store. This is an artificial characteristic that doesn't appear in real life, so it would negatively impact the model for real life predictions.


In [21]:
"""
def split_train_eval(final_dataset):
    final_dataset = final_dataset.sample(frac=1)
    eval_size = len(final_dataset) // 10
    eval_set = final_dataset.iloc[:eval_size]
    train_set = final_dataset.iloc[eval_size:]
    return train_set, eval_set
"""

'\ndef split_train_eval(final_dataset):\n    final_dataset = final_dataset.sample(frac=1)\n    eval_size = len(final_dataset) // 10\n    eval_set = final_dataset.iloc[:eval_size]\n    train_set = final_dataset.iloc[eval_size:]\n    return train_set, eval_set\n'

## Integer Features

Learn about the integer features used in the dataset.

Chapter Goals:
- Add the integer features of a DataFrame's row to a feature dictionary

A. Using Example objects

Each row of the final pandas DataFrame from the Data Analysis Lab contains the feature data for one data observation, i.e. the feature data for one store's sales in a particular week. To optimize the input pipeline, we want to convert each DataFrame row into a TensorFlow Example object. By using Example objects in the input pipeline, we're able to efficiently feed the data into a machine learning model.

After converting a DataFrame row to a TensorFlow Example, the row's integer valued features will be represented by Int64List TensorFlow Feature objects. From the analysis of our dataset, we know that the features with integer values are 'Store', 'Dept', 'IsHoliday', and 'Size'.


In [22]:
"""
def add_int_features(dataset_row, feature_dict):
    int_vals = ['Store', 'Dept', 'IsHoliday', 'Size']
    for feature_name in int_vals:
        list_val = tf.train.Int64List(value=[dataset_row[feature_name]])
        feature_dict[feature_name] = tf.train.Feature(int64_list=list_val)

"""

"\ndef add_int_features(dataset_row, feature_dict):\n    int_vals = ['Store', 'Dept', 'IsHoliday', 'Size']\n    for feature_name in int_vals:\n        list_val = tf.train.Int64List(value=[dataset_row[feature_name]])\n        feature_dict[feature_name] = tf.train.Feature(int64_list=list_val)\n\n"