<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preliminary-Data-Analysis" data-toc-modified-id="Preliminary-Data-Analysis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preliminary Data Analysis</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#The-Dataset" data-toc-modified-id="The-Dataset-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>The Dataset</a></span></li><li><span><a href="#Missing-Data" data-toc-modified-id="Missing-Data-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Missing Data</a></span></li><li><span><a href="#Dropping-Features" data-toc-modified-id="Dropping-Features-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Dropping Features</a></span></li><li><span><a href="#Filling-In-Data" data-toc-modified-id="Filling-In-Data-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Filling In Data</a></span></li></ul></li></ul></div>

# Preliminary Data Analysis


## Introduction

In this case study, you will be completing a real industry machine learning project. The project involves making sales predictions (sales forecasting) for various stores of a large retail corporation, based on a variety of different features. Along the way, you will also learn techniques for efficient data processing and model training.

A. Industry ML projects

Machine learning projects in industry are comprised of five main parts: data analysis, data processing, creating a model, training the model, and interpreting results. Data analysis is used to confirm that machine learning can actually find useful trends in the dataset to make the requisite predictions. Oftentimes, people will forget to perform data analysis and end up confused why their machine learning model doesn't work well, when in reality it is because the dataset is not conducive to making predictions.

After data analysis and data processing, we create the appropriate machine learning model given the application and dataset. In most cases, a regular multi-layer perceptron (MLP) will suffice, but sometimes we use a convolutional neural network (CNN) for images/videos or a recurrent neural network (RNN) for text-based data. Creating the model is the easiest part of an industry ML project, since you just need to reproduce the code for one of these well-known models..

Finally, after training the model on the dataset, we use it on test data to confirm that the model works. The results obtained from using the model on the test data will be interpreted and reported to the project's supervisors. This is where we can confirm to the supervisors that the model indeed works for the given application, and can be deployed to production.

B. Sales forecasting

One of the most common uses of machine learning in industry is to predict retail sales for consumer-based corporations. Our case study will have you mimic the work of an ML engineer at a large retail corporation, where you'll be creating a model to predict the weekly sales of stores, based on a large training dataset.

In this section of the case study, you'll be performing data analysis to get a feel for the dataset and also confirm that there are enough trends in the dataset to predict weekly sales. You'll also perform a bit of data processing to organize the raw dataset, although the majority of the data processing is done in the Data Pipeline Creation section.


## The Dataset

Learn about the retail dataset used for the project.

Chapter Goals:

- Learn about the retail dataset used in this case study
- Read the separate data files that comprise the dataset

A. Starting off the project

Let's say you're a machine learning engineer at a large retail corporation, and your supervisor just gave you this dataset and said,

"I want a system that can make future sales predictions for these 45 stores. We want to know whether they'll make enough money to justify keeping them. Here's a dataset containing the past sales of these stores."

This is a pretty short and vague description of the project, which is normally the type of description you'd get from a manager or supervisor. Luckily, we can learn more about the project by looking through the dataset.

Industry data usually comes in CSV files, XLSX spreadsheets, JSON data files, or can be accessed from a database using SQL. In this case, our dataset comes in three CSV files: weekly_sales.csv, features.csv, and stores.csv. We'll use the pandas library's pd.read_csv function to read each CSV file into a DataFrame.

For more on pandas and data processing, check out the Machine Learning for Software Engineers course on Educative.

B. Understanding the dataset

Your supervisor gave you some basic details about the dataset. The weekly_sales.csv file contains rows detailing the sales (in dollars) for the departments of each store in a given week. We use this file to train a machine learning model to make future weekly sales predictions for each store's department.


In [1]:
"""
import pandas as pd

train_df = pd.read_csv('weekly_sales.csv')

print(train_df)

"""

"\nimport pandas as pd\n\ntrain_df = pd.read_csv('weekly_sales.csv')\n\nprint(train_df)\n\n"

After taking a look at the data using pandas, you confirm that the weekly_sales.csv file does indeed match your supervisor's description. There's also an additional column called 'Holiday', which is True if the row's week has a holiday, otherwise it's False.

The features.csv file contains potentially useful features, with values given on a weekly basis for each store. These features include a given week's national unemployment rate and the temperature of the region that the store is located in. The stores.csv file contains information about each of the 45 stores, specifically the type of store and the size of the store.

We'll take a deeper look at the features and stores CSV files in later chapters.


In [2]:
"""
def read_dataframes():
    train_df = pd.read_csv('weekly_sales.csv')
    features_df = pd.read_csv('features.csv')
    stores_df = pd.read_csv('stores.csv')
    return train_df, features_df, stores_df
    
"""

"\ndef read_dataframes():\n    train_df = pd.read_csv('weekly_sales.csv')\n    features_df = pd.read_csv('features.csv')\n    stores_df = pd.read_csv('stores.csv')\n    return train_df, features_df, stores_df\n    \n"

## Missing Data

Learn how to merge feature datasets find missing data in features.

Chapter Goals:
- Merge the two DataFrames containing the feature data
- Learn how to identify features with missing data

A. Merging the features

Both the features_df and stores_df DataFrames contain feature data, 
i.e. data related to the stores or weeks that correspond to the rows in train_df. 
Remember each feature is a column of the DataFrame, 
and that each row in a given column is an entry for that feature. 
Let's take a look at the specific features contained in each DataFrame.


In [None]:
"""
general_features = features_df.columns

print(general_features)
print('General Features: {}\n'.format(general_features.tolist()))

store_features = stores_df.columns
print('Store Features: {}'.format(store_features.tolist()))

"""

The code above shows the features (i.e. the columns) of the features_df and stores_df DataFrames. The tolist function converts the features from an Index to a list.

You'll notice that both these DataFrames share the 'Store' feature, which is just the ID of the store for a given row in the DataFrame. Since both these DataFrames contain useful data, we can make things easier for ourselves by merging the two DataFrames into one.

We do this by using the merge function and merging the DataFrames based on the 'Store' feature.


In [3]:
"""
merged_features = features_df.merge(stores_df, on='Store')

"""

"\nmerged_features = features_df.merge(stores_df, on='Store')\n\n"

The code above merges the two DataFrames. The new DataFrame (merged_features) contains all the features from features_df and stores_df. It has a total of 8190 rows.

B. Finding missing data

Using the newly merged DataFrame, we can figure out which features contain missing data. 
This is a crucial step in the data analysis, since we need to perform data processing on any features that have missing row values (i.e. only some of the rows will have entries for that feature).

The pandas library will represent missing values with an NA in the DataFrame. 
We use the pd.isna function combined with the any function to check which columns contain missing values.


In [4]:
"""
na_values = pd.isna(merged_features) # Boolean DataFrame
na_features = na_values.any() # Boolean Series
print(na_features)

"""

'\nna_values = pd.isna(merged_features) # Boolean DataFrame\nna_features = na_values.any() # Boolean Series\nprint(na_features)\n\n'

In [5]:
"""
Store           False
Date            False
Temperature     False
Fuel_Price      False
MarkDown1        True
...
"""

'\nStore           False\nDate            False\nTemperature     False\nFuel_Price      False\nMarkDown1        True\n...\n'

The 'CPI' and 'Unemployment' features contain missing values, along with each 'Markdown' feature. We'll discuss how to handle these missing values in the next chapter.


## Dropping Features

Drop features from the dataset that have too many missing data values.

Chapter Goals:
- Figure out exactly how many missing values are in each feature
- Drop the features that contain too many missing values

A. Counting the missing values

In the previous chapter, we figured out that each of the 'MarkDown' features, along with the 'CPI' and 'Unemployment' features contained missing values. We now want to figure out how many missing values each of these features has, i.e. how many rows of the combined feature DataFrame don't contain a value for the particular feature.

This can be done by counting the number of True values for each feature's column in the boolean DataFrame.


In [6]:
"""
print(len(na_values))
print(sum(na_values['MarkDown1']))
print(sum(na_values['CPI']))
"""

"\nprint(len(na_values))\nprint(sum(na_values['MarkDown1']))\nprint(sum(na_values['CPI']))\n"

Since each feature's column contains True (equivalent to 1) or False (equivalent to 0), we just take the column's sum to count the number of True, i.e. missing values.


B. Dropping unusable features

The number of missing values in the 'MarkDown' features are 4158, 5269, 4577, 4726, and 4140 respectively. Since each of the 'MarkDown' feature values is missing in over half DataFrame's rows, we'll consider these features unusable and therefore drop them from the dataset.


In [7]:
"""
markdowns = [
    'MarkDown1',
    'MarkDown2',
    'MarkDown3',
    'MarkDown4',
    'MarkDown5'
]
merged_features = merged_features.drop(columns=markdowns)
print(merged_features.columns.tolist())

"""

"\nmarkdowns = [\n    'MarkDown1',\n    'MarkDown2',\n    'MarkDown3',\n    'MarkDown4',\n    'MarkDown5'\n]\nmerged_features = merged_features.drop(columns=markdowns)\nprint(merged_features.columns.tolist())\n\n"

Both the 'CPI' and 'Unemployment' features contain only 585 missing values. This is significantly less than the total number of rows in the DataFrame (8190), so we can still use these features. We'll discuss how to deal with the missing values in the next chapter.


## Filling In Data

Fill in missing data for features that only have a few missing values.

Chapter Goals:

- Find the rows that contain missing values for 'CPI' and 'Unemployment'
- Fill in the missing values using previous row values

A. Finding the missing values

We previously noted that both the 'CPI' and 'Unemployment' features contain 585 missing values. We'll find the row indexes containing these missing values by first converting the feature columns in the na_values boolean DataFrame to integers, i.e. 0 and 1.

We then use the nonzero function to find the locations of the 1's, which correspond to the True values.


In [8]:
"""
import numpy as np  # NumPy library

na_cpi_int = na_values['CPI'].astype(int)
na_indexes_cpi = na_cpi_int.to_numpy().nonzero()[0]
na_une_int = na_values['Unemployment'].astype(int)
na_indexes_une = na_une_int.to_numpy().nonzero()[0]

print(np.array_equal(na_indexes_cpi, na_indexes_une))

"""

"\nimport numpy as np  # NumPy library\n\nna_cpi_int = na_values['CPI'].astype(int)\nna_indexes_cpi = na_cpi_int.to_numpy().nonzero()[0]\nna_une_int = na_values['Unemployment'].astype(int)\nna_indexes_une = na_une_int.to_numpy().nonzero()[0]\n\nprint(np.array_equal(na_indexes_cpi, na_indexes_une))\n\n"