# <font color = red>Introduction to Business Analytics:<br>Using Python for Better Business Decisions</font>

<br>
    <center><img src="http://dataanalyticscorp.com/wp-content/uploads/2018/03/logo.png"></center>
<br>
Taught by: 

* **Walter R. Paczkowski**, Ph.D. 

    * My Affliations: [Data Analytics Corp.](http://www.dataanalyticscorp.com/) and [Rutgers University](https://economics.rutgers.edu/people/teaching-personnel)
    * [Email Me With Questions](mailto:walt@dataanalyticscorp.com)
    * [Learn About Me](http://www.dataanalyticscorp.com/)
    * [See My LinkedIn Profile](https://www.linkedin.com/in/walter-paczkowski-a17a1511/)
    * [See My Books](https://www.amazon.com/-/e/B084KK4SF5?ref_=pe_1724030_132998070)    

## Slide Set-up

This code sets up the presentation slides.

In [None]:
##
## Slide code
##
from IPython.display import Image
def slide(what):
    display( Image( "../Slides/BA_Page_" + what + ".png", width = 50, height = 50, retina = True ) )

## Contents

1. [**_Helpful Background_**](#Helpful-Background)
    1. [About this Notebook](#About-this-Notebook)
    2. [Helpful Online Tutorials](#Helpful-Online-Tutorials)
    3. [Helpful Must-Read Book](#Helpful-Must-Read-Book)
2. [**_Lesson I Introduction to Business Anaytics_**](#Lesson-I-Introduction-to-Business-Anaytics)
3. [**_Lesson II Simple Analytics: Understanding and Preparing Your Data_**](#Lesson-II-Simple-Analytics:-Understanding-and-Preparing-Your-Data)
    1. [II.1 Documenting Your Data and Workflow: Best Practices](#II.1-Documenting-Your-Data-and-Workflow:-Best-Practices)
        1. [II.1.1 Documenting Your Data](#II.1.1-Documenting-Your-Data)
        2. [Exercise II.1](#Exercise-II.1)
        3. [II.1.2 Documenting Your Workflow](#II.1.2-Documenting-Your-Workflow)
        4. [II.1.3 Documenting Your Code](#II.1.3-Documenting-Your-Code)
    2. [II.2 Importing Python Packages](#II.2-Importing-Python-Packages)
        1. [II.2.1 Introduction to Python Packages](#II.2.1-Introduction-to-Python-Packages)
        2. [II.2.2 Loading Packages](#II.2.2-Loading-Packages)
        3. [II.2.3 Accessing a Function in a Package](#II.2.3-Accessing-a-Function-in-a-Package)
    3. [II.3 Importing Your Data Into Pandas](#II.3-Importing-Your-Data-Into-Pandas)
        1. [II.3.1 Set Data Path](#II.3.1-Set-Data-Path)
        2. [II.3.2 Importing Data](#II.3.2-Importing-Data)
    4. [II.4 Checking Your Data](#II.4-Checking-Your-Data)
    5. [II.5 Manipulating Your Data](#II.5-Manipulating-Your-Data)
        1. [II.5.1 Creating Variables](#II.5.1-Creating-Variables) 
        2. [Exercise II.2](#Exercise-II.2)
        3. [Exercise II.3](#Exercise-II.3)
        4. [Exercise II.4](#Exercise-II.4)
        5. [Exercise II.5](#Exercise-II.5)
        6. [II.5.2 Merge or Join DataFrames](#II.5.2-Merge-or-Join-DataFrames)
        7. [Exercise II.6](#Exercise-II.6)
    6. [II.6 Summary Statistics for Your Data](#II.6-Summary-Statistics-for-Your-Data)            
        1. [Exercise II.7](#Exercise-II.7)
    7. [II.7 What is Next?](#II.7-What-is-Next?)
4. [**_Lesson III Data Visualization for Insight_**](#Lesson-III-Data-Visualization-for-Insight)
    1. [III.1 Look at the Distribution of Your Data](#III.1-Look-at-the-Distribution-of-Your-Data)
        1. [III.1.1 Histograms](#III.1.1-Histograms)
            1. [Exercise III.1](#Exercise-III.1)
        2. [III.1.2 Boxplots](#III.1.2-Boxplots)
            1. [Exercise III.2](#Exercise-III.2) 
    2. [III.2 Look for Relationships in Your Data](#III.2-Look-for-Relationships-in-Your-Data)
        1. [III.2.1 Transformation for Better Interpretation](#III.2.1-Transformation-for-Better-Interpretation)
        2. [III.2.2 Enhancing the Scatter Plot](#III.2.2-Enhancing-the-Scatter-Plot)
        3. [III.2.3 Working with *Large-N* Data](#III.2.3-Working-with-Large-N-Data)    
            1. [III.2.3.1 Random Sampling](#III.2.3.1-Random-Sampling)
            2. [Exercises III.3](#Exercise-III.3)
            3. [III.2.3.2 Contour Plot](#III.2.3.2-Contour-Plot)
            4. [III.2.3.3 Hex Bin Plot](#III.2.3.3-Hex-Bin-Plot)
            5. [III.2.3.4 Lowess Curve](#III.2.3.4-Lowess-Curve)
            6. [Exercises III.4](#Exercise-III.4)
    3. [III.3 Look for Trends in Your Data](#III.3-Look-for-Trends-in-Your-Data)         
    4. [III.4 Look for Patterns in Your Data](#III.4-Look-for-Patterns-in-Your-Data)
    5. [III.5 Look for Anomalies in Your Data](#III.5-Look-for-Anomalies-in-Your-Data) 
    6. [III.6 What is Next?](#III.6-What-is-Next?)        
5. [**_Lesson IV Predictive Modeling: Introduction to Machine Learning_**](#Lesson-IV-Predictive-Modeling:-Introduction-to-Machine-Learning)
    1. [IV.1 Comparing and Contrasting Prediction and Forecasting](#IV.1-Comparing-and-Contrasting-Prediction-and-Forecasting)
    2. [IV.2 Steps for Predictive Modeling](#IV.2-Steps-for-Predictive-Modeling)
        1. [IV.2.1 Steps for Predictive Modeling: Train/Test Split Data](#IV.2.1-Steps-for-Predictive-Modeling:-Train/Test-Split-Data)
            1. [Exercise IV.1](#Exercise-IV.1)
            2. [Exercise IV.2](#Exercise-IV.2)
        2. [IV.2.2 Steps for Predictive Modeling: Train a Model](#IV.2.2-Steps-for-Predictive-Modeling:-Train-a-Model)
           1. [Case I Continuous Dependent Variable: OLS Regression](#Case-I-Continuous-Dependent-Variable:-OLS-Regression)
              1. [Exercise IV.3](#Exercise-IV.3)           
              2. [Case I Analyze the Results](#Case-I-Analyze-the-Results)
              3. [Exercise IV.4](#Exercise-IV.4)
              4. [Case I Predict with the Model](#Case-I-Predict-with-the-Model)
              5. [Exercise IV.5](#Exercise-IV.5)
           2. [Case II Binary Dependent Variable: Logistic Regression](#Case-II-Binary-Dependent-Variable:-Logistic-Regression)
              1. [Case II Create Your Data](#Case-II-Create-Your-Data)
              2. [Case II Train a Model](#Case-II-Train-a-Model)
              3. [Case II Predict with the Model](#Case-II-Predict-with-the-Model)
           2. [Case III Constants: Decision Trees](#Case-III-Constants:-Decision-Trees)
              1. [Case III Check Model Accuracy](#Case-III-Check-Model-Accuracy)
              2. [Case III Display the Tree](#Case-III-Display-the-Tree)
              3. [Exercise IV.6](#Exercise-IV.6)
6. [**_Lesson V Summary and Wrap-up_**](#Lesson-V-Summary-and-Wrap\-up)
7. [**_Contact Information_**](#Contact-Information)
8. [**_Appendix_**](#Appendix)
    1. [Appendix I.1 Jupyter Notebooks: Overview](#Appendix-I.1-Jupyter-Notebooks:-Overview)
    2. [Appendix I.2 Different Ways to Import Data Into Pandas](#Appendix-I.2-Different-Ways-to-Import-Data-Into-Pandas)
    3. [Appendix I.3 Some Additional Information on Checking Your Data](#Appendix-I.3-Some-Additional-Information-on-Checking-Your-Data)
        1. [Task \#1 Display the First Few Records of Your DataFrame](#Task-\#1-Display-the-First-Few-Records-of-Your-DataFrame)
        2. [Task \#2 Check the Shape of Your DataFrame](#Task-\#2-Check-the-Shape-of-Your-DataFrame)
        3. [Task \#3 Check the Column Names in Your DataFrame](#Task-\#3-Check-the-Column-Names-in-Your-DataFrame)
        4. [Task \#4 Check for Missing Data in Your DataFrame](#Task-\#4-Check-for-Missing-Data-in-Your-DataFrame)
    4. [Appendix I.4 Miscellaneous Pandas DataFrame Column Manipulations](#Appendix-I.4-Miscellaneous-Pandas-DataFrame-Column-Manipulations)
        1. [Deleting Columns](#Deleting-Columns)
    5. [Appendix I.5 Correlation Analysis](#Appendix-I.5-Correlation-Analysis)
    6. [Appendix II.1 Data Visualization](#Appendix-II.1-Data-Visualization)
        1. [Additional Histogram Methods](#Additional-Histogram-Methods)
        2. [Additional Boxplot Methods](#Additional-Boxplot-Methods)
        3. [Additional Scatter Plot Methods](#Additional-Scatter-Plot-Methods)
           1. [Categorical Variable](#Categorical-Variable)
           2. [Panel Plots](#Panel-Plot)
           3. [Combining Scatter Plots and Histograms](#Combining-Scatter-Plots-and-Histograms)
           4. [Pairwise Scatter Plots](#Pairwise-Scatter-Plots)
           5. [Contour Plots with Density Functions](#Contour-Plots-with-Density-Functions)
        4. [Additional Time Series Plot Methods](#Additional-Time-Series-Plot-Methods)
    7. [Appendix III.1 Extra Material for Predictive Modeling](#Appendix-III.1-Extra-Material-for-Predictive-Modeling)
        1. [Check OLS Model for Multicollinearity](#Check-OLS-Model-for-Multicollinearity)
        2. [Case I Model Portfolio](#Case-I-Model-Portfolio)
        3. [Case II Model Portfolio](#Case-II-Model-Portfolio)
    8. [Appendix Complete Data Dictionary](#Appendix-Complete-Data-Dictionary)
9. [**_Exercise Solutions_**](#Exercise-Solutions)

## Helpful Background

[Back to Contents](#Contents)

### About this Notebook

[Back to Contents](#Contents)

This notebook accompanies the PDF presentation

> ***Business Analytics: Using Python for Better Business Decisions***

by Walter R. Paczkowski, Ph.D. (2020).  There is more content and commentary in this notebook than in the presentation deck.  Nonetheless, the two complement each other and so should be studied together.  Every effort has been made to use the same key slide titles in the presentation deck and this notebook which will help your studying.  For your convenience, most of the presentation deck slides have been incorporated into this notebook.

### Helpful Online Tutorials

[Back to Contents](#Contents)

* <a href="http://docs.python.org/2/tutorial/" target="_parent">Python Tutorial</a>

* <a href="https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html" target="_parent">Pandas Tutorial</a>

* <a href="https://seaborn.pydata.org/tutorial.html" target="_parent">Seaborn Tutorial</a>

* <a href="https://www.statsmodels.org/stable/index.html" target="_parent">Statsmodels Tutorial</a>


### Helpful Must-Read Book

[Back to Contents](#Contents)

* <a href="https://www.amazon.com/gp/product/1491957662/ref=as_li_tl?ie=UTF8&tag=quantpytho-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=1491957662&linkId=8c3bf87b221dbcd8f541f0db20d4da83" target="_parent">Main Pandas go-to book: </a> *Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython* (2nd Edition) by Wes McKinney.


## Lesson I Introduction to Business Anaytics

[Back to Contents](#Contents)


In [None]:
slide( '003' )

In [None]:
slide( '005' )

In [None]:
slide( '007' )

In [None]:
slide( '009' )

In [None]:
slide( '010' )

In [None]:
slide( '011' )

In [None]:
slide( '012' )

## Lesson II Simple Analytics: Understanding and Preparing Your Data

[Back to Contents](#Contents)


In [None]:
slide( '014' )

In [None]:
slide( '016' )

In [None]:
slide( '017' )

### II.1 Documenting Your Data and Workflow: Best Practices

[Back to Contents](#Contents)


In [None]:
slide( '019' )

#### II.1.1 Documenting Your Data

[Back to Contents](#Contents)

The first task in any data analysis is data documentation in a *Data Dictionary*.

A data dictionary contains *metadata* which are data about the data.  Metadata can be anything that helps you understand the data you're using.  Based on [Wikipedia](http://en.wikipedia.org/wiki/Metadata), metadata are information about the distinct data items, such as:

> 1. means of creation;
> 2. purpose of the data;
> 3. time and date of creation;
> 4. creator/author/keeper of the data;
> 5. placement on a network (electronic form);
> 6. where the data were created;
> 7. what standards were used to create the data; and 
> 8. etc.

I'll restrict the metadata to:

> 1. Variable name;
> 2. Possible values or value ranges;
> 3. Source; and 
> 4. Mnemonic.

The mnemonic is the label used in data files and statistical and modeling output.  

| Variable                  | Values                                 | Source       | Mnemonic     |
|---------------------------|----------------------------------------|--------------|--------------|
| Order Number              | Nominal Integer                        | Order Sys    | Onum         |
| Customer ID               | Nominal                                | Customer Sys | CID          | 
| Transaction Date          | MM/DD/YYYY                             | Order Sys    | Tdate        | 
| Product Line ID           | Five rooms of house                    | Product Sys  | Pline        |
| Product Class ID          | Item in line                           | Product Sys  | Pclass       |
| Units Sold                | Number of units per order              | Order Sys    | Usales       |
| Product Returned?         | Yes/No                                 | Order Sys    | Return       |
| Amount Returned           | Number of units                        | Order Sys    | returnAmount |
| Material Cost/Unit        | \$US cost of material                  | Product Sys  | Mcost        |
| List Price                | \$US list                              | Price Sys    | Lprice       |
| Dealer Discount           | \% discount to dealer (decimal)        | Sales Sys    | Ddisc        |
| Competitive Discount      | \% discount for competition (decimal)  | Sales Sys    | Cdisc        |
| Order Size Discount       | \% discount for size (decimal)         | Sales Sys    | Odisc        |
| Customer Pickup Allowance | \% discount for pickup (decimal)       | Sales Sys    | Pdisc        |

A complete data dictionary is [here](#Appendix-Complete-Data-Dictionary).

#### Exercise II.1

[Back to Contents](#Contents)

We will soon import customer specific data that has four columns: 

> 1. CID: the customer ID;
> 2. State: the 50 US states plus Washington, DC;
> 3. ZIP: the 5-digit US ZIP (postal) code; and
> 4. Region: the marketing region which corresponds to the four US Census Regions (Midwest, Northeast, South, and West).

Create a Data Dictionary assuming this data come from the marketing department.  Use the labels *CID, State, ZIP*, and *Region* as the mnemonics.  Enter the Data Dictionary in a Markdown cell.

[See Solution](#Solution-II.1)

Enter the Data Dictionary here:

| Variable                     | Values                              | Source       | Mnemonic          |
|------------------------------|-------------------------------------|--------------|-------------------|
|                              |                                     |              |                   |

#### II.1.2 Documenting Your Workflow

[Back to Contents](#Contents)

Documenting your workflow is as important as documenting your data.  This documentation will enable you to reproduce your work and make it easier for a colleague to follow what you did.  The *Jupyter notebook* paradigm is the best platform for this documentation.  You will see many examples of this throughout this course. See [here](#Appendix-I.1-Jupyter-Notebooks:-Overview) for an overview of Jupyter notebooks.

#### II.1.3 Documenting Your Code

[Back to Contents](#Contents)

Always add comments to your code so that you can later recall why you did something.  Comments are added using a hash or pound sign (*#*).  I usually use two pound signs.  Anything following a pound sign is treated as a comment and is thus ignored.  You can add a comment anywhere on a line.  

Adding comments can be viewed as writing a piece of prose.  This is referred to as *literate programming.*  See [here](https://en.wikipedia.org/wiki/Literate_programming) for a discussion on literate programming.

### II.2 Importing Python Packages

[Back to Contents](#Contents)

Python is a powerful programming language that allows you to perform all standard programming operations in a clear and consistent manner.  Its strength, adhered to by Python programmers, is a coding format that emphasizes readable code.  Indentation is the primary way to accomplish this.  Also, its strength is based on a very wide array of *packages* or *modules* or *libraries*.  Packages perform analysis or data manipulation operations.  There are many packages, each one providing a special set of analysis tools so a package can be viewed as a container of functions.  Sometimes a package contains smaller, more specialized packages so a grand package could be a container for smaller ones.  You will see how to access and use packages and subset packages in this and other lessons.

Pandas is a data manipulation and graphing package with a lot of capabilities.  It will be used extensively in these lessons.  Seaborn is a scientific graphing package that is intuitive to use.  Although Pandas has visualization methods, Seaborn is preferred because of its quality, extent, and easier syntax.  Both packages use Matplotlib for base graphing functions.  Statsmodels has an array of statistical modeling functions, only a few of which will be used in these lessons.  Numpy and Matplotlib are base packages for Pandas, Seaborn, and Statsmodels.  Except for a few functions, Numpy and Matplotlib will not be used directly.

#### II.2.1 Introduction to Python Packages

[Back to Contents](#Contents)


In [None]:
slide( '022' )

#### II.2.2 Loading Packages

[Back to Contents](#Contents)

You have to load a package before you can use it.  Loading is done using an *import* command.  The alias is assigned when you import the package.  I recommend loading all the basic packages at once at the beginning of your notebook so you do not have to search for them.

In [None]:
##
## ===> Data Management <===
##
import numpy as np
import pandas as pd
##
## ===> Visualization <===
##
import seaborn as sns
import matplotlib.pyplot as plt
##
## Set the seaborn grid style.  The dot between the seaborn alias,
## "sns", and the set() function connects or "chains" the alias and the method.
##
sns.set()
##
## Set an option for the number of Pandas columns to display.  Eight in this case.
## 
pd.set_option( 'display.max_columns', 8 )
##
## ===> Modeling <===
##
## Import train_test_split package from sklearn
##
from sklearn.model_selection import train_test_split
##
## For modeling, notice the new import command for
## the formula API and the summary option
##
import statsmodels.api as sm
import statsmodels.formula.api as smf 
##
## Import the r2_score function from the sklearn metrics package
##
from sklearn.metrics import r2_score
##
## Import confusion functions for classification
##
from sklearn.metrics import confusion_matrix, classification_report
##
## Import decision tree classifier functions
##
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn import preprocessing
from sklearn.tree import export_graphviz
##
from sklearn.preprocessing import LabelEncoder
##
## Some packages are needed for decision trees:
## Some additional packages are needed to plot a decision tree:
## - graphviz
## - pydotplus
## Both packages may have to be installed before they can be used.  
## Use the operating system to do this.
##
import os
##!{sys.executable} -m pip install graphviz
##!{sys.executable} -m pip install pydotplus
##
## Tell Python where the graphviz package is load; then load it.
##
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
##
## Load the following packages
##
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import graphviz
import pydotplus
##
## Import needed tree packages
##
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus


**_Code Explanation_**

This code block loads the necessary Python packages for this course.  I recommend setting options, such as those for graphs and print, as was done here.  

#### II.2.3 Accessing a Function in a Package

[Back to Contents](#Contents)

A function in a package can be accessed by telling Python the package where the function is located and, of course, the name of the function.  These two operations are done with one statement by *chaining* the package name and the function name.  The chain is formed by connecting the package name and the function name by inserting a dot between the two.  Usually, the package alias is used for improved readability.  An example of a chained command is:

> *pd.read_csv( 'lesson1.csv' )*

where "*pd*" is the alias for Pandas and "*read_csv*" is a Pandas function that reads a *CSV* file ("lesson1.csv" in this example). Notice the dot(".") between the alias and the function name.  The dot is the chaining operator.

### II.3 Importing Your Data Into Pandas

[Back to Contents](#Contents)

Before you can begin any work, you must first import and examine the structure of your data.  This structure is a rectangular array or matrix or, in Pandas terminology, a *DataFrame*.  When you import your data using Pandas, the imported data immediately goes into a DataFrame.  This is very convenient because Seaborn and StatsModels functions recognize these DataFrames.

Pandas provides a set of very flexible import functions.  Which one you should use depends on your data format.  Some typical formats and relevant functions are:

| Data Format | Pandas Import Function |
|-------------|------------------------|
| CSV         | read_csv               |
| Excel       | read_excel             |
| Clipboard   | read_clipboard         |
| JSON        | read_json              |
| SAS         | read_sas               |
| HDF5        | read_hdf               |

The *HDF5* format is especially important for many Business Analytics applications where the datasets are large and complex.  *HDF* stands for *Hierarchical Data Format*.  "*HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections.*"  I will not use this format in this course.  See [here](https://portal.hdfgroup.org/display/support) for information on *HDF5* and [here](https://www.kaggle.com/diegovicente/a-short-introduction-to-hdf5-files) for a short introduction to Pandas and *HDF5*.

I will first import a *CSV* formatted file.  The package alias must be "chained" with the *read_csv* import function, otherwise Python will not know where to find the read function. 

When you import data, you must always specify the file path so Pandas can find the file.  If the data file is in the same directory as the notebook, then a path is unnecessary since Pandas always begins a search in the same directory as the notebook.  Otherwise, you have to specify the path.


#### II.3.1 Set Data Path

[Back to Contents](#Contents)

It is best practice to define paths in one location.  This makes error finding and changes easier.  Define the data path as shown here; keep the format *r'path'* if necessary.  Remember, if your data are in the same directory as your notebook, then a path is not necessary.

In [None]:
##
## Set data path
##
path = r'../Data/'

**_Code Explanation_**

The path is specified using the string literal: *r'../Data/'*.  The *r* at the beginning of the string tells the Python interpreter to treat this string as raw or literal text that is not to be changed.  Without the *r*, the interpreter could treat any backslashes as escape characters which would change the meaning of the string.  Even though forward slashes are used in this code block, it is good practice to avoid issues and use the *r*.

#### II.3.2 Importing Data

[Back to Contents](#Contents)

An example of data import is shown below.  Several more are shown in the Appendix [here](#Appendix-I.2-Different-Ways-to-Import-Data-Into-Pandas).

In [None]:
##
## Specify a path to the CSV data and import or read the data.
##
file = 'orders.csv'
##
## Import the data.  The parse_dates argument says to 
## treat Tdate as a date object.
##
df_orders = pd.read_csv( path + file, parse_dates = [ 'Tdate' ] )

**_Code Explanantion_**

The *CSV* file is specified as a string and then the path and file strings are concatenated using the "+" symbol.  The *parse_dates* argument says to treat the variable *Tdate* as a date variable.  This variable is the date of a transaction.  Date variables are stored and handled differently in Pandas and Python.

For future reference, count the number of unique *CID*s.

In [None]:
##
## How many unique CIDs are available?
##
data = len( df_orders.CID.unique() )
print( 'Number of unique CIDs: {}'.format( data ) )

**_Code Explanation_**

The *unique()* method extracts the unique *CID*s from the orders DataFrame.  The *len* function then counts the number of unique *CID*s.  

**_Interpretation_**

There are 779 unique *CID*s in this data set.  You will see this number quite often in succeeding lessons.

### II.4 Checking Your Data

[Back to Contents](#Contents)


In [None]:
slide( '029' )

In [None]:
##
## Task 1: Display first few records.
##
df_orders.head().style.set_caption( 'Orders Data' )

**_Code Explanation_**

The "head" method is attached to a DataFrame when you create it.  The default is to display the first five records.  You could use *n = 10* as an argument to display the first 10 records: *df.head( n = 10 )*.  You could display the last five records using the *tail* method.  The default is also five which could be changed as for the *head* method.  For both *head* and *tail*, the number of columns displayed is set using *pd.set_option( 'display.max_columns', 8 )* as was done in the package loading section above.  The *head* and *tail* methods are chained to the DataFrame name.

In [None]:
##
## Task 2: Check the shape of the data
##
print( 'Number of rows: {rows}\nNumber of columns: {cols}'.format( 
        rows = df_orders.shape[ 0 ], cols = df_orders.shape[ 1 ] ) )

**_Code Explanation_**

*shape* is an attribute of the DataFrame so it does not require parentheses; it does not have any arguments.  Functions and methods have arguments (which may be defaults) so parentheses are required.  The *shape* attribute is chained to the DataFrame name.

**_Interpretation_**

The *shape* attribute returns the number of rows and the number of columns in that order.  The *orders* DataFrame has 70,270 *rows* or *records* or *observations* and 14 *columns* or *variables* or *features*.

In [None]:
##
## Task 3: Check the column names
##
print( 'The column labels in the DataFrame:\n{}'.format( df_orders.columns ) )

**_Code Explanation_**

*columns* is an attribute of the DataFrame.  As an attribute, it does not require parentheses since an attribute is not callable so it has no arguments.

**_Interpretaion_**

When checking the column names, be sure there are no white spaces before and after the name.  White spaces can (and will) cause problems because your tendency will be to write a column name without the leading and trailing white spaces; Python will then not recognize the name.  If you see leading and trailing white spaces, you can remove them using the following:

> *df_orders.columns = df_orders.columns.str.strip()*

where *str* is the string package which is part of the base Python kernel and automatically loaded with Python.  You may also want to convert the column names to all lower case:

> *df_orders.columns = df_orders.columns.str.lower()*

You could do both at once using:

> *df_orders.columns = df_orders.columns.str.strip().str.lower()*

Notice the use of *str* twice in this last expression.

In [None]:
##
## Task 4: Check for missing data
##
df_orders.info()

**_Code Explanation_**

*info* is a method chained to the DataFrame.  It returns the number of non-missing records for each column plus data types: object (i.e., text string), floating point numbers, integers, and datetime.

**_Interpretation_**

There are 14 columns with 10 having 70,270 nonmissing values while the last four have less than 70,270 so they have mising values.  For example, *Ddisc* (the dealer discount based on the Data Dictionary) has 70,262 records so 8 are missing. 

See the Appendix [here](#Appendix-I.3-Some-Additional-Information-on-Checking-Your-Data) for more information about checking your data.

This dataset is moderately large.  If memory becomes an issue with very large datasets, then an argument for *read_csv* can be used to read in chunks of data.  The argument is *chunksize = XXX* where *XXX* is the number of records to read in each chunk.  This is an advanced topic.  See [here](https://stackoverflow.com/questions/33642951/python-using-pandas-structures-with-large-csviterate-and-chunksize) for some dicussion.

### II.5 Manipulating Your Data

[Back to Contents](#Contents)


In [None]:
slide( '032' )

#### II.5.1 Creating Variables

[Back to Contents](#Contents)


In [None]:
slide( '034' )

In [None]:
##
## Calculate total discount.
##
## Discounts are sometimes called "leakages" so the total is 
## the total leakage.
##
## Note: use "axis = 1" in the sum() function to sum across columns.
## This allows you to do the summation even with missing values.
##
lst = [ 'Ddisc', 'Odisc', 'Cdisc', 'Pdisc' ]
df_orders[ 'Tdisc' ] = df_orders[ lst ].sum( axis = 1 )
##
## Display only the discounts
##    Create a list of what to print. 
##
lst.append( 'Tdisc' )
df_orders[ lst ].head().style.set_caption( 'Orders Data for Discounts' )

**_Code Explanantion_**

The *sum* method has an axis argument that specifies the axis the function is to be applied on.  *axis = 0* specifies summing along the rows for each column (i.e., sum down a column) and *axis = 1* specifies summing along the columns in each row.

**_Interpretation_**

Notice the *NaN* values.  *NaN* stands for *Not a Number*.  These are missing values.

#### Exercise II.2

[Back to Contents](#Contents)

The *Pocket Price* is the list price less total discounts or total leakages.  It is the amount the business "pockets" and is the amount the customer actually pays.  The pocket price formula is

> $Pprice = Lprice \times (1  - Tdisc)$.

The *Tdisc* variable was created above.  Calculate the pocket price for the *df_orders* DataFrame and display the first five records for the list price and pocket price.

[See Solution](#Solution-II.2)

In [None]:
##
## Enter code here
##


#### Exercise II.3

[Back to Contents](#Contents)

Calculate total revenue as 

> $Rev = Usales \times Pprice$ 

using the *df_orders* DataFrame.

[See Solution](#Solution-II.3)

In [None]:
##
## Enter code here
##


#### Exercise II.4

[Back to Contents](#Contents)

*Contribution* and *contribution margin* are two values financial analysts often examine.  Contribution is comparable to what economists call *profit* but is more restricted in that it just refers to a product without considering any fixed or overhead costs.  Contribution is

> $Con = Revenue - Material~Cost$ 

and contribution margin is 

> $CM = \dfrac{Con}{Revenue}$.

Calculate both quantities using the *df_orders* DataFrame.

[See Solution](#Solution-II.4)

In [None]:
##
## Enter code here
##


#### Exercise II.5

[Back to Contents](#Contents)

Some products are returned so another revenue number, *revenue net of returns*, is more meaningful and revealing for business decisions.  Net revenue is

> $Net Revenue = (Unit Sales - Returns) \times Pocket Price$.

Calculate net revenue and call it *netRev*.  Also calculate the loss in revenue due to the returns.  The calculation is

> $lostRev = Rev - netRev$. 

Use the *df_orders* DataFrame.  

[See Solution](#Solution-II.5)

In [None]:
##
## Enter code here
##


#### II.5.2 Merge or Join DataFrames 

[Back to Contents](#Contents)

It is not unusual to have data in two (or more) tables so you will need to *merge* or *join* them to get all the data you need for an analysis.  For our problem, a second data table has information on each customer and this second table must be merged with the orders table.  The merge is done on the customer *ID* (*CID*).  There are many types of joins but we will only use an *inner join* in the examples.  *Inner join* is the default.

In [None]:
slide( '037' )

In [None]:
##
## Import a second DataFrame on the customers
##
file = 'customers.csv'
df_cust = pd.read_csv( path + file )
df_cust.head().style.set_caption( 'Customer DataFrame' )

In [None]:
##
## Do an inner join using CID as the link
##
df_orders_cust = pd.merge( df_orders, df_cust, on = 'CID' )
##
df_orders_cust.head().style.set_caption( 'Orders-Customers DataFrame' )

**_Code Explanation_**

The merge function takes two arguments: the *left* table and the *right* table to merge or join.  The tables are in that order.  A third argument specifies what to merge on.  There are several options for the *on* variable.  In this example, the *on* variable is just the common key in each table: *CID*.

An alternative form for the merge statement is:

> *df_orders = df_orders.merge( df_cust, on = 'CID' )*

An *inner join* is the default.  

See the Pandas documentation <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html" target="_parent">here</a> and <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html" target="_parent">here</a> for extensive discussion with examples about this topic.

In [None]:
## 
## Check the shape of the new DataFrame against 
## that of the orders and customers DataFrames
##
print( 'Shape of the orders DataFrame: {}\n'.format( df_orders.shape ) )
print( 'Shape of the customers DataFrame: {}\n'.format( df_cust.shape ) )
print( 'Shape of the merged DataFrame: {}\n'.format( df_orders_cust.shape ) )
##
## Find the number of unique CIDs
##
data = len( df_orders_cust.CID.unique() )
print( 'Number of unique CIDs in merged DataFrame: {}'.format( data ) )

**_Interpretation_**

*df_orders* has 21 columns, *df_cust* has 4 columns, while the merged *df_orders_cust* has 24.  The difference of one column is the *CID* which is in both and is the linking variable; it's only included once.  Notice that the number of unique *CID*s is 779 as before.

##### Exercise II.6

[Back to Contents](#Contents)

There is a third data set: a marketing data set that contains information for each customer about their loyalty program membership, a buyer rating provided by the sales force, and their customer satisfaction rating based on an annual customer satisfaction survey.  The marketing data are in a *csv* file named *marketing.csv*.  You have to:

> 1. import the marketing data into a DataFrame (name it *df_marketing*) and
> 2. merge the *df_orders_cust* DataFrame and this marketing DataFrame.

The *CID* is the same in both data sets so it is the linking variable.  Name the final merged DataFrame *df*.

[See Solution](#Solution-II.6)

In [None]:
##
## Enter code here to import the marketing data
## Name this imported data df_marketing
##


In [None]:
##
## Enter code here to merge the df_orders_cust and df_marketing DataFrames
## Name the new merged DataFrame df
##


In [None]:
##
## Enter code here to check the shape of df
##


In [None]:
##
## Enter code here to check number of unique CIDs in df
##


### II.6 Summary Statistics for Your Data

[Back to Contents](#Contents)

Summary statistics are a mainstay for starting any analysis.  Pandas has all the usual descriptve statistics.  One function, *describe()* will display the essential ones.

In [None]:
##
## Example: "describe" is a method attached to the DataFrame so it requires ().
## Round to 1 decimal place for readability (more decimal places are
## unnecessay, anyway).
##
## Display the descriptive statistics for the discounts.
##    First create a list of variables to display.
##
lst = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc', 'Tdisc' ]
df[ lst ].describe().round( 1 ).style.set_caption( 'Descriptive Statistics' )

**_Code Explanation_**

The *round* function is chained to the *describe* method.  An alternative way to round is shown next. 

The above report for the descriptive statistics is a challenge to read.  I prefer to have the statistics as the columns.  This can easily be done once you recognize that the report is just a matrix.  Matrices can be transposed which could help you read the report more easily.  Use the *T* attribute to transpose a matrix.

In [None]:
##
## Example of transposed matrix and alternative
## round function use.
##
lst = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc', 'Tdisc' ]
round( df[ lst ].describe().T, 1 ).style.set_caption( 'Descriptive Statistics' )

**_Interpretation_**

From the *Five Number Summary* (min/25%/50%/75%/max), you can determine the skewness of your data.

> 1. Symmetric: $(75\% - 50\%) = (50\% - 25\%)$
> 2. Right Skewed: $(75\% - 50\%) > (50\% - 25\%)$
> 3. Left Skewed: $(75\% - 50\%) < (50\% - 25\%)$

**QUESTION** What is the skewness for the Dealer Discount (*Ddisc*)?

**ANSWER**: right skewed

#### Exercise II.7

[Back to Contents](#Contents)

Using your merged orders/customers/marketing DataFrame, *df*, create a summary statistics display.  What is the skewness of the Total Discount (*Tdisc*)?

[See Solution](#Solution-II.7)

In [None]:
##
## Enter code here
##


### II.7 What is Next?

[Back to Contents](#Contents)

In Lesson III, I will show you how to do some basic graphing or *visualization* of your data.  This may seem more like scientific visualization than business visualization.  The latter is usually *infographics* which is not useful for gaining insight and, hence, useful Rich Information.  The former, scientific visualization, is the tool for extracting Rich Information. 


In [None]:
slide( "042" )

## Lesson III Data Visualization for Insight

[Back to Contents](#Contents)


In [None]:
slide( '044' )


Specifically, you will learn to use:

> 1. histograms;
> 2. boxplots;
> 3. scatter plots;
> 4. contour plots; and
> 5. hex bin plots

to visualize your data.  The focus is on *scientific visualization* rather than *infographics visualization*.      

**Case Study Problem**:
<br><br>
The product manager wanted to know about unit sales and discounts by:

> 1. overall market;
> 2. marketing region;
> 3. customer loyalty; and
> 4. buyer rating.

In [None]:
slide( '046' )

In [None]:
slide( '047' )

In [None]:
slide( '049' )

In [None]:
slide( '051' )

### III.1 Look at the Distribution of Your Data

[Back to Contents](#Contents)


In [None]:
slide( '053' )

#### III.1.1 Histograms

[Back to Contents](#Contents)


In [None]:
slide( '054' )

You can use a histogram to examine the distribution of unit sales and the total discount.  Notice in the following display that a smooth line is overlayed.  This is a *kernel density estimate* (*KDE*).  You will see this again shortly.

In [None]:
slide( '055' )

In [None]:
##
## Histogram of unit sales
##
ax = sns.distplot( df.Usales )
ax.set( title = 'Unit Sales Distribution', xlabel = 'Unit Sales', 
       ylabel = 'Proportions' );

**_Code Explanation_**

Plotting a histogram is very easy.  The Seaborn *distplot* command is used with the argument set to the variable of interest.  The plot is saved in a variable called "ax".  Parameters such as title and labels can be passed to this variable.

**_Interpretation_**

The distribution is highly skewed to the right which distorts the impression of the data.  The natural log will normalize the display.  This is helpful so when you model unit sales you should use a log transformation.  This next graph shows that the distribution (on a log scale) is fairly normal.

**_Recommendation_**
    
Use the Numpy *log1p* function.  This returns the natural log of one plus the argument: $np.log1p( x ) = log_e(1 + x)$.  The reason for using this function is to avoid cases where $x = 0$: $log(0)$ is undefined, which is meaningless, but $log( 1 ) = 0$ so you would have a meaningful number.

In [None]:
##
## Plot the natural log of unit sales
## A KDE curve is included by default
##
ax = sns.distplot( np.log1p( df.Usales ) )
ax.set( title = 'Unit Sales Distribution: Log Scale', 
       xlabel = 'Unit Sales (Natural Log)',
       ylabel = 'Proportions' );

**_Interpretation_**

The natural log transformation changed the distribution to a more normal looking distribution.  Normality is preferred for statistical analysis for a host of reasons.

#### Exercise III.1

[Back to Contents](#Contents)

Examine the distribution of pocket price using a histogram.  What can you conclude?  Redo using a log transformation.  Now what do you conclude?

[See Solution](#Solution-III.1)

In [None]:
##
## Enter code here for pocket price.
##


In [None]:
##
## Enter code here for transformed pocket price.
##


#### III.1.2 Boxplots

[Back to Contents](#Contents)

Boxplots are the most useful visualization tool for examining distributions.

In [None]:
slide( '057' )

In [None]:
slide( '058' )

In [None]:
##
## Display the boxplot for total discounts
##
ax = sns.boxplot( y = 'Tdisc', data = df )
ax.set( title = 'Distribution of Total Discount', ylabel = 'Total Discount' );

**_Code Explanation_**

Notice that the Seaborn boxplot function only has an argument for the y-axis.  In this case, the x-axis is understood.  This gives a vertical chart as shown.  However, if you change the "y" to "x", the boxplot will be horizontal: 

> *sns.boxplot( x = 'Tdisc', data = df )*

produces a horizontal chart.  You can also use the argument *orient = "h"* or *orient = "v"* if the entire DataFrame is used.  See [here](https://seaborn.pydata.org/generated/seaborn.boxplot.html) for examples.

**_Interpretation_**

The Total Discount is symmetrically distributed.  This is evident by an almost mirror image above and below the center line inside the box.  The center line is the median.  This boxplot is for the entire market.  But what about regions?

In [None]:
##
## Total discount distribution by regions
##
ax = sns.boxplot( x = 'Region', y = 'Tdisc', data = df )
ax.set( title = 'Distribution of Total Discount by Region', ylabel = 'Total Discount', 
       xlabel = 'Marketing Regions' );

**_Code Explanation_**

In this drill-down of the total discounts by marketing regions, the Seaborn boxplot function now has two axis arguments: 

> 1. y-axis; and
> 2. x-axis (*Region* in this case).

**_Interpretation_**

Notice that discounts are the lowest in the Southern Region while the Midwest has a large number of very low discounts.  Also, the dispersion of the discounts in the Southern Region is small relative to that in the other three regions.  Let us drill down on the discounts to verify the differences for the Southern Region.

In [None]:
##
## Drill down on the discounts in the Southern Region
##
## Select the discounts for the Southern Region
##
lst = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc' ]
df_south = df.loc[ df.Region == 'South', lst ]
##
## Melt the data from wide- to long-form.
##
df_melt = pd.melt( df_south )
##
## Get summary statistics
##
grp = df_melt.groupby( 'variable' ).describe()
print( 'Summary statistics for the discounts:\n{}'.format( grp ) )
##
## Use a boxplot to examine the distributions.
##
ax = sns.boxplot( x = 'variable', y = 'value', data = df_melt )
ax.set( title = 'Discount Distribution\nSouthern Marketing Region', 
        xlabel = 'Type of Discount',
        ylabel = 'Discount Amount')
##
## Reset the tick labels to more meaningful labels
##
ax.set_xticklabels( [ 'Dealer', 'Order\nSize', 'Competitive', 'Pickup' ] );

**_Code Explanation_**

The statement *data = pd.melt( df_south )* was used.  The *melt* method in the Pandas package stacks the columns in the DataFrame into a new DataFrame so that the column names of the original DataFrame become a single new variable in the new stacked DataFrame (with name *variable*) and the values in each column of the original DataFrame become the values in a single new variable in the stacked DataFrame (with name *value*).  For this example, there are four discounts, so four columns, in *df_south*.  These four are stacked to create, or are *melted* to create, a new DataFrame with two columns: *variable* and *value*.  The melting takes a DataFrame that is in *wide form* and converts it to one in *long form*.  The long form is needed for the boxplots.

**_Interpretation_**

Notice that the dealer discount tends to be the largest while the order discount has the most variation.

#### Exercise III.2

[Back to Contents](#Contents)

Check the distribution of the pocket price by marketing region, loyalty program membership, and buyer rating. What do you conclude?  A complete Data Dictionary is [here](#Appendix-Complete-Data-Dictionary).

[See Solution](#Solution-III.2)

In [None]:
##
## Enter code here for net revenue by region.
##


In [None]:
##
## Enter code here for net revenue by loyalty program.
##


In [None]:
##
## Enter code here for net revenue by buyer rating.
##


### III.2 Look for Relationships in Your Data

[Back to Contents](#Contents)

Scatter plots are the workhorse of statistical displays because they allow you to see relationships -- sometimes.  Properly drawn, they can provide a wealth of insight into: 

> - relationships;
- trends;
- patterns; and
> - anomalies

of two continuous variables.  They can be supplemented with histograms on the margins to show distributions.

In [None]:
slide( '061' )

In [None]:
slide( '062' )

#### III.2.1 Transformation for Better Interpretation

[Back to Contents](#Contents)

Since one objective from the product manager is to estimate a price elasticity, you should graph unit sales and Pocket Price.  We noticed earlier that unit sales were right skewed but that using a log transform shifted the distribution to a more normal one.  We should take the log of pocket price as well as unit sales.  This is a very common transformation in empirical demand analysis because the slope of a line relating sales to price is the elasticity.

In [None]:
slide( "063" )

In [None]:
##
## Transform unit sales and pocket price
##
df[ 'log_Pprice' ] = np.log1p( df.Pprice )
df[ 'log_Usales' ] = np.log1p( df.Usales )
##
## Display the unlogged and logged data
##
lst = [ 'Pprice', 'log_Pprice', 'Usales', 'log_Usales' ]
df[ lst ].head().style.set_caption( 'Sales and Price Data' )

In [None]:
##
## Plot the logged data
## Use the Seaborn "relplot" function
##
ax = sns.relplot( x = 'log_Pprice', y = 'log_Usales', data = df )
ax.set( title = 'Unit Sales vs. Pocket Price\nLog Scales', xlabel = 'Log Pocket Price', 
       ylabel = 'Log Unit Sales' );

**_Interpretation_**

A negative relationship is evident -- as it should be.  But the large number of plot points makes it slightly difficult to see.

#### III.2.2 Enhancing the Scatter Plot

[Back to Contents](#Contents)

In [None]:
##
## Replot the logged data with a regression line added. 
## Use the Seaborn "regplot" function.
##
## Warning -- this will take a few seconds
##
## Note: 
##   The plot element colors can be set:
##     b:blue, g:green, r:red, c:cyan,
##     m:magenta, y:yellow, k:black, w:white.
##
ax = sns.regplot( x = 'log_Pprice', y = 'log_Usales', data = df, 
                 scatter_kws = { 'color':'black' },
                 line_kws ={ 'color':'yellow' } )
ax.set( title = 'Unit Sales vs. Pocket Price\nLog Scales', 
       xlabel = 'Log Pocket Price', ylabel = 'Log Unit Sales' );

**_Code Explanation_**

The Seaborn *regplot* function is used to add a regression line to the scatter plot.  To help distinguish between plotting points and the regression line, the *scatter_kws={"color": "black"}, line_kws={"color": "yellow"}* arguments are used.  The points are specified as black and the line as yellow.  The default is for both to be the same color.

**_Interpretation_**

The regression line shows a negative relationship between price and sales.

**_Appendix_**

Some additional graphs are in the Appendix [here](#Additional-Scatter-Plot-Methods).

#### III.2.3 Working with *Large-N* Data

[Back to Contents](#Contents)

The scatter plots are dense, making it difficult to see patterns. Options are to use a:

> 1. random sample;
> 2. contour plot;
> 3. hex bin plot; or
> 4. Lowess smooth.

#### III.2.3.1 Random Sampling

[Back to Contents](#Contents)

In [None]:
slide( '065' )

In [None]:
##
## Draw a random sample of size n = 500
## Put the sample in a new DataFrame.
##
smpl = df.sample( n = 500, random_state = 1234, replace = False )
##
## Plot the data using the random sample
##
ax = sns.regplot( x = 'log_Pprice', y = 'log_Usales', data = smpl )
ax.set( title = 'Unit Sales vs. Pocket Price\nRandom Sample\nn = 500', 
       ylabel = 'Log Unit Sales', xlabel = 'Log Pocket Price' );

**_Code Explanation_**

The Pandas DataFrame method *sample* is used to draw a random sample without replacement of size $n = 500$.  The random seed is set at 1234 so the same sample would be drawn each time the cell is run.

**_Interpretation_**

The negative relationship between unit sales and price is evident.

#### Exercise III.3

[Back to Contents](#Contents)

Create a random sample of $n = 1000$ and plot unit sales vs. total discounts (*Tdics*).  What do you conclude?

[See Solution](#Solution-III.3)

#### III.2.3.2 Contour Plot

[Back to Contents](#Contents)

In [None]:
slide( '066' )

In [None]:
slide( '067' )

In [None]:
##
## Contour plot with marginal distributions
## Sample
##
ax = sns.jointplot( x = 'log_Pprice', y = 'log_Usales', kind = 'kde', data = smpl );

**_Code Explanation_**

Seaborn's *jointplot* is used.  The *kind = kde* argument is used for a *kernel density plot* which is the contours.

**_Interpretation_**

The dark spot in the middle shows the concentration of the data points.  The negative relationship between sales and price is evident.

#### III.2.3.3 Hex Bin Plot

[Back to Contents](#Contents)

In [None]:
slide( '068' )

In [None]:
##
## Hex binning
## Sample
##
## Note: A white background is best for this 
## Note: 
##   The plot element colors can be set: 
##     b:blue, g:green, r:red, c:cyan,
##     m:magenta, y:yellow, k:black, w:white.
##
with sns.axes_style( 'white' ):
    ax = sns.jointplot( x = 'log_Pprice', y = 'log_Usales', 
                       kind = 'hex', color = 'k', data = smpl );

#### III.2.3.4 Lowess Curve

[Back to Contents](#Contents)

An alternative is to fit a *Lowess Smooth* to the data.  *Lowess* stands for *Locally Weighted Scatterplot Smooth*.  It is a regression fit and approaches the *OLS* line for very smooth fits.  See <a href="https://en.wikipedia.org/wiki/Local regression" target="_parent">here</a> for a description.  

In [None]:
##
## Fit a Lowess Smooth with the scatter turned off
## Use the sample data
##
ax = sns.regplot( x = 'log_Pprice', y = 'log_Usales', lowess = True, scatter = False, data = smpl )
ax.set( title = 'Unit Sales vs. Pocket Price\nRandom Sample\nn = 500\nWith Lowess Smooth', 
       ylabel = 'Log Unit Sales', xlabel = 'Log Pocket Price' );

#### Exercise III.4

[Back to Contents](#Contents)

Study the relationship between Total Discount (*Tdisc*) and the pocket price (*Pprice*).  Use a random sample of $n = 200$, a Lowess smooth, and omit the scatter points.  Let the pocket price be on the vertical axis.  What can you conclude?

[See Solution](#Solution-III.4)

In [None]:
##
## Enter code here.
##


### III.3 Look for Trends in Your Data

[Back to Contents](#Contents)

Trends are identified using line graphs, usually with time on the X-axis. 

In [None]:
slide( '071' )

In [None]:
##
## Subset the date indicator and the Dealer discount
##
lst = [ 'Tdate', 'Ddisc' ]
data = df[ lst ].copy()
##
## Reset the index to the date
##
data.set_index( 'Tdate', inplace = True )
data.head()

**_Code Explanation_**

The subset DataFrame containing *Tdate* and *Ddisc*, *data*, is reindexed using *Tdate*.  *Tdate* was converted to a *DateTime* variable when the orders data were originally imported. 

**_Interpretation_**

The data for *Ddisc* are by year-month-day.  Notice that there are missing values indicated by *NaN*.  

In [None]:
##
## Group the data by months and calculate the 
## mean discount for each month.
##
grp = data.resample( 'M' ).mean()
grp.head().style.set_caption( 'Grouped Data' )

**_Code Explanation_**

The *data* DataFrame is *resampled* to monthly data, the resampling using the mean of values in each month.  Basically, *resample* aggrgegates the data by the datetime index.  See <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html" target="_parent">here</a> for documentation on *resample*.


**_Interpretation_**

Each value is the mean of values for the indicated month.

In [None]:
##
## Use Pandas' plot function.
## It automatically uses the time index for the X-axis.
##
ax = grp.plot( y = 'Ddisc', legend = False )
ax.set( title = 'Dealer Discount\nMonthly', ylabel = 'Dealer Discount', xlabel = 'Months' );

**_Interpretation_**

Pandas does not connect points if a point is missing. Pandas gives a better representation and is better with time series data.

### III.4 Look for Patterns in Your Data

[Back to Contents](#Contents)

Patterns are identified using a variety of visual displays.  So all the graph types discussed will help identify patterns.

In [None]:
slide( '074' )

### III.5 Look for Anomalies in Your Data

[Back to Contents](#Contents)

In [None]:
slide( '076' )

In [None]:
##
## Categorical plot: boxplot variant
##
ax = sns.catplot( 'Tdisc', kind = 'box', orient = 'v', data = df_orders )
ax.set( title = 'Total Discount\nOutliers', ylabel = 'Total Discount', xlabel = '' );

**_Interpretation_**

There are some clear outliers:

> 1. A number of points are very low.
> 2. Only one or two points are very high.

### III.6 What is Next?

[Back to Contents](#Contents)

In Lesson IV, I will show you how to build three predictive models:

> 1. *OLS*;
> 2. Logit; and
> 3. Decision trees.

I'll discuss these in the next lesson.

In [None]:
slide( "078" )

## Lesson IV Predictive Modeling: Introduction to Machine Learning

[Back to Contents](#Contents)


In [None]:
slide( '080' )

### IV.1 Comparing and Contrasting Prediction and Forecasting

[Back to Contents](#Contents)

In [None]:
slide( '082' )

### IV.2 Steps for Predictive Modeling

[Back to Contents](#Contents)


In [None]:
slide( '084' )

#### IV.2.1 Steps for Predictive Modeling: Train/Test Split Data

[Back to Contents](#Contents)



In [None]:
slide( '086' )

In [None]:
slide( '087' )

In [None]:
slide( '088' )

The data are split into two parts using *sklearn*.  Each part has a *X* variable array and a *y* vector (The upper and lower cases are conventional).  The *X* array is a Pandas DataFrame of the *X* variables.  The *y* vector is a Pandas Series.

In [None]:
##
## Create the X and y data for splitting.  Notice the cases for the variable names.
##
y = df[ 'Usales' ]
##
lst = [ 'Pprice', 'Ddisc', 'Odisc', 'Cdisc', 'Pdisc', 'Region', 'buyerRating' ]
X = df[ lst ]
##
## Split the data.  The default is 3/4 train.
## Note: the train_test_split function was loaded in the packages section
##
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.25,
                                                    random_state = 42 ) 

**_Code Explanation_**

The dependent and independent variables need to be separated from the main DataFrame before the train/test split can be done.  The index from the main DataFrame is preserved.  The first three lines of code do this.  The *train_test_split* function randomly divides the data, keeping the indexes aligned.  The *random_state = 42* argument sets the random seed.  Four data sets are returned which are (in order): 

> *X_train*
>
> *X_test*
>
> *y_train*
>
> *y_test*.

In [None]:
##
## Display some data
##
print( 'Sample sizes:\n\tX: {}, y: {}'.format( X_train.shape[ 0 ], y_train.shape[ 0 ] ) )
print( '\nTraining X Data:\n{}'.format( X_train.head() ) )
print( '\nTraining y Data:\n{}'.format( y_train.head() ))

**_Interpretation_**

Notice that the sample sizes for the training and testing data sets are the same.  Also note the indexes for the training and testing data sets.  These are the same as the main DataFrame, *df*.

In [None]:
## 
## Merge the X and y training data for 
## model training.  Do an inner join on the indexes.
##
## Rename the y variable: Usales
##
yy = pd.DataFrame( { 'Usales':y_train } )
train = yy.merge( X_train, left_index = True, right_index = True )
print( 'Training Data Set:\n{}'.format( train.head() ) )

**_Code Explanation_**

The *X* and *Y* training data sets are merged on the indexes.  Recall that the indexes were preserved when the *X*  and *y* data sets were created.  This is why the indexes were the link.

#### Exercise IV.1

[Back to Contents](#Contents)

Merge the X and y testing data sets for predicting.

[See Solution](#Solution-IV.1)

In [None]:
##
## Enter code here.
##


In [None]:
##
## Add log Usales and log Pprice to the training data
## The log is based on the Numpy function log1p
## Note: log1p( x ) = log( 1 + x )
##
train[ 'log_Usales' ] = np.log1p( train.Usales )
train[ 'log_Pprice' ] = np.log1p( train.Pprice )
print( 'Training Data Set:\n{}'.format( train.head() ) )
print( '\nTraining Data Set Shape:\n{}'.format( train.shape ) )

**_Code Explanation_**

Logged terms are added because the Data Visualization showed that logs induce normality.

#### Exercise IV.2

[Back to Contents](#Contents)

Add log Usales and log Pprice to the testing data set.

[See Solution](#Solution-IV.2)

In [None]:
##
## Enter code here.
##


#### IV.2.2 Steps for Predictive Modeling: Train a Model

[Back to Contents](#Contents)

I will cover three predictive models:

> 1. *OLS*
> 2. Logit
> 3. Decision trees

In [None]:
slide( '091' )

In [None]:
slide( '092' )

In [None]:
slide( '093' )

In [None]:
slide( '094' )

In [None]:
slide( '096' )

In [None]:
slide( '097' )

##### Case I Continuous Dependent Variable: *OLS* Regression

[Back to Contents](#Contents)

Model unit sales as a function of the pocket price to get a price elasticity.  Recall that you are using log terms and that the estimated coefficient for log price is the elasticity.

**Recommendation**:  Use formulas to specify the model.  You need the *statsmodels.formula* api for this.  Se the package loading section [here](#II.2.2-Load-Packages)

In [None]:
slide( '099' )

In [None]:
slide( '100' )

In [None]:
## 
## OLS
##
## There are four steps for estimatng a model:
##
##   1. define a formula (i.e., the specific model to estimate)
##   2. instantiate the model (i.e., specify it)
##   3. fit the model
##   4. summarize the fitted model
##
## ===> Step 1: Define a formula <===
##
## The formula uses a “~” to separate the left-hand side from the right-hand side
## of a model and a “+” to add columns to the right-hand side.  A “-” sign (not 
## used here) can be used to remove columns from the right-hand side (e.g.,
## remove or omit the constant term which is always included by default). 
##
formula = 'log_Usales ~ log_Pprice + Ddisc + Odisc + Cdisc + Pdisc + C( Region )'
##
## Since Region is categorical, you must create dummies for the regions.  You
## do this using 'C( Region )' to indicate that Region is categorical.
##
## ===> Step 2: Instantiate the OLS model <===
##
mod = smf.ols( formula, data = train )
##
## ===> Step 3: Fit the instantiated model <===
##      Recommendation: number your fitted models
##
reg01 = mod.fit() 
##
## ===> Step 4: Summarize the fitted model <===
##
print( reg01.summary() )

**_Code Explanation_**

The modeling follows four steps as shown above.  Regardless of the software you use, these same four steps are followed.  Some software combine them, others require explicit statement.  This is what statsmodels requires.

**_Interpretation_**

The price elasticity is the coefficient for the logged price variable (i.e., log_Pprice): -1.7.  If price falls by 1\%, unit sales rise by 1.7%.  This indicates that blinds are highly elastic.  This should be expected since furniture is a competitive business and blinds are very competitive.  Revenue will also change.  If price falls, revenue increases.  The amount revenue increases (in percentage terms) is given by $1 + elasticity$.  So for a 1% fall in price, revenue will rise 0.7% (= $1 + [-1.7]$). 

The discounts and regions seem to have no effect, but this can be tested as shown below.  Also note that the $R^2 = 0.20$ which is very low.  

The Jarque-Bera Test is a test for normality of the disturbance term.  It is a test of the "goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution. $\ldots$ The null hypothesis is a joint hypothesis of the skewness being zero and the excess kurtosis being zero.  $\ldots$ If it is far from zero, it signals the data do not have a normal distribution."  So the Null Hypothesis is $H_O: Normality$.  (Source: <a href="https://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test" target="_parent">see here</a>)  In this case, the Null is rejected.  The Omnibus Test is an alternative test of normality with the same Null.  It also indicates that the Null must be rejected.

#### Exercise IV.3

[Back to Contents](#Contents)

Estimate a new OLS model by adding the buyer rating to the above model. Name your model *regE01*.  Interpret your results.  Is the buyer rating important for sales?

**Hint**: Buyer rating is categorical so you have to create dummies for the rating.

[See Solution](#Solution-IV.3)

In [None]:
##
## Enter code here
##


###### Case I Analyze the Results

[Back to Contents](#Contents)

Quantities of interest can be extracted directly from the fitted model. Type *dir(results)* for a full list.

Since the product manager wanted to know about a region effect, you should do an F-test of all the coefficients for the regions to determine if they are all zero, meaning that the dummies as a group do nothing.  This is a <u>joint</u> test of significance.  The test statistic is:

> $F_C = \dfrac{\left(SSR_U - SSR_R\right)/(df_U - df_R)}{SSE_U/(n - p - 1)} = \dfrac{\left(SSE_R - SSE_U\right)/(df_U - df_R)}{SSE_U/(n - p - 1)}$

where "U" indicates the *unrestricted* or *full* model with the Region dummies and "R" indicates the *restricted* model without the Region dummies.

In [None]:
##
## Specify the joint (Null) hypothesis that the regions are the same;
## i.e., there is no region effect.
##
hypothesis = ' ( C(Region)[T.Northeast] = 0, C(Region)[T.South] = 0, C(Region)[T.West] = 0 ) '
print( 'Null Hypothesis:\n\t{}'.format( hypothesis ) )
print( '\nAlternative Hypothesis:\n\tAt least one is not zero')
##
## Run an F-test 
##
f_test = reg01.f_test( hypothesis )
##
## Retrieve the p-value
##
pval = round( float( f_test.pvalue ), 2 )
##
## Print results
##
print( '\np-value for F-Test: {}'.format( pval ) )
if pval < 0.05:
    print( '\nSignificant so reject H0' )
else:
    print( '\nInsignificant so do not reject H0' )

**_Code Explanation_**

Notice that there are only three regions specified even though there are four: one is omitted as the base.  Also notice that the three hypotheses are specified as *C(Region)[T.XX] = 0* where *XX* is the region name.  The *T* stands for *Treatment* which is the *R* and *Python* term for a dummy variable encoding of a categorical variable.  There are other forms of encoding.  See [here](https://www.statsmodels.org/dev/examples/notebooks/generated/contrasts.html) for documentation on encoding a categorical variable in Statsmodels.

**_Output Interpretation_**

There are several returned values for the F-test.  Only the p-value is important.

**_Interpetation_**

The Null Hypothesis is that there is no region effect.  The p-value is 0.32 so the Null Hypothesis is not rejected: there is no Region effect.

#### Exercise IV.4

[Back to Contents](#Contents)

Test the Null Hypothesis that all the buyer rating estimated parameters are zero.  That is, there is no difference among the ratings.

[See Solution](#Solution-IV.4)

In [None]:
##
## Enter code here
##


###### Case I Predict with the Model

[Back to Contents](#Contents)

Predict unit sales.  Recognize that sales are in (natural) log terms.  You will convert back to unit sales in "normal" terms later.

In [None]:
##
## Calculate predicted log of unit sales, the dependent variable.
##
## Note: the inverse of the log is needed; use np.expm1( x )
## since log1p was used: np.expm1 = exp(x) - 1.
##
log_pred = reg01.predict( test )  ## test is the testing data from Exercises IV.1 and IV.2
y_pred = np.expm1( log_pred )
##
##
## Combine into one temporary DataFrame for convenience
##
data = pd.DataFrame( { 'y_test':y_test, 'y_logPred':log_pred, 'y_pred':y_pred } )
data.info()

Use the sklearn metrics function *r2_score* to check the fit of actual vs. predicted values.  From the sklearn User Guide:

> "*The r2_score function computes R², the coefficient of determination. It provides a measure of how well future samples are likely to be predicted by the model. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.*"

In [None]:
##
## Display the r2 score.  But first drop any NaN data.
##
data.dropna( inplace = True )
print( 'r2 Score:\n{}'.format( round( r2_score( data.y_test, data.y_pred ), 3 ) ) )

This is not very good.

You can also graph the actual vs predicted values.  Sometimes, however, the number of data points is too large to plot so a random sample may be needed.  This is our case.

In [None]:
##
## Draw a random sample of 500 observations without replacement
## from the tmp DataFrame.
##
smpl = data.sample( n = 500, random_state = 1 )
##
## Plot the data
##
ax = sns.regplot( x = 'y_test', y = 'y_pred', scatter = True, data = smpl )
ax.set( title = 'Actual vs Predicted Units Sales\nRandom Sample of 500', 
       ylabel = 'Predicted Sales', xlabel = 'Actual Sales' );

Predict unit sales for different settings of the variables.  This is *scenario* or *what-if* analysis.

In [None]:
##
## Specify scenario values to use for prediction
##
## Create a dictionary (i.e, a dict)
##
data = {
         'Pprice': [ 2.50 ],
         'Ddisc': [ 0.03 ],
         'Odisc': [ 0.05 ],
         'Cdisc': [ 0.03 ],
         'Pdisc': [ 0.03 ],
         'Region': [ 'West' ]
        }
##
## Create a DataFrame using the dict
##
scenario = pd.DataFrame.from_dict( data )
##
## Insert a log price column after the Pprice variable
##
scenario.insert( loc = 1, column = 'log_Pprice',
                value = np.log1p( scenario.Pprice ) )
##
## Display the settings and the predicted unit sales
##
print( 'Scenario settings:\n{}'.format( scenario ) )
##
## Create a pediction
##
log_pred = reg01.predict( scenario )
y_pred = np.expm1( log_pred )
print( '\nPredicted Unit Sales: \n{}'.format( round( y_pred, 0 ) ) )

#### Exercise IV.5

[Back to Contents](#Contents)

Create a scenario with the following settings for your model with buyer Rating (regE01):

> - 'Pprice': [ 2.50 ]
> - 'Ddisc': [ 0.03 ]
> - 'Odisc': [ 0.03 ]
> - 'Cdisc': [ 0.03 ]
> - 'Pdisc': [ 0.03 ]
> - 'buyerRating': [ 'Poor' ]
> - 'Region': [ 'South' ]

[See Solution](#Solution-IV.5)

In [None]:
##
## Enter code here
##


##### Case II Binary Dependent Variable: Logistic Regression

[Back to Contents](#Contents)

In [None]:
slide( '103' )

In [None]:
slide( '104' )

In [None]:
slide( '105' )

###### Case II Create Your Data

[Back to Contents](#Contents)

Customer satisfaction is part of the DataFrame.  Satisfaction is measured on a five-point scale: *1 = Not at All Satisfied*, *5 = Very Satisfied*.  

First, look at the frquency count of satisfaction.  But, there is a problem: you cannot use the same data as before since satisfaction is by customer and the data used so far are by transaction.  The satisfaction rating is in the customer DataFrame.  Region, which will be included in the model, is in the marketing DataFrame, *df_marketing*.  You need to first find the mean price and mean discounts by customer from the transactions DataFrame and then merge this new DataFrame with the customer DataFrame.  So, there are several steps:

> 1. Extract the pocket price and discounts -- include the *CID*
2. Group by the CID and calculate the means by *CID*
3. Merge with the customer DataFrame
4. Recode the satisfaction values in the merged file so that 1 is the top-two values (called *top-two box* or *T2B*) and 0 is all other values.  The *T2B* is *Very Satisfied*.
> 5. Train a model with *T2B* satisfaction as a function of the pocket price, discounts, and Region.

The Customer Satisfaction variable is documented in the complete Data Disctionary [here](#Appendix-Complete-Data-Dictionary).

In [None]:
## 
## ===> Step 1: Extract the pocket price and discounts -- include the CID <===
##
lst = [ 'CID', 'Pprice', 'Ddisc', 'Odisc', 'Cdisc', 'Pdisc' ]
data = df[ lst ].copy()
##
## Set the index to the CID
##
data.set_index( 'CID', inplace = True )
print( 'Number of rows: {rows}\nNumber of columns: {cols}'.format( 
        rows = data.shape[ 0 ], cols = data.shape[ 1] ) )

In [None]:
##
## ===> Step 2: Group by CID and calculate the means by CID <===
##
grp = data.groupby( 'CID' ).mean()
print( 'Number of rows: {rows}\nNumber of columns: {cols}'.format( 
        rows = grp.shape[ 0 ], cols = grp.shape[ 1 ] ) )

In [None]:
##
## Look at the head of grp
##
grp.head().style.set_caption( 'Mean Price and Discounts' )

In [None]:
##
## ===> Step 3: Merge with the customer and marketing DataFrames <===
##              Merge on the CID
##
df_sat = pd.merge( pd.merge( grp, df_cust, on = 'CID' ), df_marketing, on = 'CID' )
##
## Alternative merge:
## df_sat = grp.merge( df_cust, on = 'CID' ).merge( df_marketing, on = 'CID' )
## 
print( 'Number of rows: {rows}\nNumber of columns: {cols}'.format( 
        rows = df_sat.shape[ 0 ], cols = df_sat.shape[ 1 ] ) )

In [None]:
##
## Check the columns
##
df_sat.columns

In [None]:
##
## Do a quick check of the satisfaction distribution.
##
## Use the DataFrame's value_counts() method. Sort by the
## scale values 1 - 5.
##
data = df_sat.buyerSatisfaction.value_counts( normalize = True, sort = False )
pd.DataFrame( data ).style.set_caption( 'Buyer Satisfaction' ).\
bar( align = 'mid', color = 'red' ).format( '{:.1%}' )

**_Code Explanation_**

The *normalize = True* argument divides the count for each satisfaction level by the sum of the counts to give the proportion of *CID*s at each level.  These proportions should sum to 1.0. 

In [None]:
##
## ===> Step 4: Recode the scale values so that 1 is the top-two values <===
## (called "top-two box" or "T2B") and 0 is all other values.  
## The "T2B" is "Very Satisfied".
##
## Define a lambda function for the recoding
##
df_sat[ 'sat_t2b' ] = df_sat.buyerSatisfaction.apply( lambda x: 1 if ( x >= 4 ) else 0 )
##
data = df_sat[ 'sat_t2b' ].value_counts( normalize = True ).round( 3 )
data = pd.DataFrame( data )
data.rename( index = { 1:'T2B', 0:'B3B' }, inplace = True )
pd.DataFrame( data ).style.set_caption( 'Buyer Satisfaction' ).\
bar( align = 'mid', color = 'red' ).format( '{:.1%}' )

Model *T2B* satisfaction as a function of the pocket price, discounts, and Region.  First, create training and testing DataFrames as before but with *sat_t2b* as the *y* variable.

In [None]:
##
## ===> Step 5: Split the Data. <===
##
## Create the X and y data for splitting
##
y = df_sat[ 'sat_t2b' ]
lst = [ 'Pprice', 'Ddisc', 'Odisc', 'Cdisc', 'Pdisc', 'Region' ]
X = df_sat[ lst ]
##
## Split the data.  The default is 1/3 testing.
##
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.33, 
                                                    random_state = 42 )

In [None]:
##
## Display some data: training data
##
print("Sample sizes: \nX Training: {}, y Training: {}\n".format( X_train.shape[ 0 ], y_train.shape[ 0 ] ) )
print( '\nX Training Data: \n{}'.format( X_train.head() ) )
print( '\ny Training Data: \n{}'.format( y_train.head() ))

In [None]:
##
## Display some data: testing data
##
print("Sample sizes: \nX Testing: {}, y Testing: {}\n".format( X_test.shape[ 0 ], y_test.shape[ 0 ] ) )
print( '\nX Testing Data: \n{}'.format( X_test.head() ) )
print( '\ny Testing Data: \n{}'.format( y_test.head() ))

**_Interpretation_**

Notice that the respective sample sizes sum to 779 (= 521 Train + 258 Test) which is the total unique *CID*s from before.

In [None]:
## 
## Merge the two training sets
##
yy = pd.DataFrame( { 'sat_t2b':y_train } )
train = yy.merge( X_train, left_index = True, right_index = True )
train.head().style.set_caption( 'Training Data' )

In [None]:
##
## Check the shape of the merged training data
##
print( 'Training Sample size:\n {}'.format( train.shape[ 0 ] ) )

In [None]:
## 
## Merge the two testing sets
##
yy = pd.DataFrame( { 'sat_t2b':y_test } )
test = yy.merge( X_test, left_index = True, right_index = True )
test.head().style.set_caption( 'Testing Data' )

In [None]:
##
## Check the shape of the merged testing data
##
print( 'Testing Sample size:\n{}'.format( test.shape[ 0 ] ) )

###### Case II Train a Model

[Back to Contents](#Contents)

In [None]:
##
## Train a logit model
##
## ===> Step 1: Define a formula <===
##
formula = 'sat_t2b ~ Pprice + Ddisc + Odisc + Cdisc + Pdisc + C( Region )'
##
## ===> Step 2: Instantiate the logit model <===
##
mod = smf.logit( formula, data = train )
##
## ===> Step 3: Fit the instantiated model <===
##
logit01 = mod.fit()
##
## ===> Step 4: Summarize the fitted model <===
##
print( logit01.summary() )

###### Case II Predict with the Model

[Back to Contents](#Contents)

The prediction process is the same as discussed for *Case I* above, but now you can compare actuals and predicted using a *confusion matrix*.

In [None]:
##
## Make predictions
##
predictions = logit01.predict( test )  ## test is the testing dataset
predictions_nominal = [ 0 if x < 0.5 else 1 for x in predictions ]
rpt = classification_report( y_test, predictions_nominal, digits = 3 )
print( 'Logit Model Classification Report:\n{}'.format( rpt ) )

**_Interpretation_**

To quote from [here](https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8):

> *The precision is the ratio $tp/(tp + fp)$ where $tp$ is the number of true positives and $fp$ the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative.*
>
> *The recall is the ratio $tp/(tp + fn)$ where $tp$ is the number of true positives and $fn$ the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.*
>
> *The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.*
>
> *The F-beta score weights the recall more than the precision by a factor of beta. beta = 1.0 means recall and precision are equally important.*
>
> *The support is the number of occurrences of each class in y_test.*

Also see [here](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html) for more detail.

For binary classification, the count of **true negatives** ($tn$), **false negatives** ($fn$), **true positives** ($tp$), and **false positives** ($fp$) can be found from a *confusion matrix*.

In [None]:
##
## Create a confusion matrix
##
cm = confusion_matrix( y_test, predictions_nominal ).ravel()
##
## zip the variable names and the confusion
##
lbl = [ 'True Negative', 'False Positive', 'False Negative', 'True Positive' ]
##
## Display the confusion matrix in a DataFrame
##
df_confusion = pd.DataFrame( list( zip( lbl, cm ) ), columns = [ 'Label', 'Value' ] )
df_confusion[ 'Percent' ] = round( ( df_confusion.Value/df_confusion.Value.sum() )*100, 1 )
df_confusion.style.set_caption( 'Confusion Matrix' ).bar( align = 'mid', color = 'red' )

In [None]:
##
## Plot the confusion values
##
ax = sns.barplot( y = df_confusion.Label, x = df_confusion.Percent )
ax.set( title = 'Percent of Sample\nby Confusion Labels',
        xlabel = 'Percent Confusion', ylabel = '' );

**_Interpretation_**

There were:

> 1 true negatives
>
> 81 false positives
>
> 3 false negatives
>
> 173 true positives.

Alternative plot of the confusion matrix.

In [None]:
##
## Create labels
##
lbl = ['Not Satisfied', 'Satisfied']
##
## Create the confusion matrix
##
cm = confusion_matrix( y_test, predictions_nominal )
data = pd.DataFrame( data = cm, index = lbl, columns = lbl )
print( 'Confusion Matrix: \n{}'.format( data ) )
##
## Plot the confusion matrix
##
sns.set( font_scale = 1.4 )   #for label size
##
ax = sns.heatmap( cm/cm.sum(), annot = True, annot_kws = { "size": 16 } )  # font size
ax.set( title = 'Confusion Matrix for the Classifier', xlabel = 'Predicted',
       ylabel = 'True' )
ax.set_xticklabels( lbl )
ax.set_yticklabels( lbl );

**_Interpretation_**

67% of the cases were predicted correctly.

##### Case III Constants: Decision Trees

[Back to Contents](#Contents)

Decision Trees can handle continuous or discrete dependent variables.  They are an alternative to *OLS* and logistic regression: you don't have to specify a "model" *per se*.  They also have the advantage that a visual display, a *tree*, is produced which is easier for management and clients to understand than complex regression output and statistics.  You will only look at a discrete case; a continuous case is the same.

In [None]:
slide( '108' )

In [None]:
slide( '109' )

In [None]:
slide( '110' )

In [None]:
slide( '111' )

In [None]:
slide( '112' )

In [None]:
slide( '113' )

In [None]:
##
## Instantiate LabelEncoder for Region
##
labelencoder = LabelEncoder()
##
## Convert "Region" to integers: the decision tree must have all numerics
## Note 1: use the LabelEncoder function for this
## Note 2: "Region" will be encoded in alphanumeric order:
##
##          0: Midwest
##          1: Northeast
##          2: South
##          3: West
##
## Encode the Region labels in the training and testing data sets using Label Encoder
## First copy the DataFrames
##
X_train_copy = X_train.copy()
X_test_copy = X_test.copy()
y_train_copy = y_train.copy()
y_test_copy = y_test.copy()
##
## Label encode
##
print( '\nTraining Data Before Encoding Region:\n \n{}'.format( X_train_copy.head() ) )
le = preprocessing.LabelEncoder()
X_train_copy[ 'Region_encoded' ] = labelencoder.fit_transform( X_train_copy[ 'Region' ] )
print( '\nTraining Data After Encoding Region:\n{}'.format( X_train_copy.head() ) )
##
## drop Region since Region_encoded will be used
##
X_train_copy = X_train_copy.drop( columns = ['Region'] )
print( '\nTraining Data After Dropping Region:\n{}'.format( X_train_copy.head() ) )
##
## Apply the label encoder to the test DataFrame
##
print( '\n\n' )
print( '\nTesting Data Before Encoding Region:\n \n{}'.format( X_test_copy.head() ) )
X_test_copy[ 'Region_encoded' ] = labelencoder.fit_transform( X_test_copy[ 'Region' ] )
print( '\nTesting Data After Encoding Region:\n{}'.format( X_test_copy.head() ) )
X_test_copy = X_test_copy.drop( columns = [ 'Region' ] )
print( '\nTesting Data After Dropping Region:\n{}'.format( X_test_copy.head() ) )
##
## Instantiate the tree
## Specify:
##    - Depth: 3 levels
##    - Minimum sample size for a leaf: 5
##
dtree = tree.DecisionTreeClassifier( random_state = 0, max_depth = 3, 
                                    min_samples_leaf = 5 )
##
## Fit the tree
##
dtree.fit( X_train_copy, y_train_copy )

###### Case III Check Model Accuracy

[Back to Contents](#Contents)

In [None]:
##
## Check accuracy scores
##
print( "Accuracy on training data: {:.3f}".format( dtree.score( X_train_copy, y_train_copy ) ) )
print( "Accuracy on testing data: {:.3f}".format( dtree.score( X_test_copy, y_test_copy ) ) )

These are good scores.

###### Case III Display the Tree

[Back to Contents](#Contents)

In [None]:
##
## Displaying a tree is a slight challenge!
## There are four steps:
##
## ===> Step 1: Create a placeholder for all the plotting points.
##
dot_data = StringIO()
##
## ===> Step 2: Extract the feature names for labels models
##
feature_names = [ i for i in X_train.columns ]
##
## ===> Step 3: Export the plotting data to the placeholder
##
export_graphviz( dtree, out_file = dot_data,  
                filled = True, rounded = True,
                special_characters = True,
                class_names = [ 'Not Satisfied', 'Satisfied' ],
                feature_names = feature_names ,
                proportion  = True
               )
##
## ===> Step 4: Create the display
##
graph = pydotplus.graph_from_dot_data( dot_data.getvalue() )  
Image( graph.create_png() )

**_Interpretation_**

For the right node on the third level, "$Region \le 2.5$" is interpreted as *Region* having a value less than or equal to 2.5.  Since $Region = 0$, $Region = 1$, and $Region = 2$ meet this criteria and $Region = 3$ is the West, then if *Region* is Midwest/Northeast/South, go to the left; otherwise, go to the right for the West. 

#### ExerciseI V.6

[Back to Contents](#Contents)

Interpret the decision tree.

## Lesson V Summary and Wrap-up

[Back to Contents](#Contents)

In [None]:
slide( '116' )

## Contact Information

[Back to Contents](#Contents)

If you have any questions after this course, please do not hesitate to contact me.

In [None]:
slide( '118' )

## Appendix

[Back to Contents](#Contents)

This Appendix contains material extra to this lesson, material that you may want to review to solidify your understanding and knowledge about working with Python and Pandas.

This Appendix covers:

1. Jupyter Notebook: Overview;
2. Importing data into Pandas;
3. Checking your Data;
4. Manipulating columns of a Pandas DataFrame;
5. Correlation Analysis;
6. Data Visualization; and 
7. OLS Modeling.


### Appendix I.1 Jupyter Notebooks: Overview

[Back to Contents](#Contents)

Jupyter is a sophisticated programming tool sometimes called an *ecosystem*.  It is an ecosystem because it will handle a number of programming languages, Python being one of them.  The langauges are called _kernels_.  Other kernels are R, Julia, Fortran to mention a few.  There are over 100 kernels supported by Jupyter.  Jupyter was originally created to handle Julia, Python, and R.  In fact, the name "Jupyter" is a contraction for **JU**lia, **PYT**hon, and **R**.

The paradigm for Jupyter is a lab notebook used in the physical sciences for laboratory experiment documentation.  The Jupyter notebook consists of cells where text and programming code are entered and executed or *run*.  There are two basic cells:

1. Code cell; and 
2. Markdown cell.

A code cell is where programming code is entered and executed.  The results are displayed in another cell that appears immediately below a code cell.  To organize the code and result, the pair of cells are labeled with a marker for the code and the result.  The code marker is In[] and the result marker is Out[].  Inside the square brackets are numbers representing the sequence in which code is executed.  The numbers will vary, and even be out of order, if a code cell is executed several times but a following code cell is not reexecuted.  The sequencing numbers can be reset by rerunning the kernel.  This is done by clicking on *Kernel* on the main toolbar and selecting *Restart & Run All*.  

The markdown cell is where documentation is entered.  This cell, in fact, is a markdown cell.  It is that you use as many markdown cells as possible to document your work.  You will see many examples of this below.<br>

A code cell is the default.  You make a code cell a markdown cell by using the drop-down menu list on the main toolbar.  Of course, you can change a markdown cell to a code cell using the same drop-down list.  You could also use keyboard short-cuts.  See *Help/Keyboard Shortcuts* on the main menu bar. 

You can easily insert/delete cells and move them from one location in your notebook to another.  To insert a cell above the current cell, select the cell as the base for the insert and click on the colored bar on the left until it turns blue.  Then type *A* to insert a cell above the current cell.  Type *B* to insert one below it.  Type *DD* (two *D*'s) to delete the current cell.  To move a cell, select the cell you want to move, click the colored bar on the left, and use the up-arrow and down-arrow icons on the main menu toolbar.  

### Appendix I.2 Different Ways to Import Data Into Pandas

[Back to Contents](#Contents)

You could also define the path without the file name if you plan to use multiple files from the same directory.  You just have to define the general path and then concatenate the file name.  Here is an example.  Note:

1. the "+" which instructs Pandas to add or concatenate the two strings;
2. the quotes around the file name; and
3. the forward slash as the last symbol in the path definition.

In [None]:
##
## Specify a path to the CSV data and concatenate the file name.
##
## Not run
##
## path = r'../Data/furniture/final data files/'
## df_orders = pd.read_csv( path + 'orders.csv' )
## df_orders.head()

Use the following form if your data are in the same directory as your notebook.

In [None]:
##
## Do not specify a path if the CSV file is in the same directory
## as your notebook.
##
## Not run
##
## df_orders = pd.read_csv( 'orders.csv' )

If your data are in an Excel file, use *pd.read_excel( filename, 'sheetname' )*.  The sheet name could be a character string or a sheet number as an integer.

In [None]:
##
## Example: Import an Excel file
##
## Not run
##
## path = r'../Data/furniture/final data files/'
## df_orders = pd.read_excel( path + 'orders.xlsx', 'furniture' )
## df_orders.head()

### Appendix I.3 Some Additional Information on Checking Your Data

[Back to Contents](#Contents)

Look at:

1. the head of your data;
2. the shape of your DataFrame; 
3. a list of column names; and
4. the missingness of your data.

#### Task \#1 Display the First Few Records of Your DataFrame

[Back to Contents](#Contents)

In [None]:
##
## Example: list the head of the imported data
## n = 5 is the default
##
df_orders.head( )

You should immediately notice that several discounts have missing values indicated by *NaN* (*Not a Number*).

#### Task \#2 Check the Shape of Your DataFrame

[Back to Contents](#Contents)

In [None]:
##
## Example: check the shape of the imported data
## Note 1: the order of the returned shape is always #rows, #columns
## Note 2: there is no () for this command because the shape
## is a DataFrame attribute
##
print( "Shape of the DataFrame: {}".format( df_orders.shape ) )

There are 70,270 rows or observations and 15 columns or variables.

#### Task \#3 Check the Column Names in Your DataFrame

[Back to Contents](#Contents)

In [None]:
##
## Example: list the column names
##
df_orders.columns

In [None]:
##
## Remove white spaces in the column names
## White spaces are not an issue here.  This is 
## just illustrative.
##
df_orders.columns = df_orders.columns.str.strip()
df_orders.columns

#### Task \#4 Check for Missing Data in Your DataFrame

[Back to Contents](#Contents)

In [None]:
##
## Use the DataFrame's information content.  The info()
## method returns the number of non-missing rows for
## each variable.  The numbers should be the same.
##
df_orders.info()

The above listing indicates that the four discounts each have missing values. 

You can count the number of missing values using the *isnull()* method and chaining the *sum()* function.  The *isnull()* method returns a Boolean variable so the *sum()* function just adds 0 and 1 values.  Use a nice print statement for clarity.

In [None]:
##
## Sum the Boolean variable returned by isull()
## Let us check the Order Discount (Odisc).
##
x = df_orders.Odisc.isnull().sum()
print( 'Missing count for Odisc: {}'.format( x ) )

You can check the proportion of each variable that is missing rather than the sum. Proportions are more meaningful. Use the *mean()* function for this.  Since *isnull()* returns a Boolean, the mean is just the proportion.

In [None]:
##
## Check for missing values only for the discounts.
## Chain the mean() function to the isnull() method.
## Note: Do this for first 20 records for illustration only.
##
lst = [ 'Ddisc', 'Odisc', 'Cdisc', 'Pdisc' ]
df_orders[ lst ].iloc[ :20 ].isnull().mean()

Create a heatmap of missing data using the *isnull()* method on the entire DataFrame.  Use the transpose attribute, *T*, for a more readable chart.

In [None]:
##
## Note: the "cbar = False" argument turns off the color bar
## Note: Do this for first 20 records for illustration only
##
lst = [ 'Ddisc', 'Odisc', 'Cdisc', 'Pdisc' ]
ax = sns.heatmap( df_orders[ lst ].iloc[ :20, : ].isnull().T, cbar = False )
ax.set( title = 'Heatmap of Missing Data' );

### Appendix I.4 Miscellaneous Pandas DataFrame Column Manipulations

[Back to Contents](#Contents)

#### Deleting Columns

[Back to Contents](#Contents)

You can easily delete unwanted columns with the *drop* method.  You must specify if you want the DataFrame replaced or not; the default is to not replace in which case you must assign a new name to the modified DataFrame.  If you drop a column, it is good practice to not replace your DataFrame so that you preserve your original data.

The *drop* method can be used to drop rows or columns, so you have to tell it which one.  This is done with the *axis* arugment.  The DataFrame, as a simple rectangular array, is said to have two *axes*: the row axis and the column axis.  Since in mathematics the size of a matrix is conventionally specified as $\#row \times \#columns$ (rows always come before columns), the DataFrame axes are designated as 0 and 1 for rows and columns, respectively, since 0 comes before 1.  Specifying *axis = 0* in the *drop* method says to drop a row while specifying *axis = 1* says to drop a column. The default is *axis=0*. 

In [None]:
##
## Drop a column and replace the DataFrame.  Notice that axis = 1 is used
## to drop a column.
## 
## Not Run
##
## df_orders.drop( 'obs', axis = 1, inplace = True )
## df_orders.head()

### Appendix I.5 Correlation Analysis

[Back to Contents](#Contents)

There is a *correlation* method attached to a DataFrame.

In [None]:
##
## Example: display a correlation matrix of the discounts
##
lst = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc' ]
df[ lst ].corr().round( 3 )

**_Interpretation_**

The correlation matrix has a 1.0 in the cells along its main diagonal (the diagonal running from top left to bottom right).  The off-diagonal cells have the pair-wise correlations.  Notice that the correlation matrix is symmetric around the main diagonal: the top portion (called the *upper triangle*) matches the bottom portion (called the *lower triangle*).

Round to three decimal places is usually sufficient.

The correlations are all very low indicating that the discounts are not linearly associated.

A *heatmap* is sometimes more effective for displaying a correlation matrix. 

In [None]:
##
## Example: plot the correlation matrix as a heatmap
##
lst = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc' ]
cor = df[ lst ].corr()
ax = sns.heatmap( cor )
ax.set( title = 'Heatmap of the Correlation Matrix' );

**_Interpretation_**

The cells along the main diagonal are all white which, by the color bar on the right, indicates they are all 1.0 as they should be.  All other cells are black indicating that the correlations are all 0.0.

### Appendix II.1 Data Visualization

[Back to Contents](#Contents)

This Appendix contains material extra to this lesson, material that you may want to review to solidify your understanding and knowledge about working with Python, Pandas, and Seaborn for Data Visualization.

This Appendix covers:

1. Additional Histogram Methods;
2. Additional Boxplot Methods;
3. Additional Scatter Plot Methods; and
4. Additional Time Series Plot Methods.

#### Additional Histogram Methods

[Back to Contents](#Contents)

You can add a *rug plot* to the bottom of the histogram to show each observation.  This is helpful to show where the data are for each bar in the histogram.  This, of course, is not practical for large data sets since the rug would just be a dense, black bar at the bottom of the graph.  
<br>
You can also remove the *KDE* curve for a better visualization of the distribution.

In [None]:
##
## Add a rug and remove the KDE
##
ax = sns.distplot( np.log1p( df.Usales), kde = False, rug = True )
ax.set( title = "Unit Sales Distribution: Log Scale", 
       xlabel = 'Unit Sales (Natural Log)', 
       ylabel = 'Proportions' );

You can display just the *KDE* curve for a cleaner view of the distribution.

In [None]:
##
## KDE only
##
ax = sns.distplot( np.log1p( df.Usales), hist = False )
ax.set( title = "Unit Sales Distribution: Log Scale", 
       xlabel = 'Unit Sales (Natural Log)', 
       ylabel = 'Proportions' );

#### Additional Boxplot Methods 

[Back to Contents](#Contents)

You can examine the discounts by the customer loyalty status.

In [None]:
##
## Total discount distribution by regions and Loyalty Program
## members
##
ax = sns.boxplot( x = 'Region', y = 'Tdisc', hue = 'loyaltyProgram', data = df )
ax.set( title = 'Distribution of Total Discount by Region \n and \n Loyalty Program',
       ylabel = 'Total Discount' );

In [None]:
##
## Another view of total discount distribution by Regions and Loyalty Program
## members
##
ax = sns.catplot( x = 'Tdisc', y = 'loyaltyProgram', row = 'Region',
                kind = 'box', orient = 'h', height = 1.5, aspect = 4,
                data = df )
ax.set( xlabel = 'Total Discount', ylabel = 'Loyalty Program\nMember' );

It should be disturbing that the discounts are the same whether a customer is in the loyalty program or not.  Members should have bigger discounts.  What about how they are rated?

In [None]:
##
## Total discount distribution by regions and buyer rating
##
ax = sns.boxplot( x = 'Region', y = 'Tdisc', hue = 'buyerRating', data = df )
ax.set( title = 'Distribution of Total Discount by Region \n and \n Buyer Rating', 
       ylabel = 'Total Discount' );

Loyalty and good ratings are not rewarded.

#### Additional Scatter Plot Methods 

[Back to Contents](#Contents)

This Appendix section covers:

1. Categorical Variable
2. Panel Plot
3. Combining Scatter Plots and Histograms
4. Pairwise Scatter Plots
5. Contour Plots and Density Functions

##### Categorical Variable

[Back to Contents](#Contents)

You can add a third variable that is categorical to show relationships across groups.  This is done with a "hue" command which colors the points.

In [None]:
##
## Add Loyalty Program membership
##
## Warning -- this will take a few seconds
##
ax = sns.relplot( x = 'log_Pprice', y = 'log_Usales', hue = 'loyaltyProgram', 
                 data = df )
ax.set( title = 'Unit Sales vs. Pocket Price\nLog Scales', 
       xlabel = 'Log Pocket Price', 
       ylabel = 'Log Unit Sales' );

In [None]:
##
## Add Region
##
## Warning -- this will take a few seconds
##
ax = sns.relplot( x = 'log_Pprice', y = 'log_Usales', hue = 'Region', data = df )
ax.set( title = 'Unit Sales vs. Pocket Price\nLog Scales', 
       xlabel = 'Pocket Price', ylabel = 'Unit Sales' );

##### Panel Plot

[Back to Contents](#Contents)

In [None]:
##
## Add Loyalty Program membership
## A less cluttered view with panels
##
## Warning -- this will take a few seconds
##
ax = sns.relplot( x = 'log_Pprice', y = 'log_Usales', hue = 'loyaltyProgram', 
                 col = 'Region', col_wrap = 2,
                 data = df )
ax.set( xlabel = 'Pocket Price', ylabel = 'Unit Sales' );

**_Interpretation_**

Notice the gap between 17 and 19 in the Northeast.

##### Combining Scatter Plots and Histograms

[Back to Contents](#Contents)

You can combine scatter plots with histograms for each variable.

In [None]:
##
## Add histograms to the margins
##
ax = sns.jointplot( x = 'log_Pprice', y = 'log_Usales', data = df );

##### Pairwise Scatter Plots

[Back to Contents](#Contents)

You can also plot multiple variables in pair-wise combinations.

In [None]:
##
## Use the Seaborn pairwise function
## Full sample
##
x = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc' ]
##
## We know there are missing values for the discounts.
## Missing values are not handled well with Seaborn histograms.
## So drop all records with any missing data.
##
tmp = df[ x ].copy()
tmp.dropna( inplace = True )
sns.pairplot( tmp[ x ] );
##
## Warning -- this will take a few minutes
##

**_Interpretation_**

Unfortunately, this particular plot is clearly not useful because the data set is large; we have a case of *Large-N*.  So how is this handled?  Try a random sample as in the next example.

In [None]:
##
## Pairwise plot
##
## Random sample, n = 500 (previously drawn)
##
smpl = df.sample( n = 500, random_state = 1234 )
x = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc' ]
sns.pairplot( smpl[ x ] );

**_Interpretation_**

This is not much better.  Maybe a smaller sample will work.  You can try this on your own.  A contour or hex bin plot might be better.

##### Contour Plots with Density Functions

[Back to Contents](#Contents)

In [None]:
##
## Contour plot with marginal distributions
## Random sample, n = 500
##
## Warning -- this will take a minute
##
ax = sns.jointplot( x = 'log_Pprice', y = 'log_Usales', data = smpl, kind = 'kde' );

**_Interpretation_**

A different contour plot is produced.

In [None]:
##
## Hex binning
##
## Random sample, n = 500
##
## Note: A white background is best for this 
## Note: The plot element colors can be set:
##   b:blue, g:green, r:red, c:cyan,
##   m:magenta, y:yellow, k:black, w:white.
##
## Warning -- this will take a minute
##
with sns.axes_style( 'white' ):
    ax = sns.jointplot(x = 'log_Pprice', y = 'log_Usales', data = smpl, 
                       kind = 'hex', color = 'k' );

In [None]:
##
## Add a regression line
##
## Full data sample
##
## Warning -- this will take a minute
##
with sns.axes_style("white"):
    g = sns.jointplot( x = 'log_Pprice', y = 'log_Usales', data = df, 
                      kind = 'hex', color = 'k',
                      joint_kws={'gridsize':40, 'bins':'log'} )
    ax = sns.regplot( x = 'log_Pprice', y = 'log_Usales', data = df, 
                     ax = g.ax_joint, scatter = False, color = "yellow" )
    ax.set( xlabel = 'Log Pocket Price', ylabel = 'Log Unit Sales' );

### Additional Time Series Plot Methods 

[Back to Contents](#Contents)

In [None]:
##
## Time series plot for Southern Region
##
lst = [ 'Tdate', 'Ddisc' ]
data = df.loc[ df.Region == 'South', lst ]
##
## Reset the index to the date
##
data.Tdate = pd.to_datetime( data.Tdate )
data.set_index( 'Tdate', inplace = True )
grp = data.resample( 'M' ).mean()
##
## Create a Month variable from the index
##
grp[ 'x' ] = grp.index
grp[ 'Month' ] = grp.x.dt.month
print( grp.head() )
##
ax = grp.plot( y = 'Ddisc' , legend = False )
ax.set( title = 'Dealer Discount\nMonthly\nSouthern Region', ylabel = 'Dealer Discount', xlabel = 'Months' );

### Appendix III.1 Extra Material for Predictive Modeling

[Back to Contents](#Contents)

#### Check OLS Model for Multicollinearity

[Back to Contents](#Contents)

Multicollinearity is a major issue with high-dimensional datasets.  A high level of multicollinearity can negatively impact estimation results.  It can be checked for using:

1. a correlation matrix; or
2. a variance inflation factor (VIF) measure.  A rule-of-thumb is that any $VIF > 10$ indicates a problem.

This material assumes that the OLS model for Case I was estimated [here](#Case-I-Continuous-Dependent-Variable:-OLS-Regression).

In [None]:
##
## Create the correlation matrix
## 
## Subset the design matrix to eliminate the first column of 1s
## the iloc method says to find the location of columns based on 
## their integer locations (i.e., 0, 1, 2, etc.)
## the term in brackets says to find all rows (the : ) and all 
## columns from the first to the end (1: )
##
data = reg01.model.data.orig_exog.iloc[ :, 1: ] 
corr_matrix = data.corr()
corr_matrix

In [None]:
##
## Graph the correlation matrix
##
sns.heatmap( corr_matrix ).set_title( 'Heatmap of the Correlation Matrix' );

In [None]:
## 
## A fancy version of the heatmap
## Based on: https://stackoverflow.com/questions/39409866/correlation-heatmap
##
cmap = sns.diverging_palette( 5, 250, as_cmap = True )
##
corr_matrix.style.background_gradient( cmap, axis = 1 ).set_precision( 1 );

In [None]:
##
## Calculate VIFs
##
## The VIFs are the diagonal elements of the inverted correlation
## matrix of the independent variables.
##
## Subset the design matrix to eliminate the first column of 1s.
## The iloc method says to find the location of columns based on their 
## integer locations (i.e., 0, 1, 2, etc.) the term in brackets says 
## to find all rows (the : ) and all columns from the first to the end (1: ).
##
## Create the correlation matrix
##
data = reg01.model.data.orig_exog.iloc[ :, 1: ]
corr_matrix = data.corr()
##
## Invert the correlation matrix and extract the main diaginal
##
vif = np.diag( np.linalg.inv( corr_matrix ) ) 
##
## Zip the variable names and the VIFs
##
indepvars = [ i for i in data.columns ]
xzip = zip( indepvars, vif ) 
##
## Display the zip matrix.
##
pd.DataFrame( list( xzip ), columns = [ 'Variable', 'VIF' ] )

**_Interpretation_**

The *VIF*s are all below 10 so there is no problem.  $VIF > 10$ is a rule-of-thumb for indicating multicollinearity.

#### Case I Model Portfolio

[Back to Contents](#Contents)

This is a nice way to summarize the models.

In [None]:
##
## Import some packags
##
from statsmodels.iolib.summary2 import summary_col
from statsmodels.stats.api import anova_lm
##
## Create a variable to hold the model names; this is a list.
## Note: the range() function specifies 1 - 2 but the "2" is
## not included.
##
model_names = [ 'Model ' + str( i ) for i in range( 1, 2 ) ]
##
## Create a variable to hold the statistics to print; this is a dictionary.
##
info_dict = { '\nn': lambda x: "{0:d}".format( int( x.nobs ) ),
              'R2 Adjusted': lambda x: "{:0.3f}".format( x.rsquared_adj ),
              'AIC': lambda x: "{:0.2f}".format( x.aic ),
              'F': lambda x: "{:0.2f}".format( x.fvalue ),
}
##
## Create the portfolio summary table.
##
summary_table = summary_col( [ reg01 ],
            float_format = '%0.2f',
            model_names = model_names,
            stars = True, 
            info_dict = info_dict 
)
summary_table.add_title( 'Summary Table for Living Room Blinds Sales' )
print( summary_table )


#### Case II Model Portfolio

[Back to Contents](#Contents)

In [None]:
model_names = [ 'Model ' + str( i ) for i in range( 1, 2 ) ]
##
## Create a variable to hold the statistics to print; this is a dictionary.
##
info_dict = { '\nn': lambda x: "{0:d}".format( int( x.nobs ) ),
}
##
## Create the portfolio summary table.
##
summary_table = summary_col( [ logit01 ],
            float_format = '%0.2f',
            model_names = model_names,
            stars = True, 
            info_dict = info_dict 
)
summary_table.add_title( 'Summary Table for Living Room Blinds Sales' )
print( summary_table )

### Appendix Complete Data Dictionary

[Back to Contents](#Contents)

| Variable                  | Values                                 | Source       | Mnemonic     |
|---------------------------|----------------------------------------|--------------|--------------|
| Order Number              | Nominal Integer                        | Order Sys    | Onum         |
| Customer ID               | Nominal                                | Customer Sys | CID          | 
| Transaction Date          | MM/DD/YYYY                              | Order Sys   | Tdate        | 
| Product Line ID           | Five rooms of house                    | Product Sys  | Pline        |
| Product Class ID          | Item in line                           | Product Sys  | Pclass       |
| Units Sold                | Number of units per order              | Order Sys    | Usales       |
| Product Returned?         | Yes/No                                 | Order Sys    | Return       |
| Amount Returned           | Number of units                        | Order Sys    | returnAmount |
| Material Cost/Unit        | \$US cost of material                  | Product Sys  | Mcost        |
| List Price                | \$US list                              | Price Sys    | Lprice       |
| Dealer Discount           | \% discount to dealer (decimal)        | Sales Sys    | Ddisc        |
| Competitive Discount      | \% discount for competition (decimal)  | Sales Sys    | Cdisc        |
| Order Size Discount       | \% discount for size (decimal)         | Sales Sys    | Odisc        |
| Customer Pickup Allowance | \% discount for pickup (decimal)       | Sales Sys    | Pdisc        |
| Customer's State          | 50 US states + DC                      | Marketing Sys| State        |
| ZIP Code                  | 5-digit US ZIP (postal) code           | Marketing Sys| ZIP          |
| Marketing Region of Customer | Four US Census Regions              | Marketing Sys| Region       |
| Member of Loyalty Program | Nominal: Yes/No                        | Marketing Sys| loyaltyProgram |
| Rating of Customer        | Nominal: Poor/Good/Excellent           | Marketing Sys| buyerRating |
| Cusomer Satisfaction Rating | 5 Point Likert Scale: 5 = Very Sat.  | Marketing Sys| buyerSatisfaction |
| Total Discount            | \%                                     | Calculated: Sum of Discounts | TDisc    |
| Pocket Price              | \$US                                   | Calculated: $Lprice \times (1  - Tdisc)$| Pprice     |
| Revenue                   | \$US  | Calculated: $USales \times Pprice$ | Rev |
| Net Revenue               | \$US  | Calculated: $(Usales - returnAmount) \times Pprice$  | newRev |
| Lost Revenue              | \$US  | Calcualted: $Rev - netRev$    | lostRev |
| Profit Contribution       | \$US  | Calculated: $Rev - Mcost$    | Con |
| Contribution Margin       | \%    | Calculated: $\dfrac{Con}{Rev}$ | CM |


## Exercise Solutions

[Back to Contents](#Contents)

### Solution II.1

Create a table in Markdown mode.

[Return to Exercise II.1](#Exercise-II.1)

| Variable                     | Values                              | Source       | Mnemonic          |
|------------------------------|-------------------------------------|--------------|-------------------|
| Customer ID                  | Nominal                             | Customer Sys | CID               | 
| Customer's State             | 50 US states + DC                   | Marketing Sys| State             |
| ZIP Code                     | 5-digit US ZIP (postal) code        | Marketing Sys| ZIP               |
| Marketing Region of Customer | Four US Census Regions              | Marketing Sys| Region            |


### Solution II.2

[Return to Exercise II.2](#Exercise-II.2)

The *Pocket Price* is the list price less total discounts or total leakages.  It is the amount the business "pockets" and is the amount the customer actually pays.  The pocket price formula is $Pprice = Lprice \times (1  - Tdisc)$.  The *Tdisc* variable was created above.  Calculate the pocket price for the *df_orders* DataFrame and display the first five records for the list price and pocket price.

In [None]:
df_orders[ 'Pprice' ] = df_orders.Lprice*( 1 - df_orders.Tdisc )
##
## Display just the components of Pprice
##
lst = [ 'Lprice', 'Tdisc', 'Pprice' ]
df_orders[ lst ].head()

### Solution II.3

[Return to Exercise II.3](#Exercise-II.3)

Calculate total revenue as $Rev = Usales \times Pprice$ using the *df_orders* DataFrame.

In [None]:
##
## Multiply Unit Sales and Pocket Price
##
df_orders[ 'Rev' ] = df_orders.Usales * df_orders.Pprice
##
## Create a list of unit sales, pocket price, and revenue
##
lst = [ 'Usales', 'Pprice', 'Rev' ]
df_orders[ lst ].head()

### Solution II.4

[Return to Exercise II.4](#Exercise-II.4)

*Contribution* and *contribution margin* are two values financial analysts often examine.  Contribution is comparable to what economists call *profit* but is more restricted in that it just refers to a product without considering any fixed or overhead costs.  Contribution is $Con = Revenue - Material~Cost$ and contribution margin is $CM = \dfrac{Con}{Revenue}$.  Calculate both quantities using the *df_orders* DataFrame.

In [None]:
##
## Contribution: Subtract Material Cost (Mcost from the Data Dictionary) from Revenue
##
df_orders[ 'Con' ] = df_orders.Rev - df_orders.Mcost
##
## Contribution Margin: Divide Contribution by Revenue
##
df_orders[ 'CM' ] = df_orders.Con/df_orders.Rev
##
## Create a list to display
##
lst = [ 'Usales', 'Pprice', 'Mcost', 'Rev', 'Con', 'CM' ]
df_orders[ lst ].head( )

### Solution II.5

[Return to Exercise II.5](#Exercise-II.5)

Some products are returned so another revenue number, *revenue net of returns*, is more meaningful and revealing for business decisions.  Net revenue is
<br><br>
$Net Revenue = (Unit Sales - Returns) \times Pocket Price$.
<br><br>
Calculate net revenue and call it 'netRev'.  Also calculate the loss in revenue due to the returns.  The calculation is $lostRev = Rev - netRev$.  Use the *df_orders* DataFrame.   

In [None]:
##
## Net Revenue: Subtract the amount returned (returnAmount from the Data Dictionary) and 
## multiply by the Pocket Price
##
df_orders[ 'netRev' ] = ( df_orders.Usales - df_orders.returnAmount )*df_orders.Pprice
##
## Lost Revenue: Total revenue less the net revenue
##
df_orders[ 'lostRev' ] = df_orders.Rev - df_orders.netRev
##
## Create a list to display
##
lst = [ 'Rev', 'netRev', 'lostRev' ]
df_orders[ lst ].head()

### Solution II.6

[Return to Exercise II.6](#Exercise-II.6)

There is a third data set: a marketing data set that contains information for each customer on their loyalty program membership, a buyer rating provided by the sales force, and their customer satisfaction rating based on an annual customer satisfaction survey.  The marketing data are in a *csv* file named *marketing.csv*.  You have to:

> 1. import the marketing data into a DataFrame (name it df_marketing) and
> 2. merge the order_cust DataFrame and this marketing DataFrame.

The *CID* is the same in both data sets so it is the linking variable.  Name the final merged DataFrame *df*.

In [None]:
##
## Import the marketing data
## Name this imported data df_marketing
##
df_marketing = pd.read_csv( path + 'marketing.csv' )
df_marketing.head()

In [None]:
##
## Enter code here to merge the df_orders_cust and df_marketing DataFrames
## Name the new merged DataFrame df
##
df = pd.merge( df_orders_cust, df_marketing, on = 'CID' )
df.head()

In [None]:
##
## Enter code here to check the shape of df
##
df.shape

In [None]:
##
## Enter code here to check number of unique CIDs in df
##
data = len( df_orders.CID.unique() )
print( 'Number of unique CIDs: {}'.format( data ) )

### Solution II.7

[Return to Exercise II.7](#Exercise-II.7)

Using your merged orders/customers DataFrame, *df*, create a summary statistics display.  What is the skewness of the Total Discount (Tdisc)?

In [None]:
df.describe().T

Total Discount is slightly left skewed since the distance between Q! and the median is bigger than the distance between Q3 qnd the median.

### Solution III.1

[Return to Exercise III.1](#Exercise-III.1)

Examine the distribution of pocket price using a histogram.  What can you conclude?  Redo using a log transformation.  Now what do you conclude?

In [None]:
ax = sns.distplot( df.Pprice )
ax.set( title = "Pocket Price Distribution", 
       xlabel = 'Pocket Price',
       ylabel = 'Proportions' );

In [None]:
ax = sns.distplot( np.log1p( df.Pprice) )
ax.set( title = "Pocket Price Distribution\nLog Scale", 
       xlabel = 'Pocket Price (Natural Log)',
       ylabel = 'Proportions' );

### Solution III.2

[Return to Exercise III.2](#Exercise-III.2)

Check the distribution of the pocket price by marketing region, loyalty program membership, and buyer rating. What do you conclude?  A complete Data Dictionary is [here](#Appendix-Complete-Data-Dictionary).

In [None]:
##
## Enter code here for pocket price by region.
##
ax = sns.boxplot( x = "Region", y = "Pprice", data = df )
ax.set( title = 'Pocket Price Distribution\nMarketing Region', 
       xlabel = 'Marketing Region',
      ylabel = 'Pocket Price');

In [None]:
##
## Enter code here for pocket price by loyalty program.
##
ax = sns.boxplot( x = "loyaltyProgram", y = "Pprice", data = df )
ax.set( title = 'Pocket Price Distribution\nLoyalty Program Membership', 
       xlabel = 'Loyalty Program Membership',
      ylabel = 'Pocket Price');

In [None]:
##
## Enter code here for net revenue by buyer rating.
##
ax = sns.boxplot( x = "buyerRating", y = "Pprice", data = df, order = [ 'Poor', 'Good', 'Excellent' ] )
ax.set( title = 'Pocket Price Distribution\nLBuyer Rating', 
       xlabel = 'Buyer Rating',
      ylabel = 'Pocket Price');

### Solution III.3

Create a random sample of $n = 1000$ and plot unit sales vs. total discounts (*Tdics*).  What do you conclude?

[Return to Exercise III.3](#Exercise-III.3)

In [None]:
#
## Draw a random sample of size n = 1000
## Put the sample in a new DataFrame.
##
smpl = df.sample( n = 1000, random_state = 1234 )
##
## Plot the data using the random sample
##
ax = sns.regplot( x = 'Tdisc', y = 'Usales', data = smpl )
ax.set( title = 'Unit Sales vs. Total Discounts\nRandom Sample\nn = 1000', 
       ylabel = 'Unit Sales', xlabel = 'Total Discounts' );

### Solution III.4

[Return to Exercise III.4](#Exercise-III.4)

Study the relationship between Total Discount (*Tdisc*) and the pocket price (*Pprice*).  Use a random sample of $n = 200$, a Lowess smooth, and omit the scatter points.  Let the pocket price be on the vertical axis.  What can you conclude?

In [None]:
##
## Create a sample of n = 200
##
smpl = df.sample( n = 200, random_state = 1234 )
##
## Plot
##
ax = sns.regplot( x = 'Tdisc', y = 'Pprice', data = smpl, lowess = True, scatter = False )
ax.set( title = 'Pocket Price vs Total Discount\nRandom Sample\nn = 200\nWith Lowess Smooth', 
       xlabel = 'Total Discount', ylabel = 'Pocket Price' );

### Solution IV.1

Merge the X and y testing data sets for predicting.

[Return to Exercise IV.1](#Exercise-IV.1)

In [None]:
## 
## Merge the X and y testing data sets for predicting.
## Use an inner join on the indexes.
##
## Rename the y variable Usales.
##
yy = pd.DataFrame( { 'Usales':y_test } )
test = yy.merge( X_test, left_index = True, right_index = True )
print( 'Testing Data Set:\n{}'.format( test.head() ) );

### Solution IV.2

Add log Usales and log Pprice to the testing data set.

[Return to Exercise IV.2](#Exercise-IV.2)

In [None]:
##
## Repeat for the testing data
##
test[ 'log_Usales' ] = np.log1p( test.Usales )
test[ 'log_Pprice' ] = np.log1p( test.Pprice )
print( 'Testing Data Set:\n{}'.format( test.head() ) )
print( '\nTestng Data Set Shape:\n{}'.format( test.shape ) )

### Solution IV.3

[Return to Exercise IV.3](#Exercise-IV.3)

Estimate a new OLS model by adding the buyer rating to the above model. Name your model regE01.  Interpret your results.  Is the buyer rating important for sales?

Hint: Buyer rating is categorical so you have to create dummies for the rating.  

In [None]:
## 
## OLS
##
## There are four steps for estimatng a model:
##
##   1. define a formula (i.e., the specific model to estimate)
##   2. instantiate the model (i.e., specify it)
##   3. fit the model
##   4. summarize the fitted model
##
## ===> Step 1: Define a formula <===
##
## The formula uses a “~” to separate the left-hand side from the right-hand side
## of a model and a “+” to add columns to the right-hand side.  A “-” sign (not 
## used here) can be used to remove columns from the right-hand side (e.g.,
## remove or omit the constant term which is always included by default). 
##
formula = 'log_Usales ~ log_Pprice + Ddisc + Odisc + Cdisc + Pdisc + C( Region ) + C( buyerRating )'
##
## Since Region is categorical, you must create dummies for the regions.  You
## do this using 'C( Region )' to indicate that Region is categorical.
##
## ===> Step 2: Instantiate the OLS model <===
##
mod = smf.ols( formula, data = train )
##
## ===> Step 3: Fit the instantiated model <===
##      Recommendation: number your models
##      This numbering includes an "E" for "Exercise"
##
regE01 = mod.fit()
##
## ===> Step 4: Summarize the fitted model <===
##
print( regE01.summary() )


**_Interpretation_**

The buyer rating is highly insignificant so this variable is not important and can be omitted in a next iteration of estimation.

### Solution IV.4

[Return to Exercise IV.4](#Exercise-IV.4)

Test the Null Hypothesis that all the buyer rating estimated parameters are zero.  That is, there is no difference among the ratings.

In [None]:
##
## F-test for the buyer ratings
##
hypothesis = ' ( C(buyerRating)[T.Good] = 0, C(buyerRating)[T.Poor] = 0 ) '
##
## Run an F-test and retrieve the p-value
##
f_test = regE01.f_test( hypothesis )
pval = round( float( f_test.pvalue ), 2 )
pval = round( float( f_test.pvalue ), 2 )
##
## Print results
##
print( 'p-value for F-Test: {}'.format( pval ) )
if pval < 0.05:
    print( '\nSignificant so reject H0' )
else:
    print( '\nInsignificant so do not reject H0' )

**_Interpretation_**

Buyer Rating is highly insignificant (i.e., do not reject the Null Hypothesis) because the p-value is 0.68 which is greater than 0.05.  We already know this result.

### Solution IV.5

Create a scenario with the following settings for your model with buyer Rating (regE01):

- 'Pprice': [ 2.50 ]
- 'Ddisc': [ 0.03 ]
- 'Odisc': [ 0.03 ]
- 'Cdisc': [ 0.03 ]
- 'Pdisc': [ 0.03 ]
- 'buyerRating': [ 'Poor' ]
- 'Region': [ 'South' ]

[Return to Exercise IV.5](#Exercise-IV.5)

In [None]:
##
## Specify scenario values to use for prediction
##
## Create a dict
##
data = {
         'Pprice': [ 2.50 ],
         'Ddisc': [ 0.03 ],
         'Odisc': [ 0.03 ],
         'Cdisc': [ 0.03 ],
         'Pdisc': [ 0.03 ],
         'buyerRating': [ 'Poor' ],
         'Region': [ 'South' ]
        }
##
## Create a DataFrame using the dict
##
scenario = pd.DataFrame.from_dict( data )
##
## Insert a log price column after the Pprice variable
##
scenario.insert( loc = 1, column = 'log_Pprice',
                value = np.log1p( scenario.Pprice ) )
##
## Display the settings and the predicted unit sales
##
print( 'Scenario settings:\n{}'.format( scenario ) )
##
## Create a pediction
##
log_pred = regE01.predict( scenario )
y_pred = np.expm1( log_pred )
print( '\nPredicted Unit Sales: \n{}'.format( round( y_pred, 0 ) ) )