<font color = red>Introduction to Business Analytics:<br>Using Python for Better Business Decisions</font>
=======
<br>
    <center><img src="http://dataanalyticscorp.com/wp-content/uploads/2018/03/logo.png"></center>
<br>
Taught by: 

* Walter R. Paczkowski, Ph.D. 

    * My Affliations: [Data Analytics Corp.](http://www.dataanalyticscorp.com/) and [Rutgers University](https://economics.rutgers.edu/people/teaching-personnel)
    * [Email Me With Questions](mailto:walt@dataanalyticscorp.com)
    * [Learn About Me](http://www.dataanalyticscorp.com/)
    * [See My LinkedIn Profile](https://www.linkedin.com/in/walter-paczkowski-a17a1511/)
    

# <font color = blue> Lesson \#2:<br>Data Visualization for Insight</font>

In this lesson, you will learn:

1. some fundamentals for visualizing your data; and
2. how to interpret basic graphs common in Business Analytics.

Specifically, you will learn to use:

1. histograms;
2. boxplots;
3. scatter plots;
4. contour plots; and
5. hex bin plots

to visualize your data.  The focus is on scientific visualization rather than infographics visualization.      

**Case Study Problem**:
<br><br>
The product manager wanted to know about unit sales and discounts by:

1. Overall Market
2. Marketing Region
3. Customer Loyalty
4. Buyer Rating

## <font color = black> Reset the Data from Lesson 1 </font>

Resetting the data will ensure that the work you did in Lesson 1 is available in this lesson.

In [None]:
##
## Load packages
##
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
##
## Import the data.  The parse_dates argument says to 
## treat Tdate as a date object.
##
file = r'../data/orders.csv'
df_orders = pd.read_csv( file, parse_dates = [ 'Tdate' ] )
pd.set_option('display.max_columns', 8)
##
## Initial Calculations
##
x = [ 'Ddisc', 'Odisc', 'Cdisc', 'Pdisc' ]
df_orders[ 'Tdisc' ] = df_orders[ x ].sum( axis = 1 )
##
df_orders[ 'Pprice' ] = df_orders.Lprice*( 1 - df_orders.Tdisc )
##
df_orders[ 'Rev' ] = df_orders.Usales * df_orders.Pprice
##
df_orders[ 'Con' ] = df_orders.Rev - df_orders.Mcost
df_orders[ 'CM' ] = df_orders.Con/df_orders.Rev
##
df_orders[ 'netRev' ] = ( df_orders.Usales - df_orders.returnAmount )*df_orders.Pprice
df_orders[ 'lostRev' ] = df_orders.Rev - df_orders.netRev
##
##
## Import a second DataFrame on the customers
##
file = r'../data/customers.csv'
df_cust = pd.read_csv( file )
##
## Do an inner join using CID as the link
##
df = pd.merge( df_orders, df_cust, on = 'CID' )

## <font color = black> Look at the Distribution of Your Data </font>

The first step in any analysis is to examine the distribution of your data. A histogram is the simplest way to begin.

### <font color = black> Histograms </font>

You can use a histogram to examine the distribution of unit sales and the total discount.  Notice in the following display that a smooth line is overlayed.  This is a *kernel density estimate* (*KDE*).  You will see this again shortly.

In [None]:
##
## Histogram of unit sales
##
ax = sns.distplot( df.Usales )
ax.set( title = "Unit Sales Distribution", xlabel = 'Unit Sales', 
       ylabel = 'Proportions' )

**_Code Explanation_**

Plotting a histogram is very easy.  The Seaborn *distplot* command is used with the argument set to the variable of interest.  The plot is saved in a variable called "ax".  Parameters such as title and labels can be passed to this variable.

**_Interpretation_**

The distribution is highly skewed to the right which distorts the impression of the data.  Using the natural log will normalize the display.  This is helpful so when you model unit sales you should use a log transformation.  This next graph shows that the distribution (on a log scale) is fairly normal.

**_Recommendation_**
    
Use the Numpy *log1p* function.  This returns the natural log of one plus the argument: $np.log1p( x ) = log_e(1 + x)$.  The reason for using this function is to avoid cases where $x = 0$: $log(0)$ is undefined, which is meaningless, but $log( 1 ) = 0$ so you would have a meaningful number.

In [None]:
##
## Plot the natural log of unit sales
## A KDE curve is included by default
##
ax = sns.distplot( np.log1p( df.Usales) )
ax.set( title = "Unit Sales Distribution: Log Scale", 
       xlabel = 'Unit Sales (Natural Log)',
       ylabel = 'Proportions' )

**_Interpretation_**

The natural log transformation changed the distribution to a more normal looking distribution.  Normality is preferred for statistical analysis for a host of reasons.

### <font color = black> Boxplots </font>

Boxplots are the most useful visualization tool for examining distributions.

In [None]:
##
## Pocket price distribution with boxplots
## By regions
##
## Print the distribution of the region counts, normalized to sum to 1.0
##
print( 'Region Distribution: \n{}'.format( df.Region.value_counts( normalize = True ) ) )
##
## Display the boxplot for the Northeast
##
ax = sns.boxplot( y = 'Tdisc', data = df[ df.Region == 'Northeast' ] )
ax.set( title = 'Distribution of Total Discount', ylabel = 'Total Discount' )

**_Code Explanation_**

Notice that the Seaborn boxplot function only has an argument for the y-axis.  In this case, the x-axis is understood.  This gives a vertical chart as shown.  However, if you change the "y" to "x", the boxplot will be horizontal: *sns.boxplot( x = 'Tdisc', data = df[ df.Region == 'Northeast' ] )* produces a horizontal chart.

**_Interpretation_**

The Total Discount is symmetrically distributed.  This is evident by an almost mirror image above and below the center line inside the box.  The center line is the median.  This boxplot is for the entire market.  But what about segments of the overall market?

In [None]:
##
## Total discount distribution by regions
##
ax = sns.boxplot( x = 'Region', y = 'Tdisc', data = df )
ax.set( title = 'Distribution of Total Discount by Region', ylabel = 'Total Discount', 
       xlabel = 'Marketing Regions' )

**_Code Explanation_**

In this drill-down of the total discounts by marketing regions, the Seaborn boxplot function now has two axis arguments: 

1. y-axis; and
2. x-axis (*Region* in this case).

**_Interpretation_**

Notice that discounts are the lowest in the Southern Region while the Midwest has a large number of very low discounts.  Also, the dispersion of the discounts in the Southern Region is small relative to that in the other three regions.  Let us drill down on the discounts to verify the differences for the Southern Region.

In [None]:
##
## Drill down on the discounts in the Southern Region
##
## Select the discounts for the Southern Region
##
x = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc' ]
df_south = df.loc[ df.Region == 'South', x ]
##
## Use a boxplot to examine the distributions.
##
ax = sns.boxplot(x = "variable", y = "value", data = pd.melt( df_south ) )
ax.set( title = 'Discount Distribution\nSouthern Marketing Region', 
       xlabel = 'Type of Discount',
      ylabel = 'Discount Amount')
##
## Reset the tick labels to more meaningful labels
##
ax.set_xticklabels( [ 'Dealer', 'Order\nSize', 'Competitive', 'Pickup' ] )

**_Interpretation_**

Notice that the dealer discount tends to be the largest while the order discount has the most variation.

### <font color = blue> Exercises </font>

####  <font color = black> Exercise \#2.1 </font>

Check the Customer Loyalty and Buyer Rating counts and proportions.

In [None]:
##
## Enter code here: Customer Loyalty
##


In [None]:
##
## Enter code here: Customer Loyalty
##


####  <font color = black> Exercise \#2.2 </font> 

Examine the Midwestern region.  This is more complicated since there are missing values in the Midwest.  First use *df.dropna( axis = 0, inplace = True )* to remove them.

Hint: Use the Pandas *dropna* method with the *inplace = True* argument.

In [None]:
##
## Example
## Drop all rows with at least one missing value
## This example uses a temporary DataFrame
##
x = [ 'Tdisc', 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc' ]
tmp = df.loc[ df.Region == 'Midwest', x ]
##
## Before
##
print( '\nBefore:\n' )
print( tmp.info() )
##
## After
##
tmp.dropna( inplace = True )  ## axis = 0 is the default 
print( '\nAfter:\n' )
print( tmp.info() )

In [None]:
##
## Enter code here.  Insert cells below this if needed.
##


### Exercise \#2.3

Examine the distribution of net revenue by region, loyalty program, and buyer rating using boxplots.  What can you conclude?

In [None]:
##
## Enter code here for net revenue by region.
##


In [None]:
##
## Enter code here for net revenue by loyalty program.
##


In [None]:
##
## Enter code here for net revenue by buyer rating.
##


## <font color = black> Look for Relationships in Your Data </font> 

Scatter plots are the workhorse of statistical displays because they allow you to see relationships -- sometimes.  Properly drawn, they can provide a wealth of insight into: 

- relationships;
- trends;
- patterns; and
- anomalies

of two continuous variables.  They can be supplemented with histograms on the margins to show distributions.

### <font color = black> Transformation for Better Interpretation </font>

Since one objective from the product manager is to estimate a price elasticity, you should graph unit sales and Pocket Price.  We noticed earlier that unit sales were right skewed but that using a log transform shifted the distribution to a more normal one.  We should take the log of unit sales as well as pocket price.  This is a very common transformation in empirical demand analysis because the slope of a line is the elasticity.

In [None]:
##
## Transform unit sales and pocket price
##
df[ 'log_Pprice' ] = np.log1p( df.Pprice )
df[ 'log_Usales' ] = np.log1p( df.Usales )
##
## Display the unlogged and logged data
##
x = [ 'Pprice', 'log_Pprice', 'Usales', 'log_Usales' ]
df[ x ].head()

In [None]:
##
## Plot the logged data
## Use the Seaborn "relplot" function
##
ax = sns.relplot( x = 'log_Pprice', y = 'log_Usales', data = df )
ax.set( title = 'Unit Sales vs. Pocket Price\nLog Scales', xlabel = 'Log Pocket Price', 
       ylabel = 'Log Unit Sales' )

**_Interpretation_**

A negative relationship is evident -- as it should be.  But the large number of plot points makes it slightly difficult to see.

### <font color = black> Enhancing the Scatter Plot </font>

In [None]:
##
## Replot the logged data with a regression line added. 
## Use the Seaborn "regplot" function.
##
## Warning -- this will take a few seconds
##
## Note: The plot element colors can be set:
##   b:blue, g:green, r:red, c:cyan,
##   m:magenta, y:yellow, k:black, w:white.
##
ax = sns.regplot( x = 'log_Pprice', y = 'log_Usales', data = df, 
                 scatter_kws={"color": "black"}, line_kws={"color": "yellow"} )
##                 color = 'y' )
ax.set( title = 'Unit Sales vs. Pocket Price\nLog Scales', 
       xlabel = 'Log Pocket Price', ylabel = 'Log Unit Sales' )

**_Code Explanation_**

The Seaborn *regplot* function is used to add a regression line to the scatter plot.  To help distinguish between plotting points and the regression line, the *scatter_kws={"color": "black"}, line_kws={"color": "yellow"}* arguments are used.  The points are specified as black and the line as yellow.  The default is for both to be the same color.

**_Interpretation_**

The regression line shows a negative relationship between price and sales.

### <font color = black> Adding a Categorical Variable </font>

You can add a third variable that is categorical to show relationships across groups.  This is done with a "hue" command which colors the points.

In [None]:
##
## Add Loyalty Program membership
##
## Warning -- this will take a few seconds
##
ax = sns.relplot( x = 'log_Pprice', y = 'log_Usales', hue = 'loyaltyProgram', 
                 data = df )
ax.set( title = 'Unit Sales vs. Pocket Price\nLog Scales', 
       xlabel = 'Log Pocket Price', 
       ylabel = 'Log Unit Sales' )

### <font color = black> Working with *Large-N* Data </font>

The scatter plots are dense, making it difficult to see patterns. Options are to use a:

1. random sample;
2. contour plot; or
3. hex bin plot.

#### Random Sampling

In [None]:
##
## Draw a random sample of size n = 500
## Put the sample in a new DataFrame.
##
smpl = df.sample( n = 500 )
##
## Plot the data using the random sample
##
ax = sns.regplot( x = 'log_Pprice', y = 'log_Usales', data = smpl )
ax.set( title = 'Unit Sales vs. Pocket Price\nRandom Sample\nn = 500', 
       ylabel = 'Log Unit Sales', xlabel = 'Log Pocket Price' )

**_Code Explanation_**

The Pandas DataFrame method *sample* is used to draw a random sample of size $n = 500$.

**_Interpretation_**

The negative relationship between unit sales and price is evident.

#### Contour Plot

In [None]:
##
## Contour plot with margnal distributions
## Full sample
##
## Warning -- this will take a minute
##
ax = sns.jointplot( x = 'log_Pprice', y = 'log_Usales', data = df, kind = "kde" )

**_Code Explanation_**

Seaborn's *jointplot is used.  The *kind = kde* argument is used for a *kernel density plot which is the contours.

**_Interpretation_**

The dark spot in the middle shows the concentration of the data points.  The negative relationship between sales and price is evident.

#### Hex Bin Plot

In [None]:
##
## Hex binning
## Full sample
##
## Note: A white background is best for this 
## Note: The plot element colors can be set: 
##   b:blue, g:green, r:red, c:cyan,
##   m:magenta, y:yellow, k:black, w:white.
##
## Warning -- this will take a minute
##
with sns.axes_style( 'white' ):
    ax = sns.jointplot(x = 'log_Pprice', y = 'log_Usales', data = df, 
                       kind="hex", color = 'k' )

### <font color = blue> Exercises </font>

#### <font color = black> Exercise \#2.4 </font>

Study the relationship between any two variables of your choice.  What can you conclude?

In [None]:
##
## Enter code here.  Insert cells below this if needed.
##


## <font color = black> Look for Trends in Your Data </font> 

Trends are identified using line graphs, usually with time on the X-axis. 

In [None]:
##
## Subset the date indicator and the Dealer Discount
##
x = [ 'Tdate', 'Ddisc' ]
tmp = df_orders[ x ].copy()
## 
## Set Tdate to a Datetime variable
##
tmp.Tdate = pd.to_datetime( tmp.Tdate )
##
## Reset the index to the date
##
tmp.set_index( 'Tdate', inplace = True )
tmp.head()

**_Code Explanation_**

The subset DataFrame containing *Tdate* and *Ddisc*, *tmp*, is reindexed using *Tdate*.  But *Tdate* is first converted to a DateTime variable using *pd.to_datetime*. 

**_Interpretation_**

The data for *Ddisc* are by day and month.  Notice that there are missing values indicated by *NaN*.  

In [None]:
##
## Group the data by months and calculate the 
## mean discount for each month.
##
grp = tmp.resample( 'M' ).mean()
grp.head()

**_Code Explanation_**

The *tmp* DataFrame is *resampled* to monthly data, the resampling using the mean of values in each month.  Basically, *resample* aggrgegates the data by the datetime index.  See <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html" target="_parent">here</a> for documentation on *resample*.


**_Interpretation_**

Each value is the mean of values for the indicated month.

In [None]:
##
## Uses Pandas' plot function.
## It automatically uses the time index for the X-axis.
##
ax = grp.plot( y = 'Ddisc' , legend = False )
ax.set( title = 'Dealer Discount\nMonthly', ylabel = 'Dealer Discount', xlabel = 'Months' )

**_Interpretation_**

Pandas does not connect points if a point is missing. Pandas gives a better representation and is better with time series data.

## <font color = black> Look for Patterns in Your Data </font> 

Patterns are identified using a variety of visual displays.  So all the graph types discussed will help identify patterns.

## <font color = black> Look for Anomalies in Your Data </font> 

The boxplots are good for this.  You can also see odd data points in distribution plots (histograms and boxplots), scatter plots, and time series.

In [None]:
##
## Categorical plot: boxplot
##
ax = sns.catplot( 'Tdisc', kind = 'box', orient = 'v', data = df_orders )
ax.set( title = 'Total Discount\nOutliers', ylabel = 'Total Discount', xlabel = "")

**_Interpretation_**

There are some clear outliers:

1. A number of points are very low.
2. Only one or two points are very high.

## <font color = black> What's Next? </font>

In Lesson 3, I will show you how to build two predictive models:

1. *OLS*;
2. Logit; and
3. Decision trees.

I'll discuss these in the next lesson.
<br><br><br>
<font color = red, size = "+3"><b> Five Minute Break </font>

# Appendix

This Appendix contains material extra to this lesson, material that you may want to review to solidify your understanding and knowledge about working with Python, Pandas, and Seaborn for Data Visualization.

This Appendix covers:

1. Additional Histogram Methods;
2. Additional Boxplot Methods;
3. Additional Scatter Plot Methods; and
4. Additional Time Series Plot Methods.

## Appendix 2.1: Additional Histogram Methods

You can add a *rug plot* to the bottom of the histogram to show each observation.  This is helpful to show where the data are for each bar in the histogram.  This, of course, is not practical for large data sets since the rug would just be a dense, black bar at the bottom of the graph.  
<br><br>
You can also remove the *KDE* curve for a better visualization of the distribution.

In [None]:
##
## Add a rug and remove the KDE
##
ax = sns.distplot( np.log1p( df.Usales), kde = False, rug = True )
ax.set( title = "Unit Sales Distribution: Log Scale", 
       xlabel = 'Unit Sales (Natural Log)', 
       ylabel = 'Proportions' )

You can display just the *KDE* curve for a cleaner view of the distribution.

In [None]:
##
## KDE only
##
ax = sns.distplot( np.log1p( df.Usales), hist = False )
ax.set( title = "Unit Sales Distribution: Log Scale", 
       xlabel = 'Unit Sales (Natural Log)', 
       ylabel = 'Proportions' )

This looks very much like a normal distribution.  This will be important for *OLS* modeling which relies on normality.

## Appendix 2.2: Additional Boxplot Methods 

You can examine the discounts by the customer loyalty status.

In [None]:
##
## Total discount distribution by regions and Loyalty Program
## members
##
ax = sns.boxplot( x = 'Region', y = 'Tdisc', hue = 'loyaltyProgram', data = df )
ax.set( title = 'Distribution of Total Discount by Region \n and \n Loyalty Program',
       ylabel = 'Total Discount' )

In [None]:
##
## Another view of total discount distribution by Regions and Loyalty Program
## members
##
ax = sns.catplot(x = 'Tdisc', y = 'loyaltyProgram', row = 'Region',
                kind = 'box', orient = 'h', height = 1.5, aspect = 4,
                data = df )
ax.set(  xlabel = 'Total Discount', ylabel = 'Loyalty Program\nMember'  )

It should be disturbing that the discounts are the same whether a customer is in the loyalty program or not.  Members should have bigger discounts.  What about how they are rated?

In [None]:
##
## Total discount distribution by regions and buyer rating
##
ax = sns.boxplot( x = 'Region', y = 'Tdisc', hue = 'buyerRating', data = df )
ax.set( title = 'Distribution of Total Discount by Region \n and \n Buyer Rating', 
       ylabel = 'Total Discount' )

Loyalty and good ratings are not rewarded.

## Appendix 2.3: Additional Scatter Plot Methods 

### <font color = black> Categorical Variable </font>

In [None]:
##
## Add Region
##
## Warning -- this will take a few seconds
##
ax = sns.relplot( x = 'log_Pprice', y = 'log_Usales', hue = 'Region', data = df )
ax.set( title = 'Unit Sales vs. Pocket Price\nLog Scales', 
       xlabel = 'Pocket Price', ylabel = 'Unit Sales' )

### <font color = black> Panel Plot </font>

In [None]:
##
## Add Loyalty Program membership
## A less cluttered view with panels
##
## Warning -- this will take a few seconds
##
ax = sns.relplot( x = 'log_Pprice', y = 'log_Usales', hue = 'loyaltyProgram', 
                 col = 'Region', col_wrap = 2,
                 data = df )
ax.set( xlabel = 'Pocket Price', ylabel = 'Unit Sales' )

**_Interpretation_**

Notice the gap between 17 and 19 in the Northeast.

### <font color = black> Combining Scatter Plots and Histograms </font>

You can combine scatter plots with histograms for each variable.

In [None]:
##
## Add histograms to the margins
##
ax = sns.jointplot( x = 'log_Pprice', y = 'log_Usales', data = df )

### <font color = black> Pairwise Scatter Plots </font>

You can also plot multiple variables in pair-wise combinations.

In [None]:
##
## Use the Seaborn pairwise function
## Full sample
##
x = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc' ]
##
## We know there are missing values for the discounts.
## Missing values are not handled well with Seaborn histograms.
## So drop all records with any missing data.
##
tmp = df[ x ].copy()
tmp.dropna( inplace = True )
sns.pairplot( tmp[ x ] )
##
## Warning -- this will take a few minutes
##

**_Interpretation_**

Unfortunately, this particular plot is clearly not useful because the data set is large; we have a case of *Large-N*.  So how is this handled?  Try a random sample as in the next example.

In [None]:
##
## Pairwise plot
##
## Random sample, n = 500 (previously drawn)
##
x = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc' ]
sns.pairplot( smpl[ x ] )

**_Interpretation_**

This is not much better.  Maybe a smaller sample will work.  You can try this on your own.  A contour or hex bin plot might be better.

### <font color = black> Contour Plots with Density Functions </font>

In [None]:
##
## Contour plot with margnal distributions
## Random sample, n = 500
##
## Warning -- this will take a minute
##
ax = sns.jointplot( x = 'log_Pprice', y = 'log_Usales', data = smpl, kind = "kde" )

**_Interpretation_**

A different contour plot is produced.

In [None]:
##
## Hex binning
##
## Random sample, n = 500
##
## Note: A white background is best for this 
## Note: The plot element colors can be set:
##   b:blue, g:green, r:red, c:cyan,
##   m:magenta, y:yellow, k:black, w:white.
##
## Warning -- this will take a minute
##
with sns.axes_style( 'white' ):
    ax = sns.jointplot(x = 'log_Pprice', y = 'log_Usales', data = smpl, 
                       kind="hex", color = 'k' )

In [None]:
##
## Add a regression line
##
## Full data sample
##
## Warning -- this will take a minute
##
with sns.axes_style("white"):
    g = sns.jointplot( x = 'log_Pprice', y = 'log_Usales', data = df, 
                      kind = 'hex', color = 'k',
                      joint_kws={'gridsize':40, 'bins':'log'} )
    ax = sns.regplot( x = 'log_Pprice', y = 'log_Usales', data = df, 
                     ax = g.ax_joint, scatter = False, color = "yellow" )
    ax.set( xlabel = 'Log Pocket Price', ylabel = 'Log Unit Sales' )

## Appendix 2.4: Additional Time Series Plot Methods 

In [None]:
##
## Time series plot for Southern Region
##
x = [ 'Tdate', 'Ddisc' ]
tmp = df.loc[ df.Region == 'South', x ]
##
## Reset the index to the date
##
tmp.Tdate = pd.to_datetime( tmp.Tdate )
tmp.set_index( 'Tdate', inplace = True )
grp = tmp.resample( 'M' ).mean()
##
## Create a Month variable from the index
##
grp['x'] = grp.index
grp['Month'] = grp.x.dt.month
print(grp.head())
##
ax = grp.plot( y = 'Ddisc' , legend = False )
ax.set( title = 'Dealer Discount\nMonthly\nSouthern Region', ylabel = 'Dealer Discount', xlabel = 'Months' )