<font color = red>Introduction to Business Analytics:<br>Using Python for Better Business Decisions</font>
=======
<br>
    <center><img src="http://dataanalyticscorp.com/wp-content/uploads/2018/03/logo.png"></center>
<br>
Taught by: 

* Walter R. Paczkowski, Ph.D. 

    * My Affliations: [Data Analytics Corp.](http://www.dataanalyticscorp.com/) and [Rutgers University](https://economics.rutgers.edu/people/teaching-personnel)
    * [Email Me With Questions](mailto:walt@dataanalyticscorp.com)
    * [Learn About Me](http://www.dataanalyticscorp.com/)
    * [See My LinkedIn Profile](https://www.linkedin.com/in/walter-paczkowski-a17a1511/)
    

About this Notebook
-----------------------------

This notebook accompanies the PDF presentation ***Business Analytics: Using Python for Better Business Decisions*** by Walter R. Paczkowski, Ph.D. (2019).  There is more content and commentary in this notebook than in the presentation deck.  Nonetheless, the two complement each other and so should be studied together.  Every effort has been made to use the same key slide titles in the presentation deck and this notebook which will help your studying.

Exercisers and Solutions
------------------------------------

Exercises are interspersed throughout the lessons.  Their solutions are in a separate *Solutions Manual*.  There are also Appendices to some lessons that contain extra exercises.  You can view these as "homework problems."  Their solutions are also in the *Solution Manual*. 

Helpful Online Tutorials
---------------------------------

* <a href="http://docs.python.org/2/tutorial/" target="_parent">Python Tutorial</a>

* <a href="https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html" target="_parent">Pandas Tutorial</a>

* <a href="https://seaborn.pydata.org/tutorial.html" target="_parent">Seaborn Tutorial</a>

* <a href="https://www.statsmodels.org/stable/index.html" target="_parent">Statsmodels Tutorial</a>


Helpful/Must-Read Book
-----------------------------------
* <a href="https://www.amazon.com/gp/product/1491957662/ref=as_li_tl?ie=UTF8&tag=quantpytho-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=1491957662&linkId=8c3bf87b221dbcd8f541f0db20d4da83" target="_parent">Main Pandas go-to book: </a> *Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython* (2nd Edition) by Wes McKinney.


# <font color = blue> Lesson \#1:<br>Simple Analytics: Understanding and Preparing Your Data </font>

In this lesson, you will learn:

1. how to document your data;
2. some fundamentals of Jupyter notebooks for documenting analysis workflow;
3. how to import Python packages;
4. how to import data into a Pandas DataFrame; and
5. how to manipulate data.

## <font color = black> Documenting Your Data and Workflow: Best Practices </font>

There are two things to do to document your work:  

1. document your data; and
2. document your workflow.

A _Data Dictionary_ is used for the first and a _Jupyter notebook_ is used for the second.

### <font color = black> Documenting Your Data </font>

The first task in any data analysis is data documentation, perferably in the form of a *Data Dictionary*.

#### <font color = black> Data Dictionary </font>

A data dictionary contains *metadata* which are data about the data.  Metadata can be anything that helps you understand the data you're using.  Based on [Wikipedia](http://en.wikipedia.org/wiki/Metadata), metadata are information about the distinct data items, such as:

1. means of creation;
2. purpose of the data;
3. time and date of creation;
4. creator/author/keeper of the data;
5. placement on a network (electronic form);
6. where the data were created;
7. what standards were used to create the data; and 
8. etc.

I'll restrict the metadata to:

1. Variable name;
2. Possible values or value ranges;
3. Source; 
3. Date received; and 
4. Mnemonic.

The mnemonic is the label used in data files and statistical and modeling output.  

| Variable                  | Values                                 | Source       | Date Received | Mnemonic     |
|---------------------------|----------------------------------------|--------------|---------------|--------------|
| Order Number              | Nominal Integer                        | Order Sys    | 02/01/2019    | Onum         |
| Customer ID               | Nominal                                | Customer Sys | 02/01/2019    | CID          | 
| Transaction Date          | MM/DD/YYYY                              | Order Sys    | 02/01/2019    | Tdate        | 
| Product Line ID           | Five rooms of house                    | Product Sys  | 02/01/2019    | Pline        |
| Product Class ID          | Item in line                           | Product Sys  | 02/02/2019    | Pclass       |
| Units Sold                | Number of units per order              | Order Sys    | 02/01/2019    | Usales       |
| Product Returned?         | Yes/No                                 | Order Sys    | 02/01/2019    | Return       |
| Amount Returned           | Number of units                        | Order Sys    | 02/01/2019    | returnAmount |
| Material Cost/Unit        | \$US cost of material                  | Product Sys  | 02/01/2019    | Mcost        |
| List Price                | \$US list                              | Price Sys    | 02/01/2019    | Lprice       |
| Dealer Discount           | \% discount to dealer (decimal)        | Sales Sys    | 02/01/2019    | Ddisc        |
| Competitive Discount      | \% discount for competition (decimal)  | Sales Sys    | 02/01/2019    | Cdisc        |
| Order Size Discount       | \% discount for size (decimal)         | Sales Sys    | 02/01/2019    | Odisc        |
| Customer Pickup Allowance | \% discount for pickup (decimal)       | Sales Sys    | 02/01/2019    | Pdisc        |

### <font color = black> Documenting Your Work Flow </font>

Documenting your workflow is as important as documenting your data.  This documentation will enable you to reproduce your work and make it easier for a colleague to follow what you did.  The *Jupyter notebook* paradigm is the best platform for this documentation.

# <font color = black> Jupyter Notebooks: Overview </font>

Jupyter is a sophisticated programming tool sometimes called an *ecosystem*.  It is an ecosystem because it will handle a number of programming languages, Python being one of them.  The langauges are called _kernels_.  Other kernels are R, Julia, Fortran to mention a few.  There are over 100 kernels supported by Jupyter.  Jupyter was originally created to handle Julia, Python, and R.  In fact, the name "Jupyter" is a contraction for **JU**lia, **PYT**hon, and **R**.

The paradigm for Jupyter is a lab notebook used in the physical sciences for laboratory experiment documentation.  The Jupyter notebook consists of cells where text and programming code are entered and executed or *run*.  There are two basic cells:

1. Code cell; and 
2. Markdown cell.

A code cell is where programming code is entered and executed.  The results are displayed in another cell that appears immediately below a code cell.  To organize the code and result, the pair of cells are labeled with a marker for the code and the result.  The code marker is In[] and the result marker is Out[].  Inside the square brackets are numbers representing the sequence in which code is executed.  The numbers will vary, and even be out of order, if a code cell is executed several times but a following code cell is not reexecuted.  The sequencing numbers can be reset by rerunning the kernel.  This is done by clicking on *Kernel* on the main toolbar and selecting *Restart & Run All*.  

The markdown cell is where documentation is entered.  This cell, in fact, is a markdown cell.  It is that you use as many markdown cells as possible to document your work.  You will see many examples of this below.<br>

A code cell is the default.  You make a code cell a markdown cell by using the drop-down menu list on the main toolbar.  Of course, you can change a markdown cell to a code cell using the same drop-down list.  You could also use keyboard short-cuts.  See *Help/Keyboard Shortcuts* on the main menu bar. 

You can easily insert/delete cells and move them from one location in your notebook to another.  To insert a cell above the current cell, select the cell as the base for the insert and click on the colored bar on the left until it turns blue.  Then type *A* to insert a cell above the current cell.  Type *B* to insert one below it.  Type *DD* (two *D*'s) to delete the current cell.  To move a cell, select the cell you want to move, click the colored bar on the left, and use the up-arrow and down-arrow icons on the main menu toolbar.  

# <font color = black> Importing Python Packages </font>

Python is a powerful programming language that allows you to perform all standard programming operations in a clear and consistent manner.  Its strength, adhered to by Python programers, is a coding format that emphasizes readable code.  Indentation is the primary way to accomplish this.  Also, its strength is based on a very wide array of *packages* or *modules* or *libraries*.  Packages perform analysis or data manipulation operations.  There are many packages, each one providing a special set of analysis tools so a package can be viewed as a container of functions.  Sometimes a package contains smaller, more specialized packages so a grand package could be a container for smaller ones.  You will see how to access and use packages and subset packages in this and other lessons.

Pandas is a data manipulation and graphing package with a lot of capabilities.  It will be used extensively in these lessons.  Seaborn is a scientific graphing package that is intuitive to use.  Although Pandas has visualization methods, Seaborn is preferred because of its quality, extent, and easier syntax.  Both packages use Matplotlib for base graphing functions.  Statsmodels has an array of statistical modeling functions, only a few of which will be used in these lessons.  Numpy and Matplotlib are base packages for Pandas, Seaborn, and Statsmodels.  Except for a few functions, Numpy and Matplotlib will not be used directly.

## <font color = black> Introduction to Python Packages </font>

Analysis functions you will use, such as a function to create a graph or estimate a regression model, require that you specify the package they come from.  Otherwise, Python will not know where to look for the package.  You can either specify the whole package name or an alias to the package -- the alias is easier: there are fewer characters to type!

I will use only a few packages in these lessons.  The packages and their typical, almost standard, aliases are:

| Package      | Use                           | Typical Alias |
|--------------|-------------------------------|---------------|
| Pandas       | Basic data manipulation       | pd            |
| Seaborn      | Scientific data visualization | sns           |
| Statsmodels  | Regression analysis           | sm            |
| Scikit-Learn | Machine Learning              | sci           |
| Numpy        | Basic data manipulation       | np            |
| Matplotlib   | Basic graphing library        | plt           |

## <font color = black> Loading Packages </font>

You have to load a package before you can use it.  Loading is done using an *import* command.  The alias is assigned when you import the package.  I recommend loading all the basic packages at once at the beginning of your notebook.

In [None]:
##
## Import standard packages
##
import numpy as np
import pandas as pd
##
## Import data visualization packages
##
import seaborn as sns
import matplotlib.pyplot as plt
##
## Matplotlib graphs can be drawn in the Jupyter notebook; the next 
## command says to do just that.
##
%matplotlib inline
##
## Set the seaborn grid style.  The dot between the seaborn alias,
## "sns", and the set() function connects or "chains" the alias and the function.
##
sns.set()
##
## Set an option for the number of Pandas columns to display.  Eight in this case.
## 
pd.set_option( 'display.max_columns', 8 )

**_Code Explanantion_**

This code block loads the necessary Python packages for this course.  I recommend setting options, such as those for graphs and print, as was done here.  

## <font color = black> Accessing a Function in a Package </font>

A function in a package can be accessed by telling Python the package where the function is located and, of course, the name of the function.  These two operations are done with one statement by *chaining* the package name and the function name.  The chain is formed by connecting the package name and the function name by inserting a dot between the two.  Usually, the package alias is used for improved readability.  An example of a chained command is:
<br><br>
***pd.read_csv( 'lesson1.csv' )***
<br><br>
where "pd" is the alias for Pandas and "read_csv" is a Pandas function that reads a *CSV* file ("lesson1.csv" in this example). Notice the dot(".") between the alias and the function name.  The dot is the chaining operator.

# <font color = black> Importing Your Data Into Pandas </font>

Before you can begin any work, you must first import and examine the structure of your data.  This structure is a rectangular array or matrix or, in Pandas terminology, a *DataFrame*.  When you import your data using Pandas, the imported data immediately goes into a DataFrame.  This is very convenient because Seaborn and StatsModels functions recognize these DataFrames.

Pandas provides a set of very flexible import functions.  Which one you should use depends on your data format.  Some typical formats and relevant functions are:

| Data Format | Pandas Import Function |
|----------|---------------|
| CSV       | read_csv       |
| Excel     | read_excel     |
| Clipboard | read_clipboard |
| SQL       | read_sql       |
| JSON      | read_json      |
| SAS       | read_sas       |

Some of these will demonstrated below.

I will first import a *CSV* formatted file.  The package alias must be "chained" with the *read_csv* import function, otherwise Python will not know where to find the read function. 

When you import data, you must always specify the file path so Pandas can find the file.  If the data file is in the same directory as the notebook, then a path is unnecessary since Pandas always begins a search in the same directory as the notebook.  Otherwise, you have to specify the path.

An example of data import is shown below.  Several more are shown in the Appendix.

Define the data path as shown here; keep the format *r'path'* if necessary.  Remember, if your data are in the same directory as your notebook, then a path is not necessary.

In [None]:
##
## Specify a path to the CSV data and import or read the data.
##
file = r'../data/orders.csv'
##
## Import the data.  The parse_dates argument says to 
## treat Tdate as a date object.
##
df_orders = pd.read_csv( file, parse_dates = [ 'Tdate' ] )

**_Code Explanantion_**

The CSV file is specified using the string literal: *r'../Data/furniture/final data files/orders.csv'*.  The *"r"* at the begiining of the string tells the Python interpreter to treat this string as raw text that is not to be changed.  Without the *"r"*, the interpreter would treat any backslashes as escape characters which would change the meaning of the string.  Even though forward slashes are used in this code block, it is good practice to avoid issues and use the *"r"*.

# <font color = black> Cleaning and Wrangling Your Data </font>

A best practice is to check your data after you import them.  Check:

1. the first few records (i.e., the "head");
2. the shape which is the number of rows and columns;
3. the column names; and 
4. for missing values;

In [None]:
##
## Task 1: Display first few records.
##
df_orders.head()

**_Code Explanation_**

The "head" method is attached to a DataFrame when you create it.  The default is to display the first five records.  You could use *n = 10* as an argument to display the first 10 records: *df.head( n = 10 )*.  You could display the last five records using the *tail* method.  The default is also five which could be changed as for the *head* method.  For both *head* and *tail*, the number of columns displayed is set using *pd.set_option( 'display.max_columns', 8 )* as was done in the package loading section above.  The *head* and *tail* methods are chained to the DataFrame name.

In [None]:
##
## Task 2: Check the shape of the data
##
df_orders.shape

**_Code Explanation_**

*shape* is an attribute of the DataFrame so it does not require parentheses; it does not have any arguments.  Functions and methods have arguments (which may be defaults) so parentheses are required.  The *shape* attribute is chained to the DataFrame name.

**_Interpretation_**

The *shape* attribute returns the number of rows and the number of columns in that order.  The *orders* DataFrame has 70,270 *rows* or *records* or *observations* and 14 *columns* or *variables* or *features*.

In [None]:
##
## Task 3: Check the column names
##
df_orders.columns

**_Code Explanation_**

*columns* is an attribute of the DataFrame.

**_Interpretaion_**

When checking the column names, be sure there are no white spaces before and after the name.  White spaces can (and will) cause problems because your tendency will be to write a column name without the leading and trailing white spaces; Python will then not recognize the name.  If you see leading and trailing white spaces, you can remove them using the following:

*df_orders.columns = df_orders.columns.str.strip()*

where *str* is the string package which is part of the base Python kernel and automatically loaded with Python.  You may also want to convert the column names to all lower case:

*df_orders.columns = df_orders.columns.str.lower()*

You could do both at once using:

*df_orders.columns = df_orders.columns.str.strip().str.lower()*

Notice the use of *str* twice.

In [None]:
##
## Task 4: Check for Missing
##
df_orders.info()

**_Code Explanation_**

*info* is a method chained to the DataFrame.  It returns the number of non-missing records for each column plus data types: object (i.e., text string), floating point numbers, integers, and datetime.

**_Interpretation_**

There are 14 columns with 10 having 70,270 nonmissing values while the last four have less than 70,270 so they have mising values.  For example, *Ddisc* (the dealer discount based on the Data Dictionary) has 70,262 records so 8 are missing. 

# <font color = black> Manipulating Your Data </font>

Sometimes you need to clean the columns in your DataFrame.  The columns are *variables* or *features*.  This could be deleting, printing/displaying, and creating new variables.  Other times you have to *merge* or *join* your DataFrame with another DataFrame to have a more complete data set for your analysis.

## <font color = black> Creating Variables </font>

You can create new variables using standard arithmetic notation.  Here is an example of adding the four discounts to get the total discount.  Each discount variable is an *attribute* of the DataFrame and so they can be accessed using dot notation.

In [None]:
##
## Calculate total discount.
##
## Discounts are sometimes called "leakages" so the total is 
## the total leakage.
##
## Note: use "axis = 1" in the sum() function to sum across columns.
## This allows you to do the summation even with missing values.
##
x = [ 'Ddisc', 'Odisc', 'Cdisc', 'Pdisc' ]
df_orders[ 'Tdisc' ] = df_orders[ x ].sum( axis = 1 )
##
## Display only the discounts
##    Create a list of what to print. 
##
x = [ 'Ddisc', 'Odisc', 'Cdisc', 'Pdisc', 'Tdisc' ]
df_orders[ x ].head()

**_Code Explanantion_**

The *sum* method has an axis argument that specifies the axis the function is to be applied on.  *axis = 0* specifies summing along the rows for each column (i.e., sum down a column) and *axis = 1* specifies summing along the columns in each row.

**_Interpretation_**

Notice the *NaN* values.  *NaN* stands for *Not a Number*.  These are missing values.

## <font color = blue> Exercises </font>

### <font color = black> Exercise \#1.1 </font>

The *Pocket Price* is the list price less total discounts or total leakages.  It is the amount the business "pockets" and is the amount the customer actually pays.  The pocket price formula is $Pprice = Lprice \times (1  - Tdisc)$.  Calculate the pocket price and display the first five records for the list price and pocket price.

In [None]:
##
## Enter code here
##


## <font color = black> Merge or Join DataFrames </font>

It is not unusual to have data in two (or more) tables so you will need to merge or join them to get all the data you need for an analysis.  For our problem, a second data table has information on each customer and this second table must be merged with the orders table.  The merge is done on the customer *ID* (*CID*).  There are many types of joins but we will only use an *inner join* in the examples.  *Inner join* is the default.

In [None]:
##
## Import a second DataFrame on the customers
##
file = r'../data/customers.csv'
df_cust = pd.read_csv( file )
df_cust.head()

In [None]:
##
## Do an inner join using CID as the link
##
df = pd.merge( df_orders, df_cust, on = 'CID' )
##
df.head()

**_Code Explanation_**

The merge function takes two arguments: the *left* table and the *right* table to merge or join.  The tables are in that order.  A third argument specifies what to merge on.  There are several options for the *on* variable.  In this example, the *on* variable is just the common key in each table: *CID*.

An alternative form for the merge statement is:

*df = df_orders.merge( df_cust, on = 'CID' )*

An *inner join* is the default.  

See the Pandas documentation <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html" target="_parent">here</a> and <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html" target="_parent">here</a> for extensive discussion with examples about this topic.

In [None]:
## 
## Check the shape of the new DataFrame against 
## that of the orders and customers DataFrames
##
print( "Shape of the orders DataFrame: {}\n".format( df_orders.shape ) )
print( "Shape of the customers DataFrame: {}\n".format( df_cust.shape ) )
print( "Shape of the new DataFrame: {}".format( df.shape ) )

**_Code Explanation_**

A variation on the print function is used: the *format* command.

**_Interpretation_**

*df_orders* has 16 columns, *df_cust* has 7 columns, while the merged *df* has 22.  The difference of one column is the *CID* which is in both and is the linking variable; it's only included once.

## <font color = blue> Exercises </font>

### <font color = black> Exercise \#1.2 </font>

Merge your order DataFrame and the customer DataFrame.  Name the merged DataFrame *df*.

In [None]:
##
## Enter code here
##


# <font color = black> Summary Statistics for Your Data </font>

Summary statistics are a mainstay for starting any analysis.  Pandas has all the usual descriptve statistics.  One function, *describe()* will display the essential ones.

In [None]:
##
## Example: "describe" is a method attached to the DataFrame so it requires ().
## Round to 1 decimal place for readability (more decimal places are
## unnecessay, anyway).
##
## Display the descriptive statistics for the discounts.
##    First create a list of variables to display
##
x = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc', 'Tdisc' ]
df[ x ].describe().round( 1 )

**_Code Explanation_**

The *round* function is chained to the *describe* method.  An alternative way to round is shown next. 

The above report for the descriptive statistics is a challenge to read.  I prefer to have the statistics as the columns.  This can easily be done once you recognize that the report is just a matrix.  Matrices can be transposed which could help you read the report more easily.  Use the "*T*" attribute to transpose a matrix.

In [None]:
##
## Example of transposed matrix and alternative
## round function use.
##
x = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc', 'Tdisc' ]
round( df[ x ].describe().T, 1 ) 

**_Interpretation_**

From the *Five Number Summary* (min/25%/50%/75%/max), you can determine the skewness of your data.

1. Symmetric: $(75\% - 50\%) = (50\% - 25\%)$
2. Right Skewed: $(75\% - 50\%) > (50\% - 25\%)$
3. Left Skewed: $(75\% - 50\%) < (50\% - 25\%)$

**QUESTION** What is the skewness for the Dealer Discount (*Ddisc*)?

**ANSWER**: right skewed

## <font color = blue> Exercises </font>

###  <font color = black> Exercise \#1.3 </font>

Using your merged orders/customers DataFrame, *df*, create a summary statistics display.  What is the skewness of the Total Discount (Tdisc)?

In [None]:
##
## Enter code here
##


## <font color = black> What's Next? </font>

In Lesson 2, I will show you how to do some basic graphing or *visualization* of your data.  This may seem more like scientific visualization than business visualization.  The latter is usually *infographics* which is not useful for gaining insight and, hence, useful Rich Information.  The former, scientific visualization, is the tool for extracting Rich Information. 
<br><br><br>
<font color = red, size = "+3"><b> Five Minute Break </font>

# <font color = blue>Appendix</font>

This Appendix contains material extra to this lesson, material that you may want to review to solidify your understanding and knowledge about working with Python and Pandas.

This Appendix covers:

1. Importing data into Pandas;
2. Checking your data;
3. Manipulating columns of a Pandas DataFrame; and doing
4. Correlation Analysis.

Homework exercises with solutions (in the Solutions Manual) are included.

## Appendix 1.1: Different Ways to Import Data Into Pandas

You could also define the path without the file name if you plan to use multiple files from the same directory.  You just have to define the general path and then concatenate the file name.  Here is an example.  Note:

1. the "+" which instructs Pandas to add or concatenate the two strings;
2. the quotes around the file name; and
3. the forward slash as the last symbol in the path definition.

In [None]:
##
## Specify a path to the CSV data and concatenate the file name.
##
## Not run
##
## path = r'../data/'
## df_orders = pd.read_csv( path + 'orders.csv' )
## df_orders.head()

Use the following form if your data are in the same directory as your notebook.

In [None]:
##
## Do not specify a path if the CSV file is in the same directory
## as your notebook.
##
## Not run
##
## df_orders = pd.read_csv( 'orders.csv' )

If your data are in an Excel file, use *pd.read_excel( filename, 'sheetname' )*.  The sheet name could be a character string or a sheet number as an integer.

In [None]:
##
## Example: Import an Excel file
##
## Not run
##
## path = r'../data/'
## df_orders = pd.read_excel( path + 'orders.xlsx', 'furniture' )
## df_orders.head()

## Appendix 1.2: Some Additional Information on Checking Your Data

Look at:

1. the head of your data;
2. the shape of your DataFrame; 
3. a list of column names; and
4. the missingness of your data.

### Task \#1: Display the first few records of your DataFrame

In [None]:
##
## Example: list the head of the imported data
## n = 5 is the default
##
df_orders.head( )

You should immediately notice that several discounts have missing values indicated by *NaN* (*Not a Number*).

### Task \#2: Check the shape of your DataFrame

In [None]:
##
## Example: check the shape of the imported data
## Note 1: the order of the returned shape is always #rows, #columns
## Note 2: there is no () for this command because the shape
## is a DataFrame attribute
##
print( "Shape of the DataFrame: {}".format( df_orders.shape ) )

There are 70,270 rows or observations and 15 columns or variables.

### Task \#3: Check the column names in your DataFrame

In [None]:
##
## Example: list the column names
##
df_orders.columns

### Task \#4: Check for missing data in your DataFrame

In [None]:
##
## Use the DataFrame's information content.  The info()
## method returns the number of non-missing rows for
## each variable.  The numbers should be the same.
##
df_orders.info()

The above listing indicates that the four discounts each have missing values. 

You can count the number of missing values using the *isnull()* method and chaining the *sum()* function.  The *isnull()* method returns a Boolean variable so the *sum()* function just adds 0 and 1 values.  Use a nice print statement for clarity.

In [None]:
##
## Sum the Boolean variable returned by isull()
## Let us check the Order Discount (Odisc).
##
x = df_orders.Odisc.isnull().sum()
print( 'Missing count for Odisc: {}'.format( x ) )

You can check the proportion of each variable that is missing rather than the sum. Proportions are more meaningful. Use the *mean()* function for this.  Since *isnull()* returns a Boolean, the mean is just the proportion.

In [None]:
##
## Check for missing values only for the discounts.
## Chain the mean() function to the isnull() method.
## Note: Do this for first 20 records for illustration only.
##
x = [ 'Ddisc', 'Odisc', 'Cdisc', 'Pdisc' ]
df_orders[ x ].iloc[ :20 ].isnull().mean()

Create a heatmap of missing data using the *isnull()* method on the entire DataFrame.  Use the transpose attribute, *T*, for a more readable chart.

In [None]:
##
## Note: the "cbar = False" argument turns off the color bar
## Note: Do this for first 20 records for illustration only
##
x = [ 'Ddisc', 'Odisc', 'Cdisc', 'Pdisc' ]
ax = sns.heatmap( df_orders[ x ].iloc[ :20, : ].isnull().T, cbar = False )
ax.set( title = 'Heatmap of Missing Data' )

## Appendix 1.3: Miscellaneous Pandas DataFrame Column Manipulations

### <font color = black> Deleting Columns </font>

You can easily delete unwanted columns with the *drop* method.  You must specify if you want the DataFrame replaced or not; the default is to not replace in which case you must assign a new name to the modified DataFrame.  If you drop a column, it is good practice to not replace your DataFrame so that you preserve your original data.

The *drop* method can be used to drop rows or columns, so you have to tell it which one.  This is done with the *axis* arugment.  The DataFrame, as a simple rectangular array, is said to have two *axes*: the row axis and the column axis.  Since in mathematics the size of a matrix is conventionally specified as $\#row \times \#columns$ (rows always come before columns), the DataFrame axes are designated as 0 and 1 for rows and columns, respectively, since 0 comes before 1.  Specifying *axis = 0* in the *drop* method says to drop a row while specifying *axis = 1* says to drop a column. The default is *axis=0*. 

In [None]:
##
## Drop a column and replace the DataFrame.  Notice that axis = 1 is used
## to drop a column.
## 
## Not Run
##
## df_orders.drop( 'obs', axis = 1, inplace = True )
## df_orders.head()

### <font color = black> Printing or Displaying Specific Columns </font>

You can print specific columns by referring to their column indexes.  You can also print rows by chaining the *head()* function to the DataFrame name.  The default is the first five rows of your selection.

In this example, I will print the specific columns 2, 3, and 4.  What you would count as columns 2 - 4, Python counts as 1 - 3 because of the offset from the beginning.  Note that the column listing is left inclusive but right exclusive; in math notation, it is ( x ].  So telling Python to print columns 1 - 3 is done with *columns[ 1:4 ]* which says to include column 1 and up to but excluding column 4.  Remember that column 1 is the first offset from the beginning so you see it as the second column.

In [None]:
##
## Example: 
##
## Notice that columns 1, 2, and 3 BUT NOT 4
## are displayed
##
df_orders[ df_orders.columns[ 1:4 ] ].head()

You can also create a list, which is in square brackets, to select the columns and then use that list.

In [None]:
##
## Create a list of columns
##
## Not Run
##
##x = [ 'Tdate', 'State', 'Region' ]
##df_orders[ x ].head()

### <font color = black> Creating Variables </font>

The following are exercises to create variables.

### <font color = black> Exercise \#A 1.3.1 </font>

Calculate total revenue as $Rev = Usales \times Pprice$.  Use the df_orders DataFrame.

In [None]:
##
## Enter code here
##


### <font color = black> Exercise \#A 1.3.2 </font>

*Contribution* and *contribution margin* are two values financial analysts often examine.  Contribution is comparable to what economists call *profit* but is more restricted in that it just refers to a product without considering any fixed or overhead costs.  Contribution is $Con = Revenue - Material~Cost$ and contribution margin is $CM = \dfrac{Con}{Revenue}$.  Calculate both quantities and display the first 5 records of unit sales, pocket price, material cost, revenue, contribution, and contribution margin.  Use the df_orders DataFrame.

In [None]:
##
## Enter code here
##


### <font color = black> Exercise \#A 1.3.3 </font>

Some products are returned so another revenue number, *revenue net of returns*, is more meaningful and revealing for business decisions.  Net revenue is
<br><br>
$Net Revenue = (Unit Sales - Returns) \times Pocket Price$.
<br><br>
Calculate net revenue and call it 'netRev'.  Also calculate the loss in revenue due to the returns and call it 'lostRev'.  Display the first five records of the DataFrame using just gross revenue, net revenue, and the lost revenue due to returns.  Use the *df_orders* DataFrame.  

In [None]:
##
## Enter code here
##


###  <font color = black> Exercise \#A1.3.4 </font>

Display the descriptive statistics for gross revenue, net revenue, contribution, and contribution margin.  Round the answers to two decimal places.  Use the *df_orders* DataFrame. 

In [None]:
##
## Enter code here
##


## Appendix 1.4: Correlation Analysis

There is a *correlation* method attached to a DataFrame.

In [None]:
##
## Example: display a correlation matrix of the discounts
##
x = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc' ]
df[ x ].corr().round( 3 )

**_Interpretation_**

The correlation matrix has a 1.0 in the cells along its main diagonal (the diagonal running from top left to bottom right).  The off-diagonal cells have the pair-wise correlations.  Notice that the correlation matrix is symmetric around the main diagonal: the top portion (called the *upper triangle*) matches the bottom portion (called the *lower triangle*).

Round to three decimal places is usually sufficient.

The correlations are all very low indicating that the discounts are not linearly associated.

A *heatmap* is sometimes more effective for displaying a correlation matrix. 

In [None]:
##
## Example: plot the correlation matrix as a heatmap
##
x = [ 'Ddisc', 'Cdisc', 'Odisc', 'Pdisc' ]
cor = df[ x ].corr()
ax = sns.heatmap( cor )
ax.set( title = 'Heatmap of the Correlation Matrix' )

**_Interpretation_**

The cells along the main diagonal are all white which, by the color bar on the right, indicates they are all 1.0 as they should be.  All other cells are black indicating that the correlations are all 0.0.

###  <font color = black> Exercise \#A 1.4.1 </font>

Create a correlation matrix and corresponding heatmap for gross revenue, net revenue, contribution, and contribution margin.  What would you tell your product manager about the correlations?  Use the *df_orders* DataFrame.  

In [None]:
##
## Enter code here
##
