# Practical Statistics for Data Scientists
## Exploratory Data Analysis

This Notebook is the python equivalent of the R code for Chapter-1, from the book <a href="http://shop.oreilly.com/product/0636920048992.do">Practical Statistics for Data Scientists</a> by Peter Bruce and Andrew Bruce. This <a href="https://github.com/andrewgbruce/statistics-for-data-scientists">GitHub</a> repository has the complete R code for the book.

The authors note that the aim of the book is to be a "Desk Reference" for key Statistical concepts that are relevant to Data Science, explaining their importance and the reason behind that choice.

Data that is used in the book, has been curated by the authors and made available on <a href="https://drive.google.com/drive/folders/0B98qpkK5EJemYnJ1ajA1ZVJwMzg">Google Drive</a> and <a href="https://www.dropbox.com/sh/clb5aiswr7ar0ci/AABBNwTcTNey2ipoSw_kH5gra?dl=0">Dropbox</a>

<b>NOTE:</b>
The data for creating the Contingency Table has been downloaded from the Lending Club <a href=" https://www.lendingclub.com/info/download-data.action">website</a> for the year 2007-2011. Please see the screenshot
<img src="../img/lending_club.png" height=200, width=400></img>


In [62]:
import numpy as np
import pandas as pd


In [63]:
# Read the data from a .csv file
# This file is a modified version of the original
# CSV file from the Lending Club website. It just has
# grade, sub_grade and loan_status columns

loanDataDF = pd.read_csv("../data/lc_Stats_2007_2011.csv")



In [64]:
# Check the type of data structure, holding the data
type(loanDataDF)

pandas.core.frame.DataFrame

In [65]:
# Top 10 rows/records of the data
loanDataDF.head(10)

Unnamed: 0,grade,sub_grade,loan_status
0,B,B2,Fully Paid
1,C,C4,Charged Off
2,C,C5,Fully Paid
3,C,C1,Fully Paid
4,B,B5,Fully Paid
5,A,A4,Fully Paid
6,C,C5,Fully Paid
7,E,E1,Fully Paid
8,F,F2,Charged Off
9,B,B5,Charged Off


In [66]:
# Bottom 10 rows/records of the data
loanDataDF.tail(10)

Unnamed: 0,grade,sub_grade,loan_status
42528,C,C4,Does not meet the credit policy. Status:Fully ...
42529,B,B2,Does not meet the credit policy. Status:Fully ...
42530,B,B3,Does not meet the credit policy. Status:Fully ...
42531,B,B5,Does not meet the credit policy. Status:Fully ...
42532,B,B4,Does not meet the credit policy. Status:Charge...
42533,C,C1,Does not meet the credit policy. Status:Fully ...
42534,B,B4,Does not meet the credit policy. Status:Fully ...
42535,B,B3,Does not meet the credit policy. Status:Fully ...
42536,A,A5,Does not meet the credit policy. Status:Fully ...
42537,A,A3,Does not meet the credit policy. Status:Fully ...


In [67]:
# Get the data types of feature/attributes in the data
loanDataDF.dtypes

grade          object
sub_grade      object
loan_status    object
dtype: object

<br>

## Two Variables - Both Categorical

### Contingency Table

In [70]:
loanDataCrossTab = pd.crosstab(loanDataDF['grade'], loanDataDF['loan_status'], margins=True)

In [71]:
print(loanDataCrossTab)

loan_status  Charged Off  Does not meet the credit policy. Status:Charged Off  \
grade                                                                           
A                    602                                                  8     
B                   1433                                                 85     
C                   1356                                                148     
D                   1130                                                197     
E                    725                                                158     
F                    323                                                 93     
G                    101                                                 72     
All                 5670                                                761     

loan_status  Does not meet the credit policy. Status:Fully Paid  Fully Paid  \
grade                                                                         
A                              

<b>NOTE:</b><br>
The column in the Contingency Table that has the “Does not meet the credit policy” wording, represent loans that were funded by investors and issued by Lending Club, prior to 2010, but that would not qualify for listing today.<br>

The grade categories range from A to G, A being the "lowest risk", with F & G being the "high risk", in terms of paying off the loan.  A loan becomes “charged off” when there is no longer a reasonable expectation of further payments. Charge off typically occurs when a loan is no later than 150 days past due.


<br>