<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Statistics-in-Data-Science,-A-Review" data-toc-modified-id="Statistics-in-Data-Science,-A-Review-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Statistics in Data Science, A Review</a></span><ul class="toc-item"><li><span><a href="#Statistics-in-Python" data-toc-modified-id="Statistics-in-Python-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Statistics in Python</a></span><ul class="toc-item"><li><span><a href="#Statistics-Packages-for-Python" data-toc-modified-id="Statistics-Packages-for-Python-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Statistics Packages for Python</a></span><ul class="toc-item"><li><span><a href="#Importing-the-packages" data-toc-modified-id="Importing-the-packages-1.1.1.1"><span class="toc-item-num">1.1.1.1&nbsp;&nbsp;</span>Importing the packages</a></span></li><li><span><a href="#Testing-the-packages" data-toc-modified-id="Testing-the-packages-1.1.1.2"><span class="toc-item-num">1.1.1.2&nbsp;&nbsp;</span>Testing the packages</a></span></li></ul></li></ul></li><li><span><a href="#Types-of-Statistics:-Descriptive-&amp;-Inferential" data-toc-modified-id="Types-of-Statistics:-Descriptive-&amp;-Inferential-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Types of Statistics: Descriptive &amp; Inferential</a></span><ul class="toc-item"><li><span><a href="#Descriptive-Statistics" data-toc-modified-id="Descriptive-Statistics-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Descriptive Statistics</a></span></li><li><span><a href="#Inferential-Statistics" data-toc-modified-id="Inferential-Statistics-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Inferential Statistics</a></span><ul class="toc-item"><li><span><a href="#A-Closer-look-at-Statistical-Inference" data-toc-modified-id="A-Closer-look-at-Statistical-Inference-1.2.2.1"><span class="toc-item-num">1.2.2.1&nbsp;&nbsp;</span>A Closer look at Statistical Inference</a></span></li></ul></li><li><span><a href="#Types-of-Data" data-toc-modified-id="Types-of-Data-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Types of Data</a></span></li><li><span><a href="#Graphing-and-Visualizing-Data:" data-toc-modified-id="Graphing-and-Visualizing-Data:-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Graphing and Visualizing Data:</a></span><ul class="toc-item"><li><span><a href="#Graphing-Qualitative-Data" data-toc-modified-id="Graphing-Qualitative-Data-1.2.4.1"><span class="toc-item-num">1.2.4.1&nbsp;&nbsp;</span>Graphing Qualitative Data</a></span><ul class="toc-item"><li><span><a href="#Cleaning-up-the-data" data-toc-modified-id="Cleaning-up-the-data-1.2.4.1.1"><span class="toc-item-num">1.2.4.1.1&nbsp;&nbsp;</span>Cleaning up the data</a></span></li></ul></li></ul></li></ul></li></ul></li></ul></div>

# Statistics in Data Science, A Review

__Purpose:__ The purpose of this sction is to perform a brief but comprehensive review of the field of Statistics as applied in Data Science. We will cover the most important introductory topics of Statistics starting with the type of statistics, graphing and summarizing data, comparing sample vs. population, and concepts such as Central Limit Theorem, estimators, Law of Large Numbers, and then end with confidence intervals and hypothesis testing. 

__At the end of this section we will have reviewed how to do the following with Python:__
> 1. Understand and define Statistics as well as differentiate between Decriptive and Inferential Statistics.
> 2. Graph and summarize data using measures of Central Tendency and Variation 
> 3. Understand the characteristics of some common distributions such as Normal, Binomial, Uniform, and Exponential 
> 4. Work with bi-variate data and calculate summary statistics such as Correlation and Causation 
> 5. Understand the differences between working with samples and populations 
> 6. Define the Central Limit Theoreom and Law of Large Numbers 
> 7. Perform Hypothesis Testing 
> 8. Calculate Confidence Intervals 

## Statistics in Python

__Overview:__ 
- Python has a wide range of useful functions to perform Statistical routines. These functions are found in the following two Modules: 
> 1. __[`scipy.stats`](https://docs.scipy.org/doc/scipy-0.18.1/reference/stats.html):__ The `stats` module in the SciPy Package offers many Statistical functions such as mean, zscore, correlation as well as other sub-modules for Continuous Distributions, Multivariate Distributions, and Discrete Distributions
> 2. __[`numpy`](https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.statistics.html):__ The `numpy` package itself offers many Statistical Functions such as Order Statistics, Averages and Variances, Correlating, and Histograms

__Helpful Points:__
1. It is not necessary to access a sub-package within the `numpy` package like we did with `numpy.linalg`. Instead, we can simply execute the function directly from NumPy: `np.func_name`

__Practice:__ Import the Statistics Modules in Python 

### Statistics Packages for Python

#### Importing the packages

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import math 
import random
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
%whos

Variable   Type      Data/Info
------------------------------
autopep8   module    <module 'autopep8' from '<...>e-packages\\autopep8.py'>
json       module    <module 'json' from 'c:\\<...>\lib\\json\\__init__.py'>
math       module    <module 'math' (built-in)>
np         module    <module 'numpy' from 'c:\<...>ges\\numpy\\__init__.py'>
pd         module    <module 'pandas' from 'c:<...>es\\pandas\\__init__.py'>
plt        module    <module 'matplotlib.pyplo<...>\\matplotlib\\pyplot.py'>
random     module    <module 'random' from 'c:<...>ython37\\lib\\random.py'>
sns        module    <module 'seaborn' from 'c<...>s\\seaborn\\__init__.py'>
stats      module    <module 'scipy.stats' fro<...>ipy\\stats\\__init__.py'>


#### Testing the packages

In [3]:
stats.describe([1,2,3,4,5]) # scipy stats sub-package

DescribeResult(nobs=5, minmax=(1, 5), mean=3.0, variance=2.5, skewness=0.0, kurtosis=-1.3)

In [5]:
stats.describe([1,2,3,4,5]).mean # checking just the mean with scipy stats

3.0

In [4]:
np.mean([1,2,3,4,5]) # using numpy package

3.0

## Types of Statistics: Descriptive & Inferential

### Descriptive Statistics
Descriptive Statistics provides a summary of a set of data and its properties. It is important to understand that descriptive statistics simply collects and records metrics, nothing more. 

### Inferential Statistics

The purpose on infrential statistics, as te name indicates, is to use data from a sa mple of a population in order to make inferences about the population itself.

#### A Closer look at Statistical Inference

The world we live in is a complex, random, and uncertain process composed of many other processes. It is also a massive data-generating machine.

Data represent traces of those real-world processes, and exactly which traces we gather are decided by our data collection or sampling method.

We as scientists take interest in processes. An important step in our work is separating our process of interest from our data collection method. After separating the process from the data collection, we can see clearly that there are two sources of randomness and uncertainty. In particular, these are the uncertainty and randomness of the process itself, and the uncertainty of our collection methods.

Part of the implication of this is that you can't just go around collecting data then immediately understand the processes of the world.

In order to begin gaining this understanding you need to simplify those captured traces into something more comprehensible, to something that somehow captures it all in a much more concise way, and that something could be mathematical models or functions of the data, known as statistical estimators.

More precisely, statistical inference is the discipline that concerns itself with the development of procedures, methods, and theorems that al‐ low us to extract meaning and information from data that has been generated by stochastic (random) processes.

### Types of Data

__Overview:__ 
- In general, we can distinguish between two types of data:
> 1. __[Qualitative Data](https://en.wikipedia.org/wiki/Qualitative_property):__ Qualitative Data contains discrete categories known as levels or categories. Qualitative Data can be further classified into four levels as described below:
> 2. __[Quantitative Data](https://en.wikipedia.org/wiki/Quantitative_research):__ Quantitative Data is also known as Numerical Data because it contains numbers as counts or measurements 

### Graphing and Visualizing Data:

__Overview:__ 
- Depending on the type of data there exists different types of graphs that we can use to explore the properties of the data 
> 1. __Qualitative Data:__ Qualitative Data can be summarized using the following methods:
>> a. Frequency Table <br>
>> b. Pie Charts <br>
>> c. Bar Charts 
> 2. __Quantitative Data:__ Quantitative Data can be summarized using the following methods: 
>> a. Histograms <br>
>> b. Scatter Plots <br>
>> c. Line Graph 

__Helpful Points:__
1. There is lots of debate and best practices shared regarding the type of graph that should be chosen. Feel free to read about this online, but [here](https://blog.hubspot.com/marketing/types-of-graphs-for-data-visualization) is one such resource. We'll cover this in more depth later when we specifically cover Visualization

__Practice:__ Examples of Graphing Data in Python 

In [6]:
import pandas as pd
import numpy as np
from scipy import stats

My hometown is Pittsburgh, PA which is located in Allegheny county. Let's play with an Allegheny County housing dataset!

In [10]:
# read in our data and the key for interpreting some of the values

ach_df = pd.read_csv('data/AlleghenyHousing/assessments.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [13]:
# reading this will return a UnicodeError unless we tell it how to decode the string values

ach_key_df = pd.read_csv('data/AlleghenyHousing/assessment_dictionary.csv', encoding = "ISO-8859-1")

In [15]:
#ach_key_df

We've gotten some very useful feedback. There are some mixed datatypes in some of our columns. It's probably a good guess that there are missing datapoints and therefore a mixture of numerical data and NaN values. We'll find out!

In [9]:
# view the first ten lines of the data
ach_df.head(10)

Unnamed: 0,PARID,PROPERTYHOUSENUM,PROPERTYFRACTION,PROPERTYADDRESS,PROPERTYCITY,PROPERTYSTATE,PROPERTYUNIT,PROPERTYZIP,MUNICODE,MUNIDESC,...,HALFBATHS,HEATINGCOOLING,HEATINGCOOLINGDESC,FIREPLACES,BSMTGARAGE,FINISHEDLIVINGAREA,CARDNUMBER,ALT_ID,TAXYEAR,ASOFDATE
0,0001G00224060300,151.0,,FORT PITT BLVD,PITTSBURGH,PA,UNIT 603,15222.0,101,1st Ward - PITTSBURGH,...,0.0,B,Central Heat with AC,,,1761.0,1.0,,2018,01-SEP-18
1,0001G00224060400,151.0,,FORT PITT BLVD,PITTSBURGH,PA,UNIT 604,15222.0,101,1st Ward - PITTSBURGH,...,0.0,B,Central Heat with AC,,,1275.0,1.0,,2018,01-SEP-18
2,0001G00224060500,151.0,,FORT PITT BLVD,PITTSBURGH,PA,UNIT 605,15222.0,101,1st Ward - PITTSBURGH,...,1.0,B,Central Heat with AC,,,888.0,1.0,,2018,01-SEP-18
3,0002M00222000000,40.0,,VAN BRAAM ST,PITTSBURGH,PA,,15219.0,101,1st Ward - PITTSBURGH,...,0.0,2,Central Heat,0.0,0.0,1200.0,1.0,,2018,01-SEP-18
4,0002M00223000000,42.0,,VAN BRAAM ST,PITTSBURGH,PA,,15219.0,101,1st Ward - PITTSBURGH,...,0.0,2,Central Heat,0.0,0.0,3225.0,1.0,,2018,01-SEP-18
5,0002M00224000000,1638.0,,FORBES AVE,PITTSBURGH,PA,,15219.0,101,1st Ward - PITTSBURGH,...,,,,,,,,,2018,01-SEP-18
6,0002M00225000000,1632.0,,FORBES AVE,PITTSBURGH,PA,,15219.0,101,1st Ward - PITTSBURGH,...,,,,,,,,,2018,01-SEP-18
7,0002M00227000000,1628.0,,FORBES AVE,PITTSBURGH,PA,,15219.0,101,1st Ward - PITTSBURGH,...,,,,,,,,,2018,01-SEP-18
8,0002M00228000000,1631.0,,TUSTIN ST,PITTSBURGH,PA,,15219.0,101,1st Ward - PITTSBURGH,...,,,,,,,,,2018,01-SEP-18
9,0002M00229000000,1626.0,,FORBES AVE,PITTSBURGH,PA,,15219.0,101,1st Ward - PITTSBURGH,...,,,,,,,,,2018,01-SEP-18


As we can see this data isn't very clean. There are mixed types in some columns and a whole lot of data that we simply don't want or need. We can handle this as we go.

This is probably a pretty big dataset. Maybe we should see how many rows we have in our dataframe.

In [18]:
len(ach_df)

578221

Almost 600,000 rows of data, 86 columns. That's a lot of rows and columns, which is a good thing. It might just mean that we have a lot of cleaning to do and some of our operations take a minute or so to complete depending on how fast our machine is.

In [17]:
# lets convert the ASOFDATE column to datetime

ach_df['ASOFDATE'] = pd.to_datetime(ach_df['ASOFDATE'])

#### Graphing Qualitative Data

Lets check the datatypes of each column.

In [25]:
ach_df.dtypes

PARID                         object
PROPERTYHOUSENUM             float64
PROPERTYFRACTION              object
PROPERTYADDRESS               object
PROPERTYCITY                  object
PROPERTYSTATE                 object
PROPERTYUNIT                  object
PROPERTYZIP                  float64
MUNICODE                       int64
MUNIDESC                      object
SCHOOLCODE                     int64
SCHOOLDESC                    object
LEGAL1                        object
LEGAL2                        object
LEGAL3                        object
NEIGHCODE                     object
NEIGHDESC                     object
TAXCODE                       object
TAXDESC                       object
TAXSUBCODE                    object
TAXSUBCODE_DESC               object
OWNERCODE                      int64
OWNERDESC                     object
CLASS                         object
CLASSDESC                     object
USECODE                        int64
USEDESC                       object
L

##### Cleaning up the data

There are two cases 