# Statistics Using Python

In [1]:
# import packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# default style for plt plots to seaborn
sns.set()

## Exploratory Data Analysis (EDA)
- It's best practice to do quick exploration prior to advanced analysis
    - viewing structure
    - checking for NaN's
    - calculating summary stats
    - plotting histograms
    - plotting boxplots to check for outliers, percentiles, and median
        - line is median, box is IQR, whiskers go to 1.5 IQR or range (whichever is greater), dots outliers
    - bee swarm plots to check distribution
        - avoids binning bias from histograms
        - however don't do well with too many data points
    - empirical cumulative distribution function (ECDF)
        - very good plot, do this early to examine the data
        - often include multiple columns (or separate by categories) if it makes sense to do this
        - good function for returning the x, y values for an ECDF
        
```python
def ecdf(data):
"""Return x, y data for an ECDF receiving a 1d array, list, or series."""
         
    # compute the number of points
    n = len(data)
            
    # sort the x data and store
    x = np.sort(data)
            
    # compute y using the distribution
    y = np.arange(1, n+1) / n
            
    # return x and y
    return x, y
```

#### Summary Stats
- Good idea to check quick summary stats
    - `np.mean()`
    - `np.median()` (50th percentile)
    - `np.percentile(data, list)`
        - supply a list of the percentiles you want to compute as ints or a single value
        - interquartile range (IQR) is 25%-75%
    - `np.var()` variance
    - `np.std()` stdev

#### Simple Correlation
- Covariance
    - `np.cov(arr1, arr2)`
        - returns a 2d array of covariance (entry `[0,1]` has the data of interest)
        - `co_var = np.cov(arr1, arr2)` then `real_co_var = co_var[0,1]`
    - mean of the product of the x/y distance from the mean
        - ((x - x_mean)(y - y_mean)) / number of observations
        - positive covariance means a positive correlation, negative means negative corr
    - function
    
```python
def co_var(x, y):
    """Compute covariance between two arrays."""
    
    # Compute covariance matrix: cov_mat
    cov_mat = np.cov(x,y) 

    # Return entry [0,1]
    return cov_mat[0,1]
```

- Pearson Correlation Coefficient
    - `np.corrcoef(arr1, arr2)`
        - returns a 2d array (entry `[0,1]` has the data of interest)
        - see covariance above for data extraction
    - Greek letter lowercase rho (kinda like a p)
    - covariance / (np.std(x) * np.std(y))
        - variability due to covariance / independent variability
    - dimensionless (no units)
    - range -1 complete negative corr, to 0 no corr, to 1 complete positive corr
    - closer to -1 or 1, the more tightly packed points are, the better the corr
    - function:
    
```python
def pearson_r(x, y):
    """Compute Pearson correlation coefficient between two arrays."""
    
    # Compute correlation matrix: corr_mat
    corr_mat = np.corrcoef(x,y) 

    # Return entry [0,1]
    return corr_mat[0,1]
```