In [1]:
# Basic imports
import numpy as np
import pandas as pd
from scipy import stats
# Data visualizations
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Pre-Processing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

#Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

import random 

In [2]:
random.seed(42)

TOPICS:
- Z-Score
- PDF/SF
- Conferance Intervals
       - upper/lower
       - interpret it
- Null/Alt Hypothesis
- Type I/ II Errors
- Calculating P-Value
- Bayes
- Linear Regression
- Correlation Matrixes

Z-Score means how many standard deviations away

In [5]:
# Let's transform the normal distribution centered on 5
# with a standard deviation of 2 into a standard normal

# Generating our data
normal_dist = np.random.normal(loc=5, scale=2, size=1000)

np.mean(normal_dist)

5.0013550774889

In [6]:
# Here, let's standardize by hand
# (x - mean) / std
z_dist = [(x - np.mean(normal_dist)) / np.std(normal_dist)
          for x in normal_dist]

np.mean(z_dist)

5.1514348342607266e-17

Standardizing a Distribution in Pandas

In [None]:
#Calculate the z-score for that row's HourlyRate
(sample_row['HourlyRate'].values[0] - df['HourlyRate'].mean()) / df['HourlyRate'].std()

In [None]:
# Standardize the column
mu = df['HourlyRate'].mean()
sigma = df['HourlyRate'].std()
standardized_rate = [(x-mu)/sigma for x in df['HourlyRate']]

## Confidence Intervals

Margin of Error = Critical Value * Sample Standard Error

### Critical Value

need 𝛼
<br>
𝛼=1−Confidence Level
<br>
So, if you pick a 95% confidence level, then  𝛼  = 1 - .95 = .05
<br>
BUT because you want to be confident on either side, this actually ends up being divided by 2!
.05/2=.025

This is the percentage of "acceptable" error on either side.
<br>
Why does this matter? Because you'll feed this value into your search for your critical value - a value which comes from the probability at the point at which there's 2.5% on each side.

### Calculating the Confidence Interval

In [None]:
# WRITTEN OUT VERSION

#80%-confidence interval
n = 30
x_bar = 2000
s = 200

alpha80 = 1 - .80

#Want confidence on both sides of the curve
# 1 - (alpha80/2) = 0.9

#calculate t-values
t_value80 = stats.t.ppf(0.9, n-1)

#calculate t-margins of error
margin_error80 = t_value80 * 200/(n**0.5)
#RAISED TO 0.5 IS THE SQUARE ROOT

#calculate 80%-intervals
conf_int80 = (x_bar - margin_error80, x_bar + margin_error80)

#print out results
print(conf_int80)

CHECK THE ABOVE USING THE VERSION BELOW

In [1]:
# Of course, there's also: USE THIS TO CHECK HARD CODED WORK ON CODE CHALLENGE
stats.t.interval(alpha=0.95,
                 loc = sample_mean,
                 scale = stats.sem(sample['HourlyRate']),
                 df=n-1)


#Alpha really means be confidence 
#Scale is getting the sem (standard error of the mean)

#THIS GIVES US THE CONFIDENCE INTERVAL AT 95% Confidence

NameError: name 'stats' is not defined

## HYPOTHESIS TESTING

This is at heart what hypothesis testing is: *"Does our sample come from the population or is it a special set?"*

Defining a _threshold value_ $\alpha$ (called the **significance level** or **False Positive Rate**) helps to decide whether we believe that the sample is from the same underlying population or not.

### Steps of a Hypothesis Test

1. State the null hypothesis and the alternative hypothesis
2. Specify significance level ($\alpha$)
3. Calculate test statistic (z-statistic, t- statistic, etc.)
4. Calculate p-value
5. Interpret p-value (reject or fail to reject the null hypothesis) 


Calculating P-Value

Assuming the Z-Score is Positive: for the $z$-test, we can use the CDF of the normal distribution to find this probability (`p = 1 - scipy.stats.norm.cdf(z_score)`). Shortcut: `p = scipy.stats.norm.sf(z_score)`.

If $p \lt \alpha$, we reject the null hypothesis.:

If $p \geq \alpha$, we fail to reject the null hypothesis.

Can also caluclate the P value using `p = stats.norm.sf(z_score)` which is the opposite `p = 1 - scipy.stats.norm.cdf(z_score)`

EDA for multiple groups before ANOVA Testing to compare multiple samples

In [None]:
# find the mean and std for every group

df.groupby('Column Name').agg(['mean', 'std'])

F- Statistic:
The higher the F- Statistic, the lower the P-Value so it is more likely you reject the null hypothesis

Chi Square Tests-  about frequencies (counts) for discrete variables (categorical data). 

Used when frequency > 5

Goodness of Fit Test: 1 categorical variable, which could have subclasses. Testing 1 category with 1 or more changes. (Not looking for a relationship between classes but looking to compare Expectations (Control) and Observations (Experiment/Variation))

Independent Test: 2 or more categorical variables. Each category may have subclasses.
Testing realtionship between multiple categories.

### Multiple Linear Regression


Interpreting coefficients: <br>
Intercept: Mean of your target variable/column (aka baseline) (value of Y/dependent variable when X/independent variable is 0)<br>
X: For every increase of 1 in X there is an increase of NUMBER in the target variable <br>


In [None]:
import statsmodels.api as sm

X = df.drop("target", axis =1)
#Add constant
X = sm.add_constant(X)

#Create Target Variable
y = df["target"]

# Instantiate model
model = sm.OLS(endog = y, exog = X)

#Fit model - Pulls Mean and STD from data since it's parametric model
model = model.fit()
model.summary()

#Transform or Scale 


#Evaluate

SCALING (Not neccessary for a standard Linear Regression Model)<br>
For every increase in 1 standard deviation of independent variable, there is an increase in the dependent variable shown in the standardized coefficient values.

In [None]:
#Standardizing/Normalizing the Independent Variables is equivalent to getting the Z-Scores of every variable

(X - X.mean()) / np.std(X)

#Can use SKLearn Standard Scaler

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss.fit(X_train)
X_standardized_train = ss.transform(X_train)
X_standardized_test = ss.transform(X_test)

#CHECKING STANDARD SCALER WORKED- MEANS should be close to 0 and STD should be close to 1
X_standardized_train.mean(axis = 0)

X_standardized_train.std(axis = 0)

X_standardized_test.mean(axis = 0)

X_standardized_test.std(axis = 0)

SKLearn's Standard Scaler- Standardizing a distribution (going to be very important for Machine Learning)

NEVER FIT ON TEST OR VALIDATION DATA. ONLY FIT ON TRAINING DATA

In [None]:
# Importing StandardScaler from the preprocessing module
from sklearn.preprocessing import StandardScaler

# Need to instantiate our scaler
scaler = StandardScaler()

# Fitting our scaler (note how we need to make the column into a dataframe)
scaler.fit(df[['HourlyRate']])

# Grabbing the transformed values out as scaled_rate
scaled_rate = scaler.transform(df[['HourlyRate']])

Error Metrics

In [1]:
#MEAN ABSOLUTE ERROR - in the same units as my target variable
metrics.mean_absolute_error(wine_target, lr.predict(wine_preds_st_scaled))


#ROOT MEAN SQUARED ERROR - in the same units as my target variable
metrics.mean_squared_error(wine_target, lr.predict(wine_preds_st_scaled), squared = False)

#MEAN SQUARED ERROR
metrics.mean_squared_error(wine_target, lr.predict(wine_preds_st_scaled), squared = True)

NameError: name 'metrics' is not defined

### Avoiding Multicollinearity

A further assumption for multiple linear regression is that the predictors/x variables are independent.

#### VIEWING CORRELATION AND USING A HEATMAP (from Feture Selection Lecture)

In [None]:
# Use the .corr() DataFrame method to find out about the
# correlation values between all pairs of variables!

df.corr()

In [None]:
sns.set(rc={'figure.figsize':(14, 14)})

# Use the .heatmap function to depict the relationships visually!
sns.heatmap(df.corr(),annot=True);

#Want to check that 1 predictive variable is not correlated with another predictive variable. Need to be independent.

If a predictor/X Variable has the largest correlation w target/Y Variable, then it will also have the largest coefficient- meaning a change of 1 unit in X will have the largest change in Y