<a href="https://colab.research.google.com/github/gauravtailor8520/Machine-Learning/blob/main/Regression_Cheat_Sheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Resources

[List of Regression and Classification Metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values)

[Cheat Sheets](https://github.com/bundickm/CheatSheets)

#Definitions
**[Linear Regression](https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html)** - Linear Regression is a statistical model that seeks to describe the relationship between some y variable and one or more x variables ("Line of Best Fit").  A linear regression model that involves multiple x variables is known as Multiple Regression.
- X - Independent Variable, predictor variable, explanatory variable, regressor, covariate
- Y - Response variable, predicted variable, measured vairable, explained variable, outcome variable
- $\beta_0$ - "Beta Naught" or "Beta Zero", the intercept value. This is how much of y would exist if X were zero. This is sometimes represented by the letter "a" but I hate that. So it's "Beta 0" during my lecture.
- $\beta_1$ - "Beta One" The primary coefficient of interest. This values is the slope of the line that is estimated by "minimizing the sum of the squared errors/residuals" - We'll get to that.
- $\epsilon$ - "Epsilon" The "error term", random noise, things outside of our model that affect y.

<center><img src="https://cdn-images-1.medium.com/max/1600/1*_TqRJ9SmwFzRigJhMiN2uw.png" width="400"/></center>

**Mean Absolute Error (MAE)** - Measures the average magnitude of the errors in a set of forecasts, without considering their direction. It measures accuracy for continuous variables.

**Mean Squared Error (MSE)** - Measures the average of the squares of the errors—that is, the average squared difference between the estimated values and what is estimated. The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better.

**Root Mean Squared Error (RMSE)** - The RMSE is a quadratic scoring rule which measures the average magnitude of the error. The difference between forecast and corresponding observed values are each squared and then averaged over the sample. Finally, the square root of the average is taken. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE is most useful when large errors are particularly undesirable.
- The MAE and the RMSE can be used together to diagnose the variation in the errors in a set of forecasts. The RMSE will always be larger or equal to the MAE; the greater difference between them, the greater the variance in the individual errors in the sample. If the RMSE=MAE, then all the errors are of the same magnitude
- RMSE has the same units as the quantity being estimated

**Coefficient of Determination ($R^2$)** - The proportion of the variance in the dependent variable that is predictable from the independent variable(s).

**Residual Error** -  The distance between points in the dataset and some "true" value, usually the predicted values such as those along the regression line.

**Bias** - Difference between estimators expected values and the true value. High bias is caused by erroneous assumptions. High Bias = Underfitting

**Variance** - The variability of model predictions for a given data point or value which tell the spread of the data.  High variance is a sensitivity to small fluctuations in the training set. High Variance = Overfitting

In [None]:
#Linear Regression Model and Errors
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

model = LinearRegression()
model.fit(X,y)
X_test['prediction'] = model.predict(X_test[[feature]])

print('Feature:',feature)
print('MSE:',mean_squared_error(y_true=y_test, y_pred=X_test['prediction']))
print('Sqrt(MSE):',np.sqrt(mean_squared_error(y_true=y_test, y_pred=X_test['prediction'])))
print('R2_Score:',r2_score(y_test,model.predict(X_test[[feature]])))

X_test.plot(x=feature,y='prediction',kind='line',color='r')
plt.scatter(x=X_test[[feature]],y=y_test)
plt.show()

NameError: name 'X' is not defined

**[Polynomial Regression](https://en.wikipedia.org/wiki/Polynomial_regression)** -  A form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Just as multiple regression is an extension of the bivariate case, Polynomial Regression is an extention of multiple regression and can be used to fit data to any curved shape. [Additional Explanation](https://stats.stackexchange.com/questions/92065/why-is-polynomial-regression-considered-a-special-case-of-multiple-linear-regres)

In [None]:
#Plot features to search for candidates to create polynomials from
for feature in features:
    sns.residplot(X[feature], y=df[target], lowess=True, line_kws=dict(color='r'))
    plt.show()

In [None]:
#Creating polynomial features
from sklearn.preprocessing import PolynomialFeatures

polynomial_features= PolynomialFeatures(degree=2)
X_poly = polynomial_features.fit_transform(X)

In [None]:
#Create polynomial features and add to the data frame
def poly_features(df, feature, degrees):
  for degree in range(2, degrees+1):#squaring, going up to however many degrees
    name = f'{feature} ** {degree}'#give feature a name
    df[name] = df[feature] ** degree#raise it to however many degrees
  return df

##Log Transformation of Target and Features

Many features are normally distributed and many machine learning algorithms also tend to assume a normal distribution. However there are also attributes that have a log-normal distribution. In such a distribution, most of the observations have smaller values and few take on higher values. A feature could have a positive skew where the tail of the distribution is on the right or a negative skew where its tail is on the left.

The question one needs to ask is should a transform be applied to the given distribution and if there is a need then whether a log-transform or some other transform is more appropriate. As an example, consider this distribution which shows a positive skew. Skew for this particular distribution is 1.88 and is calculated using the `pandas.DataFrame` function skew (`df.skew()`). A value of 1.88 indicates a high level of positive skew.

<center><img src="https://storage.googleapis.com/kaggle-forum-message-attachments/356970/9896/Original.png" width="400"/></center>

Lets apply the square root transformation to this distribution. It can be seen that this has helped a little.

<center><img src="https://storage.googleapis.com/kaggle-forum-message-attachments/356970/9897/Square_root_transformation.png" width="400"/></center>

Now lets try applying the log transformation instead. The distribution generated after applying this transformation follows the normal distribution better.

<center><img src="https://storage.googleapis.com/kaggle-forum-message-attachments/356970/9898/Log_transformation.png" width="400"/></center>

The log transformation works well on some distributions because there are often values in nature that follow this distribution. For example, the amount of rainfall when there is little to some rainfall most of the time and higher amounts of rainfall on fewer occasions. Stock prices are often described using a log-normal distribution.

A transformation can be applied to the dependent and independent variables to achieve a more normal distribution. Whether it would be beneficial to apply such a transformation has to be determined based on the distribution.

Transforming our values in this way won't change our model's ability to generate predictions, but what it will do is change the interpretation of all of our coefficients. This will change our coefficients from have an elasticity type interpretation (a raw value change if there is a 1 unit increase in x) to having a percentage-terms interpretation.

In [None]:
#Calculate skew
df.skew()

In [None]:
#Distplot to visually inspect skew
sns.distplot(df[feature/target]);

##Interaction Terms

An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent variable.

Lets look at an example where we were trying to estimate the level of satisfaction that a person would have when eating some kind of food with a condiment on it.

$satisfaction_i = \beta_0 + \beta_1 food_i + \beta_2condiment_i + \epsilon$

Imagine that we have two foods: Ice Cream and Hot Dogs, and we also have two condiments: hot fudge and mustard.

$\beta_1$ in this example is trying to capture the effect of on satisfaction between eating hot dogs vs eating ice cream, and $\beta_2$ is trying to capture the effect of putting hot fudge vs mustard on your food.

$\beta_2$ is a little more problematic in this scenario. If someone were to come up to you and ask if you preferred hot fudge or mustard on your food, how would you answer?

You would probably say something like "It depends on what the food is." This means that the effect of our x variables on y (satisfaction) depends on the combination of food and condiment. I don't know about you guys, but I wouldn't be as satisfied if I had hot fudge on my hot dog or mustard on my ice cream.

An interaction term is something that we add to our regression to account for these "It Depends" moments between two x variables. We do this by multiplying the two of them together or interacting them with each other to capture the implications of the different combinations taking place.

$satisfaction_i = \beta_0 + \beta_1 food_i + \beta_2condiment_i + \beta_3(food\times condiment_i) + \epsilon$

http://statisticsbyjim.com/regression/interaction-effects/

##Removing Outliers

To remove outliers via the 1.5*Interquartile-Range method. The first step is to calculate the IQR for each variable.

The IQR is the difference between the 25th and 75th percentiles of the feature.

Find the IQR and multiply it by 1.5

Then add the 1.5IQR to the 3rd quartile (75th percentile). Anything above that range is an outlier. Subtract 1.5IQR from the 1st quartile (25th percentile). Anything below that value is also an outlier.

You want to minimize outliers in your dataset, so remove them by dropping observations that contain outliers in key features.

Typically you will wan to remove outliers before doing anything else with your dataset. We haven't focused on this strongly yet in the class, but coefficients get strongly biased by outliers so if you want to really have accurate predictions, remove outliers before you begin your feature engineering and modeling.


In [None]:
#Removing Outliers, df must contain all numeric features
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

##Parameter Estimation

OLS and Gradient-Descent based methods of linear regression modeling both seek to estimate parameters that "minimize the sum of the squared error."

An estimated regression coefficient represents the mean change in our response variable (y) given a one unit change in the predictor. But because it is an estimate, there is a certain confidence interval around our prediction of our coefficient. The confidence interval is vital to our interpretation of regression coefficients.

Suppose we fit a regression model and calculate its coefficients and substitute them into the equation:
$$\begin{align} \hat{y} = .42+ 2.05x \end{align}$$

$\hat{\beta}_1$ represents the slope of the regression line, but it's just an estimate for the slope of the regression line, and as an estimate has an associated confidence interval.

Lets say that we calculated the 95% confidence interval for $\hat{\beta}_1$ and it came out to be $(1.9 , 2.2)$. This means that we can only be 95% confident that the average effect of x on y is within this range. While we can calculate a 95% confidence interval for any estimated parameter, we usually won't refer to the potential spread of parameter estimates by its confidence interval. We'll usually refer to how wide or how narrow the spread is by referring to what's called the "Standard Error."

**Standard Error (SE)** - The SE of a coefficient estimate is the estimated standard deviation of the error in measuring it. So the coefficient itself is the estimated mean effect of x on y and the Standard Error is the estimated standard deviation of our coefficient. We use standard errors to calculate the confidence interval.

**Accuracy** - A regression coefficient that is "Accurate" is centered around its "true" value. When we say that a coefficient is more accurate we mean that we suspect that it better represents ground truth.

**Precision** - A regression coefficient that is "Precise" has a small standard error. It has a tighter confidence interval as well.

<center><img src="https://camo.githubusercontent.com/c9b217df86bbd37a0322427f0803bafb9c00f010/68747470733a2f2f7777772e646e61736f6674776172652e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031352f30372f746172676574732e706e67" width="400"/></center>

##Gauss Markov Assumptions
[Regression Diagnostics Assignment](https://github.com/bundickm/DS-Unit-2-Sprint-2-Regression/blob/master/module3-regression-diagnostics/Model_Diagnostics_Assignment.ipynb)

There are 6 Gauss Markov assumptions (also called conditions) that are required for OLS to be BLUE (the "Best Linear Unbiased Estimator").

1) Well Defined: $X^{T}X$ is invertible (No perfect multicollinearity), $|X| \neq 0$

2) Linearity: the parameters we are estimating using the OLS method must be themselves linear.
- When two variables are close to being a linear combination of each other we call this **collinearity** or having high levels of collinearity. If there are three of more variables all with significant levels of collinearity we call this "multicollinearity" but the two terms are used interchangeably.
- High levels of Collinearity in a dataset is bad because it increases standard errors and therefore makes estimates of our coefficients less precise. Very high levels of collinearity (nearing perfect multicollinearity can cause standard errors to grow drastically.)

3) Random: our data must have been randomly sampled from the population.

4) Non-Collinearity: the regressors (x vars) being calculated aren’t perfectly (or highly) correlated with each other.

5) Exogeneity: the regressors (x vars) aren’t correlated with the error term.

- Omitted Variables Bias (Ice Cream Sales and Burglaries)
- Instrumental Variables: A regression of education on earnings would be biased both education and ability are both influenced by influenced by natural ability. We use an additional "Instrumental Variable" that is correlated with of schooling and earnings but isn't correlated with ability in order to estimate the effect of years of schooling on earnings. (Month of birth - Angrist and Kreuger)

6) Homoskedasticity: no matter what the values of our regressors might be, the error of the variance is constant.

<center><img src="https://camo.githubusercontent.com/ad626f774612725ad436ab03956af1b45cd3b6d8/68747470733a2f2f7777772e7265736561726368676174652e6e65742f70726f66696c652f416b686d61645f4661757a792f7075626c69636174696f6e2f3331393039313737352f6669677572652f666967322f41533a35323638323630303838343633333640313530323631363531363036372f4578616d706c652d6f662d686f6d6f7363656461737469636974792d496465616c6c792d726573696475616c732d6172652d72616e646f6d6c792d7363617474657265642d61726f756e642d302d7468652e706e67" width="400"/></center>

We test for high levels of collinearity by calculating the dataset's ***Variance Inflation Factor** (VIF).

From Wikipedia:

    "In statistics, the variance inflation factor (VIF) is the ratio of variance in a model with multiple terms, divided by the variance of a model with one term alone. It quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It provides an index that measures how much the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity." VIF Wikipedia

As a rule of thumb any variable that has a VIF > 10 needs to be dealt with (probably dropped from your model). If you see a VIF greater than 10 it is likely that two x variables are highly correlated. We can use the correlation matrix to check levels of correlation between our independent variables.

In [None]:
#Calculating VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = sm.add_constant(X)
vif = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
pd.Series(vif, X.columns)

In [None]:
#View correlation matrix
df.corr()

In [None]:
#Correlation heatmap
corrmat = df.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

In [None]:
#Restricted heatmap example

k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'feature')['feature'].index
cm = np.corrcoef(df[cols].values.T)

sns.set(font_scale=1.25)
sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f',
            annot_kws={'size': 10}, yticklabels=cols.values,
            xticklabels=cols.values)
plt.show()

###Addressing Heteroskedasticity

If heteroskedasticity exists in our dataset it will damage our standard errors and make our estimates less precise. Any challenges that damage the reliability of standard errors also damage the reliability of confidence intervals and hypothesis tests. Therefore, these challenges that damage standard errors also damage a whole host of statistical tools.

Dealing with heteroskedasticity is pretty straightforward, we simply employ what are called "robust standard errors". If you are suspicious of heteroskedasticity in your dataset and you intend to interpret the coefficients of your model you should run the regression using robust standard errors the majority of the time.

In [None]:
#Model with Robust Standard Errors
model = sm.OLS(y, sm.add_constant(X))
results = model.fit(cov_type='HC3')
print(results.summary())

##Decision Tree

**Decision trees** look at one variable at a time and uses if-then statements to define patterns in data. These statements are called **forks**, and they split the data into two branches based on some value.

That value between the branches is called a **split point**. Features to the left of that point get categorized in one way, while those to the right are categorized in another. A split point is the decision tree's version of a boundary. At the best split, the results of each branch should be as homogeneous (or pure) as possible.

Additional forks will add new information that can increase a tree's prediction accuracy. You can fork a tree until there is a single value for each leaf to obtain 100% accuracy, however this results in high variance and overfitting. The ultimate branches of the tree are called leaf nodes.

In [None]:
!pip install graphviz
!apt-get install graphviz

In [None]:
import graphviz
from IPython.display import display
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz

#supporting function for tree_disp
def viztree(decision_tree, feature_names):
  """Visualize a decision tree"""
  dot_data = export_graphviz(decision_tree, out_file=None, feature_names=feature_names,
                             filled=True, rounded=True)
  return graphviz.Source(dot_data)

#Display a regressor decision tree
def regres_tree_disp(X,y,features,depth=3):
  tree = DecisionTreeRegressor(max_depth=depth)
  tree.fit(X,y)
  print('R^2 Score:', tree.score(X, y))
  display(viztree(tree, feature_names=features))

#Display a classifier decision tree
def class_tree_disp(X,y,features,depth=3):
  tree = DecisionTreeClassifier(max_depth=depth)
  tree.fit(X,y)
  print('R^2 Score:', tree.score(X, y))
  display(viztree(tree, feature_names=features))

#Additional Useful Code Snippets

In [None]:
#Train Test Split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.80, test_size=0.20, random_state=42)

In [None]:
#Pairplot of multiple Features versus Target
import seaborn as sns

sns.pairplot(df,y_vars=target,x_vars=features);

In [None]:
#Box plot for looking at spread of categorical features
fig = sns.boxplot(x=feature, y=target, data=data)

In [None]:
#convert categorical variable into dummy
df = pd.get_dummies(df)