## EDA notes

__Univariate analysis:__ an analysis involving one single explanatory variable.\
__Multivariate analysis:__ an analysis involving multiple explanatory variables. \
__Dependent variable for regression analysis:__ the variable(s) whose values we will predict. \
Regression analysis ---> continuous variables \
Classification algorithms ---> categorical/discrete variables\
`sns.pairplot()` \
`sns.heatmap()`

```plt.figure(figsize =(40,20))
sns.heatmap(wine_data.corr(), annot=True)```

```seaborn.regplot()
#Call the regplot method on your sns object, with parameters: x = 'density', y = 'fixed.acidity'
sns.regplot(x="density", y="fixed.acidity", data=wine_data)```

Using double brackets allows you to keep the column header with the data and keeps it as a df rather than converting to pd Series: \
`X = wine_data[["density"]]` ---> X is df\
vs\
`X = wine_data[["density"]]` ---> X is pd Series

__sklearn:__ \
`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)`

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

__Multiple linear regression:__ including more than one explanatory variable into a linear regression analysis

***
## Lab 4: Harvard CS109

* Linear regression is used to model and predict continuous outcomes with normal random errors. 
* There are many, many different types of regression models, and each regression model is typically defined by the distribution of the prediction errors (called "residuals") of the type of data. 
* Logistic regression is used to model binary outcomes.
* Poisson regression is used to predict counts.
***
<div class="span5 alert alert-info">

<p> Given a dataset containing predictor variables $X$ and outcome/response variable $Y$, linear regression can be used to: </p>
<ul>
  <li> Build a <b>predictive model</b> to predict future values of $\hat{Y}$, using new data $X^*$ where $Y$ is unknown.</li>
  <li> Model the <b>strength of the relationship</b> between each independent variable $X_i$ and $Y$</li>
    <ul>
      <li> Many times, only a subset of independent variables $X_i$ will have a linear relationship with $Y$</li>
      <li> Need to figure out which $X_i$ contributes most information to predict $Y$ </li>
    </ul>
   <li>It is in many cases, the first pass prediction algorithm for continuous outcomes. </li>
</ul>
</div>


***
Linear Regression is a method to model the relationship between a set of independent variables $X$ (also knowns as explanatory variables, features, predictors) and a dependent variable $Y$.

This method assumes the relationship between each predictor $X$ is **linearly** related to the dependent variable $Y$. The most basic linear regression model contains one independent variable $X$, we'll call this the simple model. 

$$ Y = \beta_0 + \beta_1 X + \epsilon$$

where $\epsilon$ is considered as an unobservable random variable that adds noise to the linear relationship.

In linear regression, $\epsilon$ is assumed to be normally distributed with a mean of 0.

In other words, what this means is that on average, if we know $Y$, a roughly equal number of predictions $\hat{Y}$ will be above $Y$ and others will be below $Y$. That is, on average, the error is zero. 

The residuals, $\epsilon$ are also assumed to be "i.i.d.": independently and identically distributed. Independence means that the residuals are not correlated -- the residual from one prediction has no effect on the residual from another prediction. Correlated errors are common in time series analysis and spatial analyses.

__Multiple linear regression__ is when you have more than one independent variable and the estimation involves matrices
    * $X_1$, $X_2$, $X_3$, $\ldots$
    
There are many ways to estimate the coefficients, the most common being __least squares.__

__*Having redundant features in our regression models will lead to poorly fitting (and unstable) models.*__

## Exploratory Data Analysis for Linear Relationships

__`.describe()` for basic summary statistics of a dataframe.__

We can also use seaborn regplot for the scatterplot above. This provides automatic linear regression fits (useful for data exploration later on). 

`sns.regplot(y="PRICE", x="RM", data=bos, fit_reg = True)`

### Fitting Linear Regression using `statsmodels`
Statsmodels is a great Python library for a lot of basic and inferential statistics.\
It also provides basic regression functions using an R-like syntax, so it's commonly used by statisticians.\
The version of least-squares we will use in statsmodels is called ordinary least-squares (OLS). There are many other versions of least-squares such as partial least squares (PLS) and weighted least squares (WLS).

We can interpret the RM coefficient (9.1021) by first noticing that the p-value (under P>|t|) is so small, basically zero. This means that the number of rooms, RM, is a statisticall significant predictor of PRICE. 

The regression coefficient for RM of 9.1021 means that on average, each additional room is associated with an increase of \\$9,100  in house price net of the other variables. The confidence interval gives us a range of plausible values for this average change, about ( \\$8,279, \\$9,925 ), definitely not chump change.

The formula gives instruction for a general structure for a regression call. For `statsmodels` (`ols` or `logit`) calls you need to have a Pandas dataframe with column names that you will add to your formula. In the below example you need a pandas data frame that includes the columns named (`Outcome`, `X1`,`X2`, ...), but you don't need to build a new dataframe for every regression. Use the same dataframe with all these things in it. The structure is very simple:

`Outcome ~ X1`

But of course we want to to be able to handle more complex models, for example multiple regression is doone like this:

`Outcome ~ X1 + X2 + X3`

In general, a formula for an OLS multiple linear regression is

`Y ~ X1 + X2 + ... + Xp`

This is the very basic structure but it should be enough to get you through the homework. Things can get much more complex. You can force statsmodels to treat variables as categorical with the `C()` function, call numpy functions to transform data such as `np.log` for extremely-skewed data, or fit a model without an intercept by including `- 1` in the formula.

__ceiling effect__

Y = "target" = dependent variable = response variable = Boston housing prices\
X = all other features = independent variables = predictors = explanatory variables

Remember: Notation may vary across resources particularly the use of  𝑅  and  𝐸  in  𝑅𝑆𝑆/𝐸𝑆𝑆  and  𝑀𝑆𝑅/𝑀𝑆𝐸 . In some resources, E = explained and R = residual. In other resources, E = error and R = regression (explained). This is a very important distinction that requires looking at the formula to determine which naming scheme is being used.

Given the MSR and MSE, we can now determine whether or not the entire model we just fit is even statistically significant. We use an  𝐹 -test for this. The null hypothesis is that all of the  𝛽  coefficients are zero, that is, none of them have any effect on  𝑌 . The alternative is that at least one  𝛽  coefficient is nonzero, but it doesn't tell us which one in a multiple regression:\
Once we compute the  𝐹 -statistic, we can use the  𝐹 -distribution with  𝑁−𝐾  and  𝐾−1  degrees of degrees of freedom to get a p-value.

Warning! The  𝐹 -statistic mentioned in this section is NOT the same as the F1-measure or F1-value discused in Unit 7.

The  𝐹 -statistic can also be used to compare two nested models, that is, two models trained on the same dataset where one of the models contains a subset of the variables of the other model. The full model contains  𝐾  variables and the reduced model contains a subset of these  𝐾  variables. This allows us to add additional variables to a base model and then test if adding the variables helped the model fit.

$$F = \frac{\left( \frac{RSS_{reduced} - RSS_{full}}{DF_{reduced} - DF_{full}} \right)}{\left( \frac{RSS_{full}}{DF_{full}} \right)}$$

where $DF_x = N - K_x - 1$ where $K_x$ is the number of variables in model $x$.

### Akaike Information Criterion (AIC)
Another statistic for comparing two models is AIC, which is based on the likelihood function and takes into account the number of variables in the model.

$$AIC = 2 K - 2 \log_e{L}$$

where $L$ is the likelihood of the model. AIC is meaningless in the absolute sense, and is only meaningful when compared to AIC values from other models. Lower values of AIC indicate better fitting models.

`statsmodels` provides the AIC in its output.

## Part 5: Evaluating the Model via Model Assumptions and Other Issues
***
Linear regression makes several assumptions. It is always best to check that these assumptions are valid after fitting a linear regression model.

<div class="span5 alert alert-danger">
<ul>
  <li>**Linearity**. The dependent variable $Y$ is a linear combination of the regression coefficients and the independent variables $X$. This can be verified with a scatterplot of each $X$ vs. $Y$ and plotting correlations among $X$. Nonlinearity can sometimes be resolved by [transforming](https://onlinecourses.science.psu.edu/stat501/node/318) one or more independent variables, the dependent variable, or both. In other cases, a [generalized linear model](https://en.wikipedia.org/wiki/Generalized_linear_model) or a [nonlinear model](https://en.wikipedia.org/wiki/Nonlinear_regression) may be warranted.</li>
  <li>**Constant standard deviation**. The SD of the dependent variable $Y$ should be constant for different values of X. We can check this by plotting each $X$ against $Y$ and verifying that there is no "funnel" shape showing data points fanning out as $X$ increases or decreases. Some techniques for dealing with non-constant variance include weighted least squares (WLS), [robust standard errors](https://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors), or variance stabilizing transformations.
    </li>
  <li> **Normal distribution for errors**.  The $\epsilon$ term we discussed at the beginning are assumed to be normally distributed. This can be verified with a fitted values vs. residuals plot and verifying that there is no pattern, and with a quantile plot.
  $$ \epsilon_i \sim N(0, \sigma^2)$$
Sometimes the distributions of responses $Y$ may not be normally distributed at any given value of $X$.  e.g. skewed positively or negatively. </li>
<li> **Independent errors**.  The observations are assumed to be obtained independently.
    <ul>
        <li>e.g. Observations across time may be correlated
    </ul>
</li>
</ul>  

</div>

There are some other issues that are important investigate with linear regression models.

<div class="span5 alert alert-danger">
<ul>
  <li>**Correlated Predictors:** Care should be taken to make sure that the independent variables in a regression model are not too highly correlated. Correlated predictors typically do not majorly affect prediction, but do inflate standard errors of coefficients making interpretation unreliable. Common solutions are dropping the least important variables involved in the correlations, using regularlization, or, when many predictors are highly correlated, considering a dimension reduction technique such as principal component analysis (PCA).
  <li>**Influential Points:** Data points that have undue influence on the regression model. These points can be high leverage points or outliers. Such points are typically removed and the regression model rerun.
</ul>
</div>
