# Simple Linear Regression

## Explanation of parameters
### normalize = False
- **Standardization:** the process of subtracting the mean and dividing by the standard deviation **(this is a type of normalization)**
- **Normalization:** has a different meaning depending on the case; here - we subtract the mean but divide by the L2-norm of the inputs

### copy_X = True
- When this is true, it copies the inputs before fitting them
- This is a safety net against normalization and other transformations
- Recall: when using statsmodels we would create copies of dataframes occasionally but sklearn does this automatically

### fit_intercept = True
- In statsmodels we jad to manually add a constant (x = sm.add_constant(x1)), this parameter takes care of that
- If you don't want an intercept you can just set this parameter to false

### n_jobs = 1
- This is used to parallelize routines
- By default only one CPU is used
- If you have a job with lots of data and more than one CPU available, you can set this parameter to be higher

![image.png](attachment:image.png)

# Multiple Linear Regression

## Feature Selection (F-regression)
- Feature selection simplifies models, improves speed, and prevents a series of unwanted issues arising from having too many features (inputs)
- We have done this already in statsmodels when we examined the p-values of the variables to determine if they were significant
- If a variable has a p-value > 0.05 we can disregard it

### feature_selection.f_regression
- F-regression creates simple linear regressions of each feature and the dependent variable
- In the example above we would have a linear regression between SAT and GPA, and one between Rand 1,2,3 and GPA
- Then the method would calculate the F-statistic for each of those regressions and return the respective p-values
- If there were 50 features, 50 simple regressions would be created
- **Note:** For a simple linear regression, the p-value of the F-stat is the p-value of the only independent variable, so this method is exactly what we need

## Feature Scaling (Standardisation)
- The most common problem when working with numerical data is **difference in magnitudes**
- Feature scaling is a solution to this problem
- This is the process of transforming the data we are working with to a standard scale
- This means subtracting by the mean and dividing by the standard deviation, this results in a standardised variable with mean 0 and standard deviation 1. 

# Overfitting and Underfitting
## Overfitting
- Our training has focused on the particular training set so much, it has 'missed the point'
- High train accuracy
- Low test accuracy
- Difficult to spot

## Underfitting
- The model has not captured the underlying logic of the data
- Does not have strong predictive power
- Low train accuracy
- Low test accuracy
- Easy to spot, as no relationship will be found


![image.png](attachment:image.png)
- A good model would not be perfect but would be very close to the actual relationship

## Solution to avoid overfitting
- Split the initial dataset into two - training and test (80/20)