# Data Science Fundamentals
## Lesson 1: Linear Regression
Last Updated on August 17, 2021  |  Created by Brandi Beals

Linear **regression** is a machine learning technique that is part of the **supervised** category. This category requires labeled data, which means the data set used to train a model contains examples the model can learn from. Typically labeled data is historical in nature where the answers are already known. Our goal is to use this historical knowledge and create a model that can accurately predict what the label will be for data the model hasn't seen before.

![Types of Machine Learning](https://www.kindpng.com/picc/m/158-1585451_coding-deep-learning-for-beginners-machine-learning-algorithms.png)

In a data set used for supervised learning, there are **independent variables** that we hope will do a relatively good job at predicting our labeled **dependent variable**. A regression problem focuses on predicting a numerical value that could exist anywhere along the spectrum (i.e. numbers with precise decimals). Futher, a linear regression assumes a linear relationship (as opposed to a non-linear relationship) between the independent and dependent variables, which determines the type of math used behind the scenes.

The math (i.e. algorithms) used to train a model relies on a variety of assumptions. Ensuring these assumptions are met is one of the most important things you must do. In this lesson we will follow a standard machine learning process:
- [Data Wrangling](#Data-Wrangling)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Featuring Engineering](#Featuring-Engineering)
- 
- [Resources](#Resources)

### Import Packages

### Data Wrangling
- get the data (size and shape)
- understand the features (units of measurement, descriptive statistics)
- clean the data if needed (feature names, data types)
- transform into a different shape if needed (reshaping)

### Exploratory Data Analysis
- visualize data (boxplot, distribution plots)
- identify relationships (scatterplot, pairs plot)
- test for multicollinearity (correlation plot, variance inflation factor)
- test for linear relationship (t-test, ANOVA)

### Feature Engineering
- create new features that might be valuable
- transform features if needed (one-hot encoding, log transformation)
- scale data (normalization, standardization)
- handle dirty data (outliers, missing values)

### Split Data
- divide the data set into a training set and testing set (70/30)
- separate independent and dependent variables

### Create Model
- use only training data on this step
- fit a benchmark model to improve upon with iterations

### Make Predictions
- use only testing data on this step
- make predictions using the model

### Evaluate Performance
- calculate error metrics (MAE, MSE, RMSE, MAPE)
- calculate model comparison metrics (AIC, BIC, R2)
- visualize residual plot (Q-Q plot, histogram of errors)

### Resources
The following webpages will help further your knowledge and understanding of linear regression.

https://www.ibm.com/cloud/learn/data-labeling

https://towardsdatascience.com/a-checklist-for-linear-regression-bd7b3e47ea91

https://towardsdatascience.com/machine-learning-algorithms-in-laymans-terms-part-1-d0368d769a7b

https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/introduction-to-trend-lines/v/fitting-a-line-to-data

https://www.unite.ai/what-is-linear-regression/

https://machinelearningmastery.com/simple-linear-regression-tutorial-for-machine-learning/

https://learn.datacamp.com/courses/introduction-to-linear-modeling-in-python