# Data Science Fundamentals
## Lesson 1: Linear Regression
Last Updated on August 17, 2021  |  Created by Brandi Beals

Linear **regression** is a machine learning technique that is part of the **supervised** category. This category requires labeled data, which means the data set used to train a model contains examples the model can learn from. Typically labeled data is historical in nature where the answers are already known. Our goal is to use this historical knowledge and create a model that can accurately predict what the label will be for data the model hasn't seen before.

![Types of Machine Learning](https://www.kindpng.com/picc/m/158-1585451_coding-deep-learning-for-beginners-machine-learning-algorithms.png)

In a data set used for supervised learning, there are **independent variables** that we hope will do a relatively good job at predicting our labeled **dependent variable**. A regression problem focuses on predicting a numerical value that could exist anywhere along the spectrum (i.e. numbers with precise decimals). Futher, a linear regression assumes a linear relationship (as opposed to a non-linear relationship) between the independent and dependent variables, which determines the type of math used behind the scenes.

The math (i.e. algorithms) used to train a model relies on a variety of assumptions. Ensuring these assumptions are met is one of the most important things you must do. In this lesson we will follow a standard machine learning process:
- [Data Wrangling](#Data-Wrangling)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Featuring Engineering](#Featuring-Engineering)
- [Split Data](#Split-Data)
- [Create model](#Create-Model)
- [Make Predictions](#Make-Predictions)
- [Evaluate Performance](#Evaluate-Performance)

### Import Packages

In [58]:
import yfinance as yf                        # https://pypi.org/project/yfinance/
import pandas_datareader as pdr              # https://pandas-datareader.readthedocs.io/en/latest/
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Data Wrangling
- get the data (size and shape)
- understand the features (units of measurement, descriptive statistics)
- clean the data if needed (feature names, data types)
- transform into a different shape if needed (reshaping)

In [46]:
# Amazon Stock Price and Volume
# this package is currently experiencing an issue
# fin = yf.download("AMZN", start='2018-01-02', end='2020-10-28')

amzn = pd.read_csv('Prices_AMZN.csv')        # read in csv file since we can't get programmatically (bummer)
amzn['Date'] = pd.to_datetime(amzn['Date'])  # convert Date column to an actual DateTime data type
amzn = amzn.set_index('Date')                # set Date column as index
print(amzn.shape)                            # understand the shape of this dataframe (rows x columns)
amzn.head()                                  # view the first few rows of data

(712, 7)


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Ticker
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-01-02,1172.0,1190.0,1170.51001,1189.01001,1189.01001,2694500,AMZN
2018-01-03,1188.300049,1205.48999,1188.300049,1204.199951,1204.199951,3108800,AMZN
2018-01-04,1205.0,1215.869995,1204.660034,1209.589966,1209.589966,3022100,AMZN
2018-01-05,1217.51001,1229.140015,1210.0,1229.140015,1229.140015,3544700,AMZN
2018-01-08,1236.0,1253.079956,1232.030029,1246.869995,1246.869995,4279500,AMZN


In [49]:
print(amzn.first('1D'))                      # view first date in data set (take note of minimum date)
print(amzn.last('1D'))                       # view last date in data set (take note of maximum date)
print(amzn.describe(include=['number']))     # get summary statistics for all numeric fields (look for oddities)

              Open    High         Low       Close   Adj Close   Volume Ticker
Date                                                                          
2018-01-02  1172.0  1190.0  1170.51001  1189.01001  1189.01001  2694500   AMZN
                   Open        High          Low        Close    Adj Close  \
Date                                                                         
2020-10-28  3249.300049  3264.02002  3162.469971  3162.780029  3162.780029   

             Volume Ticker  
Date                        
2020-10-28  5588300   AMZN  
              Open         High          Low        Close    Adj Close  \
count   712.000000   712.000000   712.000000   712.000000   712.000000   
mean   1969.124102  1991.580660  1943.836798  1968.780154  1968.780154   
std     514.032337   522.356915   503.879182   512.847289   512.847289   
min    1172.000000  1190.000000  1170.510010  1189.010010  1189.010010   
25%    1670.687500  1689.852478  1642.250000  1665.530029  1665.530029 

In [20]:
# US/Euro Exchange Rate
# https://fred.stlouisfed.org/series/DEXUSEU
mkt = pdr.get_data_fred('DEXUSEU')           # get data from API
mkt.columns = ['Exchange Rate']              # rename column so it makes sense
print(mkt.shape)                             # understand the shape of this dataframe
mkt.head()                                   # view the first few rows of data

(1300, 1)


Unnamed: 0_level_0,Exchange Rate
DATE,Unnamed: 1_level_1
2016-08-22,1.1314
2016-08-23,1.1308
2016-08-24,1.1256
2016-08-25,1.1274
2016-08-26,1.1237


In [57]:
print(mkt.first('1D'))                       # view first date in data set (take note of minimum date)
print(mkt.last('1D'))                        # view last date in data set (take note of maximum date)
print(mkt.describe(include=['number']))      # get summary statistics for all numeric fields (look for oddities)

            Exchange Rate
DATE                     
2016-08-22         1.1314
            Exchange Rate
DATE                     
2021-08-13         1.1796
       Exchange Rate
count    1242.000000
mean        1.146482
std         0.048324
min         1.037500
25%         1.111700
50%         1.139900
75%         1.183175
max         1.248800


In [55]:
data = amzn.merge(mkt, how='inner', left_index=True, right_index=True)  # join the two tables together
data.head()                                                             # view the first few rows of data

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,Ticker,Exchange Rate
2018-01-02,1172.0,1190.0,1170.51001,1189.01001,1189.01001,2694500,AMZN,1.205
2018-01-03,1188.300049,1205.48999,1188.300049,1204.199951,1204.199951,3108800,AMZN,1.203
2018-01-04,1205.0,1215.869995,1204.660034,1209.589966,1209.589966,3022100,AMZN,1.2064
2018-01-05,1217.51001,1229.140015,1210.0,1229.140015,1229.140015,3544700,AMZN,1.2039
2018-01-08,1236.0,1253.079956,1232.030029,1246.869995,1246.869995,4279500,AMZN,1.1973


### Exploratory Data Analysis
- visualize data (boxplot, distribution plots)
- identify relationships (scatterplot, pairs plot)
- test for multicollinearity (correlation plot, variance inflation factor)
- test for linear relationship (t-test, ANOVA)

### Feature Engineering
- create new features that might be valuable
- transform features if needed (one-hot encoding, log transformation)
- scale data (normalization, standardization)
- handle dirty data (outliers, missing values)

### Split Data
- divide the data set into a training set and testing set (70/30)
- separate independent and dependent variables

### Create Model
- use only training data on this step
- fit a benchmark model to improve upon with iterations

### Make Predictions
- use only testing data on this step
- make predictions using the model

### Evaluate Performance
- calculate error metrics (MAE, MSE, RMSE, MAPE)
- calculate model comparison metrics (AIC, BIC, R2)
- visualize residual plot (Q-Q plot, histogram of errors)

### Resources
The following webpages will help further your knowledge and understanding of linear regression.
- https://www.ibm.com/cloud/learn/data-labeling
- https://towardsdatascience.com/a-checklist-for-linear-regression-bd7b3e47ea91
- https://towardsdatascience.com/machine-learning-algorithms-in-laymans-terms-part-1-d0368d769a7b
- https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/introduction-to-trend-lines/v/fitting-a-line-to-data
- https://www.unite.ai/what-is-linear-regression/
- https://machinelearningmastery.com/simple-linear-regression-tutorial-for-machine-learning/
- https://learn.datacamp.com/courses/introduction-to-linear-modeling-in-python