This project recreates a linear regression algorithm from scratch using two python libraries - pandas and numpy for matrix operations. It then compares it to the linear regression implementation from scikit-learn. The dataset used is a two column salary dataset with 30 rows. The independent variable "YearsExperience" details the number of years of job experience. The dependent variable "Salary" details the salary earned.
Recreating the linear regression algorithm gives the benefit of
- understanding what happens under the hood of the algorithm
- understanding why and if a prediction for a dataset is best using linear regression
- model customization
For a detailed explanation on the theory used for this computation, check out the accompanying article on medium
You can find the code for this project here.
File overview:
LinearRegressionScratch.ipynb- the full code from this project
To follow this project, please install the following locally:
- Python 3.8+
- Python packages
- pandas
- numpy
- scikit-learn
The data used for this implementation is the salary data originally on Kaggle.
You can download the file we'll use in this project here:
- Salary_Data.csv - the salary data that we use in this project.