- i wanted to kill two birds with one stone and do machine learning in golang
- at least now i understand why we use python for data science :D
Ah, regression. The statistical process of modelling relationships between a dependent variables and independent variable(s), enabling us to predict new values. Note that regression techniques are generally concerned with predicting continous values, as opposed to a discreet set of categories.
This brings us to possibly the most fundamental model, linear regression, expressed with the battle tested equation:
y = mx + b
Which describes a line with gradient m and y-intercept b.
One way of actually computing m and b is with the ordinary least squares method:
- Randomise values for both m and b to create an example line
- Find the distance between the example line and each value in the dataset (these distances are called 'errors'):
- Sum the squares of these errors:
Now, we iteratively adjust the values of m and b in order to minimize this sum. A ubiquitous optimization technique to find the local minima is called gradient descent, but that's a topic for another day :)
The accuracy and performance of linear regression is dependant on its assumptions:
- Linearity: there is a linear relationship between the dependant variable and the independant variable(s)
- Normality: your variables are distributed normally
- No multicollinearity: your independant variables should not be predictors of eachother, almost by definition
- No auto-correlation: a fancy way of saying your variables should not depend on themselves, i.e they are not values in a time series; Tesla's share price, for example
Pitfalls:
- Extrapolation beyond the model can quickly become very inaccurate
- Extreme outliers can throw off the model

