# Tools

scikit-learn install directions: https://scikit-learn.org/stable/install.html

Activate the venv with `sklearn-venv\Scripts\activate`

(Made a folder on desktop called github_ml_course, venv is in this folder)

```
In vscode jupyter notebooks, press m to switch a cell to markdown and y to switch to code

### Scikit-learn
"Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities"

### Types of Regression
- Linear Regression- good for when you are seeking a numeric value. example- esitmate a height given an age
- logistic regression- good for when seeking a category assignment. example- determine if a meal should be considered vegan

### Diabetes Dataset


![image.png](attachment:image.png)

This shows a simple linear regression (1 independent variable and 1 dependent variable)

### What is regression?

Regression establishes a relationship between features and labels/targets.
In regression, 'training a model' means finding a function that can use the independent variable(s) to calculate the dependent variable.
Essentially find $f(x) = y$.

Linear regression uses training data to find a linear function of x to predict y. 

To evaluate how well the model works, you can measure the variance between the predicted and actual values. 
In statistics, the difference between predicted and actual label is often called *error*.
DS has different terminology though- we compare a predicted value ($\hat{y}$) with an observed value ($y$), with the difference between them being called *residuals*. The residuals for all validation predictions can be summarized to calculate the overall *loss* in the model as a metric of predictive performance.

One metric of loss is mean squared error- square the individual residuals, sum the squares, calculate the mean. Squaring is used to base the calculation on absolute values and give more weight to larger residuals.

Given $y$ and $\hat{y}$, residual is $(y-\hat{y})$ and squared error is  $(y-\hat{y})^2$.

In general, lower MSE value means less loss, but MSE isn't expressed in a meaningful unit, so the number itself is a bit hard to interpret besides as a comparison to another MSE. To interpret it a bit better, the metric is often put back in units of the label value, which is done by taking the root of the MSE to give us root mean squared error (RMSE). Now with this metric in the same units as the label, it is easier to interpret.

Another loss metric is $R^2$, sometimes known as R-squared or the coefficient of determination, which is a correlation between x and y squared, and is a value between 0 and 1, with 1 being perfect prediction.

## Experimenting with Regression Models

- Linear regression is the simplest type of regression.
	- no limit to # of features used
- Decision trees- step by step approach to predicting a variable
	- may split the data based on one feature and the assess from there
- Ensemble algorithms
	- construct a large number of trees, such as a random forest
	- ensemble algorithms combine multiple base estimators to produce an optimal model
		- can be done by applying an aggregate function to a collection of base models (bagging)
			- kind of like doing multiple algorithms in parallel
			- reduces variance but not bias
			- good when you have a lot of features and want to reduce variance
		- can also be done by building a sequence of models that build on one another to improve predictive power (boosting)
			- like doing multiple algos in series
			- reduces bias but not variance
			- good when you have small number of features and want to reduce bias

## Improving Models with Hyperparameters

Large datasets should be fit repeatedly and predictions should be compared with expected labels. If the prediction is accurate enough, consider the model trained. If not, adjust model and loop again.
Hyperparameters are params which govern how models fit. 

## Preprocessing

Preprocessing data can help the algorithm understand the data better.
This could be
- changing categorical data to numerical data, by a various methods
- scaling- scale numerical values down to a value between 0 and 1, so that even features where the values may be very different numbers can be understood better by the models
- using categories as features with one-hot vectors
	- lets say you have 3 categories- car, bike, train. to implement a one hot vector, have a vector with all 0s except one 1, which represents which category is chosen. so if car is the first element, bike is second and train is third, the value representing bike would be (0,1,0), car would be (1,0,0) and train would be (0,0,1)

Scaling Numeric Features
Normalizing numeric features so theyre on the same scale prevents features with much larger values from producing coefficients that unevenly affect the predictions.
This can be done a few ways.
1. if you know the minimum and maximum values a feature can be, you can scale between those with the min being 0 and the max being 1
2. you can scale between the highest and lowest values that appear in the data
3. you can use the mean and standard deviation of a normally distributed variable to maintain the same spread of values on a scale from 0 to 1

Encoding categorical variables
ML works best with numeric features, not words, so it can be helpful to encode categorical variables as numbers. You can encode with ordinal encoding, or as suggested above, one-hot vector encoding.

![image.png](attachment:image.png)

Categorical data, ordinal encoding, and one-hot vector encoding.

These transformations can be done with scikit-learn pipelines, which allows the specification of preprocessing steps which end in an algorithm. then the entire pipeline can be fit to the data, so the model essentially includes the preprocessing steps as well as the regression algorithm.

At some point, to test my knowledge, i can try the challenge listed here:
https://learn.microsoft.com/en-us/training/modules/train-evaluate-regression-models/9-summary