# Intro

## Intro to Machine Learning

### Supervised
Input data is used to predict a label for that data. Predict a label based on inputs.
- Examples: Transaction data to label certain transactions as fraudulent

### Unsupervised
Clustering data based on common characteristics.
- No labels, they just get clumped together
- Grouping customer segments
- Grouping docs by topic

## Why Linear Regression

### Interpretability
LinReg models have clear levers for how the changes in the inputs might impact the result... depending on the number of conditions.
- LinReg is about isolating those components.

### Computational Efficiency
Vs. deep learning, LinReg or Logistic Regression (LogReg) can find a solution faster and with less resources

### Usefulness in Controlled Situations
When used in a real world scenario, it's often for experimental things, especially where high levels of control over variables is needed.
- Due to their simplicity, linreg/logreg models have limitations in the kinds of relationships b/w vars that can be captured.
- Still enable control over most of the vars and focus on answering specific questions.
- Lots of assumptions that might make results unactionable (assumptions are somewhat simplifying the problem)

# Regression

## Intro to LinReg

### Simple LinReg
Linear comparison w/two quantitative variables. We just care about the relationship between the two
- hours of studying vs test grades
- Often plotted as a scattered plot
- Variables X (explanatory var) and Y (response variable)
  - x is used to predict the response (we have control over this)
  - y is the thing we are interested in predicting
- We usually try to draw a line to fit the data

### Scatter Plots
- Used to view relationship between two variables
- Strength (closeness of points) and direction (positive or negative) of relationship
- **Correlation Coefficient** - strength and direction of a linear relationship (denoted by lowercase $r$)
  - Always between -1 and. Closer to 1 means stronger relationship. Negative tells you it's a negative relationship while a positive number tells you it's a positive relationship

  ![image.png](attachment:image.png)

  ![image-2.png](attachment:image-2.png)

  - In the event you are for some reason applying Pearson's Correlation Coefficient to a quadratic (plot 3) it will be 0. PCC does not accomodate quadratics, it's simply for LINEAR relationships.

### Correlation Coeff

Equation below as well as definitions for strong, moderate, and weak relationships

![image-3.png](attachment:image-3.png)

## Fitting a Line

defined by intercept (x=0) and slope (rise/run)

![image-4.png](attachment:image-4.png)

$\beta_1$ is for the parameter and $b_1$ is for the statistic

![image-5.png](attachment:image-5.png)

The little $\hat{y}_1$ (hat y) is used to denote/label that the point has the same x value as the x,y pair, but the y is not on the line, hence the hat (it means it's predicted)
- This is the predicted value of the response from the line

#### Least Squares Algorithm
Minimizes the sum of the squared vertical distances from the line to points (in order to draw a line of best fit)
- For each of the datapoints in the set, get the difference between actual and predicted, square it, then sum it all together.

![image-6.png](attachment:image-6.png)

- This usually does a good job in most scenarios
- This is the function we want to minimize, though
 

## Intro to StatsModels

Lib for performing LinReg and much more

In [5]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

df = pd.read_csv('./datasets/house_price_area_only.csv')
df.head()

Unnamed: 0,price,area
0,598291,1188
1,1744259,3512
2,571669,1134
3,493675,1940
4,1101539,2208


In [None]:
df['intercept'] = 1  # add a column with the intercept (sm does not automatically do th is)

lm = sm.OLS(df['price'], df[['intercept', 'area']])  # linear model
results = lm.fit()  # fit the model with a line
results.summary()  # get a summary of the regression results

0,1,2,3
Dep. Variable:,price,R-squared:,0.678
Model:,OLS,Adj. R-squared:,0.678
Method:,Least Squares,F-statistic:,12690.0
Date:,"Wed, 25 Jun 2025",Prob (F-statistic):,0.0
Time:,21:37:14,Log-Likelihood:,-84517.0
No. Observations:,6028,AIC:,169000.0
Df Residuals:,6026,BIC:,169100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,9587.8878,7637.479,1.255,0.209,-5384.303,2.46e+04
area,348.4664,3.093,112.662,0.000,342.403,354.530

0,1,2,3
Omnibus:,368.609,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,349.279
Skew:,0.534,Prob(JB):,1.43e-76
Kurtosis:,2.499,Cond. No.,4930.0


### Interpretting the results

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

![image-5.png](attachment:image-5.png)

... These here are the 'P' values. In regression, 'P' values are always given for testing if the parameter forthe intercept or the slope are equal to zero and the null hypothesis, like this.

Then in the alternative, by default,Python and other software compute 'P' values using a not equal to. These 'P' values can give a quick glimpse as to whether or nota particular variable is useful for predicting the response.

The 'P' value here on the intercept isn't as useful as the one for area, which suggests that the area is statistically significant in relating to the price.

We can do hypothesis tests against the coefficients in the linear models. These help us figure out if the linear relationship between the var and the response is statistically significant
- The hypot test for the intercept isn't useful in most cases.


*However, the hypothesis test for each x-variable is a test of if that population slope is equal to zero vs. an alternative where the parameter differs from zero. Therefore, if the slope is different than zero (the alternative is true), we have evidence that the x-variable attached to that coefficient has a statistically significant linear relationship with the response. This in turn suggests that the x-variable should help us in predicting the response (or at least be better than not having it in the model).*

### Does the line fit well?
![image-4.png](attachment:image-4.png)

This part of the output is useful for understanding how well the line fits through the data
- R-Squared = closer to 1 is a better fit
  - just the square of the correlation coefficient
  - "the amount of variability in the response (y) that can be explained with the x-var"
  - This reads as "68.7% of the price is explained by the area of the house" - everything else is from external variables 
  - Using this value directly from the output **can be misleading**

# Multiple Linear Regression

# Logistic Regression