# Class 14: Introduction to Machine Learning 1 — General

1. Come in. Sit down. Open Teams.
2. Make sure your notebook from last class is saved.
3. Open up the Jupyter Lab server.
4. Open up the Jupyter Lab terminal.
5. Activate Conda: `module load anaconda3/2022.05`
6. Activate the shared virtual environment: `source activate /courses/PHYS7332.202510/shared/phys7332-env/`
7. Run `python3 git_fixer2.py`
8. Github:
    - git status (figure out what files have changed)
    - git add ... (add the file that you changed, aka the `_MODIFIED` one(s))
    - git commit -m "your changes"
    - git push origin main

## Goals of today's class:
1. Get a sense of the kinds of tasks machine learning can be used for
2. Go through a quick overview of machine learning solutions; test out some for ourselves on real data.
3. Discuss how we can evaluate the efficacy of our machine learning solutions.

## What is machine learning?
(This chapter borrows heavily from Chapter 16 of Bagrow & Ahn)

Machine learning, in the broadest sense, refers to building models that can predict things; we build these models using a bunch of math and a lot of data. In Class 14 and Class 15, we'll be going over "regular" (i.e. non-network ML) and "network" (i.e. ML on graphs) techniques, respectively. 

At this point (2024), machine learning algorithms are pretty ubiquitous. Computing, including GPU computing, is cheap enough that even mildly successful ML algorithms can make a significant difference in a company's revenue compared to baseline (although the idea that there is an ML-free baseline is increasingly laughable). You may have noticed ML at work when you get uncannily targeted ads on social media (or worse, weirdly damning ones -- why, oh why, does Facebook think I need this giant acrylic crab table?). 

![crab ad 1](images/crab_ad.jpeg)
![crab ad 2](images/crab_ad_2.jpeg)

You'll also see ML in face/fingerprint ID technology on your phones, text prediction in any application where you produce text, and Microsoft Excel trying to parse everything as a date (or worse, extrapolate from your data).

![maruary](images/maruary.jpg)
([source](https://www.reddit.com/r/tumblr/comments/g6vr3h/maruary/))


### Unsupervised vs. Supervised ML
Let's break down some of the common tasks ML can do. The first division we'll introduce will be **supervised** versus **unsupervised** machine learning. With **supervised** ML, we have a bunch of data points and the ground truth corresponding to each of them. For example, I might have a bunch of data on students' academic work as well as their final grades. Or I might have data on customers' browsing behavior along with information on what they bought/didn't buy. In **unsupervised** machine learning, I don't have that ground truth "answer" available, or I choose not to use it. Instead, I use a different set of algorithms to find patterns or commonalities within the dataset. 

## Supervised ML Problems
### Regression vs. Classification
Regression refers to the task of predicting a continuous/numeric output, while classification means we're predicting a discrete output. 

When we do machine learning for either case, we often want to think about a **loss function**, which is a measure of how well our model fits the data we're using to train it. Loss functions are usually differentiable such that you can figure out algorithmic ways to minimize them given training data, but some are messier to deal with than others.

We've seen a linear regression before - given a bunch of data points **X** and a set of dependent variables **y**, we want to find the best vector **$\beta$** for **X** and **y** such that $\beta X + \epsilon  y$, where **$\epsilon$** is a random error term. Formally, the loss function for such a regression, if we're using mean squared error loss, is $\frac{1}{N} \sum_{i=0}^{N}(y_i - (\beta x_i + \epsilon))^2$. 

We can also run a **logistic regression**, which works well for classification tasks. Logistic regressions are very similar to linear regressions, but instead of fitting to some dependent variable $y$ as a linear combination of its independent variables $X$, we model the **log-odds** of some event as the linear combination of independent variables. The log-odds (or "logit") of an event with probability $p$ is defined as $\log\frac{p}{1-p}$. This function neatly maps probabilities, which range from 0 to 1, onto the whole range of real numbers. It is also the inverse of the **sigmoid** function, which is defined as $\sigma(x) = \frac{1}{1 + e^{-x}}$. For each data point, we relate the log-odds to a linear regression problem once again:
$$\log\frac{p}{1-p} = \beta x_i + \epsilon$$.

We define the model's calculated probability for a data point $x_i$, using $\beta$ as our coefficient vector once again, to be 
$$\hat{p}(x_i) = \frac{e^{\beta x_i}}{1 + b^{\beta x_i}}$$.


Here, we use a log-likelihood or cross-entropy loss. The cross-entropy between two distributions $p$ and $q$ is defined as 
$$H(p, q) = p(x) \log q(x)$$. 

Cross-entropy is useful for quantifying how much two distributions diverge, or how "surprising" an estimate $q(x_i)$ is for a given "ground truth" value of $p$. The cross-entropy loss (or log-likelihood) of our model's calculated probabilities will be
 
$$ l = \sum_{i=1}^{N} y_i \log\hat{p}(x_i) + \sum_{i=1}^{N} (1 - y_i)\log(1-\hat{p}(x_i))$$. 


Now let's try loading two datasets, one for classification and one for regression, and trying out logistic and linear regression with `scikit-learn` in Python.

In [135]:
import pandas as pd

df_water = pd.read_csv('data/water_potability.csv')
df_housing = pd.read_csv('data/vietnam_housing_dataset.csv')

In [136]:
print(df_water.Potability.mean()) # how often is our water potable to begin with?
df_water.head(5)

0.3901098901098901


Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


In [151]:
df_water = df_water.dropna() # fill in NaN values with a placeholder

# make a numpy matrix of independent variables
x_water = df_water.copy()
del x_water['Potability']

x_water = x_water.to_numpy()
# make our y-values
y_water = df_water['Potability'].to_numpy().astype('float')

In [138]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

x_water_scaled = sc.fit_transform(x_water)
reg = LogisticRegression(random_state=0).fit(x_water_scaled, y_water)
reg.score(x_water_scaled, y_water)

0.5977125808055693

In [139]:
from sklearn.metrics import confusion_matrix

confusion_matrix(reg.predict(x_water_scaled), y_water)

array([[1198,  807],
       [   2,    4]])

In [145]:
df_housing = df_housing.fillna(-1)
dummies_balcony = pd.get_dummies(df_housing['Balcony direction'], prefix='balcony')
dummies_legal = pd.get_dummies(df_housing['Legal status'], prefix='legal')
dummies_furniture = pd.get_dummies(df_housing['Furniture state'], prefix='furniture')
dummies_direction = pd.get_dummies(df_housing['House direction'], prefix='direction')
df_housing = pd.concat((df_housing, dummies_balcony, dummies_legal, dummies_furniture), axis=1)

x_housing = df_housing.copy()
for col in ['Address', 'Legal status', 'House direction', 'Balcony direction', 'Furniture state', 'Price']:
    del x_housing[col]

x_housing = x_housing.to_numpy()
y_housing = df_housing['Price'].to_numpy()

In [147]:
from sklearn.linear_model import LinearRegression

sc = StandardScaler()

x_housing = sc.fit_transform(x_housing)
reg = LinearRegression().fit(x_housing, y_housing)
reg.score(x_housing, y_housing)

0.11892188800638648

### Decision Trees
There are other ways to solve these types of problem,s and we'll talk about two of them today: decision trees/random forests and neural networks. 



In [154]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(x_water, y_water)
print(clf.score(x_water, y_water))
preds = clf.predict(x_water)
confusion_matrix(preds, y_water)

1.0


array([[1200,    0],
       [   0,  811]])

In [155]:
from sklearn.tree import export_text

export_text(clf)

'|--- feature_4 <= 260.92\n|   |--- feature_2 <= 18346.62\n|   |   |--- feature_0 <= 7.95\n|   |   |   |--- feature_4 <= 229.41\n|   |   |   |   |--- feature_0 <= 6.59\n|   |   |   |   |   |--- class: 0.0\n|   |   |   |   |--- feature_0 >  6.59\n|   |   |   |   |   |--- class: 1.0\n|   |   |   |--- feature_4 >  229.41\n|   |   |   |   |--- class: 0.0\n|   |   |--- feature_0 >  7.95\n|   |   |   |--- class: 1.0\n|   |--- feature_2 >  18346.62\n|   |   |--- feature_0 <= 6.06\n|   |   |   |--- feature_3 <= 4.72\n|   |   |   |   |--- class: 1.0\n|   |   |   |--- feature_3 >  4.72\n|   |   |   |   |--- feature_3 <= 9.20\n|   |   |   |   |   |--- class: 0.0\n|   |   |   |   |--- feature_3 >  9.20\n|   |   |   |   |   |--- class: 1.0\n|   |   |--- feature_0 >  6.06\n|   |   |   |--- feature_5 <= 555.35\n|   |   |   |   |--- feature_0 <= 11.53\n|   |   |   |   |   |--- class: 1.0\n|   |   |   |   |--- feature_0 >  11.53\n|   |   |   |   |   |--- class: 0.0\n|   |   |   |--- feature_5 >  555.35

In [157]:
from sklearn.tree import DecisionTreeRegressor

clf = DecisionTreeRegressor()
clf.fit(x_housing, y_housing)
print(clf.score(x_housing, y_housing))

0.9366664364650129


In [158]:
from sklearn.tree import export_text

export_text(clf)

'|--- feature_4 <= 0.37\n|   |--- feature_4 <= -1.32\n|   |   |--- feature_0 <= -0.71\n|   |   |   |--- feature_0 <= -1.01\n|   |   |   |   |--- feature_0 <= -1.09\n|   |   |   |   |   |--- feature_1 <= 0.72\n|   |   |   |   |   |   |--- feature_19 <= -0.83\n|   |   |   |   |   |   |   |--- feature_1 <= -0.39\n|   |   |   |   |   |   |   |   |--- feature_0 <= -1.20\n|   |   |   |   |   |   |   |   |   |--- value: [1.22]\n|   |   |   |   |   |   |   |   |--- feature_0 >  -1.20\n|   |   |   |   |   |   |   |   |   |--- feature_0 <= -1.19\n|   |   |   |   |   |   |   |   |   |   |--- value: [1.95]\n|   |   |   |   |   |   |   |   |   |--- feature_0 >  -1.19\n|   |   |   |   |   |   |   |   |   |   |--- feature_3 <= -0.47\n|   |   |   |   |   |   |   |   |   |   |   |--- value: [1.80]\n|   |   |   |   |   |   |   |   |   |   |--- feature_3 >  -0.47\n|   |   |   |   |   |   |   |   |   |   |   |--- value: [1.80]\n|   |   |   |   |   |   |   |--- feature_1 >  -0.39\n|   |   |   |   |   |   |

### Neural Networks






## Unsupervised ML & Friends

### Embeddings
### Clustering