<img width="300px" src="images/learning-tree-logo.svg" alt="Learning Tree logo" />

# Module 1: Introduction to Machine Learning

In this module, we cover

- Working with notebooks (e.g. JupyterLab)
- Artificial Intelligence (AI) and Machine Learning (ML)
- Supervised and unsupervised learning
- Overview of regression and classification problems
- Tabular ML vs Deep Learning
- Hands-on experience with executing a notebook to perform a simple ML task

The [notebooks](https://github.com/decisionmechanics/lt539j) for the course are available on GitHub. Clone or download them to follow along.

In this notebook, we make use of the following third-party packages.

```bash
pip install jupyterlab numpy 'polars[all]' scikit-learn
```

## Working with notebooks

Notebooks are a popular tool for working with data. They allow code, text and data to be combined in flexible ways.

Popular notebooks technologies/tools are

- [JupyterLab](https://jupyterlab.readthedocs.io/en/latest/)
- [Marimo](https://marimo.io/)
- [Quarto](https://quarto.org)
- [DataSpell](https://www.jetbrains.com/dataspell/)

Alternatives to notebooks are IDEs, such as

- [PyCharm](https://www.jetbrains.com/pycharm/)
- [Visual Studio Code](https://code.visualstudio.com)

Most Python IDEs can work with notebooks, so there's a crossover between the two technologies.

### Setting up a virtual environment

Virtual environments aren't essential, but they allow us to isolate projects from each other.

![Python environment](https://imgs.xkcd.com/comics/python_environment.png)

https://xkcd.com/1987

To create and activate a virtual environment in Mac/Linux use

```bash
python3 -m venv venv
source ./venv/bin/activate
```

To create and activate a virtual environment in Windows use

```bash
py -m venv venv
.\venv\Scripts\activate
```

You should see the terminal prompt change to show you are operating in a virtual enviroment.

Check the version current version of Python.

```bash
python --version
```

To deactivate the virtual enviroment run the `deactivate` script.

```bash
deactivate
```

### JupyterLab

JupyterLab is the most popular notebook technology. As such, it's the one we will use thoughout this course.

To install JupyterLab (with a virtual environment active) use

```bash
pip install jupyterlab
```

To start JupyterLab, navigate to the folder containing your notebooks (or the folder where you'd like any new notebooks to be) and run

```bash
jupyter lab
```

This will run the engine and launch a browser window. JupyterLab runs in a web browser.

![JupyterLab](images/module1-jupyterlab.png)

Create a virtual environment, install JupyterLab and use it to open the notebook for Module 1 (i.e. this notebook).

When you `pip install` JupyterLab, add a few additional libraries at the same time.


```bash
pip install jupyterlab numpy 'polars[all]' scikit-learn
```

### Working with cells

Notebooks are built from cells. The two most important types of cells are

- code cells (containing executable Python code)
- markdown cells (containing text, titles, images, etc.)

### Python code

Code cells are evalulated and the output is displayed below the cell. There's no need to use `print` to view results.

JupyterLab is able to format tabular data and display charts.

### Markdown

[Markdown](https://www.markdownguide.org/cheat-sheet/) is a lightweight text-formatting syntax.

Can be used to format

- headings
- links
- images
- lists
- block quotes
- code
- bold text
- italic text
- strikethrough text
- horizontal rules

### Shortcuts

JupyterLab can be controlled from the UI, but power users tend to use shortcut keys. Right-click on a cell to display a context menu complete with shortcut keys.

Common shortcut keys are

- `a`: insert cell above the selected cell
- `b`: insert cell below the selected cell
- `d`, `d`: delete the selected cell
- `m`: change to markdown cell
- `y`: change to code cell
- `z`: undo changes
- `1`-`6`: change cell to markdown heading
- `Shift`+`Enter`: execute selected cell

### Execution order

Python cells are only executed when you run them. If you run them out of order, you can get confusing results. If in doubt, run all cells from the start.

### Marimo

Marimo is an alternative (to JupyterLab) Python notebook environment. Features include

- Reactive execution (cells are re-run automatically, as required)
- Notebooks are stored as Python files
- More developer-oriented experience (e.g. GitHub Copilot, autocomplete, code formatting)

To install Marimo use

```bash
pip install marimo
```

To launch Marimo, navigate to the folder containing your (Marimo) notebooks and use

```bash
marimo edit
```

![Marimo](images/module1-marimo.png)

### Containers

Containers provide a much greater degree of isolation than virtual enviroments---including isolation of the file system and operating system configuration.

[Docker](https://docs.docker.com) is the most well-known tool for building container images and running containers.

To launch JupyterLab as a Docker container use

```bash
docker run --name jupyterlab -p 8898:8888 -d --rm quay.io/jupyter/datascience-notebook start-notebook.py --NotebookApp.token='my-token'   
```

Then browse to http://127.0.0.1:8898/lab?token=my-token.

To stop the container use

```bash
docker stop jupyterlab
```

### Installing packages

There are three ways of installing packages in JupyterLab.

1. `pip install` the packages in the virtual environment before launching JupyterLab.
2. Launch an integrated terminal in JupyterLab and use `pip install` packages.
3. Execute `!pip install ...` in a notebook code cell (`!` launches a shell command).

## Artificial Intelligence and Machine Learning

Artificial intelligence (AI) has been around for decades, going through many different phases. Today it's almost synonymous with Machine Learning (ML).

> A year spent in artificial intelligence is enough to make one believe in God.

-- Alan Perlis

### What's the difference between AI and ML?

![AI vs ML](images/module1-ai-vs-ml.svg)

### How would you define intelligence?

- What's your definition of "intelligence"?
- Are animals intelligent? All animals? Or only some?
- How would you determine if an alien was an intelligent lifeform?
- Do we even need a definition of intelligence?
- Are you more intelligent than modern AI?

### History of AI

![History of AI](images/module1-history-of-ai.svg)

### Definitions of AI

The term ”Artificial Intelligence” was defined by John McCarthy in 1955.

> ...the science and engineering of making intelligent machines...

-- John McCarthy

> ...the science of making machines do things that would require intelligence if done by men.
    
-- Marvin Minsky, AI pioneer

> ...the science of making machines smart
                
-- Demis Hassabis, DeepMind

> ...a computerized system that exhibits behavior that is commonly thought of as requiring intelligence.\

-- US Government

> ...anything a machine does to respond to its environment to maximize its chances of success.

-- Steven Struhl, author

> ...the study of agents that receive percepts from the environment and perform actions.

-- Peter Norvig and Stuart Russell, computer scientists

> ...whatever hasn’t been done yet

-- Larry Tesler, computer scientist

> When you're fundraising, it's AI  
> When you're hiring, it's ML  
> When you're specifying, it's linear regression  
> When you're implementing, it's if-then  
> When you're debugging, it's printf()  

-- Baron Schwartz

### The economics of prediction

Deep Learning (neural networks) has been around for decades. Why has AI suddenly taken off?

![The economics of predicition](images/module1-economics-of-prediction.png)

### AI as Intelligent Automation

Another way to look at AI is as automated decision-making.

![Intelligent Automation](images/module1-intelligent-automation.png)

### Artificial General Intelligence

- Displays a similar type of general intelligence to humans
  - We’ll know it when we see it
- Maybe requires consciousness/sentience
- Ability to learn
- Not tied to specific tasks
- Reacts appropriately to novel situations
- Acts based on current context and its relationship with its environment
- Stages of AI are
  - Narrow AI
  - General AI
  - Super AI
- After 80 years, we are still at the first stage

### What AI is _not_

- Alive
- Robots and androids
- An existential threat
  - No-one is building Skynet
- Independent
  - Humans are very much in the loop
- A single technology
  - It's a collection of specialisations
- Intelligent
  - At least not in the way we'd recognise intelligence in others and other species

### Where does Generative AI struggle?

Generative AI tends to struggle with tasks that are novel or involve reasoning.

Let's try asking AI some hard questions...

### Applications of AI

- Manufacturing robots
- Self-driving cars
- Smart assistants
- Healthcare management
- Automated financial investing
- Virtual travel booking agent
- Social media monitoring
- Marketing chatbots
- Language translation
- Image recognition

### Ethical considerations

> AI doesn’t have to be evil to destroy humanity---if AI has a goal and humanity just happens to come in the way, it will destroy humanity as a matter of course without even thinking about it, no hard feelings.  

-- Elon Musk

![Trolley problem](images/module1-trolley-problem.png)

https://neal.fun/absurd-trolley-problems/

## ML

In ML we use computers to learn rules/algorithms from data.

It's basically statistical correlation, or pattern matching.

It turns out that this can be very effective, when executed at scale, but it does have limitations.

![Machine Learning](https://imgs.xkcd.com/comics/machine_learning.png)

https://xkcd.com/1838

### Supervised learning

In supervised learning, the machine learns by looking at historial outcomes.

![Supervised learning](images/module1-supervised-learning.png)

### Regression

Regression analysis predicts values (magnitudes).

![Regression](images/module1-regression.png)

### Classification

Classification algorithms attempt to assign observations to one of a number of classes.

![Penguin scatterplot](images/module1-penguin-scatterplot.png)

In [None]:
import matplotlib.pyplot as plt
import polars as pl
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree

penguin_df = pl.read_csv("data/penguins.csv").drop_nulls()

features_df = penguin_df.select(
    ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
)

target_df = penguin_df.select("species")

SEED = 123

feature_train_df, feature_test_df, target_train_df, target_test_df = train_test_split(
    features_df, target_df, test_size=0.30, random_state=SEED
)

classifier = DecisionTreeClassifier(max_depth=3, random_state=SEED).fit(
    feature_train_df, target_train_df
)

class_names = sorted(penguin_df["species"].unique())

plt.figure(figsize=(10, 10), dpi=300)
plot_tree(
    classifier, feature_names=features_df.columns, class_names=class_names, filled=True
)
plt.show()

### Unsupervised learning

In unsupervised learning, the machine learns about patterns in the data.

![Unsupervised learning](images/module1-unsupervised-learning.png)

### Cluster analysis

k-means clustering analysis is a popular unsupervised learning technique.

![k-means](images/module1-k-means.png)

In [None]:
import numpy as np
from sklearn.cluster import KMeans

k_means = KMeans(3, n_init="auto", random_state=SEED).fit(features_df.to_numpy())

markers = ("o", "s", "v")
species_map = {"Adelie": 0, "Chinstrap": 1, "Gentoo": 2}

for cluster_id in (0, 1, 2):
    cluster_df = penguin_df.with_columns(
        pl.Series(k_means.labels_.tolist()).alias("cluster_id"),
    ).filter(
        pl.col("cluster_id") == cluster_id,
    )

    species_ids = np.array([species_map[species] for species in cluster_df["species"]])
    marker = markers[cluster_id]
    label = f"Cluster {cluster_id}"

    plt.scatter(
        cluster_df["bill_length_mm"],
        cluster_df["bill_depth_mm"],
        c=species_ids,
        cmap="viridis",
        edgecolor=None,
        alpha=0.5,
        marker=marker,
        label=label,
    )

plt.legend()
plt.show()

## Deep Learning

Deep learning (or neural networks) represents the state-of-the-art in ML. They are powerful, but computationally expensive.

![Deep learning](images/module1-deep-learning.png)

## Building an automated loan approval model

Let's use a [loan approval dataset](https://www.kaggle.com/datasets/architsharma01/loan-approval-prediction-dataset) from Kaggle to build a loan approval classifier.

We'll use a decision tree classifier.

Load and review the data.

In [None]:
loan_approval_df = pl.read_csv("data/loan-approval-dataset.csv")

loan_approval_df

The CIBIL Score is a credit score.

There are some categorical features that we'll need to convert to [dummy variables](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)).

In [None]:
loan_approval_with_dummies_df = loan_approval_df.to_dummies(
    ["education", "self_employed"], drop_first=True
)

loan_approval_with_dummies_df

Do we have a reasonable number of examples for each target class?

In [None]:
loan_approval_df.get_column("loan_status").value_counts()

For this simple example, we'll use all the features to predict the loan approval, apart from `loan_id` and `loan_status` (the target).

In [None]:
loan_approval_features_df = loan_approval_with_dummies_df.drop("loan_id", "loan_status")

loan_approval_target_df = loan_approval_df.select("loan_status")

SEED = 123

(
    loan_approval_feature_train_df,
    loan_approval_feature_test_df,
    loan_approval_target_train_df,
    loan_approval_target_test_df,
) = train_test_split(
    loan_approval_features_df,
    loan_approval_target_df,
    test_size=0.30,
    random_state=SEED,
)

Scale the data so we're comparing apples with apples.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

loan_approval_feature_train_scaled_df = scaler.fit_transform(
    loan_approval_feature_train_df
)

loan_approval_feature_test_scaled_df = scaler.fit_transform(
    loan_approval_feature_test_df
)

Generate the classifier, limiting the maximum depth of the tree to 4.

In [None]:
loan_approval_classifier = DecisionTreeClassifier(max_depth=4, random_state=SEED).fit(
    loan_approval_feature_train_scaled_df, loan_approval_target_train_df
)

In [None]:
loan_approval_class_names = sorted(loan_approval_df["loan_status"].unique())

plt.figure(figsize=(10, 10), dpi=300)
plot_tree(
    loan_approval_classifier,
    feature_names=loan_approval_features_df.columns,
    class_names=loan_approval_class_names,
    filled=True,
)
plt.show()

Evaluate the model.

In [None]:
from sklearn.metrics import accuracy_score, classification_report

loan_approval_predicted = loan_approval_classifier.predict(
    loan_approval_feature_test_scaled_df
)

In [None]:
accuracy_score(loan_approval_target_test_df, loan_approval_predicted)

In [None]:
print(classification_report(loan_approval_target_test_df, loan_approval_predicted))