<img width="300px" src="images/learning-tree-logo.svg" alt="Learning Tree logo" />

# Module 2: Python for Machine Learning

In this module, we cover

- Introduction to the Python programming language
- Basics of Python programming
- Popular data science libraries
- Hands-on exploration of a realistic dataset

The [notebooks](https://github.com/decisionmechanics/lt539j) for the course are available on GitHub. Clone or download them to follow along.

In this notebook, we make use of the following third-party packages.

```bash
pip install jupyterlab numpy 'polars[all]' ydata-profiling
```

## Variables and data types

Python is a strongly, dyanmically typed language. There's no need to declare the types of variables before you use them.

In [None]:
meaning_of_life = 42  # int
pi = 3.14  # float
capital_of_france = "Paris"  # str
two_is_even = True  # bool

Variables can be reassigned---even to values of different types.

In [None]:
meaning_of_life = "Living each day to the max!"

Values can be printed.

In [None]:
print(meaning_of_life)
print(meaning_of_life, pi)
print("The meaning of life is:", meaning_of_life)

When working with notebooks, the results of evaluating expression (such as individual varibles) are automatically displayed, so you can often avoid using `print` explicitly.

This is particularly useful when working with tables and charts.

In [None]:
meaning_of_life

## Expressions and operators

Python supports the standard arithmetic operators on integers and floating-point values.

In [None]:
a = 3
b = 4

a + b  # 7
b / a  # 1.333...
a**b  # 81
b // a  # 1
b % a  # 1

Python also supports a number of assignment operators.

In [None]:
a = 1  # 1
a += 2  # 3
a *= 3  # 9
a -= 4  # 5
a //= 2  # 2
a **= 2  # 4
a

Operations on integers return integers (apart from true division). Operations on floating-point values return floating-point values.

In [None]:
m = 2.5
n = 1.5

m + n  # 4.0

We can convert between types.

In [None]:
int(4.0)
float(4)
str(4)
float("4.5")

The built-in `math` module provides more capabilities, such as trigonometric functions.

In [None]:
import math

math.sqrt(2)  # 1.414...
math.factorial(6)  # 720
math.sin(30.0 * math.pi / 180.0)  # ~0.5

Comparison operators create expressions that yield Boolean values. 

In [None]:
a = 1
b = 2
c = 3

a == b  # False
a != b  # True
a < b  # True
a >= b  # False
a < b < c  # True

d = False

not d  # True

Logical operators can be used to form more complex expressions.

In [None]:
age = 17
wallet_contains = 10

buy_beer = age >= 18 and wallet_contains >= 7
obtain_ticket = age <= 12 or wallet_contains >= 10

There are also a number of operations available for strings.

In [None]:
"1" + "1"  # '11'
"1" * 4  # '1111'
"abcdefghij"[2:4]  # 'cd'
"abcdefghij"[:5]  # 'abcde'
"abcdefghij"[5:]  # 'fghij'
"abcdefghij"[-1]  # 'j'
"abcdefghij"[::-1]  # 'jihgfedcba'
" 1 ".strip()  # '1'

There's a built-in `datetime` module for working with dates/times.

In [None]:
import datetime

xmas_2024 = datetime.datetime(2024, 12, 25)
now = datetime.datetime.now()

(xmas_2024 - now).days

## Collections

Python has powerful built-in collection types.

- Lists
- Tuples
- Sets
- Dictionaries

### Lists

Lists are the work horses of python data structures. They are similar to arrays in other programming languages.

In [None]:
primes = [2, 3, 5, 7, 9, 11]
primes.append(13)
primes += [17]
primes

Lists can be sliced.

In [None]:
primes[:2]  # [2, 3]
primes[-2:]  # [13, 17]

We can also use sequence unpacking (destructuring, pattern matching).

In [None]:
[first_prime, second_prime, *other_primes, last_prime] = primes

print(first_prime, second_prime, last_prime)

Lists are mutable.

In [None]:
odd_primes = list(primes)
odd_primes[0] = 0
odd_primes.remove(0)
odd_primes

Lists can be hetrogeneous.

In [None]:
[
    42,
    "cat",
    datetime.datetime.now(),
    3.14,
    False,
    [2, 3, 5, 7, 11],
]

### Tuples

Tuples are similar to lists, but immutable.

In [None]:
primes = (2, 3, 5, 7, 11)
# primes[0] = 0

### Sets

Sets allow us to perform set theoretic operations.

In [None]:
mammals = {"cat", "dog", "platypus", "kangaroo", "echidna", "dog"}
mammals

Note that duplicates are removed. Members are either members or not members.

In [None]:
egg_layers = {"duck", "echidna", "crocodile", "platypus", "sea turtle"}

egg_laying_mammals = mammals & egg_layers
animals = mammals | egg_layers
live_birth_mammals = mammals - egg_layers

### Dictionaries

Dictionaries are associative arrays---also known as hash maps.

You add and retrieve values via custom keys

In [None]:
capitals = {
    "France": "Paris",
    "Germany": "Berlin",
    "Italy": "Rome",
    "Spain": "Madrid",
    "UK": "London",
}

capitals.get("France")

We can get a list of the keys and/or values from a dictionary.

In [None]:
keys = list(capitals.keys())
values = list(capitals.values())

keys, values

## Control structures

The two major types of control structures are decision-making statements and loops.

Decision-making statements are if/then and match statements.

Loops are for and while loops.

Note that Python use whitespace semantically. It's important to ensure you use it consistently.

### If/Then

In [None]:
age = 21

if age < 1:
    print("You are a baby")
elif age < 5:
    print("You are a toddler")
elif age < 18:
    print("You are a child")
else:
    print("You are an adult")

### Match

In [None]:
iso_country_code = "GBR"

match iso_country_code:
    case "DEU":
        print("Germany")
    case "ESP":
        print("Spain")
    case "FRA":
        print("France")
    case "GBR":
        print("United Kingdom")
    case "ITA":
        print("Italy")

The `match` statement is very powerful. The above example is only its most basic use. 

### For loops

In [None]:
for prime in primes:
    print(prime**2)

In [None]:
for country, capital in capitals.items():
    print(f"The capital of {country} is {capital}")

In [None]:
for i in range(10, 0, -1):
    print(f"{i}...")

print("Blast off!")

### List comprehensions

List comprehensions are syntactic suger for simple for loops that return values.

In [None]:
primes_squared = [prime**2 for prime in primes]

primes_squared

Python also supports set and dictionary comprehensions.

### While loops

In [None]:
import random

while (number := random.randint(1, 10)) != 10:
    print(number)

We can also skip a cycle or terminate a loop early.

In [None]:
for i in range(1, 11):
    if i == 2:
        continue

    if i == 7:
        break

    print(i)

## Functions

Functions allow us to package statement up into named units, making the code more managable and facilitating reuse.

In [None]:
def print_greeting(name="Chris"):
    print(f"Hi, {name}")


print_greeting("Alex")
print_greeting()

Functions can return values.

In [None]:
def fahrenheit_to_celcius(deg_f):
    return (deg_f - 32) * 5 / 9


fahrenheit_to_celcius(32), fahrenheit_to_celcius(212)

## Working with files

There are many ways of working with files. The simplest approach is useless to read the entire file in and process it independently.

In [None]:
with open("data/shark-incidents.csv") as f:
    rows = f.readlines()

len(rows)

`with` allows us to use a context manager, which will close the file automatically when the block finishes. This is a best practice.

Similarly, we can write data out to a file.

In [None]:
data = "\n".join(str(prime) for prime in primes)

with open("temp/primes.txt", "w") as f:
    f.write(data)

data

## Importing and installing packages

Most Python distributions come with a set of mandatory standard libraries (e.g. `math`, `random`).

You can also install third-party library, from [PyPI](https://pypi.org), using `pip`.

```bash
pip install numpy
```

You need to be cautious when installing third-party packages. Anyone can publish to PyPI, so only install well-known packages. Be careful not to make common typos.

Once a third-party package is installed, you can them import it to use it.

In [None]:
import numpy as np

You can also import individual items from a package.

In [None]:
from random import randint

randint(1, 10)

## Vectorised calculations using NumPy

Python is a relatively slow programming language. It compensates for this by having an ecosystem of highly optimised libraries for performing key tasks.

One very popular library is [NumPy](https://numpy.org). NumPy makes it easy to work with large, multi-dimensional arrays and matrices. It's vectorised, meaning that you can perform operations on entire vectors or matrices using a single operation.

To install NumPy use

```bash
pip install numpy
```

It must be imported to use it.

In [None]:
import numpy as np

In [None]:
primes = np.array([2, 3, 5, 7, 11])

primes**2

In [None]:
a = np.arange(1, 10)
m = np.reshape(a, (3, -1))

m + 10

NumPy has many functions. Some of the most useful allow use to sample random numbers from known distributions. These can be used to power complex simulations.

In [None]:
np.random.normal(loc=100, scale=16, size=10)

## Working with tabular data using Polars

Polars is a library for working with tableau data---such as Excel worksheets or CSV files. Tabluar data is very common in data analysis and ML work.

Pandas is another popular Python library for working with tabular data. Polars is faster, can work with larger data files and has a more consistant API.

The Python ecosystem is increasingly adding support for Polars. If you find a library that requires a Pandas dataframe back and forth between Polars and Pandas with a single method.

To install Polars use

```bash
pip install "polars[all]"
```

It must be imported to use it.

In [None]:
import polars as pl

### Reading data

### Data wrangling

There are six data manipulation verbs we need to master to be able to perform basic data wrangling.

- Select (columns/features)
- Filter (rows/observations)
- Sort (rows/observations)
- Mutate (columns/features)
- Aggregate (rows/observations)
- Group (rows/observations)

We'll illustrate these verbs using GDP data from the World Bank.

In [None]:
gdp_df = pl.read_csv("data/world-bank-gdp.csv")

gdp_df

### Select

Select the country name, year and GDP columns.

In [None]:
gdp_df.select(["country_name", "year", "gdp"])

### Filter

Get the GDPs for 2023.

In [None]:
gdp_df.filter(pl.col("year") == 2023, pl.col("gdp").is_not_null())

### Sort

Sort in order of GDP (for 2023)

In [None]:
(gdp_df.filter(pl.col("year") == 2023, pl.col("gdp").is_not_null()).sort("gdp"))

### Mutate

Calculate GDP per capita (for 2023).

In [None]:
gdp_per_capita_2023_df = gdp_df.filter(
    pl.col("year") == 2023, pl.col("gdp").is_not_null()
).with_columns((pl.col("gdp") / pl.col("population")).alias("gdp_per_capita"))

gdp_per_capita_2023_df.sort("gdp_per_capita")

### Aggregate

Calculate the global average GDP per capita (for 2023).

In [None]:
(gdp_per_capita_2023_df.select("gdp_per_capita").mean())

### Group

Calculate average GDP per captia (for 2023) _per region_.

In [None]:
(
    gdp_per_capita_2023_df.group_by("region")
    .agg(pl.col("gdp_per_capita").mean())
    .sort("gdp_per_capita")
)

### Styling tables

Styling Polars tables is done using a [Great Tables](https://posit-dev.github.io/great-tables/articles/intro.html) object.

In [None]:
import polars.selectors as cs
from great_tables import loc, style

(
    gdp_per_capita_2023_df.sample(n=20)
    .style.tab_header(title="GDP by country", subtitle="Based on 2023 values")
    .tab_spanner("Income", cs.starts_with("gdp"))
    .cols_label(gdp="GDP", gdp_per_capita="GDP/capita")
    .tab_stub(rowname_col="country_name")
    .fmt_number(["gdp", "gdp_per_capita"], decimals=0)
    .tab_style(
        style.fill("yellow"),
        loc.body(
            rows=pl.col("gdp_per_capita") > 30_000,
        ),
    )
    .tab_style(
        style.text(weight="bold"),
        loc.body(columns="country_code"),
    )
    .tab_source_note(source_note="This is only a subset of the countries.")
    .cols_label(
        country_code="Country code",
        region="Region",
        income_group="Income group",
        year="Year",
        population="Population",
    )
)

## Data visualisation

Polars has built-in support for creating [Altair](https://altair-viz.github.io/) charts.

You can also create charts using [Matplotlib](https://matplotlib.org/) or [Plotly](https://plotly.com/python/) from Polars dataframes.

There are dozens of types of charts we can generate using these libraries. For now, we'll demonstrate four common types.

- Bar chart
- Pie chart
- Scatterplot
- Line/Time series chart

Import the `altair` package (installed by Polars) to customise the charts.

In [None]:
import altair as alt

### Bar chart

What's the average GDP per region?

In [None]:
(
    gdp_df.filter(pl.col("year") == 2023, pl.col("gdp").is_not_null())
    .sort("gdp")
    .group_by("region")
    .agg(pl.col("gdp").mean())
    .sort("gdp", descending=True)
    .plot.bar(
        x="gdp",
        y=alt.Y("region", sort=None),
    )
)

### Pie chart

What’s the average GDP per region (removing North America as an outlier)?

In [None]:
(
    gdp_df.filter(
        pl.col("year") == 2023,
        pl.col("gdp").is_not_null(),
        pl.col("region") != "North America",
    )
    .sort("gdp")
    .group_by("region")
    .agg(pl.col("gdp").mean())
    .sort("gdp")
    .plot.arc(
        theta="gdp",
        color="region",
        order=alt.Order("gdp", sort="descending"),
    )
)

### Scatterplot

What's the relationship between population and GDP?

In [None]:
(
    gdp_df.filter(pl.col("country_code") == "GBR")
    .plot.point(
        x=alt.X("population").scale(zero=False),
        y="gdp",
    )
    .properties(
        width=500,
    )
)

### Line/Time series chart

How has the GDP of the UK changed over time?

In [None]:
(
    gdp_df.filter(pl.col("country_code") == "GBR")
    .plot.line(
        x="year",
        y="gdp",
    )
    .properties(
        width=500,
    )
)

## Reviewing data quality

We can obtain a data quality report.

In [None]:
from ydata_profiling import ProfileReport

report = ProfileReport(df=gdp_df.to_pandas(), title="GDP data profile")
report.to_file("temp/gdp_profile_report.html")

## Wrangling shark incident data

Let's explore the shark incident data.

In [None]:
shark_incident_df = pl.read_csv(
    "data/shark-incidents.csv", infer_schema_length=None
).filter(~pl.all_horizontal(pl.all().is_null()))

shark_incident_df

Which species of sharks are responsible for the majority of the fatal attacks?

In [None]:
(
    shark_incident_df.rename(
        {
            "Shark.common.name": "shark_species",
        }
    )
    .filter(
        pl.col("shark_species").is_not_null(),
        pl.col("Victim.injury") == "fatal",
    )
    .group_by("shark_species")
    .agg(
        pl.len().alias("fatal_incidents"),
    )
    .sort("fatal_incidents", descending=True)
    .plot.bar(
        x="fatal_incidents",
        y=alt.Y("shark_species", sort=None),
    )
)

Which is the largest shark in the dataset?

In [None]:
(
    shark_incident_df.select(["Shark.common.name", "Shark.length.m"])
    .drop_nulls()
    .sort("Shark.length.m", descending=True)
    .head(1)
)

What are the average lengths of the different species of shark?

In [None]:
(
    shark_incident_df.select(["Shark.common.name", "Shark.length.m"])
    .drop_nulls()
    .group_by("Shark.common.name")
    .agg(pl.col("Shark.length.m").mean())
    .sort("Shark.length.m", descending=True)
)

How have the number of incidents changed over the years?

In [None]:
(
    shark_incident_df.with_columns(
        pl.col("Incident.year").round_sig_figs(3).alias("decade"),
    )
    .group_by("decade")
    .agg(pl.len().alias("incidents"))
    .sort("incidents")
    .plot.line(
        x=alt.X("decade", title="year"),
        y=alt.Y("incidents", title="# incidents"),
    )
    .properties(
        width=500,
    )
)

Is there a relationship between the depth of the water and the depth at which the incident occurred?

In [None]:
(
    shark_incident_df.rename(
        {
            "Total.water.depth.m": "water_depth",
            "Depth.of.incident.m": "incident_depth",
        }
    )
    .plot.point(
        x="water_depth",
        y="incident_depth",
    )
    .properties(
        width=500,
    )
)

What's the correlation?

In [None]:
(
    shark_incident_df.rename(
        {
            "Total.water.depth.m": "water_depth",
            "Depth.of.incident.m": "incident_depth",
        }
    )
    .select(pl.corr("water_depth", "incident_depth"))
    .item()
)

Produce a data quality report.

In [None]:
report = ProfileReport(df=shark_incident_df.to_pandas(), title="Shark incident data profile")
report.to_file("temp/shark_incident_profile_report.html")