## Setup

Run this cell to import libraries and set display options.

If any import fails, install with `pip install numpy pandas matplotlib seaborn pyarrow`.

In [2]:
# Imports and options
import sys
import math
import statistics as stats

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.precision', 3)
pd.set_option('display.max_rows', 10)
sns.set_theme(context='talk', style='whitegrid')

DATA_PARQUET = 'gtbc_20250919.parquet'

print(sys.version)
print('numpy', np.__version__, 'pandas', pd.__version__)

3.8.8 (default, Apr 13 2021, 12:59:45) 
[Clang 10.0.0 ]
numpy 1.22.0 pandas 2.0.3


## Part 1 — Python refresher

Goal: warm up with core Python. Do each task first in plain Python. We'll then revisit with NumPy/Pandas.

Topics:
- Variables and types
- Functions and docstrings
- Conditionals
- Loops and comprehension
- Basic algorithmic thinking

### Exercise 1: Variables and simple computations

1) Create variables `celsius_values = [12.5, 14.2, 10.1, None, 16.8, 20.0]` and `country_names = ["France", "Germany", "Denmark", "Iceland", "Spain", "Italy"]`.

2) Convert each non-missing Celsius float value to Fahrenheit using a for-loop and append to a new list.

3) Compute the mean of the Fahrenheit list ignoring missing values. Use only the standard library.

In [3]:
celsius_values = [12.5, 14.2, 10.1, None, 16.8, 20.0]
country_names = ["France", "Germany", "Denmark", "Iceland", "Spain", "Italy"]

# 1.2 Convert non-missing values
celsius_float_values = []
farenheit_vals = []
for c in celsius_values:
    if c is not None: 
        celsius_float_values.append(float(c))

        in_far = c * 9/5 + 32
        farenheit_vals.append(in_far)
    

    # Formula to get Fahrenheit value from Celsius value : f = c * 9/5 + 32

# 1.3 Compute mean ignoring missing values
sum_far = 0
if len(farenheit_vals) > 0:     
#or
    sum_far = sum(farenheit_vals)


    mean_vals = sum_far/len(farenheit_vals)
    print(f"Mean: {mean_vals}")
else: 
    print("No values provided")



Mean: 58.496


### Exercise 2: Functions and conditionals

Write a function `classify_temp(c)` that:
- Returns `None` if `c` is `None`.
- Returns `'cold'` if `c < 5`, `'mild'` if `5 <= c < 15`, and `'warm'` otherwise.

In [5]:
def classify_temp(c):
    if len(celsius_values) == 0:
        print("Hehe no values in this list")
    if c != None:
 
        if c< 5:
            return 'cold'
        elif 5<= c < 15:
            return 'mild'
        elif c >= 15:
            return 'warm'
        else:
            print("Not a valid temperature value!")
    else:
        return None

labels = []
for c in celsius_values:
    labels.append(classify_temp(c))
print(labels)

['mild', 'mild', 'mild', None, 'warm', 'warm']


### Exercise 3: Loops, comprehensions, and dicts

1) Using a list comprehension, recompute integer values ignoring `None`.

2) Create a dictionary mapping `country_names` to `celsius_values` (skip `None`).

3) From that dict, build a new dict of `country -> label` using `classify_temp`.

In [6]:
# 3.1 List comprehension
fahrenheit_comp = [item for item in farenheit_vals if item is not None]

# basic method
# fahrenheit_comp =[]
# for i in farenheit_vals:
#   if i != None:
#       fahrenheit_comp.append(i)

# 3.2 zip() function -> returns a zip object
# dict name: country_to_celsius 

#remove None values
celsius_comp = [item for item in celsius_values if item is not None]

# zipping
country_celsius = dict(zip(country_names, celsius_comp))

# converting in dictionary
#zip_dict = dict(x)

# 3.3 items() function
country_to_label = {key: classify_temp(v) for key, v in country_celsius.items()}

print(country_to_label)

{'France': 'mild', 'Germany': 'mild', 'Denmark': 'mild', 'Iceland': 'warm', 'Spain': 'warm'}


In [7]:
l = [1, 2, 3]
l*3
import numpy as np
c_list = [12.5, 14.2, 10.1, None, 16.8, 20.0]
arr_c = np.array([c_list], dtype=float)
arr_c

array([[12.5, 14.2, 10.1,  nan, 16.8, 20. ]])

## Part 2 — NumPy essentials

Why NumPy?
- Homogeneous n-dimensional arrays with explicit dtypes
- Vectorized operations, broadcasting, and views vs copies
- Efficient boolean masking and reduction operations
- Performance relative to pure Python loops

### NumPy: creating arrays and dtypes

- From Python lists: `np.array([...], dtype=float)`
- Inspect: `arr.shape`, `arr.ndim`, `arr.dtype`

In [8]:
# Build arrays from our earlier list, handling None as np.nan
c_list = [12.5, 14.2, 10.1, None, 16.8, 20.0]
arr_c = np.array(c_list, dtype=float)
arr_f = arr_c * 9/5 + 32

arr_c, arr_f, arr_c.dtype, arr_c.shape

(array([12.5, 14.2, 10.1,  nan, 16.8, 20. ]),
 array([54.5 , 57.56, 50.18,   nan, 62.24, 68.  ]),
 dtype('float64'),
 (6,))

### Vectorization and masking

- Vectorization replaces explicit Python loops
- Boolean masks select elements: `mask = ~np.isnan(arr)`

In [9]:
mask = ~np.isnan(arr_c)
mask

array([ True,  True,  True, False,  True,  True])

In [10]:
mean_c_np = np.mean(arr_c[mask])
mean_c_np

14.719999999999999

In [11]:
labels_np = np.where(arr_c < 5, 'cold', np.where(arr_c < 15, 'mild', 'warm'))
labels_np

array(['mild', 'mild', 'mild', 'warm', 'warm', 'warm'], dtype='<U4')

### Performance comparison: Python loop vs NumPy vectorization

We will compute Fahrenheit for a large list to see the speed difference. Use `%timeit`.

In [12]:
# Generate large synthetic data
rng = np.random.default_rng(42)
large_c = rng.normal(loc=12.0, scale=8.0, size=2_000_000)
large_c_list = large_c.tolist()

print(large_c_list[:10])


[14.437736638035451, 3.6801271500760357, 18.003609566451658, 19.524517731129713, -3.6082815092306912, 1.5825639451014553, 13.022723225338282, 9.470059261251343, 11.86559073996569, 5.17564857941136]


In [13]:
%%timeit -n 3 -r 3
# Python loop timing
out = []
for v in large_c_list:
    out.append(v * 9/5 + 32)

295 ms ± 13.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)


In [14]:
%%timeit -n 3 -r 3
# NumPy vectorized timing
out_np = large_c * 9/5 + 32

The slowest run took 5.28 times longer than the fastest. This could mean that an intermediate result is being cached.
17.3 ms ± 14.3 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)


## Part 3 — Pandas with Global Temperatures

Goals:
- Load the dataset (Parquet), parse dates, handle missing values
- Explore: head/tail/info/describe, unique countries
- Aggregations: groupby country/year, pivot tables
- Time series: resampling
- Visualization

In [15]:
# Load the temperature dataset and get the number of rows/cols , a parquet file
df = pd.read_parquet(DATA_PARQUET)
df.shape

FileNotFoundError: [Errno 2] No such file or directory: 'gtbc_20250919.parquet'

In [None]:
# Display the first rows
df.head()

In [None]:
# Display the last rows
df.tail()

In [None]:
# Get the list of columns
cols = df.columns
for c in cols:
    print(c)

### Quick EDA
- `df.info()` to inspect types and missingness
- Basic stats: `df.describe()`
- Countries count and sample
- Time coverage

In [None]:
# Inspect data types and missing values
df.info()

In [None]:
# Basic stats
df.describe()

In [None]:
# Details about countries
countries = df['Country'].unique()
print('Number of countries:', len(countries))
print(df['Country'].dropna().unique()[:10])

In [None]:
# Convert dt into datetime format
df['dt'] = pd.to_datetime(df['dt'])
df.dtypes

In [None]:
# Time coverage
print('Date range:', df['dt'].min(), '→', df['dt'].max())

### Data preparation
- Remove rows with missing values
- Add year column

In [None]:
# Basic cleaning
print('Before:', df.shape)
df = df.dropna(subset=['AverageTemperature'])
df = df[df['Country'].notna()]
print('After:', df.shape)

In [None]:
# Add year column
df['year'] = df['dt'].dt.year
df.head()

### Aggregations: average temperature by country and year

Compute annual means per country and pivot for a matrix view.

In [None]:
# Compute annual means per country
df_annual = df.groupby(['Country','year'], as_index=False)['AverageTemperature'].mean()
print(df_annual.shape)
df_annual.head()

In [None]:
# Pivot table
df_pivot = df_annual.pivot(index='year', columns='Country', values='AverageTemperature')
print(df_pivot.shape)
df_pivot.head()

### Visualization examples
- Country trend lines
- Top-N hottest/coldest countries (average over period)

In [None]:
# Plot a few country trends
countries_to_plot = ['France','Germany','Denmark','Iceland','Spain','Italy']
# Select columns for the countries to plot
sel = df_pivot[countries_to_plot]

plt.figure(figsize=(12,8))
for c in countries_to_plot:
    # Access the column directly using the country name
    s = sel[c]
    s.plot(label=c)
plt.legend()
plt.title('Annual Average Temperature by Country')
plt.xlabel('Year'); plt.ylabel('°C'); plt.show()

In [None]:
# Top-N countries by average temperature after 1900
df_pivot_post_1900 = df_pivot[df_pivot.index >= 1900]
topn = (df_pivot_post_1900.mean().sort_values(ascending=False)[:10])

plt.figure(figsize=(6,6))
sns.barplot(x=topn.values, y=topn.index, orient='h')
plt.title('Top 10 Warmest Countries (avg since 1900)')
plt.xlabel('Average °C')
plt.ylabel('Country')
plt.show()

## Part 4 — Comparative tasks

Solve the same problems using:
1) Pure Python (lists/dicts/loops)
2) NumPy arrays and/or Pandas DataFrames

Then compare code clarity and performance.

### Task A: Average annual temperature by country since 1900

Implement with:
- Pure Python (from a list of tuples)
- Pandas `groupby`

Compare speed for a moderate subset.

In [None]:
# Build a moderate subset as list-of-tuples for pure Python
df_annual = df_annual[df_annual.index >= 1900][['Country','year','AverageTemperature']]
tuples = list(map(tuple, df_annual.to_records(index=False)))  # (Country, year, temp)

In [None]:
%%timeit -n 3 -r 3
# Pure Python aggregation
sums = {}
counts = {}
for country, year, temp in tuples:
    key = (country, year)
    sums[key] = sums.get(key, 0.0) + float(temp)
    counts[key] = counts.get(key, 0) + 1
avg_py = {k: sums[k]/counts[k] for k in sums}

In [None]:
%%timeit -n 3 -r 3
# Pandas aggregation
avg_pd = (df_annual.groupby(['Country','year'])['AverageTemperature'].mean())

### Task B: Classify temperatures (cold/mild/warm) at scale

- Pure Python: loop and if/elif
- NumPy: `np.where`
- Pandas: `pd.cut` or `np.select` on a Series

In [None]:
# Prepare a series of temperatures
s = df['AverageTemperature'].dropna().sample(300_000)

In [None]:
%%timeit -n 3 -r 3
# Pure Python
vals = s.to_list()
labels_py = []
for v in vals:
    if v < 5:
        labels_py.append('cold')
    elif v < 15:
        labels_py.append('mild')
    else:
        labels_py.append('warm')

In [None]:
%%timeit -n 3 -r 3
# NumPy
arr = s.to_numpy()
labels_np = np.where(arr < 5, 'cold', np.where(arr < 15, 'mild', 'warm'))

In [None]:
%%timeit -n 3 -r 3
# Pandas
labels_pd = pd.cut(s, bins=[-np.inf,5,15,np.inf], labels=['cold','mild','warm'])

## Wrap-up
- When to choose pure Python vs NumPy vs Pandas
- Measuring performance with `%timeit`
- Readability and maintainability considerations

References:
- NumPy user guide: `https://numpy.org/doc/stable/user/index.html`
- Pandas user guide: `https://pandas.pydata.org/docs/user_guide/index.html`