# Theory for Exercises 02 (Fall 2025)

Last modified: Tue Sep 23 before session. The order of things may be quite random, but that's minor.

Theme: data wrangling with pandas (+ some basic statistics stuff for future exercises)

During our Teams sessions, we'll take a look at the following:

- The [Wes McKinney book](https://wesmckinney.com/book/) is excellent. Please start with chapter five.
- [Mika's material](https://varpha.pages.labranet.jamk.fi/dan-spring2024) (see *Materials*) is good as well. Please report any AI translation weirdness to our Teams channel.
- [Official pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)
- A nice [cleaning guide](https://medium.com/data-science/how-to-clean-your-data-in-python-8f178638b98d). There are no recipies really, you always need to make the decisions by yourself (what to do with the "dirty" parts of the data).




## On pandas

A pandas dataframe is a "table" that consists of Series objects ("columns") aligned by Index ("row"). The Index can be almost anything immutable: integers, strings, dates, intervals, etc.

A Series is a pandas version of a list. It has builtin Index and allows mixed datatypes in its values.

Method chaining is the recommended way of coding in pandas; see [pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

Harri's phone browser bookmarks on pandas are in `public/exrc_02`. You may put the urls into freedium.cfd (or better still, subscribe to medium; $5/mo. is not too much maybe).


## A helper cell
probably not needed even

In [None]:
import ipywidgets as widgets
from IPython.display import clear_output

# notebook analogue of sys.exit() is now raise StopExecution
class StopExecution(Exception):
    def _render_traceback_(self):
        pass

# let's make a function of it
def myexit():
    raise StopExecution

# an input function in a similar manner (we may not need it but anyway)
def myinput(msg='press enter to continue'):
    input(f'{msg} ')
    clear_output()


## On problem 1 (and general philosophy)

Here's an excerpt form the 2023 model solutions. (Just a few lines; the code won't run.) Then the problem 1 profiles were in UK format, which were a bit different from our US format profiles. Anyway.

An important theme on this course is *being human in the age of AI* or something like that. We will talk about this during the Tue Sep 23 session. This model solution excerpt is a good example.

**Note: This is by no means the only way, or *the* recommended way. This is just a kind of nice way that this particular student came up with.**

```python
# ------------------ MORE SANITY CHECKS ---------------------------
# What about the address?
split_addresses = df.address.map(lambda x: x.split('\n'))

# What is the maximum number of address splits?
print(f"Maximum number of address splits: {split_addresses.map(lambda x: len(x)).max()}\n")
# Result: 4

# At this point I'm making an assumption that the first part may be a specifier
# (like a flat number), second part is street name, third part city and last one postal code.
# -------------------SANITY CHECK ENDS ----------------------------

df['street_address'] = df.address.map(lambda x: x.split('\n')[-3])
df['state'] = df.address.map(lambda x: x.split('\n')[-2])
df['postal_code'] = df.address.map(lambda x: x.split('\n')[-1])

# Sample data
print(df[['street_address', 'state', 'postal_code']].head())

# Instruction: Keep also the ssn, username, sex, mail and birthdate columns. Drop all the other columns.
columns_to_keep = ['first_name', 'last_name', 'street_address', 'state', 'postal_code', 'ssn', 'username', 'sex', 'mail', 'birthdate']
df = df[columns_to_keep]

# Results so far
df.head()
```

## Problem 3 hints

This is also from the 2023 model solutions. Just a few lines, but the idea is: wrangle in whatever ways suit you best, until you are happy! **Again, the below is by no means optimal, and there is always more than one way to do things.**

```python

# harri's addition: running 'eval' on anything is questionable, but literal_eval is okay! it takes a string and converts it into the corresponding python datatype.
from ast import literal_eval

df=pd.read_csv('data/XXXXX_prob03_epl.csv')

# read the object data from the dataframe
df['fullTime']=df['fullTime'].apply(lambda x: literal_eval(x))
df['homeScore'] = df['fullTime'].apply(lambda x: x['homeTeam'])

# Build Home metrics - True = 1 and false =0
df['homeWins']=df['homeScore']>df['awayScore']

# ... (many lines snipped)

# apply game dummy to get count of games - both home and away can use the same column
df['Games']=1

# build home results dataset with home team and home metrics - rename all columns to same schema
home_results=df[['Games','homeTeam','homeWins','homedraws','homeLoss','homePoints','homeScore','awayScore']]\
.rename(columns={"homeTeam":"Team","homeWins":"Wins","homedraws":"Draws","homeLoss":"Defeats","homePoints":"Points", "homeScore":"TeamGoals", "awayScore":"OppGoals"})

# ... (many lines snipped)

# group results and sort on three metrics points, goal differential and overall goals
standings_grouped = standings.groupby(standings['Team']).sum().sort_values(by=(['Points','GoalDiff', 'TeamGoals']), ascending=False)

# and so on!

```


## About Problem 5 

Problem 5 is meant to be hard!

Let's say we want to produce a thing like [this](https://student.labranet.jamk.fi/~varpha/data_analytics/statfi_gui_result.png) from our data obtained in Round 1 Problem 5.

In [None]:
import pandas as pd
import json

statfi_data_path = '/home/varpha/dan/public/exrc_02/data/data_statfin_kihi_pxt_13zt.px.json'

# let's read the data into a plain python dict first
with open(statfi_data_path) as handle:
    statfi_data = json.load(handle)

# some plain python things one can try to get an initial grip about a json data
'''
print(type(statfi_data))
print(list(statfi_data.keys()))
print(list(statfi_data.values()))
'''

# then convert the data into a pandas dataframe, see
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
df = pd.json_normalize(statfi_data, record_path = 'data')

# some pandas things one can try to get an initial grip about the data

print(df.head(5))
'''
print(df.shape)
print(df.index)
print(df.columns)
print(df.info())
print(df.describe())
'''

'''
Of the above, df.head(5) was the most important for me in this case.
It revealed that we had two columns in the data: key and values,
and that they both contained lists. We want to get rid of the lists.

Below is one way to do it.

'''

key_exploded = df['key'].apply(pd.Series)
# print(key_exploded.head())

values_exploded = df['values'].apply(pd.Series)
# print(values_exploded.head())

df = pd.concat([key_exploded, values_exploded], axis=1)
# print(df.head())

How to advance from here: see e.g.

- [Reshaping and pivot tables User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html)

There are many good medium blog texts as well. (And substack blog texts, and ...)


## On Problem 2

Easier than Problem 5!

In [None]:
import pandas as pd

df = pd.read_csv('/home/varpha/dan/private/x1234/exrc_02/data/x1234_prob02_weather.csv')

df.index = df.Time

df = df.drop('Time', axis=1)

df = df.pivot(columns='ParameterName', values='ParameterValue')

df = df.drop(['TG_PT12H_min', 'rrday', 'tday', 'snow'], axis=1)

df = df.dropna()

# print(df)

# without the _= part, the below multiline comment
# would be spammed as output (jupyter notebook feature)
_ = '''

my_mean = ((df['tmin'] + df['tmax'])/2).mean()

my_std = ((df['tmin'] + df['tmax'])/2).std()

print(my_mean)
print(my_std)
'''

Your task in problem 2a now is to investigate, how far away `(tmin + tmax)/2` (on each row) is from its `avg` (over all rows). Here "far away" is measured in terms of the `std` of `(tmin + tmax)/2`. (For example: *this row is 1.2 stds away from the avg*.)

## About big data

Let's consider a student question: *How to deal with data bigger than the computer memory?*

First, here's a quick python code that generates some fake data into a csv file:

```python
# %pip install --user faker
import random
import faker
from datetime import datetime, timedelta
from csv import DictWriter

FILENAME = 'my_fake_data.csv'
ROWS = 20000

fake = faker.Faker()

def random_timestamp():
    dt = datetime.now() - timedelta(days=random.randint(0, 60), seconds=random.randint(0, 86400))
    return dt.strftime('%Y-%m-%d %H:%M:%S')

def fake_computer_log_row():
    row = {
        'timestamp': random_timestamp(),
        'user': fake.user_name(),
        'commit_hash': fake.sha1(),
        'cpu_usage': random.uniform(0, 100),
        'memory_usage': random.uniform(0, 100),
        'error_code': random.choice([None, 200, 404, 500])
    }
    return row


#### main ####

with open(FILENAME, 'w') as handle:
    
    fieldnames = ['timestamp', 'user', 'commit_hash', 'cpu_usage', 'memory_usage', 'error_code']
    writer = DictWriter(handle, fieldnames=fieldnames)

    writer.writeheader()

    rows_written = 0
    while (rows_written < ROWS):
        writer.writerow(fake_computer_log_row())
        rows_written += 1
```

Then imagine that this data is so big that it doesn't fit into your computer memory. People may start installing heavy frameworks right away, but most often that's unnecessary. (One of the heavy frameworks is Dash and I don't have much positive to say about it. If you really need to take a framework route, consider Polars or Vaex instead.) Often a *filter* is enough, because you don't really need to read all the data into your computer memory at once. Here's an example filter in python:

```python

# open two filehandles: one for reading and one for writing
with open('my_fake_data.csv') as read_handle, \
    open('my_filtered_data.csv', 'w') as write_handle:

    # read one line at the time    
    while myline := read_handle.readline():
        
        '''
        
        - take only the data that you need (cols 0, 3, 4, say)
        
        - entry refers to a single csv entry (a "cell")
        
        '''
        
        # this is now a list
        essential_data = [entry for i, entry in enumerate(myline.split(',')) if i in [0,3,4]]
        
        # write your essential data list to the disk as a csv row  
        write_handle.write(','.join(essential_data) + '\n')
```

This filter only reads one line at a time into your computer memory. Then you can start working with the filtered data file.
        

## Some basic statistics

In general, we should have some intuition about the following:

- frequencies etc
    - coin toss
    - probability mass (discrete) vs density (continuous)
    - normal distribution
- types of variables
    - categorical
    - ordinal
    - numerical
- mean, std, median, percentiles
- normalized data (*standard coordinates*):
    - subtract the mean, divide by std
    - --> mean = 0, std = 1

We'll get back to these later during our course, but let's do something already here.

## Coin tosses, part 1/2 (theoretical probabilities)

In [None]:

import numpy as np
from scipy import stats as st
import pandas as pd
import matplotlib.pyplot as plt

n = 40 # number of coin tosses
p = 0.5 # probability of heads (say)

# theoretical probabilities for obtaining i heads for 0 <= i <= n
heads_probabilities = np.array([st.binom.pmf(i, n, p) for i in range(n+1)])

# the sum must be one (because we're talking about probabilities here)
# print(sum(heads_probabilities))

# initialize plot
fig, ax = plt.subplots()

# bar plot of the probabilities
ax.bar(range(n+1), heads_probabilities)

# next, let's compare with the normal distribution
# with the same mean and std as our binomial distribution

# calculate the mean and std of our distribution
mean = st.binom.mean(n,p)
std = st.binom.std(n,p)

# create an "x-axis" for the "continuous" plot
x = np.linspace(0,n,1000) # 1000 is the mesh size

# plot the normal distribution curve
ax.plot(x, st.norm.pdf(x, mean, std))

# one should always make clear legends in plots
# (that wasn't done here)...
# but that will be part of exercises 3 

## Coin tosses, part 2/2 (simulation probabilities)

In [None]:

# clear and initialize plot again
fig.clear()
fig, ax = plt.subplots()

# this is a coin toss simulation
# imports, n and p: see previous cell (assume executed)
toss_experiments = 10000
heads_simulation = st.binom.rvs(n, p, size=toss_experiments)

# wrangle the simulation outcomes into a probability distribution
# there could be a more elegant way (one line) but who cares
myseries = pd.Series(heads_simulation)
mydict = myseries.value_counts().to_dict()
myvalues = np.array([mydict.get(i, 0) for i in range(n+1)])
heads_probabilities = myvalues/sum(myvalues)

# plot as before
ax.bar(range(n+1), heads_probabilities)

# the mean and std should now be calculated from our simulation
mean = np.mean(heads_simulation)
std = np.std(heads_simulation)
x = np.linspace(0,n,1000)

# plot the normal distribution with mean & std from above
ax.plot(x, st.norm.pdf(x, mean, std))


Below is some gpt4-1106-preview output to fix the "who cares" part above.


The code snippet simulates the outcomes of coin tosses and then converts the results into a probability distribution. To make this process more elegant, you can use the `numpy` library's `bincount` function, which counts the number of occurrences of each value in an array of non-negative integers. The `normalize` parameter in `value_counts` can also be used to directly obtain probabilities. Here's a refined version of the code:

```python
import numpy as np
import pandas as pd
from scipy.stats import binom

# Parameters for the binomial distribution
n = 1  # number of trials (1 for a coin toss)
p = 0.5  # probability of success (heads)

# Simulate coin toss experiments
toss_experiments = 10000
heads_simulation = binom.rvs(n, p, size=toss_experiments)

# Use numpy's bincount to get the count of each outcome
outcome_counts = np.bincount(heads_simulation)

# Convert counts to probabilities
heads_probabilities = outcome_counts / toss_experiments
```

This code does away with the need for a `pandas` Series and dictionary conversion, resulting in a more efficient and elegant solution. The `heads_probabilities` array now contains the probabilities of each outcome (0 for tails and 1 for heads) directly.

### Central Limit Theorem

me: Please give a python simulation example of the central limit theorem.

gpt-4-1106-preview: Sure, here's a Python simulation that demonstrates the central limit theorem.

In this simulation, we generate random numbers from an exponential distribution, which is not a normal distribution. We then take 1000 samples of size 50 from this data and calculate the mean of each sample. The histogram of the sample means should approximate a normal distribution, as demonstrated by the central limit theorem. The green curve in the plot represents a normal distribution with the same mean and standard deviation as the histogram of the sample means.

In [None]:

import numpy as np
import matplotlib.pyplot as plt

# Generate random numbers from a non-normal distribution
np.random.seed(0)
data = np.random.exponential(scale=5, size=1000)

# Plot histogram of original data
plt.hist(data, bins=50, density=True, color='blue', alpha=0.5, label='Original Data')

# Calculate mean and standard deviation of original data
mean = np.mean(data)
std = np.std(data)

# Generate 1000 samples of size 50 from the original data
samples = [np.random.choice(data, size=50, replace=True) for i in range(1000)]
sample_means = [np.mean(sample) for sample in samples]

# Plot histogram of sample means
plt.hist(sample_means, bins=50, density=True, color='red', alpha=0.5, label='Sample Means')

# Calculate mean and standard deviation of sample means
mean_sample_means = np.mean(sample_means)
std_sample_means = np.std(sample_means)

# Plot normal distribution with mean and standard deviation of sample means
x = np.linspace(mean_sample_means - 3 * std_sample_means, mean_sample_means + 3 * std_sample_means, 1000)

# here the ai uses the formula but one could use st.norm.pdf(x,mean,std) // Harri
y = 1 / (std_sample_means * np.sqrt(2 * np.pi)) * np.exp(-0.5 * ((x - mean_sample_means) / std_sample_means)**2)

plt.plot(x, y, color='green', label='Normal Distribution')

plt.legend()
plt.show()

## On AI

(to be continued at some point)

The AI companies always have a separate developer platform, where you pay by usage only, and if you are smart, you can get away with very little cost, and still have access to all the latest and best models. For example, [platform.openai.com](platform.openai.com) is different from chatgpt.com (different user account). Similarly, [console.anthropic.com](console.anthropic.com) is different from claude.com. Etc!

**If you experiment with these, remember to use billing guards!**

Harri uses [https://github.com/sigoden/aichat](https://github.com/sigoden/aichat), but a corresponding gui tool is [https://github.com/Mintplex-Labs/anything-llm](https://github.com/Mintplex-Labs/anything-llm).