# Round 3, Fall 2025

Last updated: Tue Sep 30 during class

Natural order of things:

- round 4: regression
- round 5: classification
- round 6: time series.

But we will have time series next (round 4), and that's why we have the weekend! The positive side is that round 5 and 6 will be easy after round 4.


## Jupyter

There's also the "plain" view (i.e. notebook instead of lab): After login, you should see a `lab?` at the end of your url. Replace that by `tree`. This could help in displaying some visualizations (some of them may work better in the plain notebook environment).

See also: [15 Killer Jupyter Hacks](https://medium.com/geekculture/15-best-jupyter-tips-tricks-754831bba408)


## On GitLab

There is the [git](https://git-scm.com/doc) command (originally authored by Linus in 2005) that has "nothing" to do with [gitlab](https://gitlab.com) or [github](https://github.com) or [gitea](https://gitea.com)... Well, the latter ones are based on the command alright. But one could use just the command too. Anyway, it's good to get used to this git stuff if you work with computers and data.

## More on Git & JupyterHub

There is a git functionality builtin in our JupyterHub. Feel free to play with git repositories (see Tue Oct 1 recording). A few okay repositories can be found under `public/exrc_03`: [one about data stuff](https://github.com/GeostatsGuy/DataScienceInteractivePython) and [another about math stuff](https://github.com/xandhiller/maths_notebooks).

## On saving private tokens

I (Harri) do it more or less as described [here](https://python.plainenglish.io/python-dotenv-how-to-manage-environment-variables-in-python-fc64d6987cfa). Except that I typically use `dotenv_values` instead of `load_dotenv`: the latter loads the content of the `.env` file in my environment, while the former just reads the values from the `.env` file.

Anyway, as the test user `x1234` on the hub, I would first save the tokens in a text file
```
/home/x1234/.env
```
and make sure that the file isn't readable by anyone else (Terminal: `chmod 600 /home/x1234/.env`).

This `.env` file would contain a line like

```
GITLAB_TOKEN=akjhfaufhqwp8y9347kfhblvb
```

Then in the python code, I would use the following lines:

```python
%pip install --user python-dotenv

from dotenv import dotenv_values
import gitlab

mytoken = dotenv_values('/home/x1234/.env')['GITLAB_TOKEN']

gl = gitlab.Gitlab(
    'https://gitlab.labranet.jamk.fi',
    private_token=mytoken
)

# authenticate
gl.auth()

```
The <tt>dotenv</tt> library could have better error messages... You can also try this way without it:

```python
import gitlab

with open('/home/x1234/.env') as handle:
    myenv = handle.read().splitlines()

mytoken = [line.split('=')[-1] for line in myenv if 'GITLAB_TOKEN' in line][0]

gl = gitlab.Gitlab(
    'https://gitlab.labranet.jamk.fi',
    private_token=mytoken
)

# authenticate
gl.auth()

```
Finally, the name of the env file doesn't matter. It's just customary to use <tt>~/.env</tt>... but the hub doesn't deal with dotfiles in it's graphical user interface... so you would need to fill the dotfile using the command line (maybe rename an existing file) or python.

Our round 03 topic is closely related to the concept of EDA, see e.g: [this blog text](https://medium.com/epfl-extension-school/advanced-exploratory-data-analysis-eda-with-python-536fa83c578a).

## Some basic statistics (copied from round 02 theory, then extended a bit)

In general, we should have some intuition about the following:

- frequencies etc
    - coin toss
    - probability mass (discrete) vs density (continuous)
    - normal distribution
- types of variables
    - categorical
    - ordinal
    - numerical
- mean, std, median, percentiles
    - median and percentiles (for ordinal) $\Leftrightarrow$ mean and std (for numerical)
    - percentile: the value below which a percentage of data falls ([link](https://www.mathsisfun.com/data/percentiles.html))
    - percentiles (including median) can be used for numerical data too
    - percentiles are not as sensitive to skewness as mean & std 
- normalized data (*standard coordinates*):
    - subtract the mean, divide by std (for each data item)
    - --> mean = 0, std = 1
    - a unified scale is often good for comparing distributions
    - on the other hand, normalizing can be bad or misleading, because it (usually) assumes a normal-like distribution and obscures:
      - the absolute scale of a distribution (e.g. income) 
      - the skewness of the distribution (e.g. income)
    - however, normalizing is a general idea ("make the scale of each distribution uniform") and as such can be done for many distributions.


## Boxplot demo
Also known as whiskerplot or box-and-whisker plot:

In [None]:
import matplotlib.pyplot as plt

# sample data
data = [20, 25, 30, 22, 28, 34, 29, 31, 35, 27, 23, 26]

# create a default box plot
plt.boxplot(data)

# add a title and labels (always a good idea)
plt.title('Box-and-Whisker Plot')
plt.ylabel('Data Values')

# clean misleading x-tick
# https://stackoverflow.com/questions/12998430/how-to-remove-xticks-from-a-plot
plt.tick_params(
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom=False,      # ticks along the bottom edge are off
    top=False,         # ticks along the top edge are off
    labelbottom=False) # no label below the figure

# display the plot
plt.show()


## Getting started with problem 5:

In [None]:
import gitlab

mytoken = dotenv_values('/home/x1234/.env')['GITLAB_TOKEN']

gl = gitlab.Gitlab(
    'https://gitlab.labranet.jamk.fi',
    private_token=mytoken
)

# authenticate
gl.auth()


'''
there are so many projects that reading them into
the memory at once (with get_all = True) seems to take ages
so let's do this instead

'''

projects = gl.projects.list(iterator = True)

'''
then we can iterate through the projects
this tells me that i have at least 1000 projects
-- and that's something already!
'''

counter = 0
for project in projects:
    if counter < 1000:
        print(counter)
        counter += 1
    else:
        break

## On Problem 2

The original form of Problem 2 (Exercise 5.29 in [this book](https://link.springer.com/book/10.1007/978-3-319-64410-3)) is:

> *An airline runs a regular flight with 10 seats on it. The probability that a passenger turns up for the flight is 0.95. What is the smallest number of seats the airline should sell to ensure that the probability the flight is full (i.e. 10 or more passengers turn up) is bigger than 0.99? You'll need to write a simple simulation; estimate the probability by counting.*

One can just copy-paste that to an AI and get a decent answer. But the idea of the problem was mainly to recognize that the phrase

> *the probability that a passenger turns up for the flight is 0.95*

is an expression for a coin-toss-type phenomenon that we dealt with in exrc_02_theory. So let's start with a similar line as in there:

```python
from scipy import stats as st

# n = number of coin tosses, p = probability of heads (say), size = the number of experiments
passenger_simulation = st.binom.rvs(n, p, size)
```

In solving a given math problem, one often spends a lot of time trying to refine two things over and over again:

- what is asked
- what is known.

The first one for us is: *How many seats should the airline sell?*

One also infers from the problem statement that the problem is about overbooking, i.e. that the expected answer is more than 10 seats... or is it? What happens if the airline sells exactly 10 seats? That's a good question to investigate, because it feels simpler than the original problem, and because it leads us to refine the latter thing above (what is known).

> *What is the probability that the flight is full, if the airline sells exactly 10 seats?*

This is a question that can be calculated theoretically, without any simulation. Reformulate:

> *What is the probability that one gets 10 heads with 10 coin tosses (p = 0.95)?*

The answer is $0.95^{10} \approx 0.599$, which can also be calculated like this (cw. exrc_02_theory):

```python
from scipy import stats as st
display(st.binom.pmf(k=10,n=10,p=0.95))
```

Okay. This leads us to suspect that the answer to the actual problem can be calculated theoretically in a similar manner... but in the problem statement we were asked to use simulation and to estimate the probability by counting (whatever that means).

We are lead to the following two questions:

1. How to get the above result (0.599) using simulation (and estimate the probability by counting)?
2. How to answer the actual problem without simulation?

The latter one seems easier so let's try that first (please see the comment below the code before you run the code):

```python
from scipy import stats as st

p = 0.95

# start with this (we saw above that 10 wasn't enough)
sold_seats = 11

# start an infinite loop (which we will break out of when satisfied)
while True:

    # this line should be understandable, given the things we've done so far
    flight_full_probability = st.binom.pmf(n = sold_seats, k = 10, p = p)

    # break out when satisfied (0.99 was given in the problem statement)
    if flight_full_probability > 0.99:
        break

    # in case we didn't break out, increase the number of sold seats
    sold_seats += 1

# this should be the answer
display(sold_seats)

# we can also display the obtained probability
display(flight_full_probability)
```

Running this yields a neverending loop (please interrupt the kernel soon), because the "understandable" line above gives the (decreasing) probability that *exactly* 10 seats are filled, while we want the (increasing) probability that *at least* 10 seats are filled.

Here's a corrected line. (One could also use <tt>cdf</tt> instead of summing the <tt>pmf</tt> results but who cares.)
```python
flight_full_probability = sum(st.binom.pmf(k=i, n=sold_seats, p=p) for i in range(10,sold_seats+1))
```

We obtain that the theoretical answer to the problem is 13 (probability = 0.9968970038319012).

Good! Then, what about question 1 above, i.e. obtaining the theoretical answer 0.599 with 10 sold seats using simulation? Well, let's again mimick the exrc_02_theory code. We can start with the following:

```python
from scipy import stats as st

# start with something relatively small
experiments = 50

simulation = st.binom.rvs(n=10, p=0.95, size=experiments)

display(simulation)
```

Instead of just displaying the simulation, we could count the cases where (at least) 10 seats were filled, and also display the induced probability:
```python

# note: scipy returns numpy arrays so this is valid syntax
display(f'count (out of {experiments}): {sum(simulation>=10)}')
display()
display(f'induced probability: {sum(simulation>=10)/experiments}')
```
Now increase the number of experiments until your induced probability is about the same as the theoretical one. In general, one increases the number of experiments until the results don't significantly change anymore.

I'll leave the rest of the actual problem to you!