In [1]:
from IPython.display import Markdown, display
display(Markdown("header.md"))

<div>
    <img src="images/emlyon.png" style="height:60px; float:left; padding-right:10px; margin-top:5px" />
    <span>
        <h1 style="padding-bottom:5px;"> AI Booster Week 02 - Python for Data Science </h1>
        <a href="https://masters.em-lyon.com/fr/msc-in-data-science-artificial-intelligence-strategy">[Emlyon]</a> MSc in Data Science & Artificial Intelligence Strategy (DSAIS) <br/>
         Paris | © Antoine SCHERRER
    </span>
</div>

Please make sure you have a working installation of Jupyter Notebook / Jupyter Lab, with Python 3.6+ up and running.

## Naming conventions

Since we will implement functions that are already available in python standard library or other libraries, you will have to *prefix* every function with `msds_` prefix.

For instance, the function implementing the `mean` function should be named `msds_mean`.

For every function you write, you will need to write a test function that should be names `test_msds_[function_name]`.

For instance, the test function for the mean will be: `test_msds_mean`.

All function should be in snake case (no Camel case!)

When creating classes, then follow these rules:
 - class names should be in camel case
 - method names should be in snake case
 - attribute names should be in 

## Exercise's difficulty

Every exercise will be prefixed with an indication of its difficulty:
 - [easy]: for very easy exercise
 - [moderate]: for intermediate level exercise
 - [advanced]: for advanced students

Advanced exercises are not mandatory.


## Session 03 - Bivariate statistics - Practice

## Qualitative data


### [moderate] Compute expected frequency matrix

Given two qualitative data sets (iterables), compute the expected frequency matrix (in count) under the assumption that data sets are independent.

Let's state that:
 - $c$ is the number of distinct values for $X$ dataset
 - $l$ is the number fo distinct values for $Y$ dataset


### [moderate] Compute $\chi^2$ statistics

Using the formula from the course, compute the expected frequency matrix and the acutal frequency matrix (from Session_03!).

Then compute the $\chi^2$ quantity.

Compute also $\phi$ and $V_{\text{cramer}}$ statistics

Apply your results to various couples of qualitative variables from datasets you already explored.

### [moderate] Study how smoking relates to lung cancer

Wynder and Graham's case-control study of smoking and lung cancer. A historically important study published compared the smoking histories of 605 cases with lung cancer to 780 controls without cancer. Data on average use of tobacco during the past 20 years was classified as follows:
 - 5 = Chain smoker (35 cigarettes of more per day for at least 20 years)
 - 4 = Excessive smoker (21-34 cigarettes per day for more than 20 years)
 - 3 = Heavy smoker (16-20 cigarettes per day for more than 20 years)
 - 2 = Moderately heavy smoker (10-15 cigarettes per day for more than 20 years)
 - 1 = Light smoker (1-9 cigarettes per day for more than 20 years)
 - 0 = Non-smoker (less than 1 cigarette per day for more than 20 years)

If the patient smoked for less than 20 years, the amount of smoking was reduced in proportion to its duration.

This is the contingency table from the study:

```
CT = {
        '5': {'YES': 123,'NO': 64}, 
        '4': {'YES': 186,'NO': 98}, 
        '3': {'YES': 213,'NO': 274}, 
        '2': {'YES': 61,'NO': 147}, 
        '1': {'YES': 14,'NO': 82}, 
        '0': {'YES': 8,'NO': 115}, 
    }
```

Source: https://www.scielosp.org/pdf/bwho/v83n2/v83n2a15.pdf

Use this contingency table to determine $\chi^2$, $\phi$ and Cramer's V statistics for testing the independence between these 2 variables.

## Quantitative data


### [easy] Correlation between weights and heights

Analyze the `weights_heights.csv` dataset using the previous functions to evaluate the correlation between weights and heights.


### [moderate] Correlation matrix

Write a function that computes the correlation matrix of a dataframe.
It is a symetric matrix that contains the correlation coeficient value for each couple of columns.

Use your implementation on the `wine.csv` dataset.
Validate your implementation by comparing your results with results from `statistics` or `statmodels` packages.

### [moderate] Visualization of correlation matrix

Write a function that draws a heatmap based on the correlation matrix computed before. 

### [advanced] Auto-correlation function for time serie data

When data corresponds to a variation of some quantity in time, then it's called a time serie.

To study time series, one can plot the auto-correlation function, which basically corresponds to how much data at a given time distance are correlated.

The autocorrelation function is supposed to be decreasing rapidly, unless the signal exhibit a particular property called long range dependance.

Write a function that computes the auto-correlation function of a given time serie data (use `A1H.csv` and refer to associated paper for description of the data : http://www3.dsi.uminho.pt/pcortez/data/itraffic.html

For definition of the formula you can refer to: https://real-statistics.com/time-series-analysis/stochastic-processes/autocorrelation-function/

Only compute for lags between 0 and 200.

Compare the observed autocorrelation function with one you get on a random normal variable.

## Object-oriented programming

### [advanced] Convert all your functions and organize them in classes using OOP