# MDI220
# Statistics

# Project 

This is the project for the course on statistics.

You must fill this notebook and upload it on eCampus, **including figures**. Please make sure that it runs **without erros**.

You can work in teams but the final notebook, including text and code, must be **yours**. Any copy-pasting across students is strictly forbidden.

Please provide **concise answers** and **concise code**, with comments when appropriate.

## Your name:

## Imports

Please do **not** import any other library.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
from scipy.stats import norm, chi2, gaussian_kde

In [None]:
import seaborn as sns

## Data

We consider the **daily electric power consumption** (in MW) in France in the period 2013-2023. The dataset is available on **eCampus**. 

Detailed information on this dataset is available [here](https://www.data.gouv.fr/fr/datasets/consommation-quotidienne-brute-regionale/).

In [None]:
# do not modify this cell
filename = 'power_consumption.csv'
df = pd.read_csv(filename)

In [None]:
df.head()

In [None]:
regions = list(df.region.unique())

In [None]:
regions

In [None]:
# average consumption per region
df[["region", "consumption"]].groupby("region").mean().astype(int)

In [None]:
# average consumption in France (MW)
df[["region", "consumption"]].groupby("region").mean().sum().astype(int)

In [None]:
# selection of a region
region = "Bretagne"
df_region = df[df.region == region]

In [None]:
# density (with kde = kernel density estimation)
sns.kdeplot(data=df_region, x="consumption", label="Data", color='blue', fill=True)
plt.title(region)
plt.xlabel("Consumption (MW)")
plt.legend() 
plt.show()

## A. Parameter estimation

In all the project, the daily power consumptions are assumed to be i.i.d., with a Gaussian distribution. 

We use the following estimators for the mean and the variance, given $n$ observations $x_1,\ldots,x_n$:
$$
\hat \mu = \frac 1 n \sum_{i=1}^n x_i\quad \hat{\sigma^2} = \frac 1 {n-1}\sum_{i=1}^n (x_i - \hat \mu)^2
$$
The corresponding estimate of the standard deviation is:
$$
\hat \sigma = \sqrt{\hat{\sigma^2}}
$$

Consider the region Bretagne. 

1. Show that the estimators for the mean and the variance are unbiased.

Your answer:

2. Give the values obtained for the mean and the standard deviation using these estimators.

3. Generate new data using the corresponding Gaussian model, with the same number of samples.

4. Plot the *kde* (kernel density estimate) of real data and the *kde* of generated data on the same figure.

5. Do the same for the days corresponding to winter (from December 22 to March 21), after updating the model.

6. In which case the Gaussian model seems more appropriate?

## B. Goodness of fit

We would like to confirm and quantify the observation of part A.

We propose the following metric to measure the dissimilarity between two pdf $f$ and $g$ (with respect to the Lebesgue measure):
$$
d(f, g) = \frac 1 2 \int |f(x) - g(x)| \mathrm dx.
$$

1. Show that $d(f,g) \in [0,1]$, with $d(f,g)=0$ if and only if $f=g$ almost everywhere (for the Lebesgue measure) and $d(f,g)=1$ if and only if $f$ and $g$ have disjoint supports.

Your answser:

2. For the region Bretagne, compute the dissimilarity between the *kde* of real data and the density of the Gaussian model.

3. Do the same for the region Bretagne in winter and check your conclusion of part A.

4. Do the same for all regions. Give the result as a single dataframe.

5. Which region is best fitted by a Gaussian model in winter?

6. For this region and this season, plot the *kde* of real data and the *kde* of generated data on the same figure, as in part A.

## C. Bayesian statistics

We would like to include prior knowledge on the estimation, using Bayesian statistics.

Consider the region Bretagne in winter. We focus on the mean $\mu$, assuming the standard deviation is known and equal to $\sigma=400$MW. We assume a Gaussian prior on $\mu$ with mean $\mu_0=3500$MW and standard deviation $\sigma_0=500$MW. 

1. Give the posterior distribution of $\mu$, using the parameters $\sigma$, $\mu_0$, $\sigma_0$ and the $n$ observations $x_1,\ldots,x_n$.

Your answer:

2. Give the parameters of the posterior distribution obtained for each year from 2013 to 2023, considered independently, in a single dataframe.

3. Plot the density of the posterior distribution in 2023 and the *kde* of real data in 2023 on the same plot.

4. Discuss the results, comparing to those obtained in part A.

Your answer:

## D. Hypothesis testing

Consider the region Bretagne. While the standard deviation is equal to 400MW in winter, you would like to test the hypothesis that it was 500MW in 2023. The mean is supposed to be known and equal to 3200MW.
1. Propose a statistical test at level $\alpha$.

Your answer:

2. Provide the result of this test for $\alpha=1\%$.

## E. Testing independence

Consider the power consumption in Bretagne and Provence-Alpes-Côte d'Azur in winter. 

1. Visualize the joint distribution of power consumption in these 2 regions using seaborn.

2. Do you think the power consumption in these two regions is independent?

Your answer:

3. Propose a chi-2 test for the independence of the power consumption in these two regions at level $\alpha$. 

Your answer:

4. Give the result of this test for $\alpha=1\%$. 

## F. Confidence intervals

Consider the region Bretagne in winter. 

1. Provide a confidence interval at level $\alpha=95\%$ for the mean power consumption. 

2. Give the result of a Student test at level $5\%$ for the null hypothesis that the mean consumption in 2023 is equal to 3100MW.