<center><img src='https://drive.google.com/uc?id=1_utx_ZGclmCwNttSe40kYA6VHzNocdET' height="60">

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Program Operacyjny Polska Cyfrowa na lata 2014-2020
<hr>

<img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'>


Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej".   
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
    </center>

# Statistical machine learning - Notebook 7, version for students
**Author: Dorota Celińska-Kopczyńska, Michał Ciach**  


## Description


In today's class, we will work with diagnostics of linear model, i.e., we will check if the assumptions of the linear regression are satisfied. We will also discuss the individual and joint significance of the parameters in our model and look for any outlying observations.

The assumptions we will check can be summarized as the LINE rule:    

- **L**inear trend,
- **I**ndependent residuals (lack of autocorrelation),
- **N**ormally distributed residuals,
- **E**qual variance of residuals for all values of independent variables (homoskesasticity).

We will check them visually by creating and analyzing the following diagnostic plots:   

- The residual value vs the fitted value
- The root square of the absolute value of standardized residuals vs the fitted value,
- The histogram of residuals.

The first plot is used to check if the relationship between the response (the dependent variable) and the predictors (the independent variables) is linear, and for a very rough check if the residuals are uncorrelated. We expect values distributed symmetrically across the line $y=0$. However, as stated in the lecture, this plot may be misleading if non-spherical random disturbance occurs. That is why, we encourage to perform Ramsey RESET test.  

The second plot is used to check homoskedasticity (equality for all values of the independent variables) of variance. We expect values distributed symmetrically across a straight horizontal line.    

The histogram is used to visualize the distribution of residuals.  
You can also use a qq-plot in this case if you know how to create and interpret it.

Finally, inspect either the influence plot or the leverage-resid2 plot,  [implemented in `statsmodels`](https://www.statsmodels.org/dev/examples/notebooks/generated/regression_plots.html).  
Both plots used to detect outliers that highly influence the model parameters.

In [4]:
!pip install gdown
!gdown https://drive.google.com/uc?id=1GW1pjKOCoKOlC4Jqbqql_ghYD_n0iC6O
!gdown https://drive.google.com/uc?id=1FInZ2jrlZGNColU4sHF9JKGHP39fTVut
!gdown https://drive.google.com/uc?id=1n1qS6dcVVKcVJOuUIIm0VTz6cSyrtzDH

Downloading...
From: https://drive.google.com/uc?id=1GW1pjKOCoKOlC4Jqbqql_ghYD_n0iC6O
To: /content/BDL municipality incomes 2015-2020.csv
100% 228k/228k [00:00<00:00, 107MB/s]
Downloading...
From: https://drive.google.com/uc?id=1FInZ2jrlZGNColU4sHF9JKGHP39fTVut
To: /content/BDL municipality area km2 2015-2020.csv
100% 180k/180k [00:00<00:00, 121MB/s]
Downloading...
From: https://drive.google.com/uc?id=1n1qS6dcVVKcVJOuUIIm0VTz6cSyrtzDH
To: /content/BDL municipality population 2015-2020.csv
100% 222k/222k [00:00<00:00, 122MB/s]


## Data & library imports

In [3]:
import pandas as pd
import plotly.express as px
import numpy as np
import statsmodels.api as sm
from scipy.linalg import svd

In [5]:
income = pd.read_csv('BDL municipality incomes 2015-2020.csv', sep=';', dtype={'Code': 'str'})
population = pd.read_csv('BDL municipality population 2015-2020.csv', sep='\t', dtype={'Code': 'str'})
area = pd.read_csv('BDL municipality area km2 2015-2020.csv', sep='\t', dtype={'Code': 'str'})

In [6]:
voivodeship_names = {
    '02': 'Dolnośląskie',
    '04': 'Kujawsko-pomorskie',
    '06': 'Lubelskie',
    '08': 'Lubuskie',
    '10': 'Łódzkie',
    '12': 'Małopolskie',
    '14': 'Mazowieckie',
    '16': 'Opolskie',
    '18': 'Podkarpackie',
    '20': 'Podlaskie',
    '22': 'Pomorskie',
    '24': 'Śląskie',
    '26': 'Świętokrzyskie',
    '28': 'Warmińsko-mazurskie',
    '30': 'Wielkopolskie',
    '32': 'Zachodniopomorskie'
}

In [7]:
code_list = [s[:2] for s in income["Code"]]
name_list = [voivodeship_names[code] for code in code_list]
income['Voivodeship'] = name_list

## Diagnostics when assumptions are satisfied

**Exercise 1.** In this exercise, we will inspect the diagnostics of the model when the assumptions of the linear regression model are satisfied. We will focus on the generated data from file simulated_regression.csv.
We generated a data set with a given multidimensional distribution and correlation matrix. Our variables are not be highly correlated with each other. We consider a model in which y will be the explained variable, and A, B, C and the constant will be the independent ones.

Our data are generated according to the linear model: $y = 0.4A + 0.5B + 0.5C + \varepsilon$

Inspect the descriptive statistics of the data set.
Visualize the relationships between dependent and independent variables.
Perform a regression of y on the constant, A, B, and C. Compare the resulting estimates (and their statistical significance) with the true values.

Compute the fitted values and their confidence intervals using the `get_prediction()` function from `statsmodels`.  
Use them to compute the residuals $\hat{\epsilon}_i = Y_i - X_i\hat{\beta}$.   
Calculate the standardized residuals $(\hat{\epsilon}_i - \text{mean}(\hat{\epsilon}))/\text{sd}(\hat{\epsilon})$.

Conduct the diagnostics of the model. Inspect the diagnostic plots and the results of the Ramsey RESET test. Note, we work with a model that satisfies the assumptions (we know the groundtruth as we work with simulated data!).

## Diagnostics when assumptions are not satisfied

**Exercise 2.**  

In this exercise, we will predict the income of a municipality in 2020 based on its population and voivodeship.

Create a data frame with the territorial code, income, population and voivodeships of municipalities in 2020 by using `pd.merge` to perform a join with the `Code` variable as the key. Remove rows with missing values.   
Use the `pd.get_dummies()` function to encode the voivodeship for each municipality with dummy variables.   

Estimate the model and inspect its summary.  
Are the variables jointly significant according to the F-test?  
Are all individual variables significant according to the t-test?
What are the interpretations of the parameters?  
Can you use a model with intercept in this exercise? Why / why not? If yes, what is its interpretation?

Conduct the diagnostics of the model. Decide which assumptions are satisfied to an appropriate degree.  
If you detect an outlying observation, remove it from the data set, run the calculations and diagnostics again and check if it improves the model fit.  

If you detect heteroskedasticity (non-constant variance of residuals), transforming the data may help.  
You may transform both the dependent and independent variable.  
Transforming the latter changes the functional relationship between the variables (i.e. whether they are linearly related), while transforming the former changes both the relationship and the structure of the residual variance.  

Estimate the average error in PLN that you would make if you used your model to predict the income of a municipality from its population.

## Linear regression and SVD

The estimator of $\beta$ is given by the equation
$$\hat{\beta} = (X^TX)^{-1}X^TY.$$
From a computational point of view, using this equation is inefficient and can lead to numerical errors.  
A more efficient approach is based on the [Singular Value Decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) of the $X$ matrix, given by $X = U\Sigma V^T$.  
We have already used the SVD decomposition to implement the Principal Component Analysis in one of the previous notebooks.   

**Exercise 3.** In this exercise, we will implement a linear regression method using the SVD decomposition and create a linear regression model to predict the income of a municipality in 2020 based on its population and voivodeship. Use the SVD decomposition of $X$ to obtain a more efficient formula for $\hat{\beta}$.  

Implement a function that takes an numpy vector of dependent variables `Y` and a numpy array of independent variables `X` and, using the SVD decomposition, computes and returns the estimated regression parameters.

Compare the results of your implementation with the one from `statsmodels`. You can find the relevant documentation [here](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html).  


<center><img src='https://drive.google.com/uc?id=1_utx_ZGclmCwNttSe40kYA6VHzNocdET' height="60">

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Program Operacyjny Polska Cyfrowa na lata 2014-2020
<hr>

<img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'>


Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej".   
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
    </center>