# Spurious correlations

Import libraries and functions.

In [1]:
import pandas as pd
import numpy as np
import glob
import os
from zipfile import ZipFile
import warnings
warnings.filterwarnings("ignore")
import functools as ft
import ipywidgets as widgets
from ipywidgets import Layout
from ipywidgets import interact, interact_manual
import plotly.express as px
from scipy import stats
from scipy.stats import shapiro
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from pandas.api.types import is_numeric_dtype

However, there is an interesting phenomenon, in some cases there are correlations that have a high coefficient and also an adequate graphics, but they do not make sense in the analysis, these are called **spurious correlations**. Here are some examples:


<img src="https://raw.githubusercontent.com/devonfw-forge/python-data-driven-decisions/main-the-big-three/Logos/chart%20(2).jpeg" width="800" height="400" />

<img src="https://raw.githubusercontent.com/devonfw-forge/python-data-driven-decisions/main-the-big-three/Logos/chart%20(1).jpeg" width="800" height="400" />


Therefore, we have to be carefull with our results because correlation does not imply causation, it may have happened by chance that both variables are really similar.
So, after some thought and experimenting, we have developed a method that we think, it will allow us to find out if the correlation has happened by chance or if there is really a correlation.

This method, consists of the following:


Firstly we have classified the indicators by a group, which can be one of the following: *A&D*, *Agriculture*, *Demography*, *Economy*, *Employment*, *Environment*, *Equality*, *Exports*, *Health*, *Mortality* or *Principal*. Moreover inside each group we have also assigned each varible a level, *primary* or *secondary*, depending on their level of relevance. For example we have consider more relevant the *Population in the largest city* over the *Rural population*, thus the first will be *primary* and the latter *secondary*, while both are part of the *Demography* group. 

With this set, we can expose our hypothesis:

"It is assumed that the correlation in the primary indicators can be caused by randomness, however if this correlation also appears in the secondary indicators for at least X% of the countries that appears in the primaries (Pareto's rule), we can suppose that there is no randomness affecting each group. Furthermore, the first assumption has to happen in Y% of the secondary indicators to avoid any fortuity." 

This hypothesis can be used in a global level, all the countries, or in the different regions. 

For example if, X and Y =80% a primary indicator is repeated 20 times the secondary indicators must have repeated 18 times. And if there are 10 secondary indicators, it has to happen for, at least, 8 indicators.

Finally, we will finish with two possible errors of 20%, which combined (20%*20%), leaves us with a 4% of margin of error, which is lower than the wildly spread of 5%.

**This will only work if the data has been collected by independent sources and uses different methods to collect it. Therefore, this step has only been developed for this data (WDI), which we have checked that comes from different sources and is gathered differently.**


In [None]:
limita=widgets.FloatSlider( value=0.8, min=0, max=1.0, step=0.05, description='% of primary indicators:', disabled=False, continuous_update=False, orientation='horizontal', readout=True)
limitb=widgets.FloatSlider( value=0.8, min=0, max=1.0, step=0.05, description='% of secondary indicators:', disabled=False, continuous_update=False, orientation='horizontal', readout=True)
display(limita,limitb)

FloatSlider(value=0.8, continuous_update=False, description='% of primary indicators:', max=1.0, step=0.05)

FloatSlider(value=0.8, continuous_update=False, description='% of secondary indicators:', max=1.0, step=0.05)

In [None]:
a=(1-limita.value)*(1-limitb.value)*100
print('The margin of error in this combination is:',a)

The margin of error in this combination is: 3.9999999999999982


Once we have selected both percentages and we agree with the margin of error, we can proceed to put into action our method. Firstly, we filter the primary indicators and get the minimun of times thatthe secondary have to be repeated.

In [None]:
selected_p=categories.loc[categories['Level']=='primary']
minprimary=selected_p.groupby('Group').min()
minprimary['Min']=round(minprimary['Number of times repeated']*limita.value)
minprimary.drop(columns=['Indicator','Number of times repeated','Level'], inplace=True)
minprimary


Unnamed: 0_level_0,Min
Group,Unnamed: 1_level_1
A&D,13.0
Agriculture,10.0
Demography,24.0
Economy,26.0
Employment,11.0
Environment,13.0
Equality,14.0
Exports,28.0
Health,9.0
Internet,27.0


In [None]:
grouplist=minprimary.index.to_list()

Then, we test if the repetition are accomplished. 

- H_0 data has correlation buy has not happened by randomness.
- H_1 data has correlation due to randomness 

**If Number of times repeated the secondary indicator < Minimun per group, then H_0 denied and H_1 accepted**

In [None]:
secondary=final.loc[final['Level']=='secondary']
secondary=pd.merge(secondary,minprimary, left_on='Group',right_on='Group')
secondaryp=secondary.loc[:,['Group','Min']]
Global_Count=secondaryp.groupby('Group').count()
Global_Count.rename(columns={'Min':'Global Count'},inplace=True)
secondary['H_0']=np.where(secondary['Number of times repeated']-secondary['Min']>0,'Not Discarded', 'Denied')
seco=secondary.groupby(['H_0','Group']).count()
sec=seco.loc['Not Discarded']
secondarycount=sec.drop(columns=['Indicator','R^2 Spearman','Behaviour','Country','Moved','Type','Continent','Number of times repeated','Level'])
secondarycount.rename(columns={'Min':'Secondary Count'},inplace=True)
continentlist=final['Continent'].unique()
namescontinents=['European', 'North African', 'Asian', 'Pair', 'Persian', 'South African', 'Latino-American']
finalcount=pd.merge(Global_Count,secondarycount, left_on='Group',right_on='Group')
finalcount['Does it have some global casuallity implied?']=np.where(finalcount['Secondary Count']/finalcount['Global Count']>limitb.value,'No', 'Yes')
finalcount['% of count (Global)']=finalcount['Secondary Count']/finalcount['Global Count']*100
finalcount.drop(columns=['Global Count','Secondary Count'],inplace=True)
finalcount

Unnamed: 0_level_0,Does it have some global casuallity implied?,% of count (Global)
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
Agriculture,No,100.0
Demography,No,100.0
Economy,Yes,71.039604
Employment,No,100.0
Environment,No,100.0
Equality,No,100.0
Exports,No,82.379863
Health,No,100.0
Internet,Yes,37.190083
Mortality,No,100.0


As we can see, in a Global situation for the groups that have a **NO**, we do not need to worry about casualities, however for the rest of the groups correlation can still be a great indicator as a basis for decision making, if we carefully analyze the variables and found some sort of real relationship between them.   

In [None]:
for i in range(0,len(continentlist)):
    apfinal=final.loc[final['Continent']==continentlist[i]]
    
    selected_p=categories.loc[categories['Level']=='primary']
    minprimary=selected_p.groupby('Group').min()
    minprimary['Min']=round(minprimary['Number of times repeated']*limita.value)
    minprimary.drop(columns=['Indicator','Number of times repeated','Level'], inplace=True)

    grouplist=minprimary.index.to_list()

    secondary=apfinal.loc[apfinal['Level']=='secondary']
    secondary=pd.merge(secondary,minprimary, left_on='Group',right_on='Group')

    secondaryp=secondary.loc[:,['Group','Min']]
    Global_Count=secondaryp.groupby('Group').count()
    Global_Count.rename(columns={'Min':'Global Count'},inplace=True)

    secondary['H_0']=np.where(secondary['Number of times repeated']-secondary['Min']>0,'Not Discarded', 'Denied')
    seco=secondary.groupby(['H_0','Group']).count()
    sec=seco.loc['Not Discarded']
    secondarycount=sec.drop(columns=['Indicator','R^2 Spearman','Behaviour','Country','Moved','Type','Continent','Number of times repeated','Level'])
    secondarycount.rename(columns={'Min':'Secondary Count'},inplace=True)

    apfinalcount=pd.merge(Global_Count,secondarycount, left_on='Group',right_on='Group')
    apfinalcount['Does it have some '+namescontinents[i]+' casuallity implied?']=np.where(apfinalcount['Secondary Count']/apfinalcount['Global Count']>limitb.value,'No', 'Yes')
    apfinalcount['% of count ('+namescontinents[i]+')']=apfinalcount['Secondary Count']/apfinalcount['Global Count']*100
    apfinalcount.drop(columns=['Global Count','Secondary Count'],inplace=True)
    finalcount=pd.merge(finalcount,apfinalcount, left_on='Group',right_on='Group')

finalcount

Unnamed: 0_level_0,Does it have some global casuallity implied?,% of count (Global),Does it have some European casuallity implied?,% of count (European),Does it have some North African casuallity implied?,% of count (North African),Does it have some Asian casuallity implied?,% of count (Asian),Does it have some Pair casuallity implied?,% of count (Pair),Does it have some Persian casuallity implied?,% of count (Persian),Does it have some South African casuallity implied?,% of count (South African),Does it have some Latino-American casuallity implied?,% of count (Latino-American)
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Agriculture,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Demography,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Economy,Yes,71.039604,Yes,75.799087,Yes,71.812081,Yes,72.794118,Yes,70.348837,Yes,66.968326,Yes,70.542636,Yes,66.666667
Employment,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Environment,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Equality,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Exports,No,82.379863,Yes,77.142857,No,80.701754,No,86.842105,No,84.0,No,86.075949,No,83.333333,No,83.333333
Health,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Internet,Yes,37.190083,Yes,47.619048,Yes,35.294118,Yes,26.315789,Yes,42.857143,Yes,36.842105,Yes,37.5,Yes,28.571429
Mortality,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0


Finally, we can observe into much more detail the differnt regions that we defined before, and for groups that have a **NO**, we do not need to worry about casualities. Meanwhile, for the rest of the groups correlation can still be a great indicator as a basis for decision making, if we carefully analyze the variables and found some sort of real relationship between them.   