# Group 7
## Questions
1. How many of the world's 1-year old children today have been vaccinated against some disease?
2. How many against more diseases?
3. How has the rate of vaccination for different diseases changed over time?
4. Are there country characteristics that predict vaccination levels, or trends in vaccination levels?

## Datasets
1. The data for the immunization coverage among 1-year-olds is provided by thw Wolrd Health Organization (https://www.who.int/data/gho/gho-search?indexCatalogue=ghosearchindex&searchQuery=immunization%20coverage%20among%201-year-olds&wordsMode=AllWords)
2. ... to be continued for country characteristics

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os
import typing

In [2]:
# Mount google drive if notebook is running in colab
try:
  from google.colab import drive
  drive.mount("/content/drive", force_remount=True)
  IN_COLAB = True
except:
  IN_COLAB = False

In [3]:
if IN_COLAB:
  base_data_path = os.path.join(os.curdir, 'drive', 'MyDrive', 'DOPP', 'data')
else:
  base_data_path = os.path.join(os.curdir, 'data')

In [4]:
mapping = [
    ('WHS4_100', "diphtheria_pertussis_tetanus"),
    ('WHS4_129', "haemophilus_influenzae"),
    ('WHS4_117', "hepatitisB"),
    ('WHS8_110', "measles"),
    ('WHS4_544', "polio"),
    ('ROTAC', "rotavirus"),
    ('PCV3', "streptococcus_pneumoniae"),
    ('WHS4_543', "tuberculosis"),
]


def diseaseNameToVaccineIndicator(disease: str) -> str:
    for e in mapping:
        if e[1] == disease:
            return e[0]

    raise ValueError("Invalid disease")


def vaccineIndicatorToDiseaseName(indicator: str) -> str:
    for e in mapping:
        if e[0] == indicator:
            return e[1]

    raise ValueError("Invalid indicator")


In [5]:
immunization_data_path = os.path.join(base_data_path, 'immunization')
immunization_data_directory = glob.glob(os.path.join(immunization_data_path, '*.csv'))

immunization_data_list = []
for filename in sorted(immunization_data_directory):
    df = pd.read_csv(filename, sep=',')    # value is immunization coverage among 1-year-olds in (%)
    indicator = df['IndicatorCode'][0]
    df.rename(columns={'Period': 'year', 'Location': 'country', 'Value': indicator}, inplace=True)
    df = df[['year', 'country', indicator]].copy()
    df['country'] = df['country'].str.lower()
    df.set_index(['year', 'country'], inplace=True)
    immunization_data_list.append(df)

immunization_data = pd.concat(immunization_data_list, axis=1)

immunization_data[diseaseNameToVaccineIndicator('diphtheria_pertussis_tetanus')] = immunization_data[diseaseNameToVaccineIndicator('diphtheria_pertussis_tetanus')].astype(float)
immunization_data[diseaseNameToVaccineIndicator('measles')] = immunization_data[diseaseNameToVaccineIndicator('measles')].astype(float)

display(immunization_data[immunization_data.index.get_level_values('country') == 'austria'])

Unnamed: 0_level_0,Unnamed: 1_level_0,WHS4_100,WHS4_129,WHS4_117,WHS8_110,WHS4_544,ROTAC,PCV3,WHS4_543
year,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021,austria,85.0,85.0,85.0,95.0,85.0,61.0,,
2020,austria,85.0,85.0,85.0,95.0,85.0,61.0,,
2019,austria,85.0,85.0,85.0,95.0,85.0,61.0,,
2018,austria,85.0,85.0,85.0,94.0,85.0,61.0,,
2017,austria,90.0,90.0,90.0,96.0,90.0,61.0,,
2016,austria,87.0,87.0,87.0,95.0,87.0,61.0,,
2015,austria,93.0,93.0,93.0,96.0,93.0,61.0,,
2014,austria,98.0,98.0,98.0,96.0,98.0,61.0,,
2013,austria,95.0,95.0,95.0,92.0,95.0,61.0,,
2012,austria,92.0,92.0,92.0,88.0,92.0,61.0,,


#To do
1. How many of the world's 1-year old children today have been vaccinated against some disease?
2. How many against more diseases?
3. How has the rate of vaccination for different diseases changed over time?

Zunächst reicht nur Österreich, wenn Zeit ist, können wir noch andere Länder hinzunehmen. Mit Seaborn kann man besonders schöne Plots erstellen, wer mag, muss aber nicht sein, matplotlib reicht auch
1. Plotten der Prozentzahl jeder Krankheit aktuell (also letzte Jahreszahl die wir haben, in dem Falle 2021) von Österreich
2. Wird mit 1 eigentlich beantwortet, wüsste nicht, was da noch mehr hingehört
3. Plotten der Prozentzahl jeder Krankheit über die Zeit von Österreich

Wenn fertig, kann man ja noch andere Länder nehmen. Frankreich oder Italieren würde sich womöglich anbieten

## Are there country characteristics that predict vaccination levels, or trends in vaccination levels?
Testing correlations between certain country characteristics and vaccination levels. Can we achieve accurate predictions? We will use logistic regression. Used country characteristics (Austria): education level, infant mortality rate, (gdp per capita), vaccine mandates

The number of diseases we are looking at is 8

## TO DO
Umschreiben zu Funktionen.. Umsetzung für mindestens 3 Länder: Österreich, Frankreich, Italien, Deutschland, Spanien? Damit Vorhersage und Korrelation aussagekräftiger ist für mehr Daten

In [6]:
mortality_data_path = os.path.join(base_data_path, 'country_characteristics')
education_data_path = os.path.join(base_data_path, 'country_characteristics')
mandates_data_path = os.path.join(base_data_path, 'country_characteristics')

In [7]:
characteristics_list = []
#First the mortality rates for Austria between 2014 and 2020
df = pd.read_csv(os.path.join(mortality_data_path, "infant_mortality_rate.csv"), sep=',')

indicator = df['IndicatorCode'][0].lower()
df.rename(columns={'Period': 'year', 'Location': 'country','FactValueNumeric': indicator}, inplace=True) 
df = df[['year', 'country','Dim1ValueCode', indicator]].copy()
df['country'] = df['country'].str.lower()
df = df[df['Dim1ValueCode'] == 'BTSX'].copy()
df = df.drop(labels='Dim1ValueCode', axis=1)
df.set_index(['year', 'country'], inplace=True)

mortality_at_data = df[(df.index.get_level_values('country') == 'austria') & (df.index.get_level_values('year') >= 2014)].copy()
mortality_at_data = mortality_at_data.droplevel(1)
mortality_at_data = mortality_at_data.loc[mortality_at_data.index.repeat(8)]
mortality_at_data = mortality_at_data.reset_index()
characteristics_list.append(mortality_at_data)

#Now the education level for Austria, highest education level achieved in group 18-64, between 2014 and 2020
df = pd.read_csv(os.path.join(education_data_path, "education.csv"), sep=',')
df.rename(columns={'TIME_PERIOD': 'year', 'geo': 'country','OBS_VALUE': 'level'}, inplace=True) 
df = df[['year', 'country', 'age', 'sex', 'isced11', 'level']].copy()
df['country'] = df['country'].str.lower()
df = df[(df['sex'] == 'T') & (df['age'] == 'Y15-64') & (df['isced11'] == 'ED3-8')].copy()
df = df.drop(labels=['sex', 'age', 'isced11'], axis=1)
df.set_index(['year', 'country'], inplace=True)

education_at_data = df[(df.index.get_level_values('country') == 'at') & (df.index.get_level_values('year') >= 2014) & (df.index.get_level_values('year') < 2021)].copy()
education_at_data = education_at_data.droplevel(1)
education_at_data = education_at_data.loc[education_at_data.index.repeat(8)]
education_at_data = education_at_data.reset_index()
characteristics_list.append(education_at_data)

#And last the vaccine mandates for Austria between the times 2014 and 2020
df = pd.read_csv(os.path.join(mandates_data_path, "mandates.csv"), sep=',')
df['country'] = df['country'].str.lower()
df.set_index(['year', 'country'], inplace=True)

mandates_at_data = df[(df.index.get_level_values('country') == 'austria') & (df.index.get_level_values('year') >= 2014) & (df.index.get_level_values('year') < 2021)].copy()
mandates_at_data = mandates_at_data.droplevel(1)
mandates_at_data = mandates_at_data.reset_index()
mandates_at_data.rename(columns={'year': 'year_m'}, inplace=True) 
characteristics_list.append(mandates_at_data)


In [8]:
#Data for building our prediction and correlation model
immunization_at_data = immunization_data[(immunization_data.index.get_level_values('country') == 'austria')].reset_index()
immunization_at_data = immunization_at_data.loc[(immunization_at_data['year'] >= 2014) & (immunization_at_data['year'] < 2021)]
immunization_at_data = immunization_at_data.drop(labels='country', axis=1)
immunization_at_data = immunization_at_data.sort_values(by="year")
cols = immunization_at_data.columns[1:]

characteristics_data = pd.concat(characteristics_list, axis=1, ignore_index=False)
characteristics_data = characteristics_data.drop(labels=['year'], axis=1)
characteristics_data = characteristics_data.sort_values(by='year_m')

for el in cols:
    df_el = immunization_at_data[['year', el]].dropna()
    for index, row in df_el.iterrows():
        characteristics_data.loc[(characteristics_data['IndicatorCode'] == el.upper()) & (characteristics_data['year_m'] == int(row['year'])), 'immunization'] = row[el]

display(characteristics_data.head())


Unnamed: 0,mdg_0000000001,level,year_m,IndicatorCode,recommended,mandatory,funded,immunization
55,3.1,81.3,2014,ROTAC,1,0,1,61.0
53,3.1,81.3,2014,PCV3,1,0,1,
52,3.1,81.3,2014,WHS8_110,1,0,1,96.0
54,3.1,81.3,2014,WHS4_544,1,0,1,98.0
6,2.97,79.7,2014,WHS4_543,0,0,0,


## Implementing Logistic Regression

In [9]:
#Correlation
corr_data = characteristics_data.drop(labels='year_m', axis=1).copy()
corr_data = corr_data[corr_data['IndicatorCode'] != 'PCV3']

display(corr_data.corr())


  display(corr_data.corr())


Unnamed: 0,mdg_0000000001,level,recommended,mandatory,funded,immunization
mdg_0000000001,1.0,0.586746,0.059904,,0.059904,0.113063
level,0.586746,1.0,0.711584,,0.711584,0.127547
recommended,0.059904,0.711584,1.0,,1.0,
mandatory,,,,,,
funded,0.059904,0.711584,1.0,,1.0,
immunization,0.113063,0.127547,,,,1.0


In [10]:
from sklearn.linear_model import LogisticRegression
X = corr_data.drop(labels=['immunization', 'IndicatorCode'], axis=1).to_numpy()
y = corr_data['immunization'].fillna(0, inplace=False).to_numpy()

training_X = X[2:]
testing_X = X[0:2]
training_y = y[2:]
testing_y = y[0:2]

model = LogisticRegression(random_state=0).fit(training_X, training_y)
print(f"True immunization: {testing_y}")
print(model.predict(testing_X))


True immunization: [61. 96.]
[85. 85.]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
