# Analysing the COVID-19 pandemic in Bosnia and Herzegovina

The analysie will be preformed on a dataset gathered from the <a href="https://www.who.int/">WHO</a> website. The first part of this analysis will be data cleaning, wich is the most important part of data analysis. You know how they say it if the data is not clean we get garbage in and garbage out.

The next part will contain visualizations to get a more understanding picture of the situation so we can preform some statistical methods later. 

After we finished the data cleaning and visualization process will continue on data modeling so we can make predictions in the later part when we will actualy use our data to make predictions on how the situation will improve or not in the future.

When all of this is set and done we will make the conclusion and suggest how things can be done in the future to improve the situation.

## Table of Contetn's

* [Importing the nececery Libraries](#importing-the-nececery-libraries)
* [Data import and exploration](#data-import-and-exploration)

## Importing the nececery Libraries

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import cufflinks as cf

import chart_studio.plotly as py
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.impute import SimpleImputer

init_notebook_mode(connected = True)
cf.go_offline()
sns.set()

## Data import and exploration

In [2]:
rawData = pd.read_excel(os.path.join("../dataSet/rawData/", "mbih.xlsx"), engine='openpyxl')

In [3]:
rawData.head()

Unnamed: 0,date,total_cases,new_cases,population
0,2020-03-05,2,2,3280815
1,2020-03-06,2,0,3280815
2,2020-03-07,3,1,3280815
3,2020-03-08,3,0,3280815
4,2020-03-09,3,0,3280815


In [4]:
rawData.info() # checking the datatype of each column, the null valuse

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308 entries, 0 to 307
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         308 non-null    object
 1   total_cases  308 non-null    int64 
 2   new_cases    308 non-null    int64 
 3   population   308 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 9.8+ KB


In [5]:
rawData.describe() # fast overview of statistical methods for each column

Unnamed: 0,total_cases,new_cases,population
count,308.0,308.0,308.0
mean,28331.162338,368.155844,3280815.0
std,35076.420847,466.37897,0.0
min,2.0,0.0,3280815.0
25%,2347.0,26.0,3280815.0
50%,13267.0,218.5,3280815.0
75%,36564.75,485.5,3280815.0
max,113392.0,1953.0,3280815.0


## Data preprocessing

In [6]:
rawData['date'] = rawData['date'].astype('datetime64')
rawData.head()

Unnamed: 0,date,total_cases,new_cases,population
0,2020-03-05,2,2,3280815
1,2020-03-06,2,0,3280815
2,2020-03-07,3,1,3280815
3,2020-03-08,3,0,3280815
4,2020-03-09,3,0,3280815


In [7]:
bihdata = pd.read_excel(os.path.join("../dataSet/rawData/", "bih.xlsx"), engine='openpyxl')
bihdata.head()

Unnamed: 0,Datum,Potvrđeni slučajevi,Broj testiranih,Broj smrtnih slučajeva,Broj oporavljenih osoba,Broj aktivnih slučajeva,Broj osoba pod nadzorom
0,30.12.2020,110985,511940,4050,77225,29710,0
1,29.12.2020,110454,509067,4024,76802,29628,0
2,28.12.2020,109911,505681,3976,76121,29814,0
3,27.12.2020,109691,503906,3953,75717,30021,0
4,26.12.2020,109330,502063,3923,75124,30283,0


In [8]:
tested = pd.DataFrame(columns = ["Datum", "Broj testiranih dnevno"])
for index in range(0, len(bihdata["Broj testiranih"])):    
    if index == len(bihdata["Broj testiranih"]) - 2:
        i, j = index, len(bihdata["Broj testiranih"]) - 1
        
        tested = tested.append(
            {"Datum": str(bihdata.iloc[index, 0]), "Broj testiranih dnevno": bihdata.iloc[i, 2] - bihdata.iloc[j, 2]},
            ignore_index = True)
        
        break
    else:
        i, j = index, index + 1
        tested = tested.append(
            {"Datum": str(bihdata.iloc[index, 0]), "Broj testiranih dnevno": bihdata.iloc[i, 2] - bihdata.iloc[j, 2]},
            ignore_index = True)        

In [9]:
arrayNegative = pd.DataFrame(columns = ["Datum", "Broj oporavljenih osoba"])
for index in range(0, len(bihdata["Broj oporavljenih osoba"])):    
    if index == len(bihdata["Broj testiranih"]) - 2:
        
        i, j = index, len(bihdata["Broj testiranih"]) - 1

        arrayNegative = arrayNegative.append(
            {"Datum": str(bihdata.iloc[index, 0]), "Broj oporavljenih osoba": bihdata.iloc[i, 4] - bihdata.iloc[j, 4]},
            ignore_index = True)
        
        break
    else:
        i, j = index, index + 1
        
        arrayNegative = arrayNegative.append(
            {"Datum": str(bihdata.iloc[index, 0]), "Broj oporavljenih osoba": bihdata.iloc[i, 4] - bihdata.iloc[j, 4]},
            ignore_index = True)        

In [10]:
died = pd.DataFrame(columns = ["Datum", "Broj smrtnih slučajeva"])
for index in range(0, len(bihdata["Broj smrtnih slučajeva"])):    
    if index == len(bihdata["Broj smrtnih slučajeva"]) - 1:
        i, j = index, len(bihdata["Broj smrtnih slučajeva"]) - 1
        
        died = died.append(
            {"Datum": str(bihdata.iloc[index, 0]), "Broj smrtnih slučajeva": bihdata.iloc[i, 3] - bihdata.iloc[j, 3]},
            ignore_index = True)
        
        break
    else:
        i, j = index, index + 1
        died = died.append(
            {"Datum": str(bihdata.iloc[index, 0]), "Broj smrtnih slučajeva": bihdata.iloc[i, 3] - bihdata.iloc[j, 3]},
            ignore_index = True)

In [11]:
rawData['date'] = rawData['date'].dt.strftime('%d.%m.%Y')

In [12]:
fullDataFrame = pd.merge(left=rawData, left_on='date', how = 'left',
         right=arrayNegative[['Broj oporavljenih osoba', 'Datum']], right_on='Datum').drop('Datum', axis = 1)

In [13]:
fullDataFrame = pd.merge(left = fullDataFrame, left_on = 'date', how = 'left',
                        right = tested[['Datum', 'Broj testiranih dnevno']], right_on = 'Datum').drop('Datum', axis = 1)

In [14]:
fullDataFrame = pd.merge(left = fullDataFrame, left_on = 'date', how = 'left',
                        right = died[['Datum', 'Broj smrtnih slučajeva']], right_on = 'Datum').drop('Datum', axis = 1)

In [15]:
fullDataFrame.head()

Unnamed: 0,date,total_cases,new_cases,population,Broj oporavljenih osoba,Broj testiranih dnevno,Broj smrtnih slučajeva
0,05.03.2020,2,2,3280815,,,
1,06.03.2020,2,0,3280815,,,
2,07.03.2020,3,1,3280815,,,
3,08.03.2020,3,0,3280815,,,
4,09.03.2020,3,0,3280815,,,


In [16]:
missingData = pd.read_excel(os.path.join("../dataSet/cleanData", "missingDataValues.xlsx"), engine = "openpyxl")

In [17]:
fullDataFrame = pd.read_excel(os.path.join("../dataSet/cleanData", "missingData.xlsx"), engine = "openpyxl")

In [18]:
 fullDataFrame

Unnamed: 0,date,total_cases,new_cases,population,Oporavljeni,Testirani,Smrtni sl.
0,05.03.2020,2,2,3280815,,,
1,06.03.2020,2,0,3280815,,,
2,07.03.2020,3,1,3280815,,,
3,08.03.2020,3,0,3280815,,,
4,09.03.2020,3,0,3280815,,,
...,...,...,...,...,...,...,...
303,02.01.2021,112143,0,3280815,,,
304,03.01.2021,112645,502,3280815,,,
305,04.01.2021,112645,0,3280815,,,
306,05.01.2021,113392,747,3280815,,,


In [19]:
fig = go.Figure()

fig.add_trace(go.Bar(x = missingData["Column Name"], y = missingData["Available Data"],
                     marker_color = "#001024", name = "Available Data"))
fig.add_trace(go.Bar(x = missingData["Column Name"], y = missingData["Missing Data"],
                     marker_color = "#FF800B", name = "Missing Data"))

fig.update_layout(barmode='group', xaxis_tickangle=-45, title = "Missing Data for each Column of the Data Set", hovermode="x unified")
fig.show()


In [20]:
fullDataFrame.isnull().sum()

date            0
total_cases     0
new_cases       0
population      0
Oporavljeni    77
Testirani      77
Smrtni sl.     76
dtype: int64

In [21]:
missingData

Unnamed: 0,Column Name,Available Data,Missing Data,Missing Pct
0,date,308,0,0.0
1,total_cases,308,0,0.0
2,new_cases,308,0,0.0
3,population,308,0,0.0
4,Oporavljeni,231,77,0.333333
5,Testirani,231,77,0.333333
6,Smrtni sl.,232,76,0.327586


In [22]:
def MissingDataPlot(dataFrame):
    colors = ["#FF800B", "#001024"]
    names = ["Missing Values", "Present Values"]
    
    columns = [col for col in dataFrame["Column Name"]]
    
    specs = []

    
    for sp in range(len(columns)):
        specs.append({"type": "pie", "rowspan": 0})
    
    fig = make_subplots(rows = 1, cols = len(columns), specs = [specs], subplot_titles = columns)
    
    data = []
    
 
    position = 0.024
    
    for index in range(len(missingData["Missing Data"])):
        move = 0.15
        avail = missingData.iloc[index, 2] 
        miss = missingData.iloc[index, 1]
        pct = missingData.iloc[index, 3]
    
        fig.add_trace(go.Pie(labels = names, values = [avail, miss], textinfo = "none", hole = .8),
                     row = 1, col = index + 1)
        
        if index == 0:
            fig.add_annotation(x = position, y=0.5, text="{:.2%}".format(pct), font_size = 15, showarrow = False)
            fig.update_traces(hoverinfo = 'label + value', marker = dict(colors = colors), col = index + 1)
#         
        elif index <= 2:
            move = 0.15
            position += move
            fig.add_annotation(x = position, y=0.5, text="{:.2%}".format(pct), font_size = 15, showarrow = False)
            fig.update_traces(hoverinfo = 'label + value', marker = dict(colors = colors), col = index + 1)
          
        elif index == 3:
            move = 0.18
            position += move
            fig.add_annotation(x = position, y=0.5, text="{:.2%}".format(pct), font_size = 15, showarrow = False)
            fig.update_traces(hoverinfo = 'label + value', marker = dict(colors = colors), col = index + 1)
        elif index == 4:
            move = 0.19
            position += move
            fig.add_annotation(x = position, y=0.5, text="{:.2%}".format(pct), font_size = 15, showarrow = False)
            fig.update_traces(hoverinfo = 'label + value', marker = dict(colors = colors), col = index + 1)
        elif index > 4:
            move = 0.147
            position += move
            fig.add_annotation(x = position, y=0.5, text="{:.2%}".format(pct), font_size = 15, showarrow = False)
            fig.update_traces(hoverinfo = 'label + value', marker = dict(colors = colors), col = index + 1)
    
    
    fig.show()

In [23]:
MissingDataPlot(missingData)

## Custom imputer

In [24]:
missingData = pd.read_excel(os.path.join("../dataSet/cleanData", "missingData.xlsx"), engine = "openpyxl")

In [25]:
pd.options.display.max_rows = 1000
missingData

Unnamed: 0,date,total_cases,new_cases,population,Oporavljeni,Testirani,Smrtni sl.
0,05.03.2020,2,2,3280815,,,
1,06.03.2020,2,0,3280815,,,
2,07.03.2020,3,1,3280815,,,
3,08.03.2020,3,0,3280815,,,
4,09.03.2020,3,0,3280815,,,
5,10.03.2020,5,2,3280815,,,
6,11.03.2020,7,2,3280815,,,
7,12.03.2020,11,4,3280815,,,
8,13.03.2020,13,2,3280815,,,
9,14.03.2020,18,5,3280815,,,


In [44]:
testingData = missingData.copy()

In [33]:
startTestedAvg = int(sum(missingData.iloc[29:35, 5].values) / len(missingData.iloc[29:35, 5].values))
startDiedAvg = int(sum(missingData.iloc[29:35, 6].values) / len(missingData.iloc[29:35, 6].values))

In [43]:
for i in range(0, 29):
    testingData.iloc[i,4] = 0
    testingData.iloc[i, 5] = int(startTestedAvg)
    testingData.iloc[i, 6] = startDiedAvg

NameError: name 'startTestedAvg' is not defined

In [40]:
testingData.iloc[, 4]

585.0

In [42]:
testingData

Unnamed: 0,date,total_cases,new_cases,population,Oporavljeni,Testirani,Smrtni sl.
0,05.03.2020,2,2,3280815,,585.0,
1,06.03.2020,2,0,3280815,,585.0,
2,07.03.2020,3,1,3280815,,585.0,
3,08.03.2020,3,0,3280815,,585.0,
4,09.03.2020,3,0,3280815,,585.0,
5,10.03.2020,5,2,3280815,,585.0,
6,11.03.2020,7,2,3280815,,585.0,
7,12.03.2020,11,4,3280815,,585.0,
8,13.03.2020,13,2,3280815,,585.0,
9,14.03.2020,18,5,3280815,,585.0,
