# Analysing the COVID-19 pandemic in Bosnia and Herzegovina

The analysie will be preformed on a dataset gathered from the <a href="https://www.who.int/">WHO</a> website. The first part of this analysis will be data cleaning, wich is the most important part of data analysis. You know how they say it if the data is not clean we get garbage in and garbage out.

The next part will contain visualizations to get a more understanding picture of the situation so we can preform some statistical methods later. 

After we finished the data cleaning and visualization process will continue on data modeling so we can make predictions in the later part when we will actualy use our data to make predictions on how the situation will improve or not in the future.

When all of this is set and done we will make the conclusion and suggest how things can be done in the future to improve the situation.

## Table of Contetn's

* [Importing the nececery Libraries](#importing-the-nececery-libraries)
* [Data import and exploration](#data-import-and-exploration)

## Importing the nececery Libraries

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

sns.set()

## Data import and exploration

In [2]:
rawData = pd.read_excel(os.path.join("../dataSet/rawData/", "mbih.xlsx"), engine='openpyxl')

In [3]:
rawData.head()

Unnamed: 0.1,Unnamed: 0,date,total_cases,new_cases,population,population_density,median_age,aged_65_older
0,6393,2020-03-05,2,2,3280815,68.496,42.5,16.569
1,6394,2020-03-06,2,0,3280815,68.496,42.5,16.569
2,6395,2020-03-07,3,1,3280815,68.496,42.5,16.569
3,6396,2020-03-08,3,0,3280815,68.496,42.5,16.569
4,6397,2020-03-09,3,0,3280815,68.496,42.5,16.569


In [4]:
rawData.info() # checking the datatype of each column, the null valuse

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          301 non-null    int64  
 1   date                301 non-null    object 
 2   total_cases         301 non-null    int64  
 3   new_cases           301 non-null    int64  
 4   population          301 non-null    int64  
 5   population_density  301 non-null    float64
 6   median_age          301 non-null    float64
 7   aged_65_older       301 non-null    float64
dtypes: float64(3), int64(4), object(1)
memory usage: 18.9+ KB


In [5]:
rawData.describe() # fast overview of statistical methods for each column

Unnamed: 0.1,Unnamed: 0,total_cases,new_cases,population,population_density,median_age,aged_65_older
count,301.0,301.0,301.0,301.0,301.0,301.0,301.0
mean,6543.0,26374.262458,368.72093,3280815.0,68.496,42.5,16.569
std,87.035433,33015.025192,467.075392,0.0,0.0,0.0,0.0
min,6393.0,2.0,0.0,3280815.0,68.496,42.5,16.569
25%,6468.0,2321.0,27.0,3280815.0,68.496,42.5,16.569
50%,6543.0,12296.0,220.0,3280815.0,68.496,42.5,16.569
75%,6618.0,32845.0,453.0,3280815.0,68.496,42.5,16.569
max,6693.0,110985.0,1953.0,3280815.0,68.496,42.5,16.569


## Data preprocessing

In [6]:
rawData.drop(columns = "Unnamed: 0", inplace = True)
rawData.head()

Unnamed: 0,date,total_cases,new_cases,population,population_density,median_age,aged_65_older
0,2020-03-05,2,2,3280815,68.496,42.5,16.569
1,2020-03-06,2,0,3280815,68.496,42.5,16.569
2,2020-03-07,3,1,3280815,68.496,42.5,16.569
3,2020-03-08,3,0,3280815,68.496,42.5,16.569
4,2020-03-09,3,0,3280815,68.496,42.5,16.569


In [7]:
bihdata = pd.read_excel(os.path.join("../dataSet/rawData/", "bih.xlsx"), engine='openpyxl')
bihdata.drop(columns = "Unnamed: 0", inplace = True)
bihdata.head()

Unnamed: 0,Datum,Potvrđeni slučajevi,Broj testiranih,Broj smrtnih slučajeva,Broj oporavljenih osoba,Broj aktivnih slučajeva
0,29.12.2020,110454,509067,4024,76802,29628.0
1,28.12.2020,109911,505681,3976,76121,29814.0
2,27.12.2020,109691,503906,3953,75717,30021.0
3,26.12.2020,109330,502063,3923,75124,30283.0
4,25.12.2020,108891,499883,3901,74667,30323.0


In [8]:
for index in range(0, len(bihdata["Broj testiranih"])):    
    if index == len(bihdata["Broj testiranih"]) - 2:
        i, j = index, len(bihdata["Broj testiranih"]) - 1
        print(f"{bihdata.iloc[index, 0]}\t {bihdata.iloc[i, 2] - bihdata.iloc[j, 2]}")
        break
    else:
        i, j = index, index + 1
        print(f"{bihdata.iloc[index, 0]}\t {bihdata.iloc[i, 2] - bihdata.iloc[j, 2]}")

29.12.2020	 3386
28.12.2020	 1775
27.12.2020	 1843
26.12.2020	 2180
25.12.2020	 2958
24.12.2020	 3261
23.12.2020	 3054
22.12.2020	 3339
21.12.2020	 1843
20.12.2020	 2375
19.12.2020	 2840
18.12.2020	 3289
17.12.2020	 3624
16.12.2020	 3555
15.12.2020	 3319
14.12.2020	 1286
13.12.2020	 2022
12.12.2020	 2941
11.12.2020	 3360
10.12.2020	 3979
09.12.2020	 9431
08.12.2020	 3596
07.12.2020	 1754
06.12.2020	 2262
05.12.2020	 2815
04.12.2020	 4158
03.12.2020	 3810
02.12.2020	 3477
01.12.2020	 3428
30.11.2020	 1553
29.11.2020	 1698
28.11.2020	 3252
27.11.2020	 3180
26.11.2020	 2757
25.11.2020	 3929
24.11.2020	 3578
23.11.2020	 1449
22.11.2020	 1969
21.11.2020	 3823
20.11.2020	 4139
19.11.2020	 3614
18.11.2020	 4549
17.11.2020	 3725
16.11.2020	 2390
15.11.2020	 2334
14.11.2020	 3773
13.11.2020	 3854
12.11.2020	 5014
11.11.2020	 5579
10.11.2020	 5375
09.11.2020	 2366
08.11.2020	 2797
07.11.2020	 4442
06.11.2020	 4694
05.11.2020	 5171
04.11.2020	 4456
03.11.2020	 2816
02.11.2020	 1618
01.11.2020	 29

In [9]:
arrayNegative = pd.DataFrame(columns = ["Datum", "Broj oporavljenih osoba"])
for index in range(0, len(bihdata["Broj oporavljenih osoba"])):    
    if index == len(bihdata["Broj testiranih"]) - 2:
        i, j = index, len(bihdata["Broj testiranih"]) - 1
        print(f"{bihdata.iloc[index, 0]}\t {bihdata.iloc[i, 4] - bihdata.iloc[j, 4]}")
        arrayNegative = arrayNegative.append(
            {"Datum": str(bihdata.iloc[index, 0]), "Broj oporavljenih osoba": bihdata.iloc[i, 4] - bihdata.iloc[j, 4]},
            ignore_index = True)
        break
    else:
        i, j = index, index + 1
        arrayNegative = arrayNegative.append(
            {"Datum": str(bihdata.iloc[index, 0]), "Broj oporavljenih osoba": bihdata.iloc[i, 4] - bihdata.iloc[j, 4]},
            ignore_index = True)
        print(f"{str(bihdata.iloc[index, 0])}\t {bihdata.iloc[i, 4] - bihdata.iloc[j, 4]}")
        
        

29.12.2020	 681
28.12.2020	 404
27.12.2020	 593
26.12.2020	 457
25.12.2020	 771
24.12.2020	 747
23.12.2020	 552
22.12.2020	 1049
21.12.2020	 654
20.12.2020	 274
19.12.2020	 371
18.12.2020	 754
17.12.2020	 1250
16.12.2020	 596
15.12.2020	 1163
14.12.2020	 728
13.12.2020	 469
12.12.2020	 1556
11.12.2020	 1141
10.12.2020	 1270
09.12.2020	 1112
08.12.2020	 722
07.12.2020	 863
06.12.2020	 591
05.12.2020	 489
04.12.2020	 1495
03.12.2020	 1083
02.12.2020	 1243
01.12.2020	 955
30.11.2020	 1290
29.11.2020	 411
28.11.2020	 1067
27.11.2020	 1221
26.11.2020	 570
25.11.2020	 685
24.11.2020	 1765
23.11.2020	 1344
22.11.2020	 623
21.11.2020	 877
20.11.2020	 1885
19.11.2020	 1145
18.11.2020	 1226
17.11.2020	 2187
16.11.2020	 867
15.11.2020	 245
14.11.2020	 870
13.11.2020	 801
12.11.2020	 921
11.11.2020	 508
10.11.2020	 1039
09.11.2020	 283
08.11.2020	 285
07.11.2020	 684
06.11.2020	 281
05.11.2020	 300
04.11.2020	 233
03.11.2020	 589
02.11.2020	 373
01.11.2020	 111
31.10.2020	 213
30.10.2020	 290
29.1

In [10]:
arrayNegative.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230 entries, 0 to 229
Data columns (total 2 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Datum                    230 non-null    object
 1   Broj oporavljenih osoba  230 non-null    object
dtypes: object(2)
memory usage: 3.7+ KB


In [11]:
arrayNegative

Unnamed: 0,Datum,Broj oporavljenih osoba
0,29.12.2020,681
1,28.12.2020,404
2,27.12.2020,593
3,26.12.2020,457
4,25.12.2020,771
...,...,...
225,07.04.2020,1
226,06.04.2020,8
227,05.04.2020,0
228,04.04.2020,3


In [12]:
rawData['date'] = rawData['date'].astype('datetime64')

In [13]:
rawData['date'] = rawData['date'].dt.strftime('%d.%m.%Y')

In [14]:
rawData

Unnamed: 0,date,total_cases,new_cases,population,population_density,median_age,aged_65_older
0,05.03.2020,2,2,3280815,68.496,42.5,16.569
1,06.03.2020,2,0,3280815,68.496,42.5,16.569
2,07.03.2020,3,1,3280815,68.496,42.5,16.569
3,08.03.2020,3,0,3280815,68.496,42.5,16.569
4,09.03.2020,3,0,3280815,68.496,42.5,16.569
...,...,...,...,...,...,...,...
296,26.12.2020,109330,439,3280815,68.496,42.5,16.569
297,27.12.2020,109691,361,3280815,68.496,42.5,16.569
298,28.12.2020,109911,220,3280815,68.496,42.5,16.569
299,29.12.2020,110454,543,3280815,68.496,42.5,16.569


In [25]:
pd.options.display.max_rows = 1000
pd.merge(left=rawData, left_on='date', how = 'left',
         right=arrayNegative, right_on='Datum')


Unnamed: 0,date,total_cases,new_cases,population,population_density,median_age,aged_65_older,Datum,Broj oporavljenih osoba
0,05.03.2020,2,2,3280815,68.496,42.5,16.569,,
1,06.03.2020,2,0,3280815,68.496,42.5,16.569,,
2,07.03.2020,3,1,3280815,68.496,42.5,16.569,,
3,08.03.2020,3,0,3280815,68.496,42.5,16.569,,
4,09.03.2020,3,0,3280815,68.496,42.5,16.569,,
5,10.03.2020,5,2,3280815,68.496,42.5,16.569,,
6,11.03.2020,7,2,3280815,68.496,42.5,16.569,,
7,12.03.2020,11,4,3280815,68.496,42.5,16.569,,
8,13.03.2020,13,2,3280815,68.496,42.5,16.569,,
9,14.03.2020,18,5,3280815,68.496,42.5,16.569,,
