# Analysing the COVID-19 pandemic in Bosnia and Herzegovina

The analysie will be preformed on a dataset gathered from the <a href="https://www.who.int/">WHO</a> website. The first part of this analysis will be data cleaning, wich is the most important part of data analysis. You know how they say it if the data is not clean we get garbage in and garbage out.

The next part will contain visualizations to get a more understanding picture of the situation so we can preform some statistical methods later. 

After we finished the data cleaning and visualization process will continue on data modeling so we can make predictions in the later part when we will actualy use our data to make predictions on how the situation will improve or not in the future.

When all of this is set and done we will make the conclusion and suggest how things can be done in the future to improve the situation.

## Table of Contetn's

* [Importing the nececery Libraries](#importing-the-nececery-libraries)
* [Data import and exploration](#data-import-and-exploration)

## Importing the nececery Libraries

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

## Data import and exploration

In [2]:
rawData = pd.read_excel(os.path.join("../dataSet/rawData/", "bih.xlsx"), engine='openpyxl')

In [3]:
rawData.head(5)

Unnamed: 0.1,Unnamed: 0,Datum,Potvrđeni slučajevi,Broj testiranih,Broj smrtnih slučajeva,Broj oporavljenih osoba,Broj aktivnih slučajeva
0,0,29.12.2020,110454,509067,4024,76802,29628.0
1,1,28.12.2020,109911,505681,3976,76121,29814.0
2,2,27.12.2020,109691,503906,3953,75717,30021.0
3,3,26.12.2020,109330,502063,3923,75124,30283.0
4,4,25.12.2020,108891,499883,3901,74667,30323.0


In [4]:
rawData.info() # checking the datatype of each column, the null valuse

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231 entries, 0 to 230
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               231 non-null    int64  
 1   Datum                    231 non-null    object 
 2   Potvrđeni slučajevi      231 non-null    int64  
 3   Broj testiranih          231 non-null    int64  
 4   Broj smrtnih slučajeva   231 non-null    int64  
 5   Broj oporavljenih osoba  231 non-null    int64  
 6   Broj aktivnih slučajeva  150 non-null    float64
dtypes: float64(1), int64(5), object(1)
memory usage: 12.8+ KB


In [5]:
rawData.describe() # fast overview of statistical methods for each column

Unnamed: 0.1,Unnamed: 0,Potvrđeni slučajevi,Broj testiranih,Broj smrtnih slučajeva,Broj oporavljenih osoba,Broj aktivnih slučajeva
count,231.0,231.0,231.0,231.0,231.0,150.0
mean,115.0,31268.471861,200085.865801,3633.891775,19292.558442,15362.16
std,66.828138,35227.672634,153225.880837,5023.575148,22381.495402,13442.102423
min,0.0,513.0,3983.0,141.0,15.0,236.0
25%,57.5,2489.5,64109.5,436.0,1817.0,2579.25
50%,115.0,17029.0,168861.0,1086.0,10881.0,6772.0
75%,172.5,51887.0,319022.0,3749.0,28380.5,31201.5
max,230.0,110454.0,509067.0,24465.0,76802.0,34452.0


## Data preprocessing

In [6]:
preprocessing = rawData.copy()

In [7]:
preprocessing.head()

Unnamed: 0.1,Unnamed: 0,Datum,Potvrđeni slučajevi,Broj testiranih,Broj smrtnih slučajeva,Broj oporavljenih osoba,Broj aktivnih slučajeva
0,0,29.12.2020,110454,509067,4024,76802,29628.0
1,1,28.12.2020,109911,505681,3976,76121,29814.0
2,2,27.12.2020,109691,503906,3953,75717,30021.0
3,3,26.12.2020,109330,502063,3923,75124,30283.0
4,4,25.12.2020,108891,499883,3901,74667,30323.0


In [8]:
preprocessing.columns

Index(['Unnamed: 0', 'Datum', 'Potvrđeni slučajevi', 'Broj testiranih',
       'Broj smrtnih slučajeva', 'Broj oporavljenih osoba',
       'Broj aktivnih slučajeva'],
      dtype='object')

## Preprocessing the dataFrame

First task is to drop the empty columns, after that we go and delete the remaining useless columns

In [9]:
preprocessing = preprocessing.dropna(axis = 1) ## Dropping the empty columns

In [10]:
preprocessing

Unnamed: 0.1,Unnamed: 0,Datum,Potvrđeni slučajevi,Broj testiranih,Broj smrtnih slučajeva,Broj oporavljenih osoba
0,0,29.12.2020,110454,509067,4024,76802
1,1,28.12.2020,109911,505681,3976,76121
2,2,27.12.2020,109691,503906,3953,75717
3,3,26.12.2020,109330,502063,3923,75124
4,4,25.12.2020,108891,499883,3901,74667
...,...,...,...,...,...,...
226,226,06.04.2020,694,6429,17818,29
227,227,05.04.2020,648,5820,20006,21
228,228,04.04.2020,619,5218,19651,21
229,229,03.04.2020,567,4628,21919,18


In [11]:
preprocessing.columns

Index(['Unnamed: 0', 'Datum', 'Potvrđeni slučajevi', 'Broj testiranih',
       'Broj smrtnih slučajeva', 'Broj oporavljenih osoba'],
      dtype='object')

In [12]:
preprocessing.drop(
    [
        'iso_code', 'continent', 'location', 'total_cases_per_million', 'new_cases_per_million','aged_70_older',
        'gdp_per_capita', 'extreme_poverty', 'cardiovasc_death_rate','diabetes_prevalence', 'female_smokers', 'male_smokers',
        'handwashing_facilities', 'hospital_beds_per_thousand', 'life_expectancy', 'human_development_index'
    ], axis = 1)

KeyError: "['iso_code' 'continent' 'location' 'total_cases_per_million'\n 'new_cases_per_million' 'aged_70_older' 'gdp_per_capita'\n 'extreme_poverty' 'cardiovasc_death_rate' 'diabetes_prevalence'\n 'female_smokers' 'male_smokers' 'handwashing_facilities'\n 'hospital_beds_per_thousand' 'life_expectancy' 'human_development_index'] not found in axis"