# Data Mining Project
## Konstantinos Georgiou
Dataset: [COVID-19 World Vaccination Progress](https://www.kaggle.com/gpreda/covid-world-vaccination-progress)

## Information About The Dataset

This dataset contains information about the vaccinations happening in each country daily. The data are being collected almost daily from this website using this code. As of writing this (2/27), the dataset has 4,380 rows with vaccination data for 112 unique countries and is in the CSV format.

It has 15 columns in total, including among others the country name, the daily vaccination, the vaccinated people per million that date, and the source of each record.

## Questions To Be Answered

- Can you identify countries that faced bottlenecks on their daily vaccination rates?
- Can you cluster together countries that faced similar bottlenecks? In what sense are they related?
- Can you enrich the data with more info (country location, GDP, etc) to achieve better results on the previous question?
- Can you track down the bottlenecks and find patterns in how they propagate from day to day from one cluster to another?
- Can you predict future bottlenecks on some clusters based on these patterns?

## Some details
- To setup this project on any machine, just run `make install`. More details in the [Readme](README.md).
- To download the latest version of the dataset (new rows are added every day), run `make download_dataset`
- The dataset is in the <a>datasets/covid-world-vaccinations-progress</a> directory
- In the [data mining directory](data_mining) are located three custom packages:
    - Configuration: for handling the yml configuration
    - ColorizedLogger: For formatted logging that saves output in log files
    - timeit: ContextManager&Decorator for timing functions and code blocks
- The project was compiled using my Template **Cookiecutter** project: <a>https://github.com/drkostas/starter</a>

### Load Libraries and configuration
Configuration path: `confs/covid.yml`

In [3]:
from data_mining import timeit, ColorizedLogger, Configuration

In [4]:
import numpy as np
import pandas as pd 

In [5]:
# Load the configuration
conf_obj = Configuration(config_src='confs/covid.yml')
covid_conf = conf_obj.get_config('covid-progress')[0]
data_path = covid_conf['properties']['data_path']
log_path = covid_conf['properties']['log_path']

2021-04-08 15:06:59 Config       INFO     [1m[37mConfiguration file loaded successfully from path: /home/drkostas/GDrive/Projects/UTK/COSC526-Project/confs/covid.yml[0m
2021-04-08 15:06:59 Config       INFO     [1m[37mConfiguration Tag: project[0m


### Setup logging, Load the Dataset

In [6]:
# Setup the Logger
logger = ColorizedLogger(logger_name='JupyterMain', color='yellow')
ColorizedLogger.setup_logger(log_path=log_path, debug=False)

2021-04-08 15:07:00 FancyLogger  INFO     [1m[37mLogger is set. Log file path: /home/drkostas/GDrive/Projects/UTK/COSC526-Project/logs/covid_progress.log[0m


In [7]:
# Load the dataset
if covid_conf['type'] == 'csv':
    covid_orig_df = pd.read_csv(data_path)
    logger.info("Dataset loaded.")
else:
    logger.error('Data type not supported!')


2021-04-08 15:07:16 JupyterMain  INFO     [1m[33mDataset loaded.[0m


In [8]:
# Print Columns info
logger.info(f"Dataframe shape: {covid_orig_df.shape}")
covid_orig_df.info()

2021-04-08 15:07:19 JupyterMain  INFO     [1m[33mDataframe shape: (9576, 15)[0m


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9576 entries, 0 to 9575
Data columns (total 15 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   country                              9576 non-null   object 
 1   iso_code                             9576 non-null   object 
 2   date                                 9576 non-null   object 
 3   total_vaccinations                   5772 non-null   float64
 4   people_vaccinated                    5167 non-null   float64
 5   people_fully_vaccinated              3608 non-null   float64
 6   daily_vaccinations_raw               4816 non-null   float64
 7   daily_vaccinations                   9393 non-null   float64
 8   total_vaccinations_per_hundred       5772 non-null   float64
 9   people_vaccinated_per_hundred        5167 non-null   float64
 10  people_fully_vaccinated_per_hundred  3608 non-null   float64
 11  daily_vaccinations_per_million

In [9]:
# Print the statistics of the Dataframe
covid_orig_df.describe()

Unnamed: 0,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million
count,5772.0,5167.0,3608.0,4816.0,9393.0,5772.0,5167.0,3608.0,9393.0
mean,3223423.0,2406510.0,1076638.0,111540.3,67052.04,11.486305,8.540521,4.171486,2835.473651
std,11981540.0,8122148.0,4359236.0,402694.7,264546.8,19.689933,13.053528,8.765672,4994.852975
min,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,45894.25,41770.0,19991.25,2972.25,973.0,0.87,0.86,0.44,361.0
50%,309125.5,272323.0,125149.0,14539.5,5979.0,4.35,3.52,1.595,1398.0
75%,1522704.0,1157205.0,551108.2,60570.0,28245.0,13.1925,9.95,3.91,3474.0
max,161688400.0,104213500.0,59858150.0,7185000.0,5190143.0,180.78,95.85,84.93,118759.0


In [10]:
# Print the first two rows
covid_orig_df.head(n=2)

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,0.0,0.0,,,Oxford/AstraZeneca,Government of Afghanistan,http://www.xinhuanet.com/english/asiapacific/2...
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,,,,35.0,Oxford/AstraZeneca,Government of Afghanistan,http://www.xinhuanet.com/english/asiapacific/2...


In [28]:
# Print Value counts for selected columns
logger.info(f"Country Value Counts:\n{covid_orig_df['country'].value_counts()}")
logger.info(f"Vaccines Value Counts:\n{covid_orig_df['vaccines'].value_counts()}")
logger.info(f"Date Value Counts:\n{covid_orig_df['date'].value_counts()}")

2021-04-04 23:58:34 Main         INFO     [1m[33mCountry Value Counts:
Northern Ireland    111
Scotland            111
Canada              111
England             111
United Kingdom      111
                   ... 
Palestine             5
Mali                  4
Bahamas               2
Brunei                2
Laos                  1
Name: country, Length: 166, dtype: int64[0m
2021-04-04 23:58:34 Main         INFO     [1m[33mVaccines Value Counts:
Moderna, Oxford/AstraZeneca, Pfizer/BioNTech                                          2255
Pfizer/BioNTech                                                                       1435
Oxford/AstraZeneca                                                                    1331
Oxford/AstraZeneca, Pfizer/BioNTech                                                   1154
Pfizer/BioNTech, Sinovac                                                               426
Sputnik V                                                                              38

In [None]:
# Drop Uneccessary columns
