# Data Visualization on Coronavirus data
The purpose of this project is to learn more on webscraping, data cleaning, data modeling, data analysis and data Visualization using Bokeh. 

I have written an easy to understand make down for anyone who wants to follow the project. 

I started by getting the data on [worldmeters.info](https://www.worldometers.info/coronavirus/) and cleaning up the data using pandas

After clean up, I modeled the data into Continents for easy analysis

# Getting started and import necessary libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import re

url = "https://www.worldometers.info/coronavirus/"

source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

## Get the whole data table

In [2]:
# this get the whole data table

table = soup.find("table", attrs={'id':'main_table_countries_today'})
# print(table)


In [3]:
# from the table, this extracts the column headings

table_head = table.thead.find_all('tr')
# print(table_head)

In [4]:
# this get the data from the table, the body

table_data = table.tbody.find_all('tr')
# print(table_data)

## Saving the column header into python variable

In [5]:
headings = []

for th in table_head[0].find_all("th"):
    a = th.text.replace('\n', ' ').strip()
    headings.append(a)

    print(a)


Country,Other
TotalCases
NewCases
TotalDeaths
NewDeaths
TotalRecovered
ActiveCases
Serious,Critical
Tot Cases/1M pop
Deaths/1M pop
TotalTests
Tests/ 1M pop
Continent


## Understanding The Data
This is just to understand and see how the data was scraped and how to save the data into pandas dataframe, it's `very important`

In [6]:
for tr in table_data:
    for what in tr.find_all('td'):
        print(what.text)
    print("\n___________________")


North America

1,186,679
+1,605
69,657
+138
188,999
928,023
16,479




North America

___________________

Europe

1,373,444

135,235

503,728
734,481
18,773




Europe

___________________

Asia

520,063
+13
18,610

265,564
235,889
5,726




Asia

___________________

South America

178,621
+1,807
8,807
+105
60,095
109,719
9,802




South America

___________________

Oceania

8,347
+15
112
+1
7,070
1,165
34




Australia/Oceania

___________________

Africa

39,785

1,638

13,078
25,069
126




Africa

___________________



721

15

645
61
4






___________________
World
3,307,660
+3,440
234,074
+244
1,039,179
2,034,407
50,944
424
30.0


All

___________________
USA
1,095,023

63,856 

152,324
878,843
15,226
3,308
193
6,391,887
19,311
North America

___________________
Spain
239,639

24,543 

137,984
77,112
2,676
5,125
525
1,455,306
31,126
Europe

___________________
Italy
205,463

27,967 

75,945
101,551
1,694
3,398
463
1,979,217
32,735
Europe

___________________
UK
171,253

26

In [7]:
# now that we understand the data, we can structure the data

data = []
for tr in table_data:
    t_row = {}
    # Each table row is stored in the form of
    # t_row = {'Country/Other': '', 'TotalCases': '', 'NewCases': ''...} 

    # find all td's in tr and zip it with headings

    for td, th in zip(tr.find_all("td"), headings): 
        t_row[th] = td.text.replace('\n', '').strip()
    data.append(t_row)


## Check
Checking if data is structured as it should, `this is also very important`

In [8]:
# for t in data:
#     for k, v in t.items():
#         print(v)
#     print("_________\n")

## Transform data
Transform data into pandas dataframe and start data clean up and manipulation

In [9]:
df = pd.DataFrame(data)

# modify data to ease data manipulation (you need to first view data) 
df.columns = df.columns.str.replace('Serious,', '').str.replace(',Other','').str.replace(' ','')
df.set_index('Country', inplace=True)
df.rename(index={"":"Others"}, inplace=True)

# view data again to check if changes were made correctly 
df.head(10)

Unnamed: 0_level_0,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,ActiveCases,Critical,Tot Cases/1Mpop,Deaths/1Mpop,TotalTests,Tests/1Mpop,Continent
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
North America,1186679,1605.0,69657,138.0,188999,928023,16479,,,,,North America
Europe,1373444,,135235,,503728,734481,18773,,,,,Europe
Asia,520063,13.0,18610,,265564,235889,5726,,,,,Asia
South America,178621,1807.0,8807,105.0,60095,109719,9802,,,,,South America
Oceania,8347,15.0,112,1.0,7070,1165,34,,,,,Australia/Oceania
Africa,39785,,1638,,13078,25069,126,,,,,Africa
Others,721,,15,,645,61,4,,,,,
World,3307660,3440.0,234074,244.0,1039179,2034407,50944,424.0,30.0,,,All
USA,1095023,,63856,,152324,878843,15226,3308.0,193.0,6391887.0,19311.0,North America
Spain,239639,,24543,,137984,77112,2676,5125.0,525.0,1455306.0,31126.0,Europe


In [10]:
# more views to understand data better
df.tail(10)

Unnamed: 0_level_0,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,ActiveCases,Critical,Tot Cases/1Mpop,Deaths/1Mpop,TotalTests,Tests/1Mpop,Continent
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Bhutan,7,,,,5.0,2,,9.0,,10045.0,13018.0,Asia
Yemen,6,,2.0,,1.0,3,,0.2,0.07,120.0,4.0,Asia
British Virgin Islands,6,,1.0,,3.0,2,,198.0,33.0,,,North America
St. Barth,6,,,,6.0,0,,607.0,,,,North America
Western Sahara,6,,,,5.0,1,,10.0,,,,Africa
Caribbean Netherlands,5,,,,,5,,191.0,,110.0,4195.0,North America
Anguilla,3,,,,3.0,0,,200.0,,,,North America
Comoros,1,,,,,1,,1.0,,,,Africa
Saint Pierre Miquelon,1,,,,,1,,173.0,,,,North America
China,82874,12.0,4633.0,,77642.0,599,38.0,58.0,3.0,,,Asia


In [11]:
# getting data (empty and Datatype) 
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 222 entries, North America to China
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   TotalCases       222 non-null    object
 1   NewCases         222 non-null    object
 2   TotalDeaths      222 non-null    object
 3   NewDeaths        222 non-null    object
 4   TotalRecovered   222 non-null    object
 5   ActiveCases      222 non-null    object
 6   Critical         222 non-null    object
 7   Tot Cases/1Mpop  222 non-null    object
 8   Deaths/1Mpop     222 non-null    object
 9   TotalTests       222 non-null    object
 10  Tests/1Mpop      222 non-null    object
 11  Continent        222 non-null    object
dtypes: object(12)
memory usage: 11.3+ KB


# Cleaning data
Clearly, the data looks useless as all column datatypes are `string`. To convert data type, data must first be cleaned and ready for conversion.
* I drop the columns that are not considered in this visualization
* I cleaned the data using `regexp` by finding patterns in the data

In [12]:
# # Removed character symbols
for column in df.columns:
    df[column] = df[column].str.replace(r"[^\.a-zA-Z0-9_]", '')

use_column = ['TotalCases', 'NewCases', 'TotalDeaths', 'NewDeaths', 'TotalRecovered', 'ActiveCases', 'Critical', 'TotalTests', 'Continent']
df = df[use_column]
    
# print out data to see if it works
# df

# Modifying Data and Datatype conversion
To convert the datatype, I can either:
* convert all missing values `(na)` to zero and convert columns to int datatype, or
* Leave the missing values and convert columns to a float datatype `(best practice)`, or
* convert all missing values to zero and convert columns to float

`Important Note`: Missing values must be uniform to perfectly clean the data. The Data has to be understood well enough. In this the missing data is `"NA"` and `None or empty strings`


In [13]:
# a["Continent"] = a.replace("" ,"Others") 
df = df.replace("NA", np.nan)
df = df.replace("", np.nan)
df.fillna(value={"Continent":"Others"}, inplace=True)
df.fillna(0, inplace=True)
df.head(10)

Unnamed: 0_level_0,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,ActiveCases,Critical,TotalTests,Continent
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
North America,1186679,1605,69657,138,188999,928023,16479,0,NorthAmerica
Europe,1373444,0,135235,0,503728,734481,18773,0,Europe
Asia,520063,13,18610,0,265564,235889,5726,0,Asia
South America,178621,1807,8807,105,60095,109719,9802,0,SouthAmerica
Oceania,8347,15,112,1,7070,1165,34,0,AustraliaOceania
Africa,39785,0,1638,0,13078,25069,126,0,Africa
Others,721,0,15,0,645,61,4,0,Others
World,3307660,3440,234074,244,1039179,2034407,50944,0,All
USA,1095023,0,63856,0,152324,878843,15226,6391887,NorthAmerica
Spain,239639,0,24543,0,137984,77112,2676,1455306,Europe


In [14]:

df.tail(10)

Unnamed: 0_level_0,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,ActiveCases,Critical,TotalTests,Continent
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Bhutan,7,0,0,0,5,2,0,10045,Asia
Yemen,6,0,2,0,1,3,0,120,Asia
British Virgin Islands,6,0,1,0,3,2,0,0,NorthAmerica
St. Barth,6,0,0,0,6,0,0,0,NorthAmerica
Western Sahara,6,0,0,0,5,1,0,0,Africa
Caribbean Netherlands,5,0,0,0,0,5,0,110,NorthAmerica
Anguilla,3,0,0,0,3,0,0,0,NorthAmerica
Comoros,1,0,0,0,0,1,0,0,Africa
Saint Pierre Miquelon,1,0,0,0,0,1,0,0,NorthAmerica
China,82874,12,4633,0,77642,599,38,0,Asia


## Convert columns datatype
* Depending on the the datatype needed, you can convert to `int` or `float` and data will work fine
* I converted the datatype to `float ` here as it is easy to use 
* It is important to catch error here because not all columns can't be converted

In [15]:
for column in df.columns:
    try:
        df[column] = df[column].astype("float")
    except Exception as e:
        print("Column datatype cannot be converted\n") 
    # a[column] =  df[column].map(lambda x: re.sub(r'[^\.a-zA-Z0-9_]', '', x))

Column datatype cannot be converted



In [16]:
# Check if it worked
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 222 entries, North America to China
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   TotalCases      222 non-null    float64
 1   NewCases        222 non-null    float64
 2   TotalDeaths     222 non-null    float64
 3   NewDeaths       222 non-null    float64
 4   TotalRecovered  222 non-null    float64
 5   ActiveCases     222 non-null    float64
 6   Critical        222 non-null    float64
 7   TotalTests      222 non-null    float64
 8   Continent       222 non-null    object 
dtypes: float64(8), object(1)
memory usage: 15.6+ KB


## Save Cleaned data

In [17]:
# df.to_csv("coronadata.csv")

## Preparing Data for analyse

In [18]:
# In here, the is cleaned but not suitable for analyse
# The continent is list in the country column
df.reset_index(inplace=True)

stop = ""
con = []


for stop in df["Country"]:

    if stop == "USA":
       break       
    con.append(stop)

col = []


for stop in df.columns:
    
    col.append(stop)
    if stop == "Critical":
       break

all_country = df[ df["Country"].isin(con) == False]

# I modelled the data into diff table using the continents
* This is just to have data for each continents

In [19]:
continent = df[ df["Country"].isin(con[:-1]) ]
continent = continent[col]

north_america = all_country[ all_country["Continent"] == "NorthAmerica"]

europe = all_country[ all_country["Continent"] == "Europe"]

asia = all_country[ all_country["Continent"] == "Asia"]

south_america = all_country[ all_country["Continent"] == "SouthAmerica"]

oceania = all_country[ all_country["Continent"] == "AustraliaOceania"]

africa = all_country[ all_country["Continent"] == "Africa"]

others = all_country[ all_country["Continent"] == "Others"]


# We now our data ready for analysis
* We have now cleaned the data and it is now perfect for analysis
* Import neccessary library to get the analysis and visualization done

In [20]:
%matplotlib inline
from math import pi

import csv
import matplotlib.pyplot as plt
from math import fsum
from urllib.request import urlopen
from bs4 import BeautifulSoup

from bokeh.plotting import figure,output_file, output_notebook, show
from bokeh.models import ColumnDataSource, ranges, LabelSet, HoverTool
from bokeh.layouts import gridplot
from bokeh.transform import cumsum

tools = "wheel_zoom, reset, save"


# Display plot in jupiter notebook
output_notebook()

blue    = '#008fd5'
red     = '#fc4f30'
yellow  = '#e5ae37'
green   = '#6d904f'
skye    = '#33D1FF'
pink    = "#FF338A"
lgreen  = '#99FF33'

colours = [blue, red, yellow,green, skye, pink, lgreen]

width = 600
heigth = 400

In [21]:
pies = all_country[["Country", 'ActiveCases','TotalRecovered', 'TotalDeaths']]
pies.set_index('Country', inplace=True)



# output_file("pie.html")


col_sum = []
for col in pies.sum(axis = 0, skipna = True):
    col_sum.append(col)

active       = col_sum[0]
recovered    = col_sum[1]
death        = col_sum[2] 

#  Calculating percentage
sum_all = fsum(col_sum)
aper    = "{0:.2f}%".format((active/sum_all)*100)
rper    = "{0:.2f}%".format((recovered/sum_all)*100)
dper    = "{0:.2f}%".format((death/sum_all)*100)




percent = {"value": [aper, rper, dper]}
# Setting data
x = { 
    'Active - ' + aper: active,
    'Recovered - ' + rper: recovered,
    'Deaths - ' + dper: death   
}

colors = [blue, green,  red]

data = pd.Series(x).reset_index(name='value').rename(columns={'index':'Case'})
# data['label'] = percent
data['angle'] = data['value']/data['value'].sum() * 2*pi
data['color'] = colors[:len(x)]

source = ColumnDataSource(data)

pie = figure(plot_height=350, title="COVID-19 cases overview", toolbar_location=None,
           tools=tools,
           x_range=(-0.5, 1.0))

pie.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True),
        end_angle=cumsum('angle'),
        line_color="white",
        fill_color = "color",
        legend_field='Case', 
        source=source)

hover = HoverTool()
hover.tooltips = """
    <div>
        <h3>@Case</h3>
        <h3>Total: @value</h3>
    </div>

"""

pie.axis.axis_label=None
pie.axis.visible=False
pie.grid.grid_line_color = None
pie.add_tools(hover)

# show(pie)

# DATA VISUALISATION FOR THE TOTAL, DEATH and RECOVERED CASES

* Here, I was preparing data for the cases to be considered
* I only used the top 20 of the data
* I printed it out to be sure
                    

# Total Case

In [22]:
""" Total case discovered"""
# sort data by total cases in descending order
con_tc = all_country.sort_values(by='TotalCases')

# picking the last 20 countries
con_tc = con_tc.tail(20)

# Plot Axis
source = ColumnDataSource(con_tc)
x_tc = con_tc['Country']

# setting plot; y_range is used for x_tc because the data is categorical data
tc = figure(y_range=x_tc, 
            title="Total COVID19 cases",
            tools=tools,
            plot_width=width, 
            plot_height=heigth)

tc.hbar(y="Country", 
        right="TotalCases",
        height=0.5,
        hover_color=skye,
        left=0, 
        color="navy",
        source=source)

hover = HoverTool(
        tooltips=[
            ("Country", '@Country'),
            ("Total", "@TotalCases")
        ]
)


tc.add_tools(hover)

# show(tc)


# Total Death

In [23]:
 """ Total Death """
# sorting the data by total death in ascending order
con_td = all_country.sort_values(by='TotalDeaths')

# picking the last 20 countries
con_td = con_td.tail(20)

# Plot Axis
source = ColumnDataSource(con_td)
x_td = con_td['Country']

# setting plot; y_range is used for x_tc because the data is categorical data
td = figure(y_range=x_td, 
            tools=tools,
            title="Total COVID19 Deaths",
            plot_width=width, 
            plot_height=heigth)

td.hbar(y="Country", 
        right="TotalDeaths",
        height=0.5, 
        left=0, 
        color="red",
        hover_color=pink,        
        source=source)

hover = HoverTool(
        tooltips=[
            ("Country", '@Country'),
            ("Total Death", "@TotalDeaths")
        ]
)


td.add_tools(hover)

# show(td)

In [24]:
 """ Total Recovered """
# sorting the data by total death in ascending order
con_tr = all_country.sort_values(by='TotalRecovered')


# picking the last 20 countries
con_tr = con_tr.tail(20)

# Plot Axis
source = ColumnDataSource(con_tr)
x_tr = con_tr['Country']

# setting plot; y_range is used for x_tc because the data is categorical data
tr = figure(y_range=x_tr, 
            title="Total COVID19 Recovered cases",
            tools=tools,
            plot_width=width, 
            plot_height=heigth)

tr.hbar(y="Country", 
        right="TotalRecovered",
        height=0.5, 
        left=0, 
        color="green",
        hover_color=skye,
        source=source)

hover = HoverTool(
        tooltips=[
            ("Country", '@Country'),
            ("Total Recovered", "@TotalRecovered")
        ]
)


tr.add_tools(hover)

# show(tr)

# Printing Out the graphically output

In [25]:
# output_file("corona_dashboard.html")


grid = gridplot([  [pie], [tc], [td], [tr]  ])
show(grid)

# Getting insight to continent data

In [26]:
continent

Unnamed: 0,Country,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,ActiveCases,Critical
0,North America,1186679.0,1605.0,69657.0,138.0,188999.0,928023.0,16479.0
1,Europe,1373444.0,0.0,135235.0,0.0,503728.0,734481.0,18773.0
2,Asia,520063.0,13.0,18610.0,0.0,265564.0,235889.0,5726.0
3,South America,178621.0,1807.0,8807.0,105.0,60095.0,109719.0,9802.0
4,Oceania,8347.0,15.0,112.0,1.0,7070.0,1165.0,34.0
5,Africa,39785.0,0.0,1638.0,0.0,13078.0,25069.0,126.0
6,Others,721.0,0.0,15.0,0.0,645.0,61.0,4.0


In [28]:
# con_data = continent[['Country', 'TotalCases', 'TotalDeaths', 'TotalRecovered']]


# Plot Axis
source = ColumnDataSource(continent)
x_tr = continent['Country']

# setting plot; y_range is used for x_tc because the data is categorical data
tcd = figure(x_range=x_tr, 
            title="Continental Death Case",
            tools=tools,
            plot_width=width, 
            plot_height=heigth)

tcd.vbar(x="Country", 
        top="TotalDeaths",
        width=0.70, 
    
        color="red",
        hover_color=skye,
        source=source)

hover = HoverTool(
        tooltips=[
            ("Country", '@Country'),
            ("Total Death", "@TotalDeaths")
        ]
)


tcd.add_tools(hover)

show(tcd)