![ ](https://www.pon-cat.com/application/files/7915/8400/2602/home-banner.jpg)

# <center> **Data Visualization and Exploratory Data Analysis** </center>

Visualization is an important part of data analysis. By presenting information visually, you facilitate the process of its perception, which makes it possible to highlight additional patterns, evaluate the ratios of quantities, and quickly communicate key aspects in the data.

Let's start with a little "memo" that should always be kept in mind when creating any graphs.

### <center> How to visualize data and make everyone hate you </center>

1. Chart **titles** are unnecessary. It is always clear from the graph what data it describes.
2. Do not label under any circumstances both **axes** of the graph. Let the others check their intuition!
3. **Units** are optional. What difference does it make if the quantity was measured, in people or in liters!
4. The smaller the **text** on the graph, the sharper the viewer's eyesight.
5. You should try to fit all the **information** that you have in the dataset in one chart. With full titles, transcripts, footnotes. The more text, the more informative!
6. Whenever possible, use as many 3D and special effects as you have. There will be less visual distortion rather than 2D.

As an example, consider the pandemic case. Let's use a dataset with promptly updated statistics on coronavirus (COVID-19), which is publicly available on Kaggle: https://www.kaggle.com/imdevskp/corona-virus-report?select=covid_19_clean_complete.csv

The main libraries for visualization in Python that we need today are **matplotlib, seaborn, plotly**. 

In [1]:
# Download required binded packages
!pip install plotly-express
!pip install nbformat==4.2.0
!pip install plotly



In [2]:
import matplotlib.pyplot as plt #  the most popular library for making the plots
%matplotlib inline 
import numpy as np
import seaborn as sns
import pandas as pd
import pickle # for JSON serialization
import plotly
import plotly_express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

%config InlineBackend.figure_format = 'svg' #  graphs in svg look sharper

# Change the default plot size
from pylab import rcParams
rcParams['figure.figsize'] = 7, 5

import warnings
warnings.filterwarnings('ignore')

We read the data and look at the number of countries in the dataset and what time period it covers.

In [3]:
data = pd.read_csv('./data/covid_19_clean.csv')

In [4]:
data.head(10)

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
0,,Afghanistan,33.93911,67.709953,2020-01-22,0,0,0,0,Eastern Mediterranean
1,,Albania,41.1533,20.1683,2020-01-22,0,0,0,0,Europe
2,,Algeria,28.0339,1.6596,2020-01-22,0,0,0,0,Africa
3,,Andorra,42.5063,1.5218,2020-01-22,0,0,0,0,Europe
4,,Angola,-11.2027,17.8739,2020-01-22,0,0,0,0,Africa
5,,Antigua and Barbuda,17.0608,-61.7964,2020-01-22,0,0,0,0,Americas
6,,Argentina,-38.4161,-63.6167,2020-01-22,0,0,0,0,Americas
7,,Armenia,40.0691,45.0382,2020-01-22,0,0,0,0,Europe
8,Australian Capital Territory,Australia,-35.4735,149.0124,2020-01-22,0,0,0,0,Western Pacific
9,New South Wales,Australia,-33.8688,151.2093,2020-01-22,0,0,0,0,Western Pacific


How many countries there are in this table?

In [5]:
data['Country/Region'].nunique()

187

In [6]:
data.shape

(49068, 10)

In [7]:
data.describe()

Unnamed: 0,Lat,Long,Confirmed,Deaths,Recovered,Active
count,49068.0,49068.0,49068.0,49068.0,49068.0,49068.0
mean,21.43373,23.528236,16884.9,884.17916,7915.713,8085.012
std,24.95032,70.44274,127300.2,6313.584411,54800.92,76258.9
min,-51.7963,-135.0,0.0,0.0,0.0,-14.0
25%,7.873054,-15.3101,4.0,0.0,0.0,0.0
50%,23.6345,21.7453,168.0,2.0,29.0,26.0
75%,41.20438,80.771797,1518.25,30.0,666.0,606.0
max,71.7069,178.065,4290259.0,148011.0,1846641.0,2816444.0


In [20]:
float(4.290259e+06)

4290259.0

In [12]:
data.describe(include=['object'])

Unnamed: 0,Province/State,Country/Region,Date,WHO Region
count,14664,49068,49068,49068
unique,78,187,188,6
top,Zhejiang,China,2020-04-17,Europe
freq,188,6204,261,15040


How many cases in average were confirmed in all reports? Metrics of centrality:

In [13]:
data

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
0,,Afghanistan,33.939110,67.709953,2020-01-22,0,0,0,0,Eastern Mediterranean
1,,Albania,41.153300,20.168300,2020-01-22,0,0,0,0,Europe
2,,Algeria,28.033900,1.659600,2020-01-22,0,0,0,0,Africa
3,,Andorra,42.506300,1.521800,2020-01-22,0,0,0,0,Europe
4,,Angola,-11.202700,17.873900,2020-01-22,0,0,0,0,Africa
...,...,...,...,...,...,...,...,...,...,...
49063,,Sao Tome and Principe,0.186400,6.613100,2020-07-27,865,14,734,117,Africa
49064,,Yemen,15.552727,48.516388,2020-07-27,1691,483,833,375,Eastern Mediterranean
49065,,Comoros,-11.645500,43.333300,2020-07-27,354,7,328,19,Africa
49066,,Tajikistan,38.861000,71.276100,2020-07-27,7235,60,6028,1147,Europe


In [17]:
data.iloc[-60:]

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
49008,,Sudan,12.8628,30.2176,2020-07-27,11424,720,5939,4765,Eastern Mediterranean
49009,,Suriname,3.9193,-56.0278,2020-07-27,1483,24,925,534,Americas
49010,,Sweden,60.128161,18.643501,2020-07-27,79395,5700,0,73695,Europe
49011,,Switzerland,46.8182,8.2275,2020-07-27,34477,1978,30900,1599,Europe
49012,,Taiwan*,23.7,121.0,2020-07-27,462,7,440,15,Western Pacific
49013,,Tanzania,-6.369028,34.888822,2020-07-27,509,21,183,305,Africa
49014,,Thailand,15.870032,100.992541,2020-07-27,3297,58,3111,128,South-East Asia
49015,,Togo,8.6195,0.8248,2020-07-27,874,18,607,249,Africa
49016,,Trinidad and Tobago,10.6918,-61.2225,2020-07-27,148,8,128,12,Americas
49017,,Tunisia,33.886917,9.537499,2020-07-27,1455,50,1157,248,Eastern Mediterranean


In [15]:
data['Confirmed'].mode()

0    0
dtype: int64

In [18]:
data['Confirmed'].median()

168.0

In [19]:
data['Confirmed'].mean()

16884.90425531915

What is the average number of confirmed cases across all countries in this table?

In [22]:
data[data['Country/Region'] == 'India'].iloc[:60]

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
129,,India,20.593684,78.96288,2020-01-22,0,0,0,0,South-East Asia
390,,India,20.593684,78.96288,2020-01-23,0,0,0,0,South-East Asia
651,,India,20.593684,78.96288,2020-01-24,0,0,0,0,South-East Asia
912,,India,20.593684,78.96288,2020-01-25,0,0,0,0,South-East Asia
1173,,India,20.593684,78.96288,2020-01-26,0,0,0,0,South-East Asia
1434,,India,20.593684,78.96288,2020-01-27,0,0,0,0,South-East Asia
1695,,India,20.593684,78.96288,2020-01-28,0,0,0,0,South-East Asia
1956,,India,20.593684,78.96288,2020-01-29,0,0,0,0,South-East Asia
2217,,India,20.593684,78.96288,2020-01-30,1,0,0,1,South-East Asia
2478,,India,20.593684,78.96288,2020-01-31,1,0,0,1,South-East Asia


In [23]:
data

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
0,,Afghanistan,33.939110,67.709953,2020-01-22,0,0,0,0,Eastern Mediterranean
1,,Albania,41.153300,20.168300,2020-01-22,0,0,0,0,Europe
2,,Algeria,28.033900,1.659600,2020-01-22,0,0,0,0,Africa
3,,Andorra,42.506300,1.521800,2020-01-22,0,0,0,0,Europe
4,,Angola,-11.202700,17.873900,2020-01-22,0,0,0,0,Africa
...,...,...,...,...,...,...,...,...,...,...
49063,,Sao Tome and Principe,0.186400,6.613100,2020-07-27,865,14,734,117,Africa
49064,,Yemen,15.552727,48.516388,2020-07-27,1691,483,833,375,Eastern Mediterranean
49065,,Comoros,-11.645500,43.333300,2020-07-27,354,7,328,19,Africa
49066,,Tajikistan,38.861000,71.276100,2020-07-27,7235,60,6028,1147,Europe


What is the average number of confirmed cases in every country in total (on the last available date in our table)?

In [25]:
max(data['Date'])

'2020-07-27'

In [26]:
# data[data['Date'] == '2020-07-27']
data[data['Date'] == max(data['Date'])]['Confirmed'].mean()

63143.620689655174

In [27]:
data[data['Date'] == max(data['Date'])]['Confirmed'].mode()

0       12
1       14
2       18
3       24
4       62
5       86
6       99
7      114
8      203
9    10621
dtype: int64

In [28]:
data[data['Date'] == max(data['Date'])]['Confirmed'].median()

1879.0

What is the maximum number of confirmed cases in every country?

In [34]:
data.groupby('Country/Region')['Confirmed'].agg(['max', 'mean'])

Unnamed: 0_level_0,max,mean
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,36263,10299.946809
Albania,4880,1046.287234
Algeria,27973,6275.292553
Andorra,907,502.148936
Angola,950,120.542553
...,...,...
West Bank and Gaza,10621,1241.813830
Western Sahara,10,4.792553
Yemen,1691,357.340426
Zambia,4552,688.409574


In [37]:
data.groupby('Country/Region')['Confirmed'].max().mean()

87615.74331550802

In [38]:
data.groupby('Country/Region')['Confirmed'].max()

Country/Region
Afghanistan           36263
Albania                4880
Algeria               27973
Andorra                 907
Angola                  950
                      ...  
West Bank and Gaza    10621
Western Sahara           10
Yemen                  1691
Zambia                 4552
Zimbabwe               2704
Name: Confirmed, Length: 187, dtype: int64

In [41]:
data.groupby('Country/Region')['Confirmed'].max().sort_values(ascending=False)[:10]

Country/Region
US                4290259
Brazil            2442375
India             1480073
Russia             816680
South Africa       452529
Mexico             395489
Peru               389717
Chile              347923
United Kingdom     300111
Iran               293606
Name: Confirmed, dtype: int64

More info on groupby: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

* **mean()**: Compute mean of groups

* **sum()**: Compute sum of group values

* **size()**: Compute group sizes

* **count()**: Compute count of group

* **std()**: Standard deviation of groups

* **var()**: Compute variance of groups

* **sem()**: Standard error of the mean of groups

* **describe()**: Generates descriptive statistics

* **first()**: Compute first of group values

* **last()**: Compute last of group values

* **nth()** : Take nth value, or a subset if n is a list

* **min()**: Compute min of group values

* **max()**: Compute max of group values

You can see several characteristics at once (mean, median, prod, sum, std,
var) - both in DataFrame and Series:

In [42]:
data.groupby('Country/Region')['Confirmed'].agg(['mean', 'median', 'std'])

Unnamed: 0_level_0,mean,median,std
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,10299.946809,1407.0,13458.792417
Albania,1046.287234,695.0,1264.372255
Algeria,6275.292553,3191.5,7339.328093
Andorra,502.148936,734.5,371.460824
Angola,120.542553,25.0,214.148815
...,...,...,...
West Bank and Gaza,1241.813830,341.0,2513.959434
Western Sahara,4.792553,6.0,4.172042
Yemen,357.340426,1.0,548.353443
Zambia,688.409574,84.0,986.737387


In [43]:
data.pivot_table(columns='WHO Region', index='Date', values='Confirmed', aggfunc='sum')

WHO Region,Africa,Americas,Eastern Mediterranean,Europe,South-East Asia,Western Pacific
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-22,0,1,0,0,2,552
2020-01-23,0,1,0,0,3,650
2020-01-24,0,2,0,2,5,932
2020-01-25,0,2,0,3,8,1421
2020-01-26,0,6,0,3,9,2100
...,...,...,...,...,...,...
2020-07-23,656696,8294228,1439937,3216701,1625727,277192
2020-07-24,677376,8460627,1453830,3239712,1679154,280946
2020-07-25,694057,8609554,1467209,3259047,1732350,284973
2020-07-26,711035,8709755,1478334,3277229,1786304,289139


In [45]:
data[data['Active'] > 0]

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
48,Anhui,China,31.825700,117.226400,2020-01-22,1,0,0,1,Western Pacific
49,Beijing,China,40.182400,116.414200,2020-01-22,14,0,0,14,Western Pacific
50,Chongqing,China,30.057200,107.874000,2020-01-22,6,0,0,6,Western Pacific
51,Fujian,China,26.078900,117.987400,2020-01-22,1,0,0,1,Western Pacific
53,Guangdong,China,23.341700,113.424400,2020-01-22,26,0,0,26,Western Pacific
...,...,...,...,...,...,...,...,...,...,...
49063,,Sao Tome and Principe,0.186400,6.613100,2020-07-27,865,14,734,117,Africa
49064,,Yemen,15.552727,48.516388,2020-07-27,1691,483,833,375,Eastern Mediterranean
49065,,Comoros,-11.645500,43.333300,2020-07-27,354,7,328,19,Africa
49066,,Tajikistan,38.861000,71.276100,2020-07-27,7235,60,6028,1147,Europe


In [46]:
data[data['WHO Region'] == 'Western Pacific']['Country/Region'].unique()

array(['Australia', 'Brunei', 'Cambodia', 'China', 'Fiji', 'Japan',
       'South Korea', 'Malaysia', 'Mongolia', 'New Zealand',
       'Papua New Guinea', 'Philippines', 'Singapore', 'Taiwan*',
       'Vietnam', 'Laos'], dtype=object)

In [49]:
avg_confirmed = data[data['Date'] == max(data['Date'])]['Confirmed'].mean()

In [50]:
avg_confirmed

63143.620689655174

In [51]:
data[(data['WHO Region'] == 'Western Pacific') & (data['Confirmed'] > avg_confirmed)]

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
8152,Hubei,China,30.975600,112.270700,2020-02-22,64084,2346,15299,46439,Western Pacific
8413,Hubei,China,30.975600,112.270700,2020-02-23,64084,2346,15343,46395,Western Pacific
8674,Hubei,China,30.975600,112.270700,2020-02-24,64287,2495,16748,45044,Western Pacific
8935,Hubei,China,30.975600,112.270700,2020-02-25,64786,2563,18971,43252,Western Pacific
9196,Hubei,China,30.975600,112.270700,2020-02-26,65187,2615,20969,41603,Western Pacific
...,...,...,...,...,...,...,...,...,...,...
48465,,Philippines,12.879721,121.774017,2020-07-25,78412,1897,25752,50763,Western Pacific
48607,Hubei,China,30.975600,112.270700,2020-07-26,68135,4512,63623,0,Western Pacific
48726,,Philippines,12.879721,121.774017,2020-07-26,80448,1932,26110,52406,Western Pacific
48868,Hubei,China,30.975600,112.270700,2020-07-27,68135,4512,63623,0,Western Pacific


In [52]:
data[(data['WHO Region'] == 'Western Pacific') & (data['Confirmed'] > avg_confirmed)]['Country/Region'].unique()

array(['China', 'Philippines'], dtype=object)

In [53]:
some_countries = ['China', 'Singapore', 'Philippines', 'Japan']

In [55]:
data[data['Country/Region'].isin(some_countries)]

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
48,Anhui,China,31.825700,117.226400,2020-01-22,1,0,0,1,Western Pacific
49,Beijing,China,40.182400,116.414200,2020-01-22,14,0,0,14,Western Pacific
50,Chongqing,China,30.057200,107.874000,2020-01-22,6,0,0,6,Western Pacific
51,Fujian,China,26.078900,117.987400,2020-01-22,1,0,0,1,Western Pacific
52,Gansu,China,35.751800,104.286100,2020-01-22,0,0,0,0,Western Pacific
...,...,...,...,...,...,...,...,...,...,...
48886,Yunnan,China,24.974000,101.487000,2020-07-27,190,2,186,2,Western Pacific
48887,Zhejiang,China,29.183200,120.093400,2020-07-27,1270,1,1268,1,Western Pacific
48944,,Japan,36.204824,138.252924,2020-07-27,31142,998,21970,8174,Western Pacific
48987,,Philippines,12.879721,121.774017,2020-07-27,82040,1945,26446,53649,Western Pacific


Let's make a small report:

In [56]:
data = pd.read_csv('./data/covid_19_clean.csv')

print("Number of countries: ", data['Country/Region'].nunique())
print(f"Day from {min(data['Date'])} till {max(data['Date'])}, overall {data['Date'].nunique()} days.")

data['Date'] = pd.to_datetime(data['Date'], format = '%Y-%m-%d')


display(data[data['Country/Region'] == 'Russia'].tail())

Number of countries:  187
Day from 2020-01-22 till 2020-07-27, overall 188 days.


Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
47948,,Russia,61.52401,105.318756,2020-07-23,793720,12873,579295,201552,Europe
48209,,Russia,61.52401,105.318756,2020-07-24,799499,13026,587728,198745,Europe
48470,,Russia,61.52401,105.318756,2020-07-25,805332,13172,596064,196096,Europe
48731,,Russia,61.52401,105.318756,2020-07-26,811073,13249,599172,198652,Europe
48992,,Russia,61.52401,105.318756,2020-07-27,816680,13334,602249,201097,Europe


The coronavirus pandemic is a clear example of an exponential distribution. To demonstrate this, let's build a graph of the total number of infected and dead. We will use a linear chart type (** Line Chart **), which can reflect the dynamics of one or several indicators. It is convenient to use it to see how a value changes over time.

In [None]:
# Line chart

ax = data[['Confirmed', 'Deaths', 'Date']].groupby('Date').sum().plot(title='Title')
ax.set_xlabel("X axes")
ax.set_ylabel("Y axes");
 
# TODO
# Change the title and axes names

The graph above shows us general information around the world. Let's select the 10 most affected countries (based on the results of the last day from the dataset) and on one **Line Chart** show data for each of them according to the number of registered cases of the disease. This time, let's try using the **plotly** library.

In [None]:
# Preparation steps fot the table

# Extract the top 10 countries by the number of confirmed cases
df_top = data[data['Date'] == max(data.Date)]
df_top = df_top.groupby('Country/Region', as_index=False)['Confirmed'].sum()
df_top = df_top.nlargest(10,'Confirmed')

# Extract trend across time
df_trend = data.groupby(['Date','Country/Region'], as_index=False)['Confirmed'].sum()
df_trend = df_trend.merge(df_top, on='Country/Region')
df_trend.rename(columns={'Country/Region' : 'Countries', 
                         'Confirmed_x':'Cases',
                         'Date' : 'Dates'}, 
                inplace=True)

In [None]:
# Plot a graph
# px stands for plotly_express
px.line(df_trend, 
        title='Increased number of cases of COVID-19',
        x='Dates', 
        y='Cases', 
        color='Countries')

Let's put a logarithm on this column. 

In [None]:
# Add a column to visualize the logarithmic
df_trend['ln(Cases)'] = np.log(df_trend['Cases'] + 1) # Add 1 for log (0) case

px.line(df_trend, 
        x='Dates', 
        y='ln(Cases)', 
        color='Countries', 
        title='COVID19 Total Cases growth for top 10 worst affected countries(Logarithmic Scale)')

What interesting conclusions can you draw from this graph?

Try to do similar graphs for the deaths and active cases.

In [None]:
# TODO

Another popular chart is the **Pie chart**. Most often, this graph is used to visualize the relationship between parts (ratios).

In [None]:
# Pie chart

fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
labels_donut = [country for country in df_top['Country/Region']]
fig.add_trace(go.Pie(labels=labels_donut, hole=.4, hoverinfo="label+percent+name", 
                     values=[cases for cases in df_top.Confirmed], 
                     name="Ratio", ), 1, 1)
labels_pie = [country for country in df_top['Country/Region']]
fig.add_trace(go.Pie(labels=labels_pie, pull=[0, 0, 0.2, 0], 
                     values=[cases for cases in df_top.Confirmed], 
                     name="Ratio"), 1, 2)

fig.update_layout(
    title_text="Donut & Pie Chart: Distribution of COVID-19 cases among the top-10 affected countries",
    # Add annotations in the center of the donut pies.
    annotations=[dict(text=' ', x=0.5, y=0.5, font_size=16, showarrow=False)],
    colorway=['rgb(69, 135, 24)', 'rgb(136, 204, 41)', 'rgb(204, 204, 41)', 
              'rgb(235, 210, 26)', 'rgb(209, 156, 42)', 'rgb(209, 86, 42)', 'rgb(209, 42, 42)', ])
fig.show()

In the line graphs above, we have visualized aggregate country information by the number of cases detected. Now, let's try to plot a daily trend chart by calculating the difference between the current value and the previous day's value.
For this purpose, we will use a histogram (**Histogram**). Also, let's add pointers to key events, for example, lockdown dates in Wuhan province in China, Italy and the UK.

In [None]:
# Histogram

def add_daily_diffs(df):
    # 0 because the previous value is unknown
    df.loc[0,'Cases_daily'] = 0
    df.loc[0,'Deaths_daily'] = 0
    for i in range(1, len(df)):
        df.loc[i,'Cases_daily'] = df.loc[i,'Confirmed'] - df.loc[i - 1,'Confirmed']
        df.loc[i,'Deaths_daily'] = df.loc[i,'Deaths'] - df.loc[i - 1,'Deaths']
    return df

df_world = data.groupby('Date', as_index=False)['Deaths', 'Confirmed'].sum()
df_world = add_daily_diffs(df_world)

fig = go.Figure(data=[
    go.Bar(name='The number of cases',
           marker={'color': 'rgb(0,100,153)'},
           x=df_world.Date, 
           y=df_world.Cases_daily),
    go.Bar(name='The number of cases', x=df_world.Date, y=df_world.Deaths_daily)
])

fig.update_layout(barmode='overlay', title='Statistics on the number of Confirmed and Deaths from COVID-19 across the world',
                 annotations=[dict(x='2020-01-23', y=1797, text="Lockdown (Wuhan)", 
                                   showarrow=True, arrowhead=1, ax=-100, ay=-200),
                              dict(x='2020-03-09', y=1797, text="Lockdown (Italy)", 
                                   showarrow=True, arrowhead=1, ax=-100, ay=-200),
                              dict(x='2020-03-23', y=19000, text="Lockdown (UK)", 
                                   showarrow=True, arrowhead=1, ax=-100, ay=-200)])
fig.show()

In [None]:
# Save 
plotly.offline.plot(fig, filename='my_beautiful_histogram.html', show_link=False)

A histogram is often mistaken for a bar chart due to its visual similarity, but these charts have different purposes. The bar graph shows how the data is distributed over a continuous interval or a specific period of time. Frequency is located along the vertical axis of the histogram, intervals or some time period along the horizontal axis.

Let's build the **Bar Chart** now. It can be vertical and horizontal, let's choose the second option.
Let's build a graph only for the top 20 countries in mortality. We will calculate this statistics as the ratio of the number of deaths to the number of confirmed cases for each country.

For some countries in the dataset, statistics are presented for each region (for example, for all US states). For such countries, we will leave only one (maximum) value. Alternatively, one could calculate the average for the regions and leave it as an indicator for the country.

In [None]:
# Bar chart

df_mortality = data.query('(Date == "2020-07-17") & (Confirmed > 100)') 
df_mortality['mortality'] = df_mortality['Deaths'] / df_mortality['Confirmed']
df_mortality['mortality'] = df_mortality['mortality'].apply(lambda x: round(x, 3))
df_mortality.sort_values('mortality', ascending=False, inplace=True)
# Keep the maximum mortality rate for countries for which statistics are provided for each region.
df_mortality.drop_duplicates(subset=['Country/Region'], keep='first', inplace=True)

fig = px.bar(df_mortality[:20].iloc[::-1],
             x='mortality', 
             y='Country/Region',
             labels={'mortality': 'Death rate', 'Country\Region': 'Country'},
             title=f'Death rate: top-20 affected countries on 2020-07-17', 
             text='mortality', 
             height=800, 
             orientation='h') # горизонтальный
fig.show()

# TODO: раскрасить столбцы по тепловой карте (используя уровень смерности)
# Для этого добавьте аргументы color = 'mortality'

**Heat Maps** quite useful for additional visualization of correlation matrices between features. When there are a lot of features, with the help of such a graph you can more quickly assess which features are highly correlated or do not have a linear relationship.

In [None]:
# Heat map
sns.heatmap(data.corr(), annot=True, fmt='.2f', cmap='cividis'); # try another color, e.g.'RdBu'

The scatter plot helps to find the relationship between the two indicators. To do this, you can use pairplot, which will immediately display a histogram for each variable and a scatter plot for two variables (along different plot axes).

In [None]:
# Pairplot
sns_plot = sns.pairplot(data[['Deaths', 'Confirmed']])
sns_plot.savefig('pairplot.png') # save

**Pivot table** can automatically sort and aggregate your data.

In [None]:
# Pivot table

plt.figure(figsize=(12, 4))
df_new = df_mortality.iloc[:10]
df_new['Confirmed'] = df_new['Confirmed'].astype(np.int)
df_new['binned_fatalities'] = pd.cut(df_new['Deaths'], 3)
platform_genre_sales = df_new.pivot_table(
                        index='binned_fatalities', 
                        columns='Country/Region', 
                        values='Confirmed', 
                        aggfunc=sum).fillna(int(0)).applymap(np.int)
sns.heatmap(platform_genre_sales, annot=True, fmt=".1f", linewidths=0.7, cmap="viridis");

In [None]:
# Geo

# file with abbreviations 
with open('./data/countries_codes.pkl', 'rb') as file:
    countries_codes = pickle.load(file)
    
df_map = data.copy()
df_map['Date'] = data['Date'].astype(str)
df_map = df_map.groupby(['Date','Country/Region'], as_index=False)['Confirmed','Deaths'].sum()
df_map['iso_alpha'] = df_map['Country/Region'].map(countries_codes)
df_map['ln(Confirmed)'] = np.log(df_map.Confirmed + 1)
df_map['ln(Deaths)'] = np.log(df_map.Deaths + 1)

px.choropleth(df_map, 
              locations="iso_alpha", 
              color="ln(Confirmed)", 
              hover_name="Country/Region",
              hover_data=["Confirmed"],
              animation_frame="Date",
              color_continuous_scale=px.colors.sequential.OrRd,
              title = 'Total Confirmed Cases growth (Logarithmic Scale)')

What important information did the new graph provide (visualization by time and geolocation)? Is it possible to answer the questions according to the schedule:
* Which country did the spread of the coronavirus start from?
* Which countries are most affected by the pandemic?
* What part of the hemisphere accounts for the majority of cases? What hypotheses can be formulated regarding the temperature and rate of spread of the virus?

What other observations can you make from the graph?

### **Recommended materials**

1. Matplotlib documentation https://matplotlib.org/3.2.1/tutorials/index.html 
2. Seaborn documentation https://seaborn.pydata.org/tutorial.html
3. Plotly https://plotly.com/python/ 
4. [Kaggle COVID19-Explained through Visualizations](https://www.kaggle.com/anshuls235/covid19-explained-through-visualizations/#data)
5. Open Data Science lecture on these topics:
https://www.youtube.com/watch?v=fwWCw_cE5aI&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX&index=2

https://www.youtube.com/watch?v=WNoQTNOME5g&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX&index=3

### **Additional libraries**:
* Bokeh
* ggplot
* geoplotlib
* pygal
