In [2]:
import pandas as pd

In [6]:
!pwd

/Users/Aja/cse583/project


## Create the Scatter Map

### Loading data

As mentioned before, we’ll work with Covid pandemic data. We’ll use the dataset from Hopkins University which is updated on a daily basis during the crisis. It is available on opendatasoft.

In [19]:
# Load toy dataset from Johns Hopkins COVID19 data for US
df = pd.read_csv("../project/toydata.csv", sep=';')

# Display first 5 lines
df.head()

Unnamed: 0,Zone,Sub Zone,Category,Date,Count,Location
0,Malaysia,,Deaths,2020-05-12,109,"4.210484,101.975766"
1,Malaysia,,Deaths,2020-06-01,115,"4.210484,101.975766"
2,Malaysia,,Deaths,2020-06-10,118,"4.210484,101.975766"
3,Malaysia,,Deaths,2020-09-19,130,"4.210484,101.975766"
4,Malaysia,,Deaths,2020-06-23,121,"4.210484,101.975766"


Data are quite easy to understand: it contains daily information about covid in many countries, about either Death, Confirmed, or Recovered persons, with GPS information.

## Processing data

This dataset has to be transformed to fit the Mapbox inputs. Let’s be clear about the input needed.

The purpose of scatter maps is to plot bubbles on a map, with variable sizes and eventually variable colors. In today’s example we want:

    A single bubble per country
    Bubble latitude: latitude of the specified country
    Bubble longitude: longitude of the specified country
    Bubble size: number of confirmed cases
    Bubble color: ratio of recovered persons
    Bubble hover: a summary of the country’s situation

Let’s get this information in different columns.

In [34]:
def process_pandemic_data(df):

    # Columns renaming
    df.columns = [col.lower() for col in df.columns]

    # Create a zone per zone/subzone
    df['zone'] = df['zone'].apply(str) + ' ' + df['sub zone'].apply(lambda x: str(x).replace('nan', ''))
    
    # Extracting latitute and longitude
    df['lat'] = df['location'].apply(lambda x: x.split(',')[0])
    df['lon'] = df['location'].apply(lambda x: x.split(',')[1])

    # Saving countries positions (latitude and longitude per subzones)
    country_position = df[['zone', 'lat', 'lon']].drop_duplicates(['zone']).set_index(['zone']

    # Merging locations after pivoting
    df = df.join(country_position)

    # Filling nan values with 0
    df = df.fillna(0)

    # Compute bubble sizes
    df['size'] = df['confirmed'].apply(lambda x: (np.sqrt(x/100) + 1) if x > 500 else (np.log(x) / 2 + 1)).replace(np.NINF, 0)
    
    # Compute bubble color
    df['color'] = (df['recovered']/df['confirmed']).fillna(0).replace(np.inf , 0)

    return df
    

- First step: extract the ``Location`` column into latitude and longitude
- Second step: get a tidy dataset by splitting the single ``category`` column with 3 lines per key (``date`` & ``zone``) into three columns: ``confirmed`` , ``deaths`` & ``recovered.``
- Third step: while pivoting the table we lost the location information. We now merge them back, thanks to the ``country_position`` dataset.
- Fourth step: define the size of each bubble
- Fifth step: define the color of each bubble

In [35]:
# Pivoting per category
df2 = pd.pivot_table(df, values='count', index=['date', 'zone'], columns=['category'])
df2.columns = ['confirmed', 'deaths', 'recovered']

#like so:
df2.head()

KeyError: 'count'

Here is a full Data Frame with multi-index : ``date`` & ``zone`` , which makes filtering very convenient.

As an example, to extract information of a single day, we’d do :

In [23]:
day = '2020-05-12'
df_day = df.xs(day)

KeyError: '2020-05-12'