# digital narratives of COVID-19: frequency visualization

In this notebook we demonstrate how to use *coveet*, a Python script that retrieves basic statistics (most frequent words, bigrams, trigrams, top users, hashtags) from our curated COVID-19 database about our collection. We also visualize the results using a bar chart race plot. Please feel free to modify this notebook or, if you would like to preserve this version, make a copy of it by clicking "File" > "Make a Copy..."

To follow along, we recommend running the script portions piecemeal, in order.

__Author:__

* Jerry Bonnell, [j.bonnell@miami.edu](mailto:j.bonnell@miami.edu)


## 0. Setting Up

Before we get started, let's set up the notebook by importing libraries we need.

In [3]:
import pandas as pd
from datetime import datetime, timedelta
import plotly.graph_objects as go
import random

from coveet import days_to_df
from bar_race import get_bar_race_plot

## 1. Querying the Database

In this part, we show how to (programatically) query the database using coveet based on some criteria. As an example, let's obtain all English bigrams in Florida between April 27 and May 3, the week in which the total number of cases in the U.S. crossed the one million mark. Later on in the notebook, we'll visualize these results. 

We create two datetime objects with the corresponding start (April 27) and end (May 3) dates. We then invoke a function from coveet called *days_to_df()* and pass it the criteria we are interested in (the Github repository contains a more detailed description of the available parameters) and obtain only the top 10 results. Note that by setting metric=2, we ask for a bigram search. This function will return the results in the form of a pandas DataFrame, stored in a variable called *df*. 

In [5]:
start = datetime(year=2020, month=4, day=27)
end = datetime(year=2020, month=5, day=3)
df = days_to_df(lang=['en'], geo=['fl'], start_date=start, end_date=end, metric=2, top_n=10)

querying date 2020-04-27 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=2
querying date 2020-04-28 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=2
querying date 2020-04-29 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=2
querying date 2020-04-30 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=2
querying date 2020-05-01 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=2
querying date 2020-05-02 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=2
querying date 2020-05-03 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=2


We can view a snapshot of the data by examining the dataframe. Note that each row corresponds to a day and each column corresponds to a popular bigram. The value in each cell denotes the number of occurrences of that bigram.

In [7]:
df

Unnamed: 0,covid pandemic,home values,covid home,impact might,might covid,covid testing,covid patients,due covid,south florida,covid crisis,...,western intel,dossier reveals,response covid,may day,day global,global action,action giving,giving unity,unity response,covid jorge
0,37,31.0,30.0,29.0,29.0,21.0,19.0,19.0,16.0,16.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,45,0.0,0.0,0.0,0.0,22.0,12.0,0.0,16.0,16.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,42,0.0,0.0,0.0,0.0,16.0,20.0,26.0,14.0,20.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,41,0.0,0.0,0.0,0.0,0.0,0.0,17.0,14.0,20.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,48,0.0,0.0,0.0,0.0,11.0,25.0,21.0,14.0,14.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,16,0.0,0.0,0.0,0.0,8.0,15.0,16.0,0.0,0.0,...,8.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,13.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0


## 2. Bar Chat Race 

Now comes the fun part :-) Using the returned dataframe, we can visualize how the top bigrams change over the course of the week. We pass in the variable *df* to a function *get_bar_race_plot()* (the details of which are in the bar_race module on GitHub), and obtain an interactive plot using __plotly__.

In [8]:
fig = get_bar_race_plot(df, 'COVID-19 top bigrams of ')
fig.show()

Let's demonstrate a few more visualizations. This time, let's visualize the top users during that same week in Florida. 

In [11]:
start = datetime(year=2020, month=4, day=27)
end = datetime(year=2020, month=5, day=3)
df = days_to_df(lang=['en'], geo=['fl'], start_date=start, end_date=end, metric='users', top_n=10)
fig = get_bar_race_plot(df, 'COVID-19 top users of ')
fig.show()

querying date 2020-04-27 00:00:00 with lang=['en'], loc=['fl']
getting top users
querying date 2020-04-28 00:00:00 with lang=['en'], loc=['fl']
getting top users
querying date 2020-04-29 00:00:00 with lang=['en'], loc=['fl']
getting top users
querying date 2020-04-30 00:00:00 with lang=['en'], loc=['fl']
getting top users
querying date 2020-05-01 00:00:00 with lang=['en'], loc=['fl']
getting top users
querying date 2020-05-02 00:00:00 with lang=['en'], loc=['fl']
getting top users
querying date 2020-05-03 00:00:00 with lang=['en'], loc=['fl']
getting top users


For instance, we can see a significant contribution of tweets are made by news outlets like wsvn and TheMiamiTimes. Let's now compare the top English words in Florida with the top Spanish words over the same dates. 

In [12]:
start = datetime(year=2020, month=4, day=27)
end = datetime(year=2020, month=5, day=3)
df = days_to_df(lang=['en'], geo=['fl'], start_date=start, end_date=end, metric=1, top_n=10)
fig = get_bar_race_plot(df, 'COVID-19 top english words of ')
fig.show()

querying date 2020-04-27 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=1
querying date 2020-04-28 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=1
querying date 2020-04-29 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=1
querying date 2020-04-30 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=1
querying date 2020-05-01 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=1
querying date 2020-05-02 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=1
querying date 2020-05-03 00:00:00 with lang=['en'], loc=['fl']
getting ngrams n=1


In [14]:
start = datetime(year=2020, month=4, day=27)
end = datetime(year=2020, month=5, day=3)
df = days_to_df(lang=['es'], geo=['fl'], start_date=start, end_date=end, metric=1, top_n=10)
fig = get_bar_race_plot(df, 'COVID-19 top spanish words of ')
fig.show()

querying date 2020-04-27 00:00:00 with lang=['es'], loc=['fl']
getting ngrams n=1
querying date 2020-04-28 00:00:00 with lang=['es'], loc=['fl']
getting ngrams n=1
querying date 2020-04-29 00:00:00 with lang=['es'], loc=['fl']
getting ngrams n=1
querying date 2020-04-30 00:00:00 with lang=['es'], loc=['fl']
getting ngrams n=1
querying date 2020-05-01 00:00:00 with lang=['es'], loc=['fl']
getting ngrams n=1
querying date 2020-05-02 00:00:00 with lang=['es'], loc=['fl']
getting ngrams n=1
querying date 2020-05-03 00:00:00 with lang=['es'], loc=['fl']
getting ngrams n=1


feel free to add your own visualizations below!