# digital narratives of COVID-19: frequency visualization

In this notebook we demonstrate how to use *coveet*, a Python script that retrieves basic statistics (most frequent words, bigrams, trigrams, top users, hashtags) from our curated COVID-19 database about our collection. We also visualize the results using a bar chart race plot. Please feel free to modify this notebook or, if you would like to preserve this version, make a copy of it by clicking "File" > "Make a Copy..."

To follow along, we recommend running the script portions piecemeal, in order.

__Author:__

* Jerry Bonnell, [j.bonnell@miami.edu](mailto:j.bonnell@miami.edu)


## 0. Setting Up

Before we get started, let's set up the notebook by importing libraries we need.

In [13]:
import pandas as pd
from datetime import datetime, timedelta
import plotly.graph_objects as go
import random

from bar_race import get_bar_race_plot

## 1. Querying the database API + filtering stopwords

__NOTE__ Documentation for the coveet tool is available on the [project GitHub](https://github.com/dh-miami/narratives_covid19/tree/master/scripts/freq_analysis).

In this part, we show how to query using coveet based on some criteria. As an example, let's obtain all English bigrams in Florida between April 27 and May 3, the week in which the total number of cases in the U.S. crossed the one million mark. We'll visualize these results using a bar chart race plot. 

We assume that the user has prepared a list of stopwords in English and Spanish. In this example, one file contains stopwords for English and another for Spanish.

Let's obtain all relevant data (tweets and hashtags) corresponding to this range of dates using the *query* function provided by coveet.

In [51]:
!python3 coveet.py query -g fl -l en -d 2020-04-27 2020-05-03 -stopwords stopwords_en.txt stopwords_es.txt

Namespace(date=[datetime.datetime(2020, 4, 27, 0, 0), datetime.datetime(2020, 5, 3, 0, 0)], func=<function handle_query at 0x117548e60>, geo=['fl'], lang=['en'], stopwords=['stopwords_en.txt', 'stopwords_es.txt'])
wrote df to dhcovid_2020-4-27_to_2020-5-3_en_fl.csv!


This just populated a CSV file which can be viewed using Excel, Numbers, or `pandas` (personal favorite :-)

## 2. Retrieving top n-grams and hashtags

Let's now use the *nlp* function from coveet to obtain the top 10 bigrams per day, using the above CSV we created as input.

In [34]:
!python3 coveet.py nlp -n 2 -t 5 -f dhcovid_2020-4-27_to_2020-5-3_en_fl.csv

Namespace(file='dhcovid_2020-4-27_to_2020-5-3_en_fl.csv', func=<function handle_nlp at 0x114df4050>, hashtags=False, ngram=2, top=5, users=False)
   impact might  home values  home might  ...  day global  giving global       date
0            29           27          27  ...           0              0 2020-04-27
1             0            0           0  ...           0              0 2020-04-28
2             0            0           0  ...           0              0 2020-04-29
3             0            0           0  ...           0              0 2020-04-30
4             0            0           0  ...           0              0 2020-05-01
5             0            0           0  ...           0              0 2020-05-02
6             0            0           0  ...          13             13 2020-05-03

[7 rows x 33 columns]
wrote freq df to dhcovid_2020-4-27_to_2020-5-3_en_fl_2.csv!


When implementing bigrams and trigrams, I treat the entire tweet as context for a word, rather than its immediate neighbors. I think this can possibly yield more (interesting) results.  

Let's now load in this CSV using `pandas`.  Let's also convert the date column to a datetime object which makes working with dates easier. 

In [40]:
df_bigram = pd.read_csv('dhcovid_2020-4-27_to_2020-5-3_en_fl_2.csv', index_col=0)
df_bigram['date'] = pd.to_datetime(df_bigram['date'])

We can view a snapshot of the data by examining the dataframe. Note that each row corresponds to a day and each column corresponds to a popular bigram. The value in each cell denotes the number of occurrences of that bigram.

In [41]:
df_bigram

Unnamed: 0,impact might,home values,home might,home impact,might values,florida south,2020 april,distancing social,coronavirus pandemic,cases us,...,deceived world,deceived intel,deceived dossier,deceived leaked,global response,global may,5 global,day global,giving global,date
0,29,27,27,27,27,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2020-04-27
1,0,0,0,0,0,15,14,12,11,9,...,0,0,0,0,0,0,0,0,0,2020-04-28
2,0,0,0,0,0,14,14,0,0,0,...,0,0,0,0,0,0,0,0,0,2020-04-29
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2020-04-30
4,0,0,0,0,0,14,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2020-05-01
5,0,0,0,0,0,0,0,0,0,0,...,9,9,9,9,0,0,0,0,0,2020-05-02
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,14,13,13,13,13,2020-05-03


Let's use the nlp function to also obtain top hashtags from the same CSV file. 

In [42]:
!python3 coveet.py nlp -hashtags -t 5 -f dhcovid_2020-4-27_to_2020-5-3_en_fl.csv

Namespace(file='dhcovid_2020-4-27_to_2020-5-3_en_fl.csv', func=<function handle_nlp at 0x11f313050>, hashtags='h', ngram=False, top=5, users=False)
   covid19  coronavirus  covid  ...  pandemic  givingtuesdaynow       date
0      109           28     13  ...         0                 0 2020-04-27
1      112           30     26  ...         0                 0 2020-04-28
2      111           32     27  ...         0                 0 2020-04-29
3      102           34     31  ...        10                 0 2020-04-30
4      139           34     29  ...         0                12 2020-05-01
5      105           20     27  ...         0                 0 2020-05-02
6       75           14     21  ...         0                17 2020-05-03

[7 rows x 10 columns]
wrote freq df to dhcovid_2020-4-27_to_2020-5-3_en_fl_h.csv!


In [43]:
df_htags = pd.read_csv('dhcovid_2020-4-27_to_2020-5-3_en_fl_h.csv', index_col=0)
df_htags['date'] = pd.to_datetime(df_htags['date'])
df_htags

Unnamed: 0,covid19,coronavirus,covid,health,miami,acscovid19,staysafe,pandemic,givingtuesdaynow,date
0,109,28,13,13,9,0,0,0,0,2020-04-27
1,112,30,26,0,0,23,7,0,0,2020-04-28
2,111,32,27,13,0,10,0,0,0,2020-04-29
3,102,34,31,13,0,0,0,10,0,2020-04-30
4,139,34,29,0,0,17,0,0,12,2020-05-01
5,105,20,27,10,0,17,0,0,0,2020-05-02
6,75,14,21,9,0,0,0,0,17,2020-05-03


Looks great!

## 2. Bar Chat Race 

Now comes the fun part :-) Using the returned dataframe, we can visualize how the top bigrams change over the course of the week. We pass in the variable *df* to a function `get_bar_race_plot()` (the details of which are in the `bar_race` module on GitHub), and obtain an interactive plot using `plotly`.

In [45]:
fig = get_bar_race_plot(df_bigram, 'COVID-19 top bigrams of ')
fig.show()

Let's demonstrate a few more visualizations. This time, let's visualize the top users during that same week in Florida. 

In [9]:
print("unsupported for now...")
#start = datetime(year=2020, month=4, day=27)
#end = datetime(year=2020, month=5, day=3)
#df = days_to_df(lang=['en'], geo=['fl'], start_date=start, end_date=end, metric='users', top_n=10)
#fig = get_bar_race_plot(df, 'COVID-19 top users of ')
#fig.show()

unsupported for now...


For instance, we can see a significant contribution of tweets are made by news outlets like wsvn and TheMiamiTimes. 

Let's now have a look at the top hashtags. 

In [48]:
fig = get_bar_race_plot(df_htags, 'COVID-19 top fl en hashtags ')
fig.show()

Feel free to add your own visualizations below! Or, if you prefer, you can tweak the parameters in the above cells and re-run them to get different visualizations. 

## 3. Exporting the visualization

The `plotly` package provides a convenience function for exporting visualizations to HTML, which can be opened on a browser and embedded on a website. Let's try it out with our last visualization.  


In [50]:
import plotly.express as px
fig.write_html("bar_race.html")