# digital narratives of COVID-19: concordance views

A *concordance view* shows us every occurrence of a given word, together with some context. In this notebook we demonstrate how to retrieve concordance views from the DHCOVID corpus using the Python script `coveet.py` as a query + tidying tool. We visualize results using tables with respect to each geographic area. 

Please feel free to modify this notebook or, if you would like to preserve this version, make a copy of it by clicking "File" > "Make a Copy..."

To follow along, we recommend running the script portions piecemeal, in order.

Author:
* Jerry Bonnell, j.bonnell@miami.edu, University of Miami

## 0. Setting Up

Before we get started, let us set up the notebook by installing and importing libraries we need. The `requirements.txt` file specifies all the packages to install on your computer for this notebook.

In [None]:
!pip3 install -r requirements.txt  # may need to replace "pip3" with "conda"

In [1]:
import pandas as pd
import numpy as np
import re
from collections import defaultdict
from tqdm import tqdm
import matplotlib.pyplot as plt
from plotnine import *
import vaderSentiment.vaderSentiment as vader
from sklearn.manifold import TSNE
from IPython.display import set_matplotlib_formats

import seaborn as sns
sns.set()
from pylab import rcParams
set_matplotlib_formats('svg')
rcParams['figure.figsize'] = 13, 16  # setting figure size
plt.rcParams.update({'font.size': 3})  # setting font size
pd.set_option('display.max_colwidth', None)

As the running example for this notebook, we will obtain concordance views for various search words from queried tweets written between May 8, 2020 and May 14, 2020. 

## 1. Querying

__NOTE__ For a more detailed explanation of the querying and tidying phases used in the analysis pipeline, please see the jupyter notebook `coveet_frequency.ipynb` which is also in this folder. 

We use the `coveet.py` tool to query tweets from all location-language pairs written between May 8, 2020 and May 14, 2020. We will *not* use the tidying component in this notebook as concordance views are meant for our own consumption, and any tidying done (e.g., filtering stopwords and lemmatization) will obscure understanding.     

Let us apply the `query` mode.

In [None]:
!python coveet.py query -g fl ar co ec es mx pe -l en es -d 2020-05-08 2020-05-14

If the Python you are using is coming from a conda environment (here called `blueberry`), use the following incantation instead:

In [None]:
!conda activate blueberry; python coveet.py query -g fl ar co ec es mx pe -l en es -d 2020-05-08 2020-05-14

Let us load in the queried CSV into a `pandas` DataFrame. 

In [3]:
df = pd.read_csv('dhcovid_2020-5-8_2020-5-14_en_es_ar_co_ec_es_fl_mx_pe.csv', index_col=0)

As a quick preprocessing step, let us filter out any rows whose text field is empty (i.e., `NaN`). This usually means that the tweet was full of hashtags.

In [5]:
df = df.dropna(subset=["text"])

Let us inspect what the data frame looks like.

In [6]:
df

Unnamed: 0,date,lang,geo,text,hashtags
0,2020-05-08,en,fl,my heart is with those who tested positive for covid19 at this keys senior living facility thank you to the first responders and frontline medical staff treating our neighbors please stay safe,
1,2020-05-08,en,fl,the probable causes of death are the same over and over again pneumonia acute respiratory distress syndrome complications from covid19 each persons story though is a little different often in heartbreaking ways,
2,2020-05-08,en,fl,the chem trails of the blue angels cure covid19 that why the government is wasting all that money on them instead of something actually useful,
3,2020-05-08,en,fl,its a lot higher then whats being reported unemployment soars to 147 job losses reach 205 million in april as coronavirus pandemic spreads,
4,2020-05-08,en,fl,intel reports deep state china use covid19 for antitrump and gates population control dog,
...,...,...,...,...,...
224732,2020-05-14,es,es,1 no todos llevan mascarilla hay muchos que no llevan nada de proteccion 2 cuando el 8m no habia tanta conciencia del covid19 3 lo que estan haciendo va en contra de muchas cosas y no solo morales,
224733,2020-05-14,es,es,luego me flipa la gente tan inculta que hay que tiene el valor de seguir diciendo que es una gripe por eso nos tienen en casa se nota quien no ha tenido cerca a alguien con covid19 y habla desde la ignorancia y su mente cuadrada,
224734,2020-05-14,es,es,el desafio de elon musk y tesla a las ordenes de salud de covid19 y su denuncia contra el condado de alameda y las amenazas de abandonar california crean,
224735,2020-05-14,es,es,en los 144 años de historia de damm la empresa ha demostrado fortaleza y flexibilidad para adaptarse a entornos complicados que segun su presidente ejecutivo demetrio carceller arce la han reforzado para crecer responsable y sosteniblemente,


## 2. Concordance Views

I defer to the NLTK documentation for a definition: "a *concordance* view shows us every occurrence of a given word, together with some context." The context is usually defined by a window of some number of characters. Given the short and atomic nature of tweets, it would be fair to consider the full tweet as context for the concordance view. We would also like to display the associated date of that tweet.

The work needed to find concordances with its associated date is straightforward thanks to `pandas` and the query CSV we have available in the variable `df`. The query CSV is suitable for this task because the stopwords have not been eliminated yet so we are able to study the context.

With `df` at hand, we can filter the rows to include only those that match the given word. This filter `filt` can be as simple as a single word (tweets where `haber` appears) or as advanced as logical expressions (tweets where `nuevo` and `america` appears). I show examples for both.

In [16]:
filt = lambda text: 'haber' in text  # a single word
# filt = lambda text: 'nuevo' in text and 'america' in text      # a logical expression 
# filt = lambda text: 'trump' in text and not 'china' in text    # another one to try

In [17]:
df_concord = df[df.apply(lambda x: filt(x["text"]), axis=1)]

In [18]:
df_concord

Unnamed: 0,date,lang,geo,text,hashtags
4615,2020-05-14,en,fl,developer daniel chaberman isnt letting covid19 stop him from breaking ground on the third phase of atlantic village delivering retail restaurants offices to hallandale beach hes opening the 2nd phase in the middle of the pandemic,
4616,2020-05-14,en,fl,developer daniel chaberman isnt letting covid19 stop him from breaking ground on the third phase of atlantic village delivering retail restaurants offices to hallandale beach hes opening the 2nd phase in the middle of the pandemic,
5444,2020-05-08,es,ar,covid19 resolucion 40820 mtyes a traves de esta resolucion se dispuso que aquellos empleadores que hubiesen efectuado el pago total o parcial de haberes correspondientes al mes de abril previo a la percepcion por parte de sus trabajadores,
6605,2020-05-08,es,ar,cientificos japoneses aseguran haber desarrollado un anticuerpo contra el covid19,
6875,2020-05-08,es,ar,angeles azules asi llamo hugo ficca a los medicos que lo cuidaron y alentaron a luchar contra el covid19 hoy ya tiene el alta despues de haber estado internado 13 dias en terapia intensiva por,
...,...,...,...,...,...
223770,2020-05-14,es,es,la gente comparando el vih con el covid19 pero como puede haber tanta gente solo con las neuronas suficientes para no mearse encima,
224028,2020-05-14,es,es,esta noche tendrian que haber comenzado las ferias de de talavera de la reina suspendidas por la pandemia del asi estaba el recinto ferial en un dia como este y asi esta hoy,#sanisidro #covid19
224211,2020-05-14,es,es,los que dicen que es tan irresponsable como el 8m ojo que no digo que no hubiera que haberlo cancelado les recuerdo que en españa ese dia habia 10 fallecidos por covid19 que se supiera y a dia de hoy hay mas de 27 mil,
224416,2020-05-14,es,es,la primera fase del estudio nacional seroepidemiologico revela que mas de 30000 habitantes de la provincia podrian haber contraido el coronavirus mas informacion,


If we wish to write this concordance dataframe to a file, we can do that easily.

In [None]:
df_concord.to_csv("concordance_view.csv")