<img src='images/gesis.png' style='height: 50px; float: left'>
<img src='images/social_comquant.png' style='height: 50px; float: left; margin-left: 40px'>

## Introduction to Computational Social Science methods with Python

# Session 2: Data handling and visualization

**Data** has two components: content and structure. Plain text data is unstructured, but its content can also be represented in a structured way. Data representations reside in a continuum of structuration. The rectangular table (also called dataframe or spreadsheet) is the most frequent data format in the social sciences because it is perfectly suited to manage the highly structured (survey) data that social scientists have typically worked on. Hierarchical data formats like JSON and HTML are semi-structured, and text data is unstructured (<a href='#destination5'>Weidmann 2022</a>, ch. 3). The practice of Computational Social Science primarly revolves around **Digital Behavioral Data**, the traces of behavior left by uses of or harnessed by digital technology. The so-called data life cycle consists of three steps: data collection, processing, and analysis. In this process, data changes its face from a raw state to a state in which it is ready for analysis. **Data processing** subsumes the steps in which this transformation takes place (<a href='#destination5'>Weidmann 2022</a>, ch. 1).

**Data management** refers to the practices by which we stay in control of data as a resource. Data is best managed with a focus on practical questions. Since data processing is about bringing data into a form that permits answering those questions, it is strategically advantageous also to focus data management on data processing. Computational data processing workflows are beneficial because they fully document the many steps from data collection to data analysis, they are convenient (like your favorite spreadsheet software could never be), they are replicable (nowadays in high demand by scholarly journals), they can be scaled up (necessary for [big data](https://en.wikipedia.org/wiki/Big_data)), and they offer the needed flexibility in the face of semi-structured or unstructured data sources (<a href='#destination5'>Weidmann 2022</a>, p. 7–9).

The **R** language and software environment for statistical computing and graphics is very popular in the social sciences, also because it provides the [Tidyverse](https://www.tidyverse.org/), a collection of mutually adapted packages for tabular data structures, their manipulation (*e.g.*, merging, aggregating), and producing appealing graphics (<a href='#destination5'>Weidmann 2022</a>, ch. 7). We argue that **Python** does not need to hide behind R in this regard. The [Pandas](https://pandas.pydata.org/) library for managing tables truly is a "fast, powerful, flexible and easy to use open source data analysis and manipulation tool" that, when combined with the [Seaborn](https://seaborn.pydata.org/) statistical data visualization library, leaves nothing to be desired.

Pandas can also be used in a way that mimics the functionality of **relational databases**. These are systems where the columns of a table are split into multiple tables such that redundancies are eliminated. Relational databases are often used in research when the data is either large in volume or rich in content because they ensure consistency and speed up data processing (<a href='#destination5'>Weidmann 2022</a>, part 3). The public [TweetsKB](https://data.gesis.org/tweetskb/) corpus of annotated tweets (<a href='#destination4'>Fafalios *et al.*, 2018</a>), as well as its offspring, the [TweetsCOV19](https://data.gesis.org/tweetscov19/) corpus (<a href='#destination3'>Dimitrov *et al.*, 2020</a>), are examples where the data is explicitly modeled relationally and can serve as illustrations of meaningful data management.

<div class='alert alert-block alert-success'>
<b>In this session</b>, 

you will learn how to manage your data, keep it tidy, and visualize it while keeping a focus on your research questions. In subsession **2.1**, we will have a deep look at the 2-dimensional table as the fundamental data structure we will work with throughout all sessions. You will experience how you can use the Pandas library to handle tables and mimic a relational database in such a way that your data gets ready for analysis. You will see what it means that relational databases eliminate redundancy and ensure consistency. The TweetsCOV19 dataset will function as an example that will shine up repeatedly in this and subsequent sessions. In subsession **2.2**, we will introduce the NumPy and SciPy libraries. NumPy allows working with n-dimensional tables called arrays, which are typically needed in the data processing. SciPy enables you to efficiently process and analyze huge matrices (*i.e.*, 2-dimensional numerical tables) with many zeros or missing values (which is often the case). Finally, in subsession **2.3**, you will learn how to use the Matplotlib and Seaborn libraries to explore data visually.
</div>

<div class='alert alert-block alert-danger'>
<b>Caution</b>

This Jupyter Notebook demonstrates a workflow that consists of a **sequence of processing steps**. In this process, tables are created, changed, and deleted. Hence, the notebook must be executed from top to bottom. Going back up from a certain code cell and trying to execute a cell that precedes it may not work.
</div>

## 2.1. Managing data with Pandas

<img src='images/pandas.png' style='height: 100px; float: right; margin-left: 10px'>

[Pandas](https://pandas.pydata.org/) is Python's package for data management and processing using 2-dimensional tables. It allows you to work with any kind of observational or statistical data set, including matrices. Column entries can be heterogeneous (*i.e.*, a single column can contain text, numerical values, or even lists). Pandas is also well-equipped to handle time series data, as we will see. We start with some illustrative toy examples before entering the almost-big-data world using the TweetsCOV19 dataset.

### 2.1.1. Toy examples

#### Data and structure

In subsections 3.2 to 3.4, <a href='#destination5'>Weidmann 2022</a> discusses data, data processing, and the benefit of relational databases using toy examples and the R language. Here, we adapt these examples to Python. Keep in mind that data = content + structure. Furthermore,  consider the following two pieces of data. They have (almost) the same content but a different structure. `sdb` represents unstructured data, `tdb` - is structured:

In [None]:
sdb = 'Switzerland is a country with 8.3 million inhabitants, and its capital is Bern. Another country is Austria; its capital is Vienna and the population is 8.7 million.'
sdb

In [None]:
import pandas as pd
pd.__version__

In [None]:
tdb = pd.DataFrame(data=[['Switzerland', 8.3, 'Bern'], ['Austria', 8.7, 'Vienna']], columns=['country', 'population', 'capital'])
tdb

`tdb` is a Pandas table called a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The columns of a DataFrame are called [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html). A DataFrame contains labeled axes (rows and columns). Axis 0 (the rows) is called the **index**, and its labels consist of integers from $0$ to $n-1$ by default, where $n$ is the number of rows:

In [None]:
tdb.index

When no names are given, the **columns** (axis 1) are also labeled in such a way, but using text labels makes the table much more readable:

In [None]:
tdb.columns

Similarly the index can be a list of text labels or unordered integers.

Rows, columns, or cells can be extracted by specifying their [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)ations (labels). For example, to extract the capital of the second row (Austria):

In [None]:
tdb.loc[1, 'capital']

Filtering data. For example, selecting all rows where the country is Switzerland:

In [None]:
tdb[tdb['country'] == 'Switzerland']

To add a new column:

In [None]:
tdb['area'] = [41, 83]
tdb

The values in the 'area' column are of the integer data type:

In [None]:
tdb['area'].dtype

To add a new row, we must first create a DataFrame containing the new row. Then we can [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)enate the two dataframes on the index axis:

In [None]:
new_row = ['Liechtenstein', 0.038 , 'Vaduz', 0.16]
tdb_new_row = pd.DataFrame(data=[new_row], columns=tdb.columns)
tdb = pd.concat(objs=[tdb, tdb_new_row], axis=0)
tdb

Note that the 'area' column is now a continuous numerical variable:

In [None]:
tdb['area'].dtype

But also note that concatenation results in the old indexes being used (the third row also has index 0). To reset the index and drop the old values:

In [None]:
tdb = tdb.reset_index(drop=True)
tdb

To remove the column 'area'  we can [`drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) it (use `axis=1`) and put the result `inplace` of the original table:

In [None]:
tdb.drop(labels=['area'], axis=1, inplace=True)
tdb

Alternatively, you could have used `del` which also works for whole dataframes:

In [None]:
# del tdb['area']

Rows can only be removed with the `drop()` method by using `axis=0`:

In [None]:
tdb.drop(labels=[2], axis=0, inplace=True)
tdb

#### Wide vs. long structure

A rule in data management states that tables should grow long, not wide. Consider this `bad_table` of two countries' population sizes in three years:

In [None]:
bad_table = pd.DataFrame(data=[['Switzerland', 4.7, 5.3, 6.2], ['Austria', 6.9, 7.1, 7.5]], columns=['country', 'pop1950', 'pop1960', 'pop1970'])
bad_table

Though appealing to the eye, this table is computationally bad, as can be demonstrated by trying to compute the average population size over all countries and years. It is relatively easy to compute the mean for each year by selecting the corresponding columns.

In [None]:
bad_table[['pop1950', 'pop1960', 'pop1970']].mean(axis=0)

But computing the overall mean requires selecting columns and taking the mean of the year means:

In [None]:
bad_table[['pop1950', 'pop1960', 'pop1970']].mean().mean()

A `good_table` is long, not wide:

In [None]:
good_table = pd.DataFrame(data=[['Switzerland', 1950, 4.7], ['Switzerland', 1960, 5.3], ['Switzerland', 1970, 6.2], ['Austria', 1950, 6.9], ['Austria', 1960, 7.1], ['Austria', 1970, 7.5]], columns=['country', 'year', 'population'])
good_table

Computing the overall mean is a simple operation on one column...

In [None]:
good_table['population'].mean()

and the year means can be obtained via aggregation (using the [`groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method) without having to specify any year columns:

In [None]:
good_table.groupby('year').mean().reset_index()

#### Multiple tables

What does it mean that relational databases eliminate redundancies and ensure consistency? Consider a copy of the `good_table` with an additional column that contains a country's capital (copying makes sure that any changes made to `good_table2` do not affect the original `good_table`):

In [None]:
good_table2 = good_table.copy()
good_table2.loc[0:2, 'capital'] = 'Bern'
good_table2.loc[3:5, 'capital'] = 'Vienna'
good_table2

Clearly, this table contains redundant information because country-capital pairs are always the same. This data format also potentially yields a consistency problem. If, for example, one wants to refer to capital names not in English but in the respective national language, one must replace each occurrence of 'Vienna' (English) with 'Wien' (German). But if you miss a single occurrence, your table becomes inconsistent. You can evade both problems if you split `good_table2` into two tables: one containing the population sizes...

In [None]:
populations = good_table2[['country', 'year', 'population']].copy()
populations

and one containing the capitals:

In [None]:
capitals = good_table2[['country', 'capital']].drop_duplicates().reset_index(drop=True)
capitals

This way, you have eliminated all redundancies and ensured consistency because you need to replace 'Vienna' with 'Wien' in one place only.

Next, we will take Pandas and relational database thinking to the next level.

### 2.1.2. TweetsCOV19

[Twitter](https://en.wikipedia.org/wiki/Twitter) is a microblogging service that is very influential among politicians and journalists. Though stagnating over the past years, the number of monthly active users was 238 million in the second quarter of 2022 (<a href='#destination6'>Wikipedia, 2022</a>). Since January 2013, researchers at [L3S](https://www.l3s.de/) and [GESIS](https://www.gesis.org/) have been collecting a 1% random sample of all Twitter transactions (tweets), detecting sentiments, and extracting named entities, user mentions, hashtags, as well as URLs, and making those publicly available as the [TweetsKB](https://data.gesis.org/tweetskb/) corpus (<a href='#destination4'>Fafalios *et al.*, 2018</a>). By August 2022, the corpus had grown to about 3 billion tweets. In the following, we will store the content of a small fraction of those tweets, one month of the [TweertsCOV19](https://data.gesis.org/tweetscov19/) corpus (<a href='#destination3'>Dimitrov *et al.*, 2020</a>), in multiple Pandas dataframes that make reference to the TweetsKB data structure.

#### Ontologies in practice

|<img src='images/model.png' style='float: none; width: 640px'>|
|:--|
|<em style='float: center'>**Figure 1**: Data structure used to build the TweetsKB corpus ([source](https://data.gesis.org/tweetskb/#Data-model))</em>|

The data structure depicted in ***figure 1*** used to build the TweetsKB corpus is relational and uses several standardized ontologies. **Relational** means that each piece of content belongs to a class, and classes have properties that can either describe a class attribute or link to another class. We will shortly see that classes are candidates for tables. Classes and properties are drawn from **ontologies** which are vocabularies for modeling data and, in our particular tweets case, online community data. These vocabularies are developed and maintained by the [Semantic Web](https://en.wikipedia.org/wiki/Semantic_Web) research community, aiming to make internet data machine-readable.

|<img src='images/TweetsKB_model_example.jpg' style='float: none; width: 640px'>|
|:--|
|<em style='float: center'>**Figure 2**: Example of how a tweet is encoded using the data structure</em>|

***Figure 2*** is an example of how a tweet is encoded using this abstract data structure. In other words, the figure depicts how the content of a tweet is modeled as machine-readable data. Starting with the central element, a **tweet** is modeled as belonging to the [Post](https://www.w3.org/Submission/sioc-spec/#term_Post) class, which is defined as an "article or message that can be posted to a Forum" in the [SIOC](http://sioc-project.org/) ontology. A tweet has a [has_creator](https://www.w3.org/Submission/sioc-spec/#term_has_creator) property which relates a tweet to a user. A **user** is modeled as belonging to the [User](https://www.w3.org/Submission/sioc-spec/#term_User) class (in figure 1, the class is called "UserAccount"), which is defined as a "User account in an online community site." The "tweet1" instance of Post as well as the "usr1" instance of User have [id](http://rdfs.org/sioc/spec/#term_id) properties that link to the actual [literal](https://www.w3.org/TR/rdf-schema/#ch_literal) values of the **tweet id** (<span style='font-family:Courier'>9565121266</span>) and (encrypted) **user name** (<span style='font-family:Courier'>2356912</span>) variables. Below, we will create separate Pandas tables for the Post and User classes, among others. Just like in the above example of populations and capitals, this is how redundancy is eliminated.

Starting from such an understanding of separate tables for tweets and users, we can discuss which one some of the other variables belong to which come with the data. The **timestamp**, **number of retweets** (number of users that forward the tweet), and **number of favorites** (number of users that like the tweet) clearly are attributes of tweets. In the example of *figure 1*, "tweet1" is liked by $12$ users, a statistic that is modeled using the [InteractionCounter](https://schema.org/InteractionCounter) for the [LikeAction](https://schema.org/LikeAction) of the [Schema.org](https://schema.org/) vocabulary. The **number of followers** (number of other users that follow a user) and **number of friends** (number of other users a user follows) seem to be attributes of users at first glance. But since they are measured at the time of tweet creation, they are better also attributed to tweets. While the Twitter API delivers these variables, the following variables have been obtained by the corpus creators by processing the tweet content. The sentiment or emotional content of a tweet is modeled by using the [Onyx](https://www.gsi.upm.es/ontologies/onyx/) ontology, which is "designed to annotate and describe the emotions expressed by user-generated content". The [SentiStrength](http://sentistrength.wlv.ac.uk/) algorithm results in **positive sentiment** (1 means low and 5 means high) and **negative sentiment** (-1 means low and -5 means high) variables. Though the sentiment expresses the mind state of a user, it is expressed in language and is, hence, a tweet attribute.

The dataset producers have also annotated tweets by extracting four different kinds of **facts** (communicative symbols) from tweet texts: named entities (universally recognized semantic concepts), user **mentions** (words starting with <span style='font-family:Courier'>@</span>), **hashtags** (words starting with <span style='font-family:Courier'>#</span>), and **URLs** (addresses of web pages). Since URLs are often too detailed, we will also extract the **TLDs** (top-level domains) from URLs. To identify **named entities**, the [FEL](https://github.com/yahoo/FEL) algorithm matches parts of the tweet **text** to Wikipedia pages as universally identifiable resources and provides a **confidence** score to what extent the match is trustworthy (0 means high and -3 means low confidence). In the example of *figure 2*, the text snippet "<span style='font-family:Courier'>Federer</span>" has been matched to the Wikipedia resource [Roger_Federer](https://de.wikipedia.org/wiki/Roger_Federer) with an average confidence of $-1.54$.

|<img src='images/TweetsCOV19_ext_erd.png' style='float: none; width: 640px'>|
|:--|
|<em style='float: center'>**Figure 3**: Entity relationship diagram to organize Pandas tables</em>|

It is clear that named entities, mentions, hashtags, URLs, and TLDs cannot be tweet attributes, as that would create an immense amount of redundancy. Hence, each of these five facts gets its own table. ***Figure 3*** shows the entity relationship diagram into which we will transform the tweets data. The diagram in *figure 3* mirrors the TweetsKB data structure.We will construct the tables shown in the figure by following the rules of [database normalization](https://en.wikipedia.org/wiki/Database_normalization), which we have introduced in section 2.1.1, in the most basic way. Entities are classes in the above sense and are not to be confused with named entities. We will create **entity tables** for the seven entities discussed so far: `tweets`, `users`, `named_entities`, `mentions`, `hashtags`, `utls`, and `tlds`. Each table has a primary key (PK) that uniquely identifies the entity instances in a table. We will use the index of Pandas dataframes as primary keys. Six of those tables have a column called 'tweets', which is the number of tweets a user has created or the number of times a fact has been selected in a tweet.

In addition, we will create five **relationship tables** that put tweets into a relationship to facts (named entities, mentions, hashtags, URLs, and TLDs). Relationship tables are depicted using dashed lines in *figure 3*. They just contain entity identifiers (indices) that are now called foreign keys (FK). We will shortly see that relationship tables can be directly used in data analysis. One of the five tables is an exception: The `tweets_named_entities` table has two more columns – the text that was used to name the named entity and the confidence score – because these are true attributes of the relationship between tweets and named entities. Finally, users and tweets are linked in the tweets table via the 'user_idx' column because a tweet is created by one and only one user.

#### Structuring TweertsCOV19

We will be working with the May 2020 dump of the TweertsCOV19 corpus. Download this [file](https://zenodo.org/record/4593502/files/TweetsCOV19_052020.tsv.gz) and put it into the '../data/TweetsCOV19' folder. The [description](https://data.gesis.org/tweetscov19/#Dataset) of the dataset says that each row contains variables of a tweet instance, there are twelve variables (columns), and variables are separated by a tab character ('\t'). In other words, the data is delivered as a table. Furthermore, the description says that sentiment scores, named entity metadata, etc. are concatenated. In other words, the delivered table is wide in selected columns. Our job will be to transform this table into the multiple tables of *figure 3*. Before reading the full data, it is a good idea to look at the first rows to check if the file contains column names and if there are any peculiarities. Using UTF-8 encoding is recommended since it allows for coding all the different characters used in tweets:

In [None]:
head = pd.read_csv(
    filepath_or_buffer = '../data/TweetsCOV19/TweetsCOV19_052020.tsv.gz', 
    sep = '\t', 
    nrows = 5, 
    encoding = 'utf-8'
)
head

Get the number of rows and columns:

In [None]:
head.shape

Knowing that the file does not contain column names and that the separator indeed creates twelve columns, we can read the whole file.

<div class='alert alert-block alert-danger'>
<b>Caution</b>

The file we are about to load is almost 200 MB large in compressed format. When loaded into memory as a dataframe, it consumes almost 1 GB. Since we are about to create many new tables from it, which all require significant amounts of memory, you can quickly reach the limits of the machine you are working on. In fact, if we work with the whole file and run the notebook all until the end, it will consume 4.2 GB. This is too much if, for example, you are executing this notebook on mybinder.org, which gives you 2 GB of memory.
</div>

To reduce the memory load, take a sample from the already-sampled file. Setting the `seed()` of the random library will create results that are exactly reproducible. `p` is the sample fraction to load. It is set to 25% to use less than 2 GB of memory. Increase it if you have more memory:

In [None]:
import random

In [None]:
random.seed(42)
p = .25

In [None]:
tweets = pd.read_csv(
    filepath_or_buffer = '../data/TweetsCOV19/TweetsCOV19_052020.tsv.gz', 
    sep = '\t', 
    header = None, 
    skiprows = lambda i: i > 0 and random.random() > p, 
    quoting = 3, 
    encoding = 'utf-8'
)

<div class='alert alert-block alert-danger'>
<b>Caution</b>

Setting the `quoting` parameter of the [`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function to the value `3` means that no quoting symbols (*e.g.*, quotation marks) are used to enclose the content of cells in columns. This allows that the respective symbol can be a cell content. In the TweetsCOV19 dataset, some hashtags actually contain quotation marks. Not setting the parameter to `3` would result in a wrong reading of the file.
</div>

Since the table does not have column names, we use those from the data [description](https://data.gesis.org/tweetscov19/#Dataset):

In [None]:
tweets.columns = ['tweet_id', 'user', 'timestamp', 'followers', 'friends', 'retweets', 'favorites', 'named_entities', 'sentiments', 'mentions', 'hashtags', 'urls']

In [None]:
tweets

There are 480 thousand tweets (1.9 million for the full sample). Look at the last five columns to see how multiple annotations are stored in single cells.

#### Creating the `users` table

Besides the 'user' name, the `users` table should also contain the number of tweets the user has created as well as the maximum number of followers and friends. Aggregate the data using `groupby()` with the `size()` method to count the number of rows (i.e., the number of tweets)...

In [None]:
users = tweets.groupby(by='user').size().reset_index(name='tweets')
users.head()

and then with the `.max()` method to get the maximum number of followers and friends:

In [None]:
users_ff = tweets.groupby(by='user')[['followers', 'friends']].max().reset_index()
users_ff.columns = ['user', 'followers_max', 'friends_max']
users_ff.head()

Since these dataframes are both ordered alphabetically and have the same length, we can simply add the two columns from `users_ff` to the `users` table:

In [None]:
users[['followers_max', 'friends_max']] = users_ff[['followers_max', 'friends_max']]

Sort the dataframe descendingly by the number of tweets, maximum number of followers, and maximum number of friends (in that order):

In [None]:
users = users.sort_values(by=['tweets', 'followers_max', 'friends_max'], ascending=False).reset_index(drop=True)

Finally, reorder the columns:

In [None]:
users = users[['user', 'tweets', 'followers_max', 'friends_max']]

The index will function as a unique user identifier:

In [None]:
users

Note that 'user' names are encrypted for privacy reasons in the original dataset. There are 350 thousand distinct users (1.1 million in the full dataset), and the most active one has created 1,989 tweets (in the full dataset). Indeed, it is not an error that some users have tens of millions of followers [and more](https://en.wikipedia.org/wiki/List_of_most-followed_Twitter_accounts). An interesting observation is that the most active users also have many followers.

The index values of this table are unique identifiers for the users in the dataset (the primary keys). The effect of sorting is that the most active users have small index values, which aids computational purposes, as you will see. To not waste memory, it is good practice to delete dataframes we do not need anymore:

In [None]:
del users_ff

#### Creating the `tweets` table

We will proceed by refining the existing `tweets` table. Sorting tweets by date and time is straightforward. For handling such data, Pandas provides the 'datetime' data type. It is perfectly suited for handling time series data as it allows many ways to manipulate dates and times. For now, we will simply transform the 'timestamp' values from 'string' [`to_datetime()`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html). Set the `format` of the original string to spare Pandas figuring it out itself (and save time):

In [None]:
tweets['timestamp'] = pd.to_datetime(tweets['timestamp'], format='%a %b %d %X %z %Y')

Knowing that the timezone is Coordinated Universal Time (UTC), remove it:

In [None]:
print(tweets['timestamp'].dt.tz)
tweets['timestamp'] = tweets['timestamp'].dt.tz_localize(tz=None)

Then remove tweets from April 2020 (since they are not complete) and sort the dataframe:

In [None]:
tweets = tweets[tweets['timestamp'] >= '2020-05-01']
tweets = tweets.sort_values(by=['timestamp']).reset_index(drop=True)
tweets.head()

After sorting, the index is stable and acts as a unique tweet identifier.

The first change the table needs is to replace the 'user' name with the 'user_idx' index from the `users` table. Since we will repeat this operation for other tables, we define an `add_index()` function. Following best Python practice, what it does is described in the function itself:

In [None]:
def add_index(source, target, entity):
    '''
    Inserts the index of a source dataframe into a target dataframe as a column.
    
    Parameters:
        source : Pandas DataFrame
            Dataframe whose index is to be inserted.
        target : Pandas DataFrame
            Dataframe into which the index is inserted.
        entity : String
            Name of the entity that is identified by the index. Will be given an '_idx' suffix and then inserted into the target dataframe.
    
    Returns:
        The target dataframe with the inserted column.
    '''
    _ = source.copy()
    _[entity + '_idx'] = _.index
    df = pd.merge(left=target, right=_[[entity + '_idx', entity]], on=entity)
    del df[entity]
    return df

In [None]:
tweets = add_index(source=users, target=tweets, entity='user')

After reordering the columns, the 'user_idx' column is at the right position:

In [None]:
tweets = tweets[['tweet_id', 'user_idx', 'timestamp', 'followers', 'friends', 'retweets', 'favorites', 'named_entities', 'sentiments', 'mentions', 'hashtags', 'urls']]
tweets.head()

#### Creating the `named_entities` and `tweets_named_entities` tables

Next, we process the facts. In general, we proceed by, first, extracting relationship tables from the `tweets` table and, second, deriving the entity tables from the relationship tables. We start with the most complicated case of **named entities**. The 'named_entities' column of the `tweets` table contains ';'-separated 3-tuples, each of which contains ':'-separated values for 'text', 'named_entity', and 'confidence'. In the process of normalization, the first step is to transform cell content into lists of 3-tuples. Again, we define a custom `to_list()` function...

In [None]:
def to_list(cell, pat):
    '''
    Function to be applied to individual cells of a dataframe column. Transforms concatenated cell content into a list.
    
    Parameters:
        pat : String
            Pattern that separates the cell values.
    
    Returns:
        The cell will automatically be overwritten by a potentially empty list.
    '''
    if cell == 'null;' or type(cell) == float:
        cell = ['']
    else:
        cell = cell.split(pat)
    return cell

that we can now `apply` cell by cell to the 'named_entities' column:

In [None]:
tweets['named_entities'] = tweets['named_entities'].apply(to_list, pat=';')

An example cell with a list of multiple 3-tuples now looks like this:

In [None]:
tweets['named_entities'][2]

Next, we create the `tweets_named_entities` relationship table. The reason for having created a list of 3-tuples is that a wide table – the subtable with just the 'named_entities' column – can be easily and quickly transformed into a long table using the [`explode()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html) method:

In [None]:
tweets_named_entities = tweets[['named_entities']].explode(column='named_entities')
del tweets['named_entities']
tweets_named_entities.head()

Remove rows without 3-tuples (concatenation artifacts) and split the 3-tuples into three columns by splitting the string at ':':

In [None]:
tweets_named_entities = tweets_named_entities[tweets_named_entities['named_entities'] != '']
tweets_named_entities = tweets_named_entities['named_entities'].str.split(pat=':', expand=True)
tweets_named_entities.head()

At this point, the index consists of the unique tweet identifiers because we used `explode()` on a `tweets` subtable.

<div class='alert alert-block alert-info'>
<b>Insight: Data quality</b>

Part of what we call "getting a feeling for your data" is actually to look at it. For example, here is the right stage in the data processing pipeline to check how well the named-entity-recognition algorithm worked because the `tweets_named_entities` table still contains the 'text' from which a named entity was recognized as well as the name of the 'named_entity'. Pandas is actually not the best tool to read tables as it only displays up to 50 rows at a time. You can cycle through your columns using `loc[]`, but it is more convenient to use a spreadsheet editor for the task. Therefore, we export the distinct rows of the `tweets_named_entities` table to an Excel file. The first column is the tweet index, and you can look up the 'tweet_id' via `tweets.loc[<tweet index without angle brackets>, :]`. You can read the actual tweet of any user (*e.g.*, with the 'tweet id' $1262933787131904001$) by visiting https://twitter.com/anyuser/status/1262933787131904001.

Going through the rows, you will notice that the text snippet <span style='font-family:Courier'>democrat</span> is matched to the [Democratic_Party_(United_States)](https://en.wikipedia.org/wiki/Democratic_Party_(United_States)) named entity with a confidence of -1.81. Even though the matching is mostly right in our particular context, the algorithm will also mismatch a lot of discourse about democracy itself. The problem also exists the other way around. The text <span style='font-family:Courier'>dems</span>, which is frequently used to name the Democratic Party of the US, is matched to [Defensively_equipped_merchant_ship](https://en.wikipedia.org/wiki/Defensively_equipped_merchant_ship) with a slightly weaker confidence of -2.03. The idea is to use this score to filter out bad matchings. However, setting the filter to -2.00 would result in missing the correct matching of <span style='font-family:Courier'>us democratic party</span>, which has a score of -2.06. As you go through the spreadsheet, you will notice more mistakes. All these point to the general problem of fully automated data processing.
</div>

Before the Excel file is saved, the target directory is created if it does not exist:

In [None]:
import os

In [None]:
directory = 'results'
if not os.path.exists(directory):
    os.makedirs(directory)

In [None]:
tweets_named_entities.drop_duplicates().to_excel(
    excel_writer = 'results/tweets_named_entities_dropped_duplicates.xlsx', 
    engine = 'xlsxwriter', 
    encoding = 'utf-8'
)

Continuing with creating `tweets_named_entities`, if we now reset the index without dropping the old one, the tweet index values will be added as the first column:

In [None]:
tweets_named_entities = tweets_named_entities.reset_index(drop=False)
tweets_named_entities.columns = ['tweet_idx', 'text', 'named_entity', 'confidence']
tweets_named_entities.head()

Drop duplicate relationships because we are not interested in multiple occurrences of a named entity in one tweet:

In [None]:
tweets_named_entities = tweets_named_entities.drop_duplicates().reset_index(drop=True)

The `named_entities` entity table with the desired 'tweets' column is created from the `tweets_named_entities` relationship table, again using a dedicated function:

In [None]:
def create_entity_table(relationship_table, entity):
    '''
    Creates an entity table from a relationship table via aggregation.
    
    Parameters:
        relationship_table : Pandas DataFrame
            Dataframe that contains tweet entity relationships.
        entity : String
            Name of the entity column in the relationship table that contains the entities.
    
    Returns:
        An entity table sorted descendingly by an additional 'tweets' column giving the number of tweets that selected an entity.
    '''
    df = relationship_table.groupby(entity).size().reset_index(name='tweets')
    df = df.sort_values(by=['tweets', entity], ascending=[False, True]).reset_index(drop=True)
    return df

In [None]:
named_entities = create_entity_table(relationship_table=tweets_named_entities, entity='named_entity')
named_entities.head()

Given this entity table, add its index to the corresponding `tweets_named_entities` relationship table, reorder the columns to obtain the desired design, and change the data type of the 'confidence' scores from string to a rounded float:

In [None]:
tweets_named_entities = add_index(source=named_entities, target=tweets_named_entities, entity='named_entity')
tweets_named_entities = tweets_named_entities[['tweet_idx', 'named_entity_idx', 'text', 'confidence']]
tweets_named_entities['confidence'] = tweets_named_entities['confidence'].astype(float).round(4)
tweets_named_entities.head()

#### Creating the other relationship and entity tables

The data of the remaining facts (mentions, hashtags, URLs, and TLDs) is easier to normalize because the relationship tables only contain foreign keys. Without having to process metadata from named entity recognition, we can create the relationship tables via a general function:

In [None]:
def create_relationship_table(entity, to_lower_case, drop_duplicates=True, source=tweets):
    '''
    Creates a relationship table for a given entity from the `tweets` table.
    
    Parameters:
        source : Pandas DataFrame, default `tweets`
            Table that contains the entity column.
        entity : String
            Name of the entity column in the source table that contains the entities. The column must contain an object data type list of entity names.
        to_lower_case : Boolean
            Whether entity names should be reduced to lower case.
        drop_duplicates : Boolean, default True
            Whether duplicate relationships should be removed.
    
    Returns:
        A relationship table linking tweet indices to entity names.
    '''
    df = source[[entity + 's']].explode(column=entity + 's')
    df = df[df[entity + 's'] != '']
    df = df.reset_index()
    df.columns = ['tweet_idx', entity]
    if to_lower_case == True:
        df[entity] = df[entity].str.lower()
    if drop_duplicates == True:
        df = df.drop_duplicates().reset_index(drop=True)
    return df

The processing pipeline is the same for all four facts:

1. Transform the entity column in the `tweets` table to a list
2. Create the relationship table from the entity column in the `tweets` table, dropping duplicate rows by default
3. Delete the entity column in the `tweets` table
4. Create the entity table from the relationship table
5. Add the entity index to the relationship table

In the case of **mentions**, in step 2, we must transform all capital (upper case) characters to lower case. This is because user names on Twitter are not case-sensitive. In other words, when a user named "realDonaldTrump" already exists, no new user will be allowed with the name "realdonaldtrump". Since user mentions are extracted as words starting with <span style='font-family:Courier'>@</span>, but tweet creators often use upper and lower cases as they wish, not transforming upper to lower case would result in the same mentioned user getting more than a single unique identifier.

In [None]:
tweets['mentions'] = tweets['mentions'].apply(to_list, pat=' ') # Step 1
tweets_mentions = create_relationship_table(entity='mention', to_lower_case=True) # Step 2
del tweets['mentions'] # Step 3
mentions = create_entity_table(relationship_table=tweets_mentions, entity='mention') # Step 4
tweets_mentions = add_index(source=mentions, target=tweets_mentions, entity='mention') # Step 5

In [None]:
mentions.head()

Donald Trump is the most mentioned user by far, followed by the prime minister of India. Narendra Modi. both with his official and private accounts.

In the case of **hashtags**, do the same transformation from upper to lower case to prevent synonymous hashtags getting from different indices:

In [None]:
tweets['hashtags'] = tweets['hashtags'].apply(to_list, pat=' ') # Step 1
tweets_hashtags = create_relationship_table(entity='hashtag', to_lower_case=True) # Step 2
del tweets['hashtags'] # Step 3
hashtags = create_entity_table(relationship_table=tweets_hashtags, entity='hashtag') # Step 4
tweets_hashtags = add_index(source=hashtags, target=tweets_hashtags, entity='hashtag') # Step 5

In [None]:
hashtags.head()

**URLs** are case-sensitive. Hence, set `to_lower_case=False` in step 2. To create the tables related to **TLDs**, postpone the step 5:

In [None]:
tweets['urls'] = tweets['urls'].apply(to_list, pat=':-:') # Step 1
tweets_urls = create_relationship_table(entity='url', to_lower_case=False) # Step 2
del tweets['urls'] # Step 3
urls = create_entity_table(relationship_table=tweets_urls, entity='url') # Step 4

At this point, `tweets_urls` has all the information to create the TLD-related tables. Create the ``tweets_tlds`` as a copy of `tweets_urls` and extract the TLDs (pseudo step 2):

In [None]:
tweets_tlds = tweets_urls.copy()
tweets_tlds['tld'] = tweets_tlds['url'].str[8:].str.split(pat='/').str[0]

Now, finish step 5 for URLs...

In [None]:
tweets_urls = add_index(source=urls, target=tweets_urls, entity='url') # Step 5

and steps 4 and 5 for TLDs (step 3 is not necessary since no such column ever existed):

In [None]:
tlds = create_entity_table(relationship_table=tweets_tlds, entity='tld') # Step 4
tweets_tlds = add_index(source=tlds, target=tweets_tlds, entity='tld') # Step 5

In [None]:
urls.head()

In [None]:
tlds.head()

As expected, the detail of URLs hides which TLDs are most popular.

#### Sentiment data

Positive and negative sentiment scores are stored as a string in `tweets['sentiments']`. To construct the desired three columns from it, we only have to split and expand the string (line 1), and transform the scores into integers (lines 2–3), create the average score (line 4), ...

In [None]:
tweets_sentiments = tweets['sentiments'].str.split(pat=' ', expand=True)
tweets_sentiments[0] = tweets_sentiments[0].astype('int')
tweets_sentiments[1] = tweets_sentiments[1].astype('int')
tweets_sentiments[2] = tweets_sentiments[[0, 1]].mean(axis=1)
tweets_sentiments.head()

and append these columns to the `tweets` table:

In [None]:
tweets[['sentiment_pos', 'sentiment_neg', 'sentiment_avg']] = tweets_sentiments
del tweets['sentiments']
tweets.head()

As the data is structured now, it is not fully normalized because the `tweets` table is still wide regarding sentiment scores. In a way, it is like the `bad_table` in the toy example above, which did not allow us to compute an average easily. It is worth discussing whether or not it is necessary to fully normalize the data and store the three sentiment scores in a relationship table and the sentiment category in yet another table.For most tasks, working with the three columns in the `tweets` table will certainly be easier. However, as we will see in the following subsection, it is beneficial to normalize our data all the way through. With `tweets_sentiments`, we already have the first step for the relationship table. Use the [`melt()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.melt.html) function, which makes this wide subtable long ([`ignore_index=False`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html) keeps the tweet index as a column). With the few steps in the following cell, we have already completed steps 2, 3, and 5 from the pipeline:





In [None]:
tweets_sentiments = pd.melt(tweets_sentiments, ignore_index=False).reset_index()
tweets_sentiments.columns = ['tweet_idx', 'sentiment_idx', 'score']
tweets_sentiments.head()

The only thing left to do is to create the `sentiments` table, which contains the labels for the 'sentiment_idx' in `tweets_sentiments`. We have to make this manually as we have not introduced these labels yet:

In [None]:
sentiments = pd.DataFrame(data=['positive', 'negative', 'average'], columns=['sentiment'])
sentiments

### 2.1.3. Saving the tables to file(s)

#### Multiple TSV files

Now that we have a set of linked tables, how do we store them? A simple way is to save each table to a file with tab-separated values (TSV) using [`to_csv(sep='\t')`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html):

In [None]:
directory = '../data/TweetsCOV19/TweetsCOV19_tables'
if not os.path.exists(directory):
    os.makedirs(directory)

In [None]:
#tweets.to_csv(path_or_buf='../data/TweetsCOV19/TweetsCOV19_tables/tweets.tsv', sep='\t', index=False, encoding='utf-8')
#users.to_csv(path_or_buf='../data/TweetsCOV19/TweetsCOV19_tables/users.tsv', sep='\t', index=False, encoding='utf-8')
#tweets_named_entities.to_csv(path_or_buf='../data/TweetsCOV19/TweetsCOV19_tables/tweets_named_entities.tsv', sep='\t', index=False, encoding='utf-8')
#tweets_mentions.to_csv(path_or_buf='../data/TweetsCOV19/TweetsCOV19_tables/tweets_mentions.tsv', sep='\t', index=False, encoding='utf-8')
#tweets_hashtags.to_csv(path_or_buf='../data/TweetsCOV19/TweetsCOV19_tables/tweets_hashtags.tsv', sep='\t', index=False, encoding='utf-8')
#tweets_urls.to_csv(path_or_buf='../data/TweetsCOV19/TweetsCOV19_tables/tweets_urls.tsv', sep='\t', index=False, encoding='utf-8')
#tweets_tlds.to_csv(path_or_buf='../data/TweetsCOV19/TweetsCOV19_tables/tweets_tlds.tsv', sep='\t', index=False, encoding='utf-8')
#named_entities.to_csv(path_or_buf='../data/TweetsCOV19/TweetsCOV19_tables/named_entities.tsv', sep='\t', index=False, encoding='utf-8')
#mentions.to_csv(path_or_buf='../data/TweetsCOV19/TweetsCOV19_tables/mentions.tsv', sep='\t', index=False, encoding='utf-8')
#hashtags.to_csv(path_or_buf='../data/TweetsCOV19/TweetsCOV19_tables/hashtags.tsv', sep='\t', index=False, encoding='utf-8')
#urls.to_csv(path_or_buf='../data/TweetsCOV19/TweetsCOV19_tables/urls.tsv', sep='\t', index=False, encoding='utf-8')
#tlds.to_csv(path_or_buf='../data/TweetsCOV19/TweetsCOV19_tables/tlds.tsv', sep='\t', index=False, encoding='utf-8')

Note that, in the above cell, we set `index=False`. That means the index is not written to file.
This is not a problem because it is a row counter starting with 0 in any table. Such an index is automatically created if `index_col=None` when reading these files:

In [None]:
pd.read_csv(
    filepath_or_buffer = '../data/TweetsCOV19/TweetsCOV19_tables/tweets.tsv', 
    sep = '\t', 
    index_col = None, 
    encoding = 'utf-8'
)

#### SQL

One benefit of a relational database is that it can be stored in one file. Hence, another possibility is to take the next step on the relational database path and store all tables in an SQL database. [SQLAlchemy](https://www.sqlalchemy.org/) has emerged as the standard Python library for working with databases in SQL-like ways. To store the twelve tables from *figure 3*, all you need to do before exporting the tables is to create a so-called "engine" that manages connections through which you create, modify, and query the database. In this case, we create an SQLite database (other SQL dialects are possible):

In [None]:
import sqlalchemy
sqlalchemy.__version__

In [None]:
engine = sqlalchemy.create_engine(url='sqlite:///../data/TweetsCOV19/TweetsCOV19.sql')

Use the [`to_sql()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html) method, configured to use the engine just created, to add the tables to the database (the following lines just add the first five rows of each table; to add all rows remove `.head()`; a table is replaced if it already exists):

In [None]:
tweets.head().to_sql(name='tweets', con=engine, if_exists='replace', index=False)
users.head().to_sql(name='users', con=engine, if_exists='replace', index=False)
tweets_named_entities.head().to_sql(name='tweets_named_entities', con=engine, if_exists='replace', index=False)
tweets_mentions.head().to_sql(name='tweets_mentions', con=engine, if_exists='replace', index=False)
tweets_hashtags.head().to_sql(name='tweets_hashtags', con=engine, if_exists='replace', index=False)
tweets_urls.head().to_sql(name='tweets_urls', con=engine, if_exists='replace', index=False)
tweets_tlds.head().to_sql(name='tweets_tlds', con=engine, if_exists='replace', index=False)
named_entities.head().to_sql(name='named_entities', con=engine, if_exists='replace', index=False)
mentions.head().to_sql(name='mentions', con=engine, if_exists='replace', index=False)
hashtags.head().to_sql(name='hashtags', con=engine, if_exists='replace', index=False)
urls.head().to_sql(name='urls', con=engine, if_exists='replace', index=False)
tlds.head().to_sql(name='tlds', con=engine, if_exists='replace', index=False)

To read a dataframe from the database, also use the engine, but now in the [`read_sql()`](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html) method:

In [None]:
pd.read_sql(sql='tweets', con=engine).head()

#### Excel

It is also possible to store all tables as sheets in an Excel file. However, the limit is one million rows and columns per sheet, so Excel is not an option for the full dataset. But this is how the tables can be exported in principle:

In [None]:
with pd.ExcelWriter(path='../data/TweetsCOV19/TweetsCOV19.xlsx', engine='xlsxwriter') as writer:
    tweets.head().to_excel(writer, sheet_name='tweets')
    users.head().to_excel(writer, sheet_name='users')
    tweets_named_entities.head().to_excel(writer, sheet_name='tweets_named_entities')
    tweets_mentions.head().to_excel(writer, sheet_name='tweets_mentions')
    tweets_hashtags.head().to_excel(writer, sheet_name='tweets_hashtags')
    tweets_urls.head().to_excel(writer, sheet_name='tweets_urls')
    tweets_tlds.head().to_excel(writer, sheet_name='tweets_tlds')
    named_entities.head().to_excel(writer, sheet_name='named_entities')
    mentions.head().to_excel(writer, sheet_name='mentions')
    hashtags.head().to_excel(writer, sheet_name='hashtags')
    urls.head().to_excel(writer, sheet_name='urls')
    tlds.head().to_excel(writer, sheet_name='tlds')

To read a sheet from that file:

In [None]:
pd.read_excel(io='../data/TweetsCOV19/TweetsCOV19.xlsx', sheet_name='tweets', index_col=0)

## 2.2. Handling data with Pandas, NumPy, and SciPy

In the previous subsection, you have learned how to create and save multiple tables from the raw TweetsCOV19 table that, taken together, mimic a relational database. In practice, particularly if your data is not as rich as this dataset, it may not be necessary to do that. However, even if you can live with some amount of redundancy, it certainly is a way to keep your data tidy and consistent. Having the TweetsCOV19 data in the relational database structure, we will next query it to obtain some first data descriptions. At some point, when working with the data, even Pandas will reach its limits. It is then time for other libraries to help or take over. We will see how [NumPy](https://numpy.org/) and [SciPy](https://scipy.org/), two fundamental libraries for scientific computing, can be used to store data in n-dimensional arrays and to compute the co-occurrence of hashtags in tweets using sparse matrices.

### 2.2.1. Querying the database just created

Querying is a term from the relational database world and it means to extract information from the database. **Filtering tables** is a simple query.

#### Which tweet named-entity relationships can we be confident about?

Recall from the discussion above that matches below a threshold (*e.g.*, -2) may be too biased.

In [None]:
tweets_named_entities[tweets_named_entities['confidence'] >= -2.].head()

#### Which are the most popular tweets by unpopular users?

This question is open to interpretation. If we think of popular tweets as those with many retweets and unpopular users as those with few followers, then a possible answer can be:

In [None]:
tweets[(tweets['retweets'] >= 1000) & (tweets['followers'] <= 10)]

Such queries are useful to identify tweets for reading. For unsampled tweets, the first two are:

- https://twitter.com/anyuser/status/1262661210647887872
- https://twitter.com/anyuser/status/1265468721440514048

The third was probably deleted by the user.

Since we have eliminated redundancy (*i.e.*, split up data in multiple tables), we also need to link tables to answer certain questions. Linking tables goes by the name of **joining tables**. We will introduce the use of Pandas' [`join()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) method step by step using this question:

#### Which are the most popular tweets by unproductive users?

The logic of joining is to link two tables on their indices or a column that exists in both tables. If no column name is specified, Pandas joins on the index by default. You can build a query step by step. Choose a dataframe from which you want to start. In our case, this is the table of popular `tweets`:

In [None]:
popular_tweets = tweets[tweets['retweets'] >= 1000]
popular_tweets.head()

The other dataframe that contains information to answer the question is the table of unproductive (*i.e.*, one tweet only) users:

In [None]:
unproductive_users = users[users['tweets'] == 1]

To be able to join these tables on an index, we have stored the 'user_idx' in the `tweets` – and, hence, `popular_tweets` – table. Set this column as the index of the `popular_tweets` table, join the `unproductive_users` table, and sort the result by the number of `retweets`. Adding `[[]]` in line 2 has the effect that no columns of the `unproductive_users` table are actually added to the result, and `how='inner'` in line 3 has the effect that not all popular tweets are shown but only those whose creators are unproductive:

In [None]:
popular_tweets.set_index(keys='user_idx').join(
    other = unproductive_users[[]], 
    how = 'inner'
).sort_values(by='retweets', ascending=False).head()

The above query only uses one join. Other queries can also require multiple joins:

#### How often is a given hashtag used over time?

In this example, the result must be constructed from two tables linked by a relationship table, so we need two joins. There are several ways to build the query. You can start with either table. Here, we start with the relationship table. First, join the `tweets` table with just the 'timestamp' column to the `tweets_hashtags` table  via the tweet index (lines 1–2). Second, join the `hashtags` table, filtered to only contain hashtags from `hashtag_list`, to the resulting table via the hashtag index (lines 3–6). Line 7 shows how the date can be extracted from the timestamp (in other words, how hours, minutes, and seconds can be removed). Line 9 aggregates the data and counts the number of tweets that use a given hashtag on a given day:

In [None]:
hashtag_list = ['coronavirus', 'covid19', 'hydroxychloroquine', 'vaccine']

In [None]:
days_hashtags_long = tweets_hashtags.set_index(keys='tweet_idx').join(
    other = tweets[['timestamp']]
).set_index(keys='hashtag_idx').join(
    other = hashtags[hashtags['hashtag'].isin(hashtag_list)][['hashtag']], 
    how = 'inner'
)
days_hashtags_long['day'] = days_hashtags_long['timestamp'].dt.date
del days_hashtags_long['timestamp']
days_hashtags_long = days_hashtags_long.groupby(by=['day', 'hashtag']).size().reset_index(name='tweets')
days_hashtags_long.head()

<div class='alert alert-block alert-warning'>
<b>Additional resources</b>

We have set up the TweetsCOV19 dataset in a relational database way and even stored it as an SQL database. It is also possible to use SQL commands to retrieve information from this dataset. Consult the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html) of Pandas' `read_sql()` method and the SQLAlchemy [tutorials](https://docs.sqlalchemy.org/en/14/tutorial/index.html) as entry points. See the following code cell for instant success (it requires the engine from above).
</div>

In [None]:
pd.read_sql(sql='SELECT hashtag, tweets FROM hashtags', con=engine).head()

### 2.2.2. Using NumPy to store data in n-dimensional arrays

<img src='images/numpy.png' style='height: 100px; float: right; margin-left: 10px'>

Pandas offers Series and DataFrames as native data structures, but its mathematical operations rely on the corresponding NumPy data structures: vectors and matrices. NumPy is purely numerical. The basic differences between Pandas and NumPy are:

- NumPy provides n-dimensional data structures called **arrays**, not just matrices (2-dimensional data structures).
- NumPy does not allow items to be lists.
- NumPy does not have metadata (*i.e.*, the indices of arrays are not labeled)

One use of NumPy is to store data in numerical form as part of a data processing pipeline. We will now go through two examples where data is stored in 2-dimensional and 3-dimensional arrays, respectively. For the **2-dimensional example**, consider that you want to create an array where the rows are days, the columns are hashtags, and the cells give the number of times that a hashtag is used in a day. We have just created the necessary `days_hashtags_long` table in the last querying example. We can make this table wide by using the [`pivot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html) method, which imitates the corresponding procedure in the Excel spreadsheet software:

In [None]:
days_hashtags_wide = days_hashtags_long.pivot(index='day', columns='hashtag', values='tweets')
days_hashtags_wide.head()

We can transform the table into a NumPy array to get what we want. Appending `[:5]` will only show the first five rows:

In [None]:
days_hashtags_wide_array = days_hashtags_wide.to_numpy()
days_hashtags_wide_array[:5]

What we see is the dataframe stripped of its metadata.

For the **3-dimensional example**, obviously, we cannot use Pandas all the way since it only supports matrices. But we can still use it to process the data before we store it in an array. Consider that you want to create an array where the first dimension is days, the second dimension is mentioned users, the third dimension is sentiment categories (positive, negative, and average), and the cells give the mean sentiment scores of tweets in which users are mentioned in a day. Now we benefit from having normalized the data all the way through because the array can naturally be populated from the normalized tables. The full information we need is stored in five tables, so we would need four joins. However, since we use all three sentiment categories (and all three indices) from the `sentiments` table, we will spare the fourth join to the `sentiments` table and use the `sentiment_idx` for the third dimension instead. Note that we do not `drop` the 'tweet_idx' in the first join (lines 1–2) because we will also need it in line 6 to prepare the third join (lines 6–8). Line 9 extracts the date. Line 11 computes the mean sentiment scores:

In [None]:
mention_list = ['realdonaldtrump', 'who', 'breitbartnews', 'cnn']

In [None]:
days_mentions_sentiments_long = tweets_mentions.set_index(keys='tweet_idx', drop=False).join(
    other = tweets[['timestamp']]
).set_index('mention_idx').join(
    other = mentions[mentions['mention'].isin(mention_list)][['mention']], 
    how = 'inner'
).set_index(keys='tweet_idx').join(
    other = tweets_sentiments.set_index(keys='tweet_idx')
).reset_index(drop=True)
days_mentions_sentiments_long['day'] = days_mentions_sentiments_long['timestamp'].dt.date
del days_mentions_sentiments_long['timestamp']
days_mentions_sentiments_long = days_mentions_sentiments_long.groupby(['day', 'mention', 'sentiment_idx']).mean().round(4).reset_index()
days_mentions_sentiments_long.head()

From this table, we now create the array, and the first three columns are the three dimensions. Since NumPy indexing is purely numerical, we must represent the days and mentions in this table by identifiers that start with 0 and are contiguous (just like the 'sentiment_idx'). In other words, we must make the first two variables categorical (if you know R, you will recognize the 'category' data type). Transform the first two columns using `astype('category')`:

In [None]:
days_mentions_sentiments_long['day'] = days_mentions_sentiments_long['day'].astype('category')
days_mentions_sentiments_long['mention'] = days_mentions_sentiments_long['mention'].astype('category')

Before replacing the category labels by their numerical codes, we save them:

In [None]:
day_categories = days_mentions_sentiments_long['day'].cat.categories
day_categories

In [None]:
mention_categories = days_mentions_sentiments_long['mention'].cat.categories
mention_categories

It is useful to also store the sentiment categories in a variable:

In [None]:
sentiment_categories = sentiments['sentiment'].tolist()
sentiment_categories

Now we can replace the categories by their codes:

In [None]:
days_mentions_sentiments_long['day'] = days_mentions_sentiments_long['day'].cat.codes
days_mentions_sentiments_long['mention'] = days_mentions_sentiments_long['mention'].cat.codes

In [None]:
days_mentions_sentiments_long.head()

We have arrived at a purely numerical table where all cells are integers. This is a data structure that NumPy can work with. To transform the table into an array:

In [None]:
days_mentions_sentiments_long_array = days_mentions_sentiments_long.to_numpy()
days_mentions_sentiments_long_array[:5]

To make this long array wide, first, create an empty container (array) with a 3-dimensional `shape` where the first dimension is as long as there are days, the second dimension is as long as there are mentions, and the third dimension is as long as there are sentiment categories. Note that `days_mentions_sentiments_long_array` contains float variables even though the first three columns were integers in the dataframe. This is because an array (of the `ndarray` class) only permits one data type and the last column contains floats. To handle this situation, we use a trick: we create a container for string variables (`dtype='object'`) because integers and floats can be encoded as strings. Initially, each cell is Not a Number (NaN).

In [None]:
import numpy as np
np.__version__

In [None]:
days_mentions_sentiments_wide_array = np.empty(shape=(len(day_categories), len(mention_categories), len(sentiments)), dtype='object')
days_mentions_sentiments_wide_array[:] = np.nan

Then fill this array by using the first three columns of `days_mentions_sentiments_long_array` as indices – Pandas actually took over indexing from NumPy – for the wide array and filling the cells from the fourth column. Note that we transform the first three columns from 'object' to 'int' and the last to 'float'. Appending `[:5]` now means that the matrices of mentions and sentiments for the first five days are shown:

In [None]:
days_mentions_sentiments_wide_array[days_mentions_sentiments_long_array[:, 0].astype('int'), days_mentions_sentiments_long_array[:, 1].astype('int'), days_mentions_sentiments_long_array[:, 2].astype('int')] = days_mentions_sentiments_long_array[:, 3].astype('float')
days_mentions_sentiments_wide_array[:5]

We can slice a n-dimensional array any way we want. The matrix of days and sentiments for the first mentioned user 'breitbartnews' is:

In [None]:
print(mention_categories[0])
days_mentions_sentiments_wide_array[:, 0, :]

### 2.2.3. Using SciPy to handle sparse data

In many cases, we want to work with matrices and arrays mathematically. For example, to compute the logarithms of each cell in a Pandas Series, we must apply a NumPy function:

In [None]:
np.log10(users['followers_max'].replace(to_replace=0, value=np.nan))

NumPy's [`log10()`](https://numpy.org/doc/stable/reference/generated/numpy.log10.html) is a so-called [universal function](https://numpy.org/doc/stable/reference/ufuncs.html) that goes through the vector item by item. Universal functions are unaware of **data sparsity**. Data is sparse when it contains many zeros or NaNs, depending on the context. `log10()` tries to take the logarithm of each cell in a vector even if only one out of 1 million cells is larger than 0. Unawareness of data sparsity can easily cause your computer to run out of memory, for example, when algebraic operations like matrix multiplication are performed. Matrix multiplication is required to obtain **co-occurrence matrices**, for example, of hashtags in tweets. The raw TweetsCOV19 dataset is quite sparse, as you can tell by the many 'null;' entries. Besides eliminating redundancy, our transformation into a relational database has also made the data completely dense (*i.e.*, unsparse). Zeros, like 0 retweets, actually is a piece of information.

<img src='images/scipy.png' style='height: 100px; float: right; margin-left: 10px'>

Pandas is not well prepared to handle sparsity. While it offers [data structures for efficiently storing sparse data](https://pandas.pydata.org/docs/user_guide/sparse.html), sparse data processing is hardly developed. This is where SciPy, the other fundamental library for scientific computing comes in. SciPy is the standard library for handling [sparse](https://docs.scipy.org/doc/scipy/reference/sparse.html) data, and we will now see how we can obtain the hashtag co-occurrence matrix for the TweetsCOV19 dataset.

To be efficient for different kinds of operations, SciPy offers different sparse matrix formats. While choosing the right format is not so important in our case, it is quite important when data gets really big. COOrdinate matrices are fast for constructing sparse matrices. To construct the occurrence matrix `TH` (T for tweets, H for hashtags), the [`coo_matrix()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html) constructor takes as input a (cells, (rows, columns)) triplet.

<div class='alert alert-block alert-info'>
<b>Insight</b>

It is one of the big benefits of **relationship tables**, as we are using them here, that they only contain the contiguous identifiers which are required by matrix manipulation routines like in SciPy. In the current example, all necessary information for constructing the sparse occurrence matrix lies in the `tweets_hashtags` relationship table.
</div>

`cells` is just a vector of 1s; each hashtag is used only once in a tweet (as a result of the `create_relationship_table()` function):

In [None]:
rows = tweets_hashtags['tweet_idx']
cols = tweets_hashtags['hashtag_idx']
cells = [1]*len(tweets_hashtags)

In [None]:
from scipy.sparse import coo_matrix

In [None]:
TH = coo_matrix((cells, (rows, cols)), shape=(len(tweets), len(hashtags)))
TH

The technical summary is:

In [None]:
TH.__dict__

Easy to read:

In [None]:
print(TH)

The index pairs of a sparse matrix can be accessed via:

In [None]:
print('Tweet/row indices:', TH.nonzero()[0])
print('Hashtag/column indices:', TH.nonzero()[1])

#### Getting the co-occurrence matrix

Given an occurrence matrix $TH$ with tweet indices as rows and hashtag indices as columns, the co-occurrence matrix of hashtags co-occurring in tweets is $Co=HT\cdot TH$ where $HT$ is the transpose of $TH$ (*i.e.*, with hashtag indices as rows and tweet indices as columns) and $\cdot$
means that $HT$ is [multiplied](https://en.wikipedia.org/wiki/Matrix_multiplication) by $TH$ (<a href='#destination1'>Batagelj & Cerinsěk, 2013</a>). To do fast vector operations (*e.g.*, matrix multiplication), it is recommended to transform the matrix into a Compressed Sparse Row (CSR) [`csr_matrix()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) before multiplication:

In [None]:
TH = TH.tocsr()

If there had been duplicate entries of the COO format they would have been summed during the conversion.

To get the co-occurrence matrix `Co` (where `TH.T` is the transpose of `TH` and [`dot()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.dot.html) is the method for matrix multiplication):

In [None]:
Co = TH.T.dot(other=TH)
Co

Note that the resulting matrix is in the Compressed Sparse Column (CSC) format.

Sparse matrices can be accessed via indices like arrays. Use [`todense()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.todense.html) to display the sparse matrix in wide (but dense), not long format. The first five rows and columns show that the matrix is symmetric (*i.e.*, has redundant information in the upper and lower triangular portions) and that the diagonal contains the counts of how often a hashtag is used in all tweets:

In [None]:
Co[:5, :5].todense()

To eliminate the redundancy, us the [`triu()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.triu.html) method to extract just the matrix's upper triangular portion, including the diagonal. This operation is fastest when the matrix is in the COO format. Hence, transform it `.tocoo()` first:

In [None]:
from scipy.sparse import triu

In [None]:
Co = triu(Co.tocoo())
Co

In the SciPy version we are using, the diagonal cannot be removed, only set to 0 (fastest in COO format). The cell is commented out because we want to keep the diagonal:

In [None]:
#Co.setdiag(values=0)

The sparse matrix can be saved as a Pandas `hashtag_cooccurrences` table by transforming the row indexes (line 3), column indexes (line 4), and cells (line 5) to Series and concatenating them (if the diagonal values have been set to 0 they still show up as rows with NaN indices):

In [None]:
hashtag_cooccurrences = pd.concat(
    objs=[
        pd.Series(Co.nonzero()[0]), 
        pd.Series(Co.nonzero()[1]), 
        pd.Series(Co.data)
    ], 
    axis=1
)
hashtag_cooccurrences.columns = ['hashtag_idx_i', 'hashtag_idx_j', 'cooccurrence']
hashtag_cooccurrences

#### Getting the normalized co-occurrence matrix

You can also compute normalized co-occurrence scores. The idea is to give a hashtag a smaller weight the more it shares the "attention" it gets with other hashtags in a tweet. For example, if three hashtags are used in a tweet, each hashtag gets an attention or relationship weight on 1/3. These weights can be obtained via row normalization (of CSR matrices). For that purpose, `TH` must be normalized and the result stored as `N`. The [scikit-learn](https://scikit-learn.org/) library's [`normalize()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html) function can be used for this task:

In [None]:
type(TH)

In [None]:
from sklearn.preprocessing import normalize

In [None]:
N = normalize(TH, norm='l1', axis=1)
N

In [None]:
print(N)

The normalized co-occurrence matrix of hashtags co-occurring in tweets is $Cn=HT\cdot N$ (<a href='#destination1'>Batagelj & Cerinsěk, 2013</a>):

In [None]:
Cn = TH.T.dot(other=N)
Cn = triu(Cn.tocoo())
#Cn.setdiag(values=0)

Attach the normalized co-occurrence scores to the `hashtag_cooccurrences` table (line 1), remove the diagonal rows if you want (line 2), and sort the table (line 3):

In [None]:
hashtag_cooccurrences['cooccurrence_norm'] = pd.Series(data=Cn.data.round(4))
#hashtag_cooccurrences = hashtag_cooccurrences[hashtag_cooccurrences['cooccurrence'] > 0]
hashtag_cooccurrences = hashtag_cooccurrences.sort_values(by=['hashtag_idx_i', 'hashtag_idx_j']).reset_index(drop=True)
hashtag_cooccurrences

Networks can be constructed directly from such co-occurrence tables, as we will see in [Session 7: Network analysis](). For now, we just save the table to the file:

In [None]:
hashtag_cooccurrences.to_csv(path_or_buf='../data/TweetsCOV19/hashtag_cooccurrences.tsv', sep='\t', index=False, encoding='utf-8')

## 2.3. Exploring the data visually

<img src='images/matplotlib.png' style='height: 50px; float: right; margin-left: 10px'>

We have claimed that Python does not need to hide behind R and the tidyverse. We hope that the previous subsections have demonstrated Python's appeal for the data management and processing steps. Now, you will learn how you can explore the data visually and produce publication-ready figures. [Matplotlib](https://matplotlib.org/) is Python's basic library for creating visualizations. Matplotlib gives you many options regarding kinds of plots and how to style them, but achieving what you want can be cumbersome. The [Seaborn](https://seaborn.pydata.org/) library is an easier-to-use interface "for drawing attractive and informative statistical graphics" that builds on Matplotlib and integrates closely with Pandas data structures.

Plotting the number of tweets over time using plain Matplotlib, you will notice a downward trend and a weekly rhythm:

In [None]:
import matplotlib.pyplot as plt

In [None]:
tweets_over_time = tweets.groupby(tweets['timestamp'].dt.date).size()

In [None]:
plt.figure(figsize=[12, 3])
plt.plot(tweets_over_time)
plt.ylabel('Frequency')
plt.show()

<img src='images/seaborn.png' style='width: 100px; float: right; margin-left: 10px'>

With `set_theme()` from Seaborn, you can set a visual style that will be used even if you do plain Matplotlib plotting. You can choose from five styles: 'darkgrid', 'whitegrid', 'dark', 'white', and 'ticks'.

In [None]:
import seaborn as sns

In [None]:
sns.set_theme(style='darkgrid')

In [None]:
plt.figure(figsize=[12, 2])
plt.plot(tweets_over_time)
plt.ylabel('Frequency')
plt.show()

The TweetsCOV19 webpage shows [plots](https://data.gesis.org/tweetscov19/#Statistics) about the frequency development of selected hashtags up until April 2020. Above, we have collected the usage statistics for hashtags listed in `hashtag_list` and stored them in `days_hashtags_wide_array`. We can easily create figures for May 2020 from that 2-dimensional array. The following cell loops through the array by iterating through the `hashtag_indices` defined in line 1. Since the array that holds the y values is stripped of labels, we must take the x labels from the corresponding `days_hashtags_wide` dataframe:

In [None]:
hashtag_indices = [0, 1]

plt.figure(figsize=[12, 2])
for hashtag_index in hashtag_indices:
    plt.plot(days_hashtags_wide.index, days_hashtags_wide_array[:, hashtag_index], label='#' + hashtag_list[hashtag_index])
plt.legend()
plt.ylabel('Frequency')
plt.show()

It is equally simple to plot data that lives in 3-dimensional arrays. The TweetsCOV19 webpage also shows time trends of the mean sentiment category scores of tweets in which prominent Twitter users are mentioned up until April 2020. We have retrieved the May 2020 data for mentions in `mention_list` and stored it in `days_mentions_sentiments_wide_array`. This time we make a first loop through all mentions from `mention_categories` (line 1) and a second loop through all sentiments from `sentiment_categories` (line 3). This time, we use datetime objects stored in `day_categories` as x values and draw the y values from the array:

In [None]:
for mention_index in range(len(mention_categories)):
    plt.figure(figsize=[12, 2])
    for sentiment_index in range(len(sentiment_categories)):
        plt.plot(day_categories, days_mentions_sentiments_wide_array[:, mention_index, sentiment_index], label=sentiment_categories[sentiment_index])
    plt.legend()
    plt.title('@' + mention_categories[mention_index])
    plt.ylabel('Sentiment')
    plt.show()

Except for @who, the average sentiment is slightly negative. This means that the language of tweets that mention these users tends to be laden with negative emotions – it does not mean that negative sentiments are voiced about the mentioned users.

#### Distributions

Social media data is known to often be very skewed (*i.e.*, not normally distributed). Indeed, we have already seen that some users have tens of millions of followers. For quantitative analysis, especially for the kinds of analyses performed in [Session 9: Statistics & supervised machine learning](), it is very important to know how variables are distributed. Boxplots can be a first step to assessing distributions. In the following, we are interested in the distributions of the 'followers_max' and 'friends_max' variables in the `users` table. To produce Seaborn boxplots, `melt()` the subtable with those two columns (*i.e.*, make it long), ...

In [None]:
followers_friends = pd.melt(users[['followers_max', 'friends_max']])
followers_friends.head()

then plot:

In [None]:
plt.figure(figsize=[3, 3])
sns.boxplot(x='variable', y='value', data=followers_friends)
plt.yscale('log')

Both variables are extremely skewed (note the logarithmic y-axis). Often such variables are transformed into their logarithm to make them behave better (add 1 before taking the log, then users with a value of 0 will keep it):

In [None]:
log_users = users[['followers_max', 'friends_max']].copy()
log_users = np.log10(log_users[['followers_max', 'friends_max']] + 1).round(4)
log_users.columns = ['log_followers_max', 'log_friends_max']
log_users.head()

Seaborn's [`histplot()`](https://seaborn.pydata.org/generated/seaborn.histplot.html) creates histograms and allows to add [kernel density estimates](https://en.wikipedia.org/wiki/Kernel_density_estimation). In line 3, we take a random sample from the data because density estimation takes quite long:

In [None]:
plt.figure(figsize=[3, 3])
sns.histplot(
    data=log_users.sample(n=10000, random_state=42), 
    bins=20, 
    kde=True
)
plt.show()

Both logged variables look normally distributed. That means it is a good hypothesis that the untransformed variables are [lognormally](https://en.wikipedia.org/wiki/Log-normal_distribution) distributed. We can test this hypothesis with the [powerlaw](https://github.com/jeffalstott/powerlaw) library. First, we fit a number of candidate functions to the whole range (`xmin=1`) of the data using [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation):

In [None]:
import powerlaw
powerlaw.__version__

In [None]:
fit_followers_max = powerlaw.Fit(data=users['followers_max'], xmin=1)
fit_friends_max = powerlaw.Fit(data=users['friends_max'], xmin=1)

We plot two of these candidate functions, the lognormal and a power law:

In [None]:
plt.figure(figsize=[3, 3])
fig = fit_followers_max.plot_pdf(marker='o', linestyle='', label='data')
fit_followers_max.lognormal.plot_pdf(linestyle='--', ax=fig, label='lognormal')
fit_followers_max.power_law.plot_pdf(linestyle='--', ax=fig, label='power_law')
plt.legend()
plt.xlabel('followers_max')
plt.ylabel('PDF')
plt.show()

In [None]:
plt.figure(figsize=[3, 3])
fig = fit_friends_max.plot_pdf(marker='o', linestyle='', label='data')
fit_friends_max.lognormal.plot_pdf(linestyle='--', ax=fig, label='lognormal')
fit_friends_max.power_law.plot_pdf(linestyle='--', ax=fig, label='power_law')
plt.legend()
plt.xlabel('friends_max')
plt.ylabel('PDF')
plt.show()

The lognormal seems to be a better fit for the data in both cases. We can again test this using loglikelihood ratios. In both cases, the ratio (the first value in the brackets) is extremely large and significantly (the second value in the bracket) different from 0. A large significant value means that the first distribution, in these cases the lognormal distribution, is a better fit to the data:

In [None]:
fit_followers_max.distribution_compare('lognormal', 'power_law')

In [None]:
fit_friends_max.distribution_compare('lognormal', 'power_law')

We have found that both variables are, in fact, lognormally distributed.

<div class='alert alert-block alert-warning'>
<b>Additional resources</b>

These results from fitting functions to the data mean that the variables are certainly not **power law distributions**. Knowing about power law behavior is important because, depending on their exponent, power laws do nto have characteristic sample variance or even sample mean, which is statistically problematic. To learn about the importance of power law distributions, consult <a href='#destination2'>Clauset *et al.* (2009)</a>.
</div>

#### Bivariate relationships

Identifying bivariate relationships or correlations is another part of data exploration and can be done visually. Seaborn's [`jointplot()`](https://seaborn.pydata.org/generated/seaborn.jointplot.html) function creates joint and marginal views on two variables. There are four different `kind`s of views. Here, we show histograms (line 6): 

In [None]:
plot = sns.jointplot(
    data = log_users.sample(n=10000, random_state=42), 
    x = 'log_followers_max', 
    y = 'log_friends_max', 
    kind = 'hist', 
    joint_kws = dict(bins=40), 
    marginal_kws = dict(bins=20)
)
plot.fig.set_figwidth(3)
plot.fig.set_figheight(3)

If you have many variables whose relationships you want to explore, Seaborn offers the [`pairplot()`]() function. The diagonal of such a plot will be filled with the univariate distribution, and the kind of view can be set separately for the univariate (`diag_kind` parameter) and bivariate cases. This time, we plot the relationships for the four numerical variables of the `tweets` table (excluding the sentiment scores). Since they are very skewed, we log them first:

In [None]:
log_tweets = tweets[['followers', 'friends', 'retweets', 'favorites']].copy()
log_tweets = np.log10(log_tweets[['followers', 'friends', 'retweets', 'favorites']] + 1)
log_tweets.columns = ['log_followers', 'log_friends', 'log_retweets', 'log_favorites']

In [None]:
# HOW CHANGE THE NUMBER OF BINS IN AND OFF THE DIAGONAL?

plot = sns.pairplot(
    data = log_tweets.sample(n=10000, random_state=42), 
    height = 2, 
    kind = 'hist', 
    diag_kind = 'hist', 
    #pair_kws = dict(bins=40), 
    #diag_kws = dict(bins=20)
)
plot.fig.set_figwidth(6)
plot.fig.set_figheight(6)

This concludes Session 2: Data handling & visualization. Some of the datasets created here will be used in later sessions. Now that we have an idea of how to manage, process, and explore our data in a research pipeline, we move to the first step of the data life cycle. [Session 3: API Harvesting]() and [Session 4: Web scraping]() are dedicated to data collection.

## References

<a id='destination1'></a>
Batagelj, V. & Cerinsěk, M. (2013). On bibliographic networks. *Scientometrics* 96:845–864. https://doi.org/10.1007/s11192-012-0940-1.

<a id='destination2'></a>
Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). "Power-law distributions in empirical data". *SIAM Review* 51:661–703. https://doi.org/10.1137/070710111.

<a id='destination3'></a>
Dimitrov, D., Baran, E., Fafalios, P., Yu, R., Zhu, X., Zloch, M., & Dietze, S. (2020). TweetsCOV19 - A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic. In: *CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management* (p. 2991–2998). https://doi.org/10.1145/3340531.3412765.

<a id='destination4'></a>
Fafalios, P., Iosifidis, V., Ntoutsi, E., & Dietze, S. (2018). TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets. In: *The Semantic Web. ESWC 2018. Lecture Notes in Computer Science*, vol 10843. Springer, Cham. https://doi.org/10.1007/978-3-319-93417-4_12.

<a id='destination5'></a>
Weidmann, N. B. (2022). *Data Management for Social Scientists: From Files to Databases*. Cambridge University Press.

<a id='destination6'></a>
Wikipedia (2022). Twitter. https://en.wikipedia.org/wiki/Twitter. Retrieved 01.12.2022.

<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: Haiko Lietz

Contributors: Pouria Mirelmi & N. Gizem Bacaksizlar Turbic

Acknowledgements: ...

Version date: 18. January 2023

License: ...
</div>

#### Notes to be removed before publication

- Reviewers: Olya & Helena
- Finish green boxes
- Pandas low_memory option
- use correct docu versions
- clean hashtags