<img src='images/gesis.png' style='height: 50px; float: left'>
<img src='images/social_comquant.png' style='height: 50px; float: left; margin-left: 40px'>

#### Notes to be removed before publication

Reviewers: Olya & Helena

## Introduction to Computational Social Science methods with Python

# Session 2: Data management and visualization

**Data** has two components: content and structure. Plain text data is unstructured, but its content can also be represented in a structured way. Data representations reside in a continuum of structuration. The rectangular table (also called dataframe or spreadsheet) is the most frequent data format in the social sciences because data is structured. Hierarchical data formats like JSON and HTML are semi-structured, and text data is unstructured (<a href='#weidmann_2022'>Weidmann 2022</a><a id='#weidmann_2022'></a>, ch. 3). In the practice of Computational Social Science, everything revolves around data. The so-called research life cycle consists of the three steps of data collection, data processing, and data analysis. During this cycle, data changes its face: it is transformed from a raw state to a state in which it is ready for analysis. **Data processing** subsumes the steps in which this transformation takes place (Weidmann 2022, ch. 1).

**Data management** refers to the practices by which we stay in control of data as a resource. Data is best managed when the focus is on practical questions. Since data processing is the central step of the research life towards answering those questions, it is strategically advantageous to also focus data management on data processing. Computational data processing workflows are advantageous because they fully document all the many steps from data collection to data analysis, they are convenient (like your favorite spreadsheet software could never be), they are replicable (nowadays in high demand by scholarly journals), they can be scaled up (necessary for [big data](https://en.wikipedia.org/wiki/Big_data)), and they offer the needed flexibility in the face of semi-structured or unstructured data sources (Weidmann 2022, p.7–9).

The **R** language and environment for statistical computing and graphics is very popular in the social sciences, also because it provides the [Tidyverse](https://www.tidyverse.org/), a collection of mutually adapted packages for tabular data structures, their manipulation (e.g., merging, aggregating), and producing appealing graphics (Weidmann 2022, ch. 7). We argue that **Python** does not need to hide behind R in this regard. The [Pandas](https://pandas.pydata.org/) library for managing tables truly is a "fast, powerful, flexible and easy to use open source data analysis and manipulation tool" that, when combined with the [Seaborn](https://seaborn.pydata.org/) statistical data visualization library, leaves nothing to be desired.

Pandas can also be used in a way that mimics the functionality of **relational databases**. These are systems where the columns of a table are split into multiple tables such that all reduncancies are eliminated. Relational databases are often used in research when the data is either large in volume or rich in content because they ensure consistency and speed up data processing (Weidmann 2022, part 3). The public [TweetsKB](https://data.gesis.org/tweetskb/) corpus of annotated tweets (Fafalios *et al.* 2018) as well as its offspring, the [TweetsCOV19](https://data.gesis.org/tweetscov19/) corpus (Dimitrov *et al.* 2020), are examples where the data is explicitly modeled relationally and can serve as illustrations of meaningful data management.

<div class='alert alert-block alert-success'>
<b>In this session</b>, 

you will learn how you can manage your data, keep it tidy, and visualize it while keeping a focus on your research questions. In subsession **2.1**, we will have a deep look at the rectangular table. You will experience how you can use the Pandas library to handle tables and mimic a relational database in such a way that your data gets ready for analysis. You will see what it means that relational databases eliminate redundancy and ensure consistency. The TweetsCOV19 dataset will function as an example that will shine up repeatedly in this and subsequent sessions. In subsession **2.2**, you will learn how to use the Matplotlib and Seaborn libraries to visualize data. In subsession **2.3**, we will introduce the SciPy and NumPy libraries. SciPy enables you to efficiently process and analyze very large matrices (i.e., 2-dimensional numerical tables) with many zeros or missing values (which is often the case). NumPy allows to work with n-dimensional tables called arrays which are typically needed in in data processing. Finally, in subsession **2.4**, you will learn how to read and write data from and to files. There, we will also work with semi-structured and unstructured data.
</div>

## 2.1. Using Pandas to relationally manage Twitter data

<img src='images/pandas.png' style='height: 100px; float: right; margin-left: 10px'>

[Pandas](https://pandas.pydata.org/) is Python's package for data management and processing using 2-dimensional tables. It allows you to work with any kind of observational or statistical data set, including matrices. Column entries can be heterogeneous (i.e., a single column can contain text, numericals, even lists). Pandas is also well-equipped to handle time series data, as we will see. We start with some illustrative toy examples before we enter the almost-big-data world using the TweetsCOV19 dataset.

### 2.1.1. Toy examples

In subsections 3.2 to 3.4, Weidmann (2022) discusses data, data processing, and the benefit of relational databases using toy examples and the R language. Here, we adapt these examples to Python. Consider the following two pieces of data. Consider that data = content + structure. The two have (almost) the same content but different structure. `sdb` is unstructured, `tdb` is structured:

In [1]:
sdb = 'Switzerland is a country with 8.3 million inhabitants, and its capital is Bern. Another country is Austria; its capital is Vienna and the population is 8.7 million.'
sdb

'Switzerland is a country with 8.3 million inhabitants, and its capital is Bern. Another country is Austria; its capital is Vienna and the population is 8.7 million.'

In [2]:
import pandas as pd

In [3]:
tdb = pd.DataFrame(data=[['Switzerland', 8.3, 'Bern'], ['Austria', 8.7, 'Vienna']], columns=['country', 'population', 'capital'])
tdb

Unnamed: 0,country,population,capital
0,Switzerland,8.3,Bern
1,Austria,8.7,Vienna


`tdb` is a Pandas table called a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The columns of a DataFrame are called [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html). A DataFrame contains labeled axes (rows and columns). Axis 0 (the rows) is called the **index**, and its labels consist of integers from $0$ to $n-1$ by default where $n$ is the number of rows:

In [4]:
tdb.index

RangeIndex(start=0, stop=2, step=1)

When no names are given, the **columns** (axis 1) are also labeled in such a way, but using text labels makes the table much more readable:

In [5]:
tdb.columns

Index(['country', 'population', 'capital'], dtype='object')

Similarly the index can be a list of text labels.

Rows, columns, or cells can be extracted by specifying their locations (labels). For example, to extract the capital of Austria:

In [6]:
tdb.loc[1, 'capital']

'Vienna'

To select all rows where the country is Switzerland:

In [7]:
tdb[tdb['country'] == 'Switzerland']

Unnamed: 0,country,population,capital
0,Switzerland,8.3,Bern


To add a new column:

In [8]:
tdb['area'] = [41, 83]
tdb

Unnamed: 0,country,population,capital,area
0,Switzerland,8.3,Bern,41
1,Austria,8.7,Vienna,83


The values in the <span style='font-family:Courier'>area</span> column are integers:

In [9]:
tdb['area'].dtype

dtype('int64')

To add a new row, we must first create a DataFrame containing the new row. Then we can [concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)enate the two dataframes on the index axis:

In [10]:
new_row = ['Liechtenstein', 0.038 , 'Vaduz', 0.16]
tdb_new_row = pd.DataFrame(data=[new_row], columns=tdb.columns)
tdb = pd.concat(objs=[tdb, tdb_new_row], axis=0)
tdb

Unnamed: 0,country,population,capital,area
0,Switzerland,8.3,Bern,41.0
1,Austria,8.7,Vienna,83.0
0,Liechtenstein,0.038,Vaduz,0.16


Note that <span style='font-family:Courier'>area</span> is now a continuous variable:

In [11]:
tdb['area'].dtype

dtype('float64')

But also note that concatenation results in the old indexes being used. To reset the index and drop the old values:

In [12]:
tdb = tdb.reset_index(drop=True)
tdb

Unnamed: 0,country,population,capital,area
0,Switzerland,8.3,Bern,41.0
1,Austria,8.7,Vienna,83.0
2,Liechtenstein,0.038,Vaduz,0.16


To remove the <span style='font-family:Courier'>area</span> column we can `drop` it and put the result `inplace` of the original table:

In [13]:
tdb.drop(labels=['area'], axis=1, inplace=True)
tdb

Unnamed: 0,country,population,capital
0,Switzerland,8.3,Bern
1,Austria,8.7,Vienna
2,Liechtenstein,0.038,Vaduz


Alternatively, you could have used `del` which also works for whole dataframes:

In [14]:
# del tdb['area']

Rows can only be removed with the `drop()` method:

In [15]:
tdb.drop(labels=[2], axis=0, inplace=True)
tdb

Unnamed: 0,country,population,capital
0,Switzerland,8.3,Bern
1,Austria,8.7,Vienna


#### Wide vs. long structure

A rule in data management states that tables should grow long not wide. Consider this `bad_table` of two countries' population sizes in three years:

In [16]:
bad_table = pd.DataFrame(data=[['Switzerland', 4.7, 5.3, 6.2], ['Austria', 6.9, 7.1, 7.5]], columns=['country', 'pop1950', 'pop1960', 'pop1970'])
bad_table

Unnamed: 0,country,pop1950,pop1960,pop1970
0,Switzerland,4.7,5.3,6.2
1,Austria,6.9,7.1,7.5


Though appealing to the eye, this table is computationally bad as can be demonstrated by trying to compute the average population size over all countries and years. It is fairly easy to compute the mean for each year by selecting the corresponding columns...

In [17]:
bad_table[['pop1950', 'pop1960', 'pop1970']].mean(axis=0)

pop1950    5.80
pop1960    6.20
pop1970    6.85
dtype: float64

but computing the overall mean requires selecting columns and taking the mean of the year means:

In [18]:
bad_table[['pop1950', 'pop1960', 'pop1970']].mean().mean()

6.283333333333334

A `good_table` is long not wide:

In [19]:
good_table = pd.DataFrame(data=[['Switzerland', 1950, 4.7], ['Switzerland', 1960, 5.3], ['Switzerland', 1970, 6.2], ['Austria', 1950, 6.9], ['Austria', 1960, 7.1], ['Austria', 1970, 7.5]], columns=['country', 'year', 'population'])
good_table

Unnamed: 0,country,year,population
0,Switzerland,1950,4.7
1,Switzerland,1960,5.3
2,Switzerland,1970,6.2
3,Austria,1950,6.9
4,Austria,1960,7.1
5,Austria,1970,7.5


Computing the overall mean is a simple operation on one column...

In [20]:
good_table['population'].mean()

6.283333333333334

and the year means can be obtained via aggregation (using the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method) without having to specify any year columns:

In [21]:
good_table.groupby('year').mean().reset_index()

Unnamed: 0,year,population
0,1950,5.8
1,1960,6.2
2,1970,6.85


#### Multiple tables

What does it mean that relational databases eliminate redundancies and ensure consistency? Consider a copy of the `good_table` with an additional column that contains a country's capital (copying makes sure that any changes made to `good_table` do not effect `good_table2`):

In [22]:
good_table2 = good_table.copy()
good_table2.loc[0:2, 'capital'] = 'Bern'
good_table2.loc[3:5, 'capital'] = 'Vienna'
good_table2

Unnamed: 0,country,year,population,capital
0,Switzerland,1950,4.7,Bern
1,Switzerland,1960,5.3,Bern
2,Switzerland,1970,6.2,Bern
3,Austria,1950,6.9,Vienna
4,Austria,1960,7.1,Vienna
5,Austria,1970,7.5,Vienna


Clearly, this table contains redundant information because country-capital pairs are always the same. But this data format potentially yields a consistency problem. If, for example, you want to refer to capital names not in English but in the respective national language you must replace each occurrence of <span style='font-family:Courier'>Vienna</span> by <span style='font-family:Courier'>Wien</span>. But if you miss a single occurrence your table becomes inconsistent. You can evade both problems if you split `good_table2` into two tables: one containing the population sizes...

In [23]:
populations = good_table2[['country', 'year', 'population']].copy()
populations

Unnamed: 0,country,year,population
0,Switzerland,1950,4.7
1,Switzerland,1960,5.3
2,Switzerland,1970,6.2
3,Austria,1950,6.9
4,Austria,1960,7.1
5,Austria,1970,7.5


and one containing the capitals:

In [24]:
capitals = good_table2[['country', 'capital']].drop_duplicates().reset_index(drop=True)
capitals

Unnamed: 0,country,capital
0,Switzerland,Bern
1,Austria,Vienna


This way you have eliminated all redundancies and you have ensured consistency because you need to replace <span style='font-family:Courier'>Vienna</span> by <span style='font-family:Courier'>Wien</span> in one place only.

Next, we will take Pandas and relational database thinking to the next level.

### 2.1.2. Getting the TweetsCOV19 dataset ready for analysis

[Twitter](https://en.wikipedia.org/wiki/Twitter) is a microblogging service that is very influential among politicians and journalists. Though stagnating over the past years, the number of monthly active users was at 238 million in the second quarter of 2022 (Wikipedia 2022). Since January 2013, researchers at [L3S](https://www.l3s.de/) and [GESIS](https://www.gesis.org/) have been collecting a 1% random sample of all Twitter posts (tweets), detecting sentiments, and extracting named entities, user mentions, hashtags, as well as URLs, and made those publicly available as the [TweetsKB](https://data.gesis.org/tweetskb/) corpus (Fafalios *et al.* 2018). By August 2022, the corpus had grown to about 3 billion tweets. **In the following**, we will store the content of a small fraction of those tweets, one month of the [TweertsCOV19](https://data.gesis.org/tweetscov19/) corpus (Dimitrov *et al.* 2020), in multiple Pandas dataframes that make reference to the TweetsKB data structure.

|<img src='images/model.png' style='float: none; width: 640px'>|
|:--|
|<em style='float: center'>**Figure 1**: Data structure used to build the TweetsKB corpus ([source](https://data.gesis.org/tweetskb/#Data-model))</em>|

The data structure depicted in ***figure 1*** used to build the TweetsKB corpus is relational and uses several standardized ontologies. **Relational** means that each piece of content belongs to a class, and classes have properties which can either describe a class attribute or link to another class. We will shortly see that classes are candidates for tables. Classes and properties are drawn from **ontologies** which are vocabularies for modeling data and, in our particular tweets case, online community data. These vacabularies are developed and maintained by the [Semantic Web](https://en.wikipedia.org/wiki/Semantic_Web) research community which aims at making internet data machine-readable.

|<img src='images/TweetsKB_model_example.jpg' style='float: none; width: 640px'>|
|:--|
|<em style='float: center'>**Figure 2**: Example how a tweet is encoded using the data structure</em>|

***Figure 2*** is an example how a tweet is encoded using this abstract data structure. In other words, the figure depicts how the content of a tweet is modeled as machine-readable data. Starting with the central element, a **tweet** is modeled as belonging to the [Post](https://www.w3.org/Submission/sioc-spec/#term_Post) class which is defined as an "article or message that can be posted to a Forum" in the [SIOC](http://sioc-project.org/) ontology. A tweet has a [has_creator](https://www.w3.org/Submission/sioc-spec/#term_has_creator) property which links a tweet to a user. A **user** is modeled as belonging to the [User](https://www.w3.org/Submission/sioc-spec/#term_User) class (in figure 1, the class is called UserAccount) which is defined as a "User account in an online community site." The "tweet1" instance of Post as well as the "usr1" instance of User have [id](http://rdfs.org/sioc/spec/#term_id) properties which link to the actual [literal](https://www.w3.org/TR/rdf-schema/#ch_literal) values of the **tweet id** (<span style='font-family:Courier'>9565121266</span>) and (encrypted) **user name** (<span style='font-family:Courier'>2356912</span>) variables. Below, we will create separate Pandas tables for the Post and User classes. This is – just like in the above example of populations and capitals – how redundancy is eliminated.

Starting from such an understanding of separate tables for tweets and users, we can discuss to which one some of the other variables belong which come with the data. The **timestamp**, **number of retweets** (number of users that forward the tweet), and **number of favorites** (number of users that like the tweet) clearly are attributes of tweets. In the example of *figure 1*, "tweet1" is liked by $12$ users, a statistic that is modeled using the [InteractionCounter](https://schema.org/InteractionCounter) for the [LikeAction](https://schema.org/LikeAction) of the [Schema.org](https://schema.org/) vocabulary. The **number of followers** and **number of friends** (number of users a user follows) seem to be attributes of users at first glance. But since they are measured at the time of tweet creation, they are better also attributed to tweets. While these variables are delived by the Twitter API, the following variables have been obtained by the corpus creators by processing the tweet content. The sentiment or emotional content of a tweet is modeled by using the [Onyx](https://www.gsi.upm.es/ontologies/onyx/) ontology which is "designed to annotate and describe the emotions expressed by user-generated content". The [SentiStrength](http://sentistrength.wlv.ac.uk/) algorithm results in **positive sentiment** (1 means low and 5 means high) and **negative sentiment** (-1 means low and -5 means high) variables. Though the sentiment expresses the mind state of a user, it is expressed in language and is, hence, a tweet attribute.

The dataset producers have also annotated tweets by extracting four different kinds of entities from tweet texts: named entities (universally recognized semantic concepts), user **mentions** (words starting with <span style='font-family:Courier'>@</span>), **hashtags** (words starting with <span style='font-family:Courier'>#</span>), and **URLs** (addresses of web pages). Since URLs are often too detailed, we will also extract the **top level domains** (TLDs) from URLs. To identify **named entities**, the [FEL](https://github.com/yahoo/FEL) algorithm matches parts of the tweet **text** to Wikipedia pages as universally identifiable resources and provides a **confidence** score to what extent the match is trustworthy (0 means low and -3 means high confidence). In the example of *figure 2*, the text snippet <span style='font-family:Courier'>Federer</span> has been matched to the Wikipedia resource [Roger_Federer](https://de.wikipedia.org/wiki/Roger_Federer) with an average confidence of $-1.54$.

|<img src='images/TweetsCOV19_ext_erd.png' style='float: none; width: 640px'>|
|:--|
|<em style='float: center'>**Figure 3**: Entity relationship diagram to organize Pandas tables</em>|

It is clear that named entities, mentions, hashtags, URLs, and TLDs cannot be tweet attributes as that would create an immense amount of redundancy. Hence, they each become tables. ***Figure 3*** shows the entity relationship diagram into which we will transform the tweets data. The diagram mirrors the TweetsKB data structure, and we will construct the tables shown in the figure by following the rules of [database normalization](https://en.wikipedia.org/wiki/Database_normalization). Entities are classes in the above sense and not to be confused with named entities. We will create **entity tables** for the seven entities discussed so far: `tweets`, `users`, `named_entities`, `mentions`, `hashtags`, `urls`, and `tlds`. Each table has a primary key (PK) which uniquely identifies the entity instances in a table. We will use the dataframe index as primary keys. Six of those tables have a column called 'tweets' which is the number of times the entity has been selected in a tweet.

In addition, we will create five **relationship tables** that put tweets into relationship to named entities, mentions, hashtags, URLs, and TLDs. Relationship tables are depicted using dashed lines in *figure 3*. They just contain entity identifiers (indices) that are now called foreign keys (FK). We will shortly see that relationship tables can be directly used in data analysis. One of the five tables is an exception: The `tweets_named_entities` table has two more columns – the text that was used to name the named entity and the confidence score – because these are true attributes of the relationship between tweets and named entities. Finally, users and tweets are linked in the tweets table via the 'user_idx' column because a tweet is created by one and only one user.

#### Structuring TweertsCOV19

We will be working with the May 2020 dump of the TweertsCOV19 corpus. Download this [file](https://zenodo.org/record/4593502/files/TweetsCOV19_052020.tsv.gz) and put it into the `/a_introduction/data` folder. The [description](https://data.gesis.org/tweetscov19/#Dataset) of the dataset says that each row contains variables of a tweet instance, there are twelve variables (columns), and variables are separated by a tab character ('\t'). In other words, the data is delivered as a table. Furthermore, the description says that sentiment scores, named entity metadata, etc. are concatenated, In other words, the delivered table is wide in selected columns. Our job will be to transform this table into the multiple tables of *figure 3*. Before reading the full data, it is a good idea to look at the first rows to check if the file contains column names and if there are any peculiarities. Using UTF-8 encoding is recommended since it allows for coding many different characters:

In [25]:
head = pd.read_csv('data/TweetsCOV19_052020.tsv.gz', sep='\t', nrows=5, encoding='utf-8')
head

Unnamed: 0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0.1,null;,1 -1,null;.1,Opinion Next2blowafrica thoughts,null;.2
0,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,null;,1 -1,null;,null;,null;
1,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,null;,2 -1,null;,null;,https://www.bbc.com/news/uk-england-beds-bucks...
2,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,null;,1 -1,null;,null;,https://lockdownsceptics.org/2020/04/30/latest...
3,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,i hate u:I_Hate_U:-1.8786140035817729;quaranti...,1 -4,null;,null;,null;
4,1255982235662024704,491a98bbc105806cb67f46f5e3f3d888,Thu Apr 30 22:07:54 +0000 2020,52,46,0,0,god forbid:God_Forbid:-1.2640735877261988;covi...,2 -4,Danartman BishopStika,null;,https://www.dailymail.co.uk/health/article-826...


In [26]:
head.shape

(5, 12)

Knowing that the file does not contain column names and that the separator indeed creates twelve columns, we can read the whole file:

In [27]:
tweets = pd.read_csv('data/TweetsCOV19_052020.tsv.gz', sep='\t', header=None, quoting=3, encoding='utf-8')

<div class='alert alert-block alert-danger'>
<b>Danger</b>

Setting the `quoting` parameter of the [`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function to the value `3` means that no quoting symbols (e.g., quotation marks) are used to enclose the content of cells in columns. This allows that the respective symbol can be a cell content. In the TweetsCOV19 dataset, some hashtags actually contain quotation marks. Not setting the parameter to `3` would result in a wrong reading of the file.
</div>

From the [description](https://data.gesis.org/tweetscov19/#Dataset) we draw the labels for the column names:

In [28]:
tweets.columns = ['tweet_id', 'user', 'timestamp', 'followers', 'friends', 'retweets', 'favorites', 'named_entities', 'sentiment', 'mentions', 'hashtags', 'urls']

In [29]:
tweets

Unnamed: 0,tweet_id,user,timestamp,followers,friends,retweets,favorites,named_entities,sentiment,mentions,hashtags,urls
0,1255980348229529601,fa5fd446e778da0acba3504aeab23da5,Thu Apr 30 22:00:24 +0000 2020,29697,24040,0,0,null;,1 -1,null;,Opinion Next2blowafrica thoughts,null;
1,1255981220640546816,547501e9cc84b8148ae1b8bde04157a4,Thu Apr 30 22:03:52 +0000 2020,799,1278,4,6,null;,1 -1,null;,null;,null;
2,1255981244560683008,840ac60dab55f6b212dc02dcbe5dfbd6,Thu Apr 30 22:03:58 +0000 2020,586,378,1,2,null;,2 -1,null;,null;,https://www.bbc.com/news/uk-england-beds-bucks...
3,1255981472285986816,37c68a001198b5efd4a21e2b68a0c9bc,Thu Apr 30 22:04:52 +0000 2020,237,168,0,0,null;,1 -1,null;,null;,https://lockdownsceptics.org/2020/04/30/latest...
4,1255981581354905600,8c3620bdfb9d2a1acfdf2412c9b34e06,Thu Apr 30 22:05:18 +0000 2020,423,427,0,0,i hate u:I_Hate_U:-1.8786140035817729;quaranti...,1 -4,null;,null;,null;
...,...,...,...,...,...,...,...,...,...,...,...,...
1922400,1267207472424660992,ae1b1e6bf2a30cd0e1047ddd0baf5ad0,Sun May 31 21:32:59 +0000 2020,15,45,0,0,spotify:Spotify:-0.9407337067771776;wifi:Wi-Fi...,2 -1,null;,null;,null;
1922401,1267207883487354881,0e4323d01d164b9eb6e33f35564c7e25,Sun May 31 21:34:37 +0000 2020,43,931,0,0,china:China:-2.113921624336916;death penalty:C...,1 -2,null;,null;,null;
1922402,1267209309559173122,00fc2c96e4012e27a6eee351723ab461,Sun May 31 21:40:17 +0000 2020,256,451,0,0,null;,2 -1,null;,null;,null;
1922403,1267212987938545667,0f99a3b8b0d490f062215575d074518b,Sun May 31 21:54:54 +0000 2020,1467,1505,0,0,omg:OMG_%28Usher_song%29:-2.580063760606172;,2 -1,lsddrq,null;,null;


There are 1.9 million tweets. Look at the 'named_entities' column to see how multiple annotations are stored in single cells.

#### Creating the `users` table

Besides the 'user' name, the `users` table should also contain the number of tweets the user has created as well as the maximum numbers of followers and friends. Aggregate the data using the `groupby()` function with the `size()` method to count the number of rows (number of tweets)...

In [30]:
users = tweets.groupby('user').size().reset_index(name='tweets')
users

Unnamed: 0,user,tweets
0,00000998260226834ffdbdf98ff33eb7,1
1,000016e54a4dc155432ebad949c2546e,2
2,00001c34da8eab17b175a9e049078b72,1
3,00001d45dd97d52b5accb3333e3790e3,1
4,00003291a067882da356e7f963d3dca8,1
...,...,...
1122828,ffffd2b829300cc638eb4c78c0fc1882,2
1122829,ffffd8b7f90bd8937218b42ee841dc20,1
1122830,ffffda7501c5f86d5ae850ca7a9fbd1f,1
1122831,ffffeca2c4676546be82c9bf9df9c322,1


and the with the `.max()` method to get the maximum numbers of followers and friends:

In [31]:
users_ff = tweets.groupby('user')[['followers', 'friends']].max().reset_index()
users_ff.columns = ['user', 'followers_max', 'friends_max']
users_ff

Unnamed: 0,user,followers_max,friends_max
0,00000998260226834ffdbdf98ff33eb7,1852,1482
1,000016e54a4dc155432ebad949c2546e,6953,992
2,00001c34da8eab17b175a9e049078b72,341,350
3,00001d45dd97d52b5accb3333e3790e3,854,3012
4,00003291a067882da356e7f963d3dca8,104,805
...,...,...,...
1122828,ffffd2b829300cc638eb4c78c0fc1882,2987,2553
1122829,ffffd8b7f90bd8937218b42ee841dc20,366,817
1122830,ffffda7501c5f86d5ae850ca7a9fbd1f,1924,925
1122831,ffffeca2c4676546be82c9bf9df9c322,335,521


Since these dataframes are both ordered alphabetically and have the same length, we can simply add the two columns from `users_ff` to the then complete `users` table:

In [32]:
users[['followers_max', 'friends_max']] = users_ff[['followers_max', 'friends_max']]

Sort the dataframe descendingly by the number of tweets, maximum number of followers, and maximum number of friends (in that order):

In [33]:
users = users.sort_values(['tweets', 'followers_max', 'friends_max'], ascending=False).reset_index(drop=True)

Finally, reorder the columns:

In [34]:
users = users[['user', 'tweets', 'followers_max', 'friends_max']]

The index will function as a unique user identifier:

In [35]:
users

Unnamed: 0,user,tweets,followers_max,friends_max
0,7513717dba8b208fe06799dcc54e59e2,1989,21985402,1116
1,2435a45b85628172c5a47122144a7c67,1533,48306390,1109
2,090264f1888056a96f32ccb7d91ba4e7,1380,3789573,266
3,4ff35e52034daec0251f7b3370969a1a,1341,4149109,0
4,bf4571b94429c5b18e0a219c197a56a4,1165,6146072,28
...,...,...,...,...
1122828,ffc76538fff4d743c01ecb345ceb8ca3,1,0,0
1122829,ffc9987b2f6a2271fb60ec4c0b5fb99e,1,0,0
1122830,ffd8897b8d99c5cfd7494b68359f3639,1,0,0
1122831,ffeb62654f94d36a4075ca408ecf0089,1,0,0


Note that 'user' names are encrypted for privacy reasons in the original dataset. There are 1.1 million distinct users, and the most active one has created 1,989 tweets. Indeed, it is not an error that some users have tens of millions of followers [and more](https://en.wikipedia.org/wiki/List_of_most-followed_Twitter_accounts). An interesting observation is that the most active users also have many followers.

The index values of this table are unique identifiers for the users in the dataset (the primary keys). The effect of sorting is that the most active users have small index values which aids computational purposes, as you will see.

To not waste memory, it is good practice to delete dataframes we do not need anymore:

In [36]:
del users_ff

#### Creating the `tweets` table

Sorting tweets by date and time is straightforward. For handling such data, Pandas provides the 'datetime' data type. It is perfectly suited for handling time series data as it allows for manipulating dates and times in many ways. For now, we will simply transform the 'timestamp' values from 'string' [`to_datetime()`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html). Set the `format` of the original string to spare Pandas figuring it out itself (and save time), ...

In [37]:
tweets['timestamp'] = pd.to_datetime(tweets['timestamp'], format='%a %b %d %X %z %Y')

then sort:

In [38]:
tweets = tweets.sort_values(['timestamp']).reset_index(drop=True)

After sorting, the index is stable and acts as a unique tweet identifier.

The table needs two changes. First, the 'user' name must be replaced by a user index called 'user_idx' from the `users` table. Since we will repeat this operation for other tables, we define an `add_index()` function. Following best Python practice, what it does is described in the function itself:

In [39]:
def add_index(source, target, entity):
    '''
    Inserts the index of a source dataframe into a target dataframe as a column.
    
    Parameters:
        source : Pandas DataFrame
            Dataframe whose index is to be inserted.
        target : Pandas DataFrame
            Dataframe into which the index is inserted.
        entity : String
            Name of the entity that is identified by the index. Will be given an '_idx' suffix and then inserted into the target dataframe.
    
    Returns:
        The target dataframe with the inserted column.
    '''
    _ = source.copy()
    _[entity + '_idx'] = _.index
    df = pd.merge(left=target, right=_[[entity + '_idx', entity]], on=entity)
    del df[entity]
    return df

In [40]:
tweets = add_index(source=users, target=tweets, entity='user')

In [41]:
tweets.columns

Index(['tweet_id', 'timestamp', 'followers', 'friends', 'retweets',
       'favorites', 'named_entities', 'sentiment', 'mentions', 'hashtags',
       'urls', 'user_idx'],
      dtype='object')

Second, the positive and negative sentiment scores are concatenated, so that column must be split. The `pat` parameter states that the scores are separated by whitespace, and `expand=True`causes two columns to be created. These are then appended to the `tweets` table...

In [42]:
tweets_sentiment = tweets['sentiment'].str.split(pat=' ', expand=True)
tweets[['sentiment_pos', 'sentiment_neg']] = tweets_sentiment[[0, 1]]
del tweets_sentiment

which is then cleaned up...

In [43]:
del tweets['sentiment']

and reordered to resemble the column order as in *figure 3*:

In [44]:
tweets = tweets[['tweet_id', 'user_idx', 'timestamp', 'followers', 'friends', 'retweets', 'favorites', 'named_entities', 'mentions', 'hashtags', 'urls', 'sentiment_pos', 'sentiment_neg']]

In [45]:
tweets.head()

Unnamed: 0,tweet_id,user_idx,timestamp,followers,friends,retweets,favorites,named_entities,mentions,hashtags,urls,sentiment_pos,sentiment_neg
0,1255980248728035329,17517,2020-04-30 22:00:00+00:00,120440,69187,78,90,null;,null;,VirusFreeVoting,https://time.com/5829264/wisconsin-primary-cor...,1,-2
1,1257477612663758849,17517,2020-05-05 01:10:00+00:00,120571,69195,10,13,democrat:Democratic_Party_%28United_States%29:...,ChristyforCA25,DemCastCA VoteByMay12,https://app.speechifai.tech/s/udM9GCKZQV-XDg6B...,2,-2
2,1258811014168088578,17517,2020-05-08 17:28:28+00:00,120679,69204,12,45,null;,null;,null;,null;,2,-1
3,1259829866821750784,17517,2020-05-11 12:57:01+00:00,120785,69220,3,2,null;,null;,COVID19,https://www.erinbromage.com/post/the-risks-kno...,1,-1
4,1260938742044545026,17517,2020-05-14 14:23:17+00:00,120841,69227,91,61,texas:Texas:-2.304388836099511;,null;,null;,https://amp.cnn.com/cnn/2020/05/13/politics/te...,1,-2


#### Creating the `named_entities` and `tweets_named_entities` tables

In general, we proceed by, first, extracting relationship tables from the `tweets` table and, second, deriving the entity tables from the relationship tables. We start with the most complicated case of **named entities**. The 'named_entities' column of the `tweets` table contains ';'-separated 3-tuples each of which contains ':'-separated values for 'text', 'named_entity', and 'confidence'. In the process of normalization, the first step is to transform cell content into lists of 3-tuples. Again, we define a custom function that we can apply later on:

In [46]:
def to_list(cell, pat):
    '''
    Function to be applied to individual cells of a dataframe column. Transforms concatenated cell content into a list.
    
    Parameters:
        pat : String
            Pattern that separates the cell values.
    
    Returns:
        The cell will automatically be overwritten by a potentially empty list.
    '''
    if cell == 'null;' or type(cell) == float:
        cell = ['']
    else:
        cell = cell.split(pat)
    return cell

In [47]:
tweets['named_entities'] = tweets['named_entities'].apply(to_list, pat=';')

An example cell with a list of multiple 3-tuples now looks like this:

In [48]:
tweets['named_entities'][5]

['conspiracy theories:Conspiracy_theory:-1.847672162378126',
 'amazon:Amazon_%28company%29:-2.435477582284357',
 'anti vaxxer:Vaccine_hesitancy:-2.7801236851138458',
 'nonfiction:Nonfiction:-2.9818246193808142',
 '']

Next we create the `tweets_named_entities` relationship table. The reason for having created a list of 3-tuples is that a wide table – the subtable with just the 'named_entities' column – can be easily and quickly transformed into a long table using the `explode()` method:

In [49]:
tweets_named_entities = tweets[['named_entities']].explode(column='named_entities')
del tweets['named_entities']
tweets_named_entities.head()

Unnamed: 0,named_entities
0,
1,democrat:Democratic_Party_%28United_States%29:...
1,
2,
3,


Remove rows without 3-tuples (concatenation artifacts) and split the 3-tuples into three columns:

In [50]:
tweets_named_entities = tweets_named_entities[tweets_named_entities['named_entities'] != '']
tweets_named_entities = tweets_named_entities['named_entities'].str.split(pat=':', expand=True)
tweets_named_entities.head()

Unnamed: 0,0,1,2
1,democrat,Democratic_Party_%28United_States%29,-1.8068639862560123
4,texas,Texas,-2.304388836099511
5,conspiracy theories,Conspiracy_theory,-1.847672162378126
5,amazon,Amazon_%28company%29,-2.435477582284357
5,anti vaxxer,Vaccine_hesitancy,-2.7801236851138458


At this point, the index consists of the unique tweet identifiers because we used `explode()` on a `tweets` subtable. If we now reset the index without dropping the old one, the tweet index values will be added as the first column:

In [51]:
tweets_named_entities = tweets_named_entities.reset_index(drop=False)
tweets_named_entities.columns = ['tweet_idx', 'text', 'named_entity', 'confidence']
tweets_named_entities.head()

Unnamed: 0,tweet_idx,text,named_entity,confidence
0,1,democrat,Democratic_Party_%28United_States%29,-1.8068639862560123
1,4,texas,Texas,-2.304388836099511
2,5,conspiracy theories,Conspiracy_theory,-1.847672162378126
3,5,amazon,Amazon_%28company%29,-2.435477582284357
4,5,anti vaxxer,Vaccine_hesitancy,-2.7801236851138458


Part of what we call "getting a feeling for your data" is to actually look at it. This in the right stage in the data processing pipeline to check how well the named-entity-recognition algorithm worked because the `tweets_named_entities` still contains the 'text' from which a named entity was recognized as well as the name of the 'named_entity'. The head of the table already shows that the text snippet "democrat" is matched to the [Democratic_Party_(United_States)](https://en.wikipedia.org/wiki/Democratic_Party_(United_States)) named entity. This bias is more than unfortunate because it probably matches a lot of discourse about democracy itself to the US political party.

SELECT THRESHOLD

Pandas will only display up to 50 rows at a time. To look at more data, you can specify which rows to look at (*e.g.*, the first 10 rows):

In [52]:
tweets_named_entities.loc[0:9, :]

Unnamed: 0,tweet_idx,text,named_entity,confidence
0,1,democrat,Democratic_Party_%28United_States%29,-1.8068639862560123
1,4,texas,Texas,-2.304388836099511
2,5,conspiracy theories,Conspiracy_theory,-1.847672162378126
3,5,amazon,Amazon_%28company%29,-2.435477582284357
4,5,anti vaxxer,Vaccine_hesitancy,-2.7801236851138458
5,5,nonfiction,Nonfiction,-2.9818246193808142
6,6,massachusetts senate,Massachusetts_Senate,-0.8234743294594018
7,6,john velis,John_Velis,-1.7046133376887849
8,6,dems,Defensively_equipped_merchant_ship,-2.028841864813329
9,6,democratic,Democratic_Party_%28United_States%29,-2.03008145419841


TALK ABOUT dems, SHOW HOW TO LOOK AT THE TWEET

https://twitter.com/anyuser/status/1262933787131904001

In [53]:
tweets.loc[6, :]

tweet_id               1262933787131904001
user_idx                             17517
timestamp        2020-05-20 02:30:53+00:00
followers                           121095
friends                              69302
retweets                               740
favorites                             2324
mentions                             null;
hashtags                         DemCastMA
urls                                 null;
sentiment_pos                            2
sentiment_neg                           -1
Name: 6, dtype: object

The `named_entities` entity table is created from the `tweets_named_entities` relationship table, again using a dedicated function:

In [54]:
def create_entity_table(relationship_table, entity):
    '''
    Creates an entity table from a relationship table via aggregation.
    
    Parameters:
        relationship_table : Pandas DataFrame
            Dataframe that contains tweet entity relationships.
        entity : String
            Name of the entity column in the relationship table that contains the entities.
    
    Returns:
        An entity table sorted descendingly by an additional 'tweets' column giving the number of tweets that selected an entity.
    '''
    df = relationship_table.groupby(entity).size().reset_index(name='tweets')
    df = df.sort_values(['tweets', entity], ascending=[False, True]).reset_index(drop=True)
    return df

In [55]:
named_entities = create_entity_table(relationship_table=tweets_named_entities, entity='named_entity')
named_entities.head()

Unnamed: 0,named_entity,tweets
0,Coronavirus_disease_2019,141238
1,Quarantine,71016
2,China,58896
3,Social_distancing,39550
4,Twitter,37516


Given this entity table, add its index to the corresponding `tweets_named_entities` relationship table, reorder the columns to obtain the desired design, and change the data type of the 'confidence' scores from string to a rounded float:

In [56]:
tweets_named_entities = add_index(source=named_entities, target=tweets_named_entities, entity='named_entity')
tweets_named_entities = tweets_named_entities[['tweet_idx', 'named_entity_idx', 'text', 'confidence']] # Reorder columns
tweets_named_entities['confidence'] = tweets_named_entities['confidence'].astype(float).round(4) # Change data type and round score
tweets_named_entities.head()

Unnamed: 0,tweet_idx,named_entity_idx,text,confidence
0,1,21,democrat,-1.8069
1,6,21,democratic,-2.0301
2,1208,21,democratic,-2.0301
3,1336,21,democratic,-2.0301
4,1453,21,democratic,-2.0301


#### Creating all other tables

The data of the other entities (mentions, hashtags, and URLs) is easier to normalize because the relationship tables only contain foreign keys. Without having to process metadata from named entity recognition, we can create the relationship tables via a general function:

In [57]:
def create_relationship_table(entity, to_lower_case, source=tweets):
    '''
    Creates a relationship table for a given entity from the `tweets` table.
    
    Parameters:
        source : Pandas DataFrame
            Set to the `tweets` table by default.
        entity : String
            Name of the entity column in the source table that contains the entities. The column must contain an object data type list of entity names.
        to_lower_case : Boolean
            Whether entity names should be reduced to lower case.
    
    Returns:
        A relationship table linking tweet indices to entity names.
    '''
    df = source[[entity + 's']].explode(column=entity + 's')
    df = df[df[entity + 's'] != '']
    df = df.reset_index()
    df.columns = ['tweet_idx', entity]
    if to_lower_case == True:
        df[entity] = df[entity].str.lower()
    return df

The processing pipeline is the same for all three entities:

1. Transform the entity column in the `tweets` table to a list
2. Create the relationship table from the entity column in the `tweets` table
3. Delete the entity column in the `tweets` table
4. Create the relationship table from the relationship table
5. Add the entity index to the relationship table

In the case of **mentions**, in step 2, we must transform all capital (upper case) characters to lower case. This is because user names on Twitter are not case-sensitive. In other words, when a user named "realDonaldTrump" already exists, no new user will be allowed with the name "realdonaldtrump". Since user mentions are extracted as words starting with @ but tweet creators often use upper and lower cases as they wish, not transforming upper to lower case would result in the same mentioned user getting more than a single unique identifier.

In [58]:
tweets['mentions'] = tweets['mentions'].apply(to_list, pat=' ') # Step 1
tweets_mentions = create_relationship_table(entity='mention', to_lower_case=True) # Step 2
del tweets['mentions'] # Step 3
mentions = create_entity_table(relationship_table=tweets_mentions, entity='mention') # Step 4
tweets_mentions = add_index(source=mentions, target=tweets_mentions, entity='mention') # Step 5

In [59]:
mentions.head()

Unnamed: 0,mention,tweets
0,realdonaldtrump,38571
1,pmoindia,6565
2,narendramodi,6450
3,jaketapper,5971
4,youtube,5731


Donald Trump is the most mentioned user by far, followed by the prime minister of india Narendra Modi both with his official and private account.

In the case of **hashtags**, do the same transformation from upper to lower case to prevent synonymous hashtags getting different indices:

In [60]:
tweets['hashtags'] = tweets['hashtags'].apply(to_list, pat=' ') # Step 1
tweets_hashtags = create_relationship_table(entity='hashtag', to_lower_case=True) # Step 2
del tweets['hashtags'] # Step 3
hashtags = create_entity_table(relationship_table=tweets_hashtags, entity='hashtag') # Step 4
tweets_hashtags = add_index(source=hashtags, target=tweets_hashtags, entity='hashtag') # Step 5

In [61]:
hashtags.head()

Unnamed: 0,hashtag,tweets
0,covid19,87900
1,coronavirus,40597
2,covid_19,12676
3,lockdown,12164
4,stayhome,11594


**URLs** are case-sensitive. Hence, set `to_lower_case=False` in step 2. To create the tables related to **TLDs**, postpone step 5:

In [62]:
tweets['urls'] = tweets['urls'].apply(to_list, pat=':-:') # Step 1
tweets_urls = create_relationship_table(entity='url', to_lower_case=False) # Step 2
del tweets['urls'] # Step 3
urls = create_entity_table(relationship_table=tweets_urls, entity='url') # Step 4

At this point, `tweets_urls` has all the information to create the TLD-related tables. Create the ``tweets_tlds`` as a copy of `tweets_urls` and extract the TLDs (pseudo step 2):

In [63]:
tweets_tlds = tweets_urls.copy()
tweets_tlds['tld'] = tweets_tlds['url'].str[8:].str.split(pat='/').str[0]

Now, finish step 5 for URLs...

In [64]:
tweets_urls = add_index(source=urls, target=tweets_urls, entity='url') # Step 5

and steps 4 and 5 for TLDs (step 3 is not necessary since no such column ever existed):

In [65]:
tlds = create_entity_table(relationship_table=tweets_tlds, entity='tld') # Step 4
tweets_tlds = add_index(source=tlds, target=tweets_tlds, entity='tld') # Step 5

In [66]:
urls.head()

Unnamed: 0,url,tweets
0,https://www.twittascope.com/?sign=5,556
1,https://api.whatsapp.com/send?phone=9190393567...,371
2,http://rebrand.ly/work-2020,286
3,https://www.twittascope.com/?sign=6,271
4,https://redcross.give.asia/campaign/essentials...,260


In [67]:
tlds.head()

Unnamed: 0,tld,tweets
0,twitter.com,35578
1,www.youtube.com,26419
2,www.instagram.com,12247
3,www.theguardian.com,8915
4,www.nytimes.com,7081


As expected, the detail of URLs hides which TLDs are most popular.

In [None]:
#set(mentions['mention'].apply(list).sum())
#
#{'!',
# '"',
# '#',
# '$',
# '%',
# '&',
# "'",
# '(',
# ')',
# '*',
# '+',
# ',',
# '-',
# '.',
# '/',
# '0',
# '1',
# '2',
# '3',
# '4',
# '5',
# '6',
# '7',
# '8',
# '9',
# ':',
# ';',
# '=',
# '?',
# '[',
# '\\',
# ']',
# '^',
# '_',
# '`',
# 'a',
# 'b',
# 'c',
# 'd',
# 'e',
# 'f',
# 'g',
# 'h',
# 'i',
# 'j',
# 'k',
# 'l',
# 'm',
# 'n',
# 'o',
# 'p',
# 'q',
# 'r',
# 's',
# 't',
# 'u',
# 'v',
# 'w',
# 'x',
# 'y',
# 'z',
# '|',
# '}',
# '~',
# '£',
# '´',
# '·',
# '×',
# 'à',
# 'é',
# 'ó',
# 'ś',
# 'ʼ',
# 'α',
# 'σ',
# 'у',
# 'є',
# 'َ',
# 'अ',
# 'ग',
# 'न',
# 'ब',
# 'भ',
# 'व',
# 'ा',
# 'த',
# 'ன',
# 'ப',
# 'ம',
# 'ர',
# 'ற',
# 'ல',
# 'ழ',
# 'வ',
# 'ி',
# 'ு',
# 'ோ',
# '்',
# '\u200b',
# '\u200d',
# '–',
# '—',
# '‘',
# '’',
# '“',
# '”',
# '…',
# '\u202f',
# '‼',
# '⁉',
# '\u2060',
# '\u2063',
# '\u2069',
# '₹',
# '▪',
# '▶',
# '☀',
# '☕',
# '☭',
# '☹',
# '♀',
# '♂',
# '♡',
# '♥',
# '♨',
# '⚔',
# '⚡',
# '⛪',
# '✅',
# '✈',
# '✊',
# '✌',
# '✍',
# '✔',
# '✖',
# '✧',
# '✨',
# '❓',
# '❗',
# '❣',
# '❤',
# '➡',
# '⠀',
# '⤵',
# '⬆',
# '⬇',
# '⭐',
# '》',
# 'か',
# 'よ',
# 'ら',
# 'り',
# '傅',
# '大',
# '妈',
# '威',
# '扬',
# '柴',
# '神',
# '虎',
# '谷',
# '차',
# '️',
# '！',
# '，',
# '｜',
# '･',
# 'ﾟ',
# '𝐀',
# '𝐂',
# '𝐍',
# '𝑑',
# '𝑒',
# '𝑖',
# '𝑙',
# '𝑜',
# '𝑡',
# '𝓪',
# '𝓬',
# '𝓮',
# '𝓯',
# '𝓱',
# '𝓲',
# '𝓴',
# '𝓵',
# '𝓶',
# '𝓸',
# '𝗘',
# '𝗢',
# '𝗧',
# '𝗩',
# '𝘩',
# '🆚',
# '🇦',
# '🇧',
# '🇨',
# '🇩',
# '🇪',
# '🇬',
# '🇭',
# '🇮',
# '🇱',
# '🇳',
# '🇴',
# '🇸',
# '🇹',
# '🇺',
# '🇿',
# '🌄',
# '🌅',
# '🌈',
# '🌍',
# '🌎',
# '🌏',
# '🌟',
# '🌪',
# '🌷',
# '🌹',
# '🌺',
# '🌻',
# '🍀',
# '🍆',
# '🍊',
# '🍍',
# '🍒',
# '🍪',
# '🍳',
# '🍴',
# '🍷',
# '🍺',
# '🎂',
# '🎉',
# '🎊',
# '🎓',
# '🎙',
# '🎟',
# '🎡',
# '🎤',
# '🎥',
# '🎧',
# '🎨',
# '🎶',
# '🎼',
# '🏁',
# '🏆',
# '🏠',
# '🏡',
# '🏥',
# '🏪',
# '🏻',
# '🏼',
# '🏽',
# '🏾',
# '🏿',
# '🐍',
# '🐎',
# '🐐',
# '🐒',
# '🐓',
# '🐘',
# '🐲',
# '🐵',
# '🐶',
# '👀',
# '👆',
# '👇',
# '👈',
# '👉',
# '👊',
# '👌',
# '👍',
# '👎',
# '👏',
# '👑',
# '👨',
# '👩',
# '👺',
# '👽',
# '💃',
# '💋',
# '💌',
# '💍',
# '💔',
# '💕',
# '💖',
# '💗',
# '💘',
# '💙',
# '💚',
# '💛',
# '💜',
# '💞',
# '💥',
# '💦',
# '💪',
# '💫',
# '💯',
# '💻',
# '📖',
# '📚',
# '📢',
# '📣',
# '📰',
# '📱',
# '📷',
# '📺',
# '📻',
# '📽',
# '🔊',
# '🔥',
# '🔨',
# '🔬',
# '🕴',
# '🕺',
# '🖕',
# '🖤',
# '🗞',
# '🗽',
# '😀',
# '😁',
# '😂',
# '😅',
# '😆',
# '😇',
# '😈',
# '😉',
# '😊',
# '😍',
# '😎',
# '😔',
# '😘',
# '😛',
# '😜',
# '😠',
# '😡',
# '😢',
# '😤',
# '😥',
# '😭',
# '😱',
# '😲',
# '😷',
# '😻',
# '🙂',
# '🙄',
# '🙆',
# '🙇',
# '🙈',
# '🙌',
# '🙏',
# '🚀',
# '🚂',
# '🚉',
# '🚨',
# '🚩',
# '🚫',
# '🛫',
# '🤎',
# '🤐',
# '🤔',
# '🤖',
# '🤗',
# '🤘',
# '🤙',
# '🤠',
# '🤡',
# '🤣',
# '🤥',
# '🤦',
# '🤪',
# '🤫',
# '🤬',
# '🤭',
# '🤮',
# '🤴',
# '🥞',
# '🥰',
# '🥳',
# '🥴',
# '🥺',
# '🦁',
# '🦄',
# '🦅',
# '🦉',
# '🦋',
# '🦠',
# '🧑',
# '🧠',
# '🧡',
# '🩸',
# '🪦',
# '🪱'}

## 2.2. Exploring the data visually



### 2.2.2. Working with a single dataframe

In [None]:
# read/save
# describe()
# changing index and column names
# grouping
# using and resetting the index
# categorize series: categories and codes
# matrix to edgelist and vice versa
# zip
# columns into dict
# datetime
# ...

### 2.2.3. Working with multiple dataframes

In [None]:
# merge split concat etc
# ...

## 2.3. NumPy

- https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html
- https://www.pythonlikeyoumeanit.com/module_3.html

In [None]:
# read/save
# relationship to pandas
# ...

# 2. Numpy  [<a href='#destination2'>3</a>] <a id='destination2_'></a>

<img src='images/numpy.png' style='height: 120px; float: right; margin-left: 40px' >


NumPy (Numerical Python) is an open source Python library that’s used in almost every field of science and engineering. It’s the universal standard for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems. Numpy users include everyone from beginning coders to experienced researchers doing state-of-the-art scientific and industrial research and development. The Numpy API is used extensively in Pandas, SciPy, Matplotlib, scikit-learn, scikit-image and most other data science and scientific Python packages.

The Numpy library contains multidimensional array and matrix data structures (you’ll find out more about this later in this notebook). It provides **ndarray**, a homogeneous n-dimensional array object, with methods to efficiently operate on it. Numpy can be used to perform a wide variety of mathematical operations on arrays. It adds powerful data structures to Python that guarantee efficient calculations with arrays and matrices and it supplies an enormous library of high-level mathematical functions that operate on these arrays and matrices.

## 2.2 Numpy arrays

An array is a central data structure of the Numpy library. An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. They seem very much like python lists, but Numpy arrays are in fact faster and more compact than python lists.

One way we can initialize Numpy arrays is from python lists, using nested lists for two- or higher-dimensional data.

For example:

In [None]:
a = np.array([1, 2, 3, 4, 5, 6])

b = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

We can access the elements in the array using square brackets, just like lists. Indexing in Numpy also starts at 0 .

In [None]:
print(a[0])

Besides creating an array from a sequence of elements, you can easily create an array filled with 0’s or 1's:

In [None]:
np.zeros(2)

In [None]:
np.ones(2)

You can also create an array with a range of elements:

In [None]:
np.arange(4)

### 2.2.1 Concatenating and sorting elements


**Sorting** an element is simple with `np.sort()`. You can specify the axis, kind, and order when you call the function. Take `arr` array for example:

In [None]:
arr = np.array([2, 1, 5, 3, 7, 4, 6, 8])

You can quickly sort the numbers in ascending order with:

In [None]:
np.sort(arr)

**Concatenating** two arrays `a` and `b` could be done like this:

In [None]:
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

np.concatenate((a, b))

Or, if you start with these arrays:

In [None]:
x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6]])

You can concatenate them with:

In [None]:
np.concatenate((x, y), axis=0)

### 2.2.2 The shape and size of an array


`ndarray.ndim` will tell you the number of axes, or dimensions, of the array.

`ndarray.size` will tell you the total number of elements of the array. This is the product of the elements of the array’s shape.

`ndarray.shape` will display a tuple of integers that indicate the number of elements stored along each dimension of the array. If, for example, you have a 2-D array with 2 rows and 3 columns, the shape of your array is (2, 3).

For example, if you create this array:

In [None]:
array_example = np.array([[[0, 1, 2, 3],
                           [4, 5, 6, 7]],

                          [[0, 1, 2, 3],
                           [4, 5, 6, 7]],

                          [[0 ,1 ,2, 3],
                           [4, 5, 6, 7]]])

You will have:

In [None]:
print(f'array_example.ndim: {array_example.ndim}')
print(f'array_example.size: {array_example.size}')
print(f'array_example.shape: {array_example.shape}')

#### Reshaping arrays:

Using `arr.reshape()` will give a new shape to an array without changing the data. Just remember that when you use the reshape method, the array you want to produce needs to have the same number of elements as the original array.

For example:

In [None]:
a = np.arange(6)
print(f'a:\n {a}')

reshaped_a = a.reshape(3, 2)
print (f'\nreshaped_a:\n {reshaped_a}')

### 2.2.3 Indexing and slicing

You can index and slice Numpy arrays in the same ways you can slice Python lists:

In [None]:
data = np.array([1, 2, 3])

In [None]:
data[1]

In [None]:
data[0:2]

In [None]:
data[1:]

In [None]:
data[-2:]

If you want to select values from your array that fulfill certain conditions, it’s straightforward with Numpy.

For example, if you start with this array:

In [None]:
a = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

You can easily print all of the values in the array that are less than 5:

In [None]:
print(a[a < 5])

You can also select, for example, numbers that are equal to or greater than 5, and use that condition to index an array:

In [None]:
five_up = (a >= 5)
print(a[five_up])

Or you can select elements that satisfy two conditions using the & and | operators:

In [None]:
c = a[(a > 2) & (a < 11)]
print (c)

### 2.2.4 Basic array operations

Once you’ve created your arrays, you can start to work with them. Let’s say, for example, that you’ve created two arrays, one called “data” and one called “ones”.

You can add the arrays together with the plus sign:

In [None]:
data = np.array([1, 2])
ones = np.ones(2, dtype=int)

data + ones

You can, of course, do more than just addition:

In [None]:
data - ones
# data * data
# data / data

Basic operations are simple with Numpy. If you want to find the sum of the elements in an array, you’d use sum(). This works for 1D arrays, 2D arrays, and arrays in higher dimensions:

In [None]:
a = np.array([1, 2, 3, 4])

a.sum()

To add the rows or the columns in a 2D array, you would specify the axis.

If you start with this array:

In [None]:
b = np.array([[1, 1], [2, 2]])

You can sum over the axis of rows with:

In [None]:
b.sum(axis=0)

You can sum over the axis of columns with:

In [None]:
b.sum(axis=1)

### 2.2.5 Matrices

You can pass python lists of lists to create a 2-D array (or *matrix*) to represent them in Numpy.

Indexing and slicing operations are useful when you’re manipulating matrices:

In [None]:
data = np.array([[1, 2], [3, 4], [5, 6]])

data[0, 1]

In [None]:
data[1:3]

In [None]:
data[0:2, 0]data = np.array([[1, 2], [5, 3], [4, 6]])

You can aggregate matrices the same way you aggregate vectors:

In [None]:
data.max()
# data.min()

In [None]:
data.sum()

You can aggregate all the values in a matrix and you can aggregate them across columns or rows using the axis parameter. To illustrate this point, let’s look at a slightly modified dataset:

In [None]:
data = np.array([[1, 2], [5, 3], [4, 6]])

data.max(axis=0)

In [None]:
data.max(axis=1)

### 2.2.6 Getting unique items and counts

You can find the unique elements in an array easily with np.unique.

For example, if you start with this array:

In [None]:
a = np.array([11, 11, 12, 13, 14, 15, 16, 17, 12, 13, 11, 14, 18, 19, 20])

You can use `np.unique` to print the unique values in your array:

In [None]:
unique_values = np.unique(a)
print(unique_values)

To get the indices of unique values in a Numpy array (an array of first index positions of unique values in the array), just pass the `return_index` argument in `np.unique()` as well as your array:

In [None]:
unique_values, indices_list = np.unique(a, return_index=True)
print(indices_list)

This also works with 2D arrays! [here](https://numpy.org/doc/stable/user/absolute_beginners.html#how-to-get-unique-items-and-counts) is how.

### 2.2.7 Reversing an array

Numpy’s `np.flip()` function allows you to flip, or reverse, the contents of an array along an axis. When using `np.flip()`, specify the array you would like to reverse and the axis. If you don’t specify the axis, Numpy will reverse the contents along all of the axes of your input array.

#### Reversing a 1D array

If you begin with a 1D array like `arr`, you can reverse it with `flip` like this:

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])

reversed_arr = np.flip(arr)

reversed_arr

#### Reversing a 2D array

A 2D array works much the same way. If you start with a 2D array like `arr_2d`, You can reverse the content in all of its rows and all of its columns like this:

In [None]:
arr_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

reversed_arr = np.flip(arr_2d)

reversed_arr

You can easily reverse only the rows or columns with:

In [None]:
reversed_arr_rows = np.flip(arr_2d, axis=0)

reversed_arr_columns = np.flip(arr_2d, axis=1)

## 2.4. SciPy

In [None]:
# sparse matrices
# matrix multiplication

## 2.5. Data visualiation with Seaborn & Matplotlib

SUGGESTION: TEACH HOW TO WITH SEABORN, USE MATPLOTLIB WHERE SEABORN DOES NOT OFFER METHODS

- https://jakevdp.github.io/PythonDataScienceHandbook/04.00-introduction-to-matplotlib.html

# X. References

[<a href='#destination1_'>1</a>] https://docs.python.org/3/howto/unicode.html <a id='destination1'></a>

[<a href='#destination1_'>2</a>] https://www.makeuseof.com/how-to-include-emojis-in-your-python-code/ <a id='destination1'></a>

[<a href='#destination2_'>3</a>] https://numpy.org/doc/stable/user/absolute_beginners.html <a id='destination2'></a>

[<a href='#destination3_'>4</a>] http://pandas.pydata.org/docs/index.html <a id='destination3'></a>

## 1.1 Reading/writing text files

In the first notebook, we went through some basics of python(did we??). In this section, you will learn more about handling files, with a focus on reading text data for further analysis.

Let's say you want to create a text file and write something to it. You can do that with the following lines of code:

In [None]:
my_file = open("test.txt", "w")
my_file.write("Some text")
my_file.close()

The first line opens (and creates) a text file named *test.txt*, and using the second argument of `open` ("w"), it tells python that we want to *write* something to it.
In the second line, we write "Some text" to the file. And with the 3rd line, we close the file and save what we have written to it.

If we want to read the text in the file, we can use these three lines with some minor change like this:

In [None]:
my_file = open("test.txt", "r")
text = my_file.read()
my_file.close()

In [None]:
print (text)

By changing the second argument of `open` to "r", we tell python that we want to *read* what is in the file. In the second line, we save what we have read in the `text` variable, and then we close the file.

We can also do the above operations in a better way as follows.

For writing to a file:

In [None]:
with open("test.txt", "w") as my_file:
    my_file.write("Some new text")

For reading the file:

In [None]:
with open("test.txt", "r") as my_file:
    text = my_file.read()
    
print (text)

## 1.2 Reading text from PDF files


<img src='images/pdf.png' style='height: 120px; float: right; margin-left: 50px' >

For analysing text data, we may require reading data from pdf files. In order to do that in an efficient way, we can use **PyPDF2** library. It is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It has many other useful features for working with pdf files, but our emphasis here is on reading data from pdf files. For more information on all the capabilities you can take a look at [its documentation](https://pypdf2.readthedocs.io/en/latest/).

You can install the library with `pip`:

In [None]:
#pip install PyPDF2

### 1.2.1 Reading Metadata

We'll begin with reading the metadata from a pdf file. We have used *Generative Adversarial Networks* paper by Ian Goodfellow as an example pdf file. You can use any other file of your own choice.

In [None]:
from PyPDF2 import PdfReader

In [None]:
reader = PdfReader("GAN.pdf")

In [None]:
meta = reader.metadata
meta

In [None]:
print(meta.author)
print(meta.creator)
print(meta.producer)
print(meta.subject)
print(meta.title)

Please note that all of the above values could be `None` for your own pdf files. You can also write metadata in your pdf files, [here](https://pypdf2.readthedocs.io/en/latest/user/metadata.html) is how.

### 1.2.2 Extracting Text from a PDF

For accessing the pages of the pdf file, you can do the following:

*Note: the `reader` variable is defined in the previous section.*

In [None]:
pages = reader.pages
len(pages)

In [None]:
print(pages[0].extract_text())

More information on the `extract_text()` method could be found [here](https://pypdf2.readthedocs.io/en/latest/modules/PageObject.html#PyPDF2._page.PageObject.extract_text).

#### Using a visitor


You can use *visitor-functions* to control which part of a page you want to process and extract. The visitor-functions you provide will get called for each operator or for each text fragment.

The function provided in argument `visitor_text` of method `extract_text()` has five arguments: current transformation matrix, text matrix, font-dictionary and font-size. In most cases the x and y coordinates of the current position are in index 4 and 5 of the current transformation matrix.

The font-dictionary may be None in case of unknown fonts. If not None it may e.g. contain key “/BaseFont” with value “/Arial,Bold”.

Warning: In complicated documents the calculated positions might be wrong.

The function provided in argument `visitor_operand_before` has four arguments: operand, operand-arguments, current transformation matrix and text matrix.

#### Example: Ignoring header and footer

In this example, we read the text of page 1 (with index = 0), but we ignore header (y < 670) and footer (y > 30).

In [None]:
page = reader.pages[0]

parts = []


def visitor_body(text, cm, tm, fontDict, fontSize):
    y = tm[5]
    if y > 30 and y < 670:
        parts.append(text)


page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)

print(text_body)

### 1.2.3 Extracting Images

Every page of a PDF document can contain an arbitrary number of images. The names of the files may not be unique.
The following piece of code extracts the images of the second page of the pdf file and saves them in the current directory.

In [None]:
page = reader.pages[1]
count = 0

for image_file_object in page.images:
    with open(str(count) + image_file_object.name, "wb") as fp:
        fp.write(image_file_object.data)
        count += 1

### 1.2.4 Reading PDF Annotations

We can also access the annotations that may be available in our pdf files. In the *GAN.pdf* file, We have 3 tentative annotations. The first page contains a sticky note containing some text, together with some highlighted text which also contains some text. There is another sticky note in the second page. We can access these texts like the following:

#### Sticky notes:

In [None]:
for page in reader.pages:
    if "/Annots" in page:
        for annot in page["/Annots"]:
            subtype = annot.get_object()["/Subtype"]
            if subtype == "/Text":
                print(annot.get_object()["/Contents"])

#### Highlighted text:

In [None]:
for page in reader.pages:
    if "/Annots" in page:
        for annot in page["/Annots"]:
            subtype = annot.get_object()["/Subtype"]
            if subtype == "/Highlight":
                print(annot.get_object()["/Contents"])

## 1.3 Unicode and emoji handling  [<a href='#destination1'>1, 2</a>] <a id='destination1_'></a>

Today’s programs need to be able to handle a wide variety of characters. Applications are often internationalized to display messages and output in a variety of user-selectable languages; the same program might need to output an error message in English, French, Japanese, Hebrew, or Russian. Web content can be written in any of these languages and can also include a variety of emoji symbols. Python’s string type uses the **Unicode** Standard for representing characters, which lets python programs work with all these different possible characters.

Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code. The Unicode specifications are continually revised and updated to add new languages and symbols.

In order to handle all kinds of text data that include characters in different languages, emojis etc, we need to be familiar with this standard.

<img src='images/emojis.png' style='height: 200px; float: left; align="center" ;margin-left: 115px'>

The Unicode standard describes how characters are represented by code points. A code point value is an integer in the range 0 to 0x10FFFF (about 1.1 million values, the [actual number assigned](https://www.unicode.org/versions/Unicode15.0.0/#Summary) is less than that). In the standard and in this document, a code point is written using the notation U+265E to mean the character with value 0x265e (9,822 in decimal).

Like every character, every emoji has a unique Unicode assigned to it. When using emoji Unicodes with Python, replace **+** with **000** from the Unicode. And then prefix the Unicode with **\**.

For example, **U+1F605** will be used as **\U0001F605**. Here, **+** is replaced with **+** and **\** is prefixed with the Unicode.

Here are some examples:

In [None]:
print("grinning face: \U0001F600")
print("beaming face with smiling eyes: \U0001F601")
print("grinning face with sweat: \U0001F605")
print("rolling on the floor laughing: \U0001F923")
print("face with tears of joy: \U0001F602")
print("slightly smiling face: \U0001F642")
print("smiling face with halo: \U0001F607")
print("smiling face with heart-eyes: \U0001F60D")
print("zipper-mouth face: \U0001F910")
print("unamused face: \U0001F612")

### 1.3.1 Extracting all emojis from the text

You can easily extract all the emojis from the text using Python. It can be done using regular expressions. First, you need to install *regex* using `pip`:

In [None]:
#pip install regex

In [None]:
import regex

You can use `regex.findall()` method to find all the emojis from the text:

In [None]:
text = 'We 😍 want 😇 to 😅 extract 😁 every 😀 emoji 😒 in 😁 this 😂 string'

emojis = re.findall(r'[^\w\⁠s,. ]', text)

In [None]:
emojis

### 1.3.2 Removing emoji from the text in python

You can remove all emojis from the text with the help of regular expressions in Python:

In [None]:
import regex

text = 'We 😍 want 😇 to 😅 extract 😁 every 😀 emoji 😒 in 😁 this 😂 string'

print(text)

In [None]:
# Function to remove emoji from text:

def removeEmoji(text):
    regrex_pattern = regex.compile(pattern = "["
    u"\U0001F600-\U0001F64F" # emoticons
    u"\U0001F300-\U0001F5FF" # symbols & pictographs
    u"\U0001F680-\U0001F6FF" # transport & map symbols
    u"\U0001F1E0-\U0001F1FF" # flags (iOS)
    "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',text)

In [None]:
removeEmoji(text)

## References

Dimitrov, D., Baran, E., Fafalios, P., Yu, R., Zhu, X., Zloch, M., & Dietze, S. (2020). TweetsCOV19 - A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic. In: CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management (p. 2991–2998). https://doi.org/10.1145/3340531.3412765

Fafalios, P., Iosifidis, V., Ntoutsi, E., & Dietze, S. (2018). TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets. In: The Semantic Web. ESWC 2018. Lecture Notes in Computer Science, vol 10843. Springer, Cham. https://doi.org/10.1007/978-3-319-93417-4_12

<a href='#weidmann_2022'>Weidmann, N. B. (2022)</a><a id='#weidmann_2022'></a>. *Data Management for Social Scientists: From Files to Databases*. Cambridge University Press.

Wikipedia (2022). Twitter. https://en.wikipedia.org/wiki/Twitter. Retrieved 01.12.2022.

<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: Haiko Lietz

Contributors: Pouria Mirelmi & N. Gizem Bacaksizlar Turbic

Acknowledgements: ...

Version date: XX. December 2022

License: ...
</div>