<img src='images/header.png' style='height: 50px; float: left'>

## Introduction to Computational Social Science methods with Python

# Session A3: Scientific computing and data visualization

In the previous [Session A2: Data management with Pandas](2_data_management_with_pandas.ipynb), we have introduced the data analysis package Pandas. We will use it to manage 2-dimensional tables as the fundamental data structure we will work with throughout all sessions. While Pandas is powerful, it has limits when it comes to **scientific computing** (*i.e.*, working with data mathematically). Two additional Python packages serve that purpose. The first is [NumPy](https://numpy.org/). NumPy expands the structured data universe from 1 (vectors) and 2 (dataframes or matrices) to $n$ dimension and provides mathematical functions to work with these so-called **arrays** (<a href='sundnes_introduction_2020'>Sundnes, 2020</a>, ch. 6). NumPy has a fundamental role in data processing pipelines. Pandas builds on NumPy in the way dataframes can be addressed via indices.

However, NumPy and, hence, Pandas are unaware of (*i.e.*, not well prepared to handle) **data sparsity**. Data is sparse when it contains many zeros, missing values, or NaN (Not a Number) values, depending on the context. Consider for example the use of hashtags on Twitter: a tweet-hashtag matrix will be very sparse because many different hashtags are used in total but only up to a handful of them are used in a single tweet. In Computational Social Science, sparse data is the rule, not the exception. While Pandas offers [data structures for efficiently storing sparse data](https://pandas.pydata.org/docs/user_guide/sparse.html), sparse data processing is hardly developed. This is where SciPy, the other fundamental library for scientific computing comes in. [SciPy](https://scipy.org/) is the standard library for handling sparse data. Beyond sparse data routines, SciPy provides tools for integration, optimization, statistics, and handling spatial data, among others.

Scientific computing is more than just data wrangling; it is data wrangling for the purpose of producing scientific knowledge. We can better understand what that means when we think about how knowledge is produced in the interplay of exploration and confirmation. The **Exploratory Data Analysis** (EDA) paradigm states that this process consists of the three steps to

1. imagine an initial model of the problem at hand,
2. produce an initial answer to the research question, and
3. assess the initial model and answer via data analysis.

This process is a loop because the third step brings one back to the first where the initial model is refined. In the beginning, the loop is exploratory in nature, which means, data analysis does not yet involve hypothesis testing or predicting unseen data. It rather consists of looking at trends, distributions, and bivariate relationships. Knowledge production, thus, begins with seeking confirmation via exploration (<a href='mclevey_doing_2022'>McLevey, 2022</a>, ch. 7.4).

**Data visualization** is essential in EDA. In the words of McLevey (<a href='mclevey_doing_2022'>2022</a>, ch. 7.5):

> Creating good graphs has little to do with Python and everything to do with the decisions you make about what to show and how to show it. Some of these decisions are high level, like the kind of graph and how it should be structured. Other decisions are low level, like selecting colours and shapes.

Effective visualization basically means to avoid typical mistakes regarding aesthetics, substantive data problems, and being inattentive to the realities of human perception (<a href='mclevey_doing_2022'>McLevey, 2022</a>, ch. 7.5). [Matplotlib](https://matplotlib.org/) is Python's basic library for creating visualizations. Matplotlib gives you many options regarding kinds of plots and how to style them, but achieving what you want can be cumbersome. The [Seaborn](https://seaborn.pydata.org/) library is an easier-to-use interface "for drawing attractive and informative statistical graphics" that builds on Matplotlib and integrates closely with Pandas data structures.

The **R** language and software environment for statistical computing and graphics is very popular in the social sciences, also because it provides the [Tidyverse](https://www.tidyverse.org/), a collection of mutually adapted packages for tabular data structures, their manipulation (*e.g.*, merging, aggregating), and producing appealing graphics (<a href='weidmann_data_2022'>Weidmann 2022</a>, ch. 7). We argue that **Python** does not need to hide behind R in this regard. Pandas, when combined with the [Seaborn](https://seaborn.pydata.org/) statistical data visualization library, with NumPy and SciPy in the loop, leaves nothing to be desired.

<div class='alert alert-block alert-success'>
<b>In this session</b>, 

you will learn basic steps of scientific computing and data visualization. The TweetsCOV19 dataset will continue to function as the example. 

In subsession **2.1**, In subsession **2.2**, we will introduce the NumPy and SciPy libraries. NumPy allows working with n-dimensional tables called arrays, which are typically needed in the data processing. SciPy enables you to efficiently process and analyze huge matrices (*i.e.*, 2-dimensional numerical tables) with many zeros or missing values (which is often the case). Finally, in subsession **2.3**, you will learn how to use the Matplotlib and Seaborn libraries to explore data visually.
</div>

## A3.1. Using NumPy to store data in $n$-dimensional arrays

<img src='images/numpy.png' style='height: 100px; float: right; margin-left: 10px'>

Pandas offers Series and DataFrames as native data structures, but its mathematical operations rely on the corresponding NumPy data structures: vectors and matrices. NumPy is purely numerical. The basic differences between Pandas and NumPy are:

- NumPy provides $n$-dimensional data structures called arrays, not just vectors and matrices.
- NumPy does not allow items to be lists.
- NumPy does not have metadata (*i.e.*, the indices of arrays are not labeled)

One use of NumPy is to store data in numerical form as part of a data processing pipeline. We will now go through two examples where data is stored in 2-dimensional and 3-dimensional arrays, respectively. For the **2-dimensional example**, consider that you want to create an array where the rows are days, the columns are hashtags, and the cells give the number of times that a hashtag is used in a day. We have created the necessary `days_hashtags_long` table in the previous session.

In [None]:
import pandas as pd

In [None]:
days_hashtags_long = pd.read_csv(filepath_or_buffer='../data/TweetsCOV19/days_hashtags_long.tsv', sep='\t', index_col=None, encoding='utf-8')
days_hashtags_long

We can make this table wide by using the [`pivot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html) method, which imitates the corresponding procedure in the Excel spreadsheet software:

In [None]:
days_hashtags_wide = days_hashtags_long.pivot(index='day', columns='hashtag', values='tweets')
days_hashtags_wide.head()

We can transform the table into a NumPy array to get what we want. Appending `[:5]` to an array corresponds to appending `head()` to a dataframe: it will only show the first five rows:

In [None]:
days_hashtags_wide_array = days_hashtags_wide.to_numpy()
days_hashtags_wide_array[:5]

What we see is the dataframe stripped of its metadata.

For the **3-dimensional example**, obviously, we cannot use Pandas all the way since it only supports matrices. But we can still use it to process the data before we store it in an array. Consider that you want to create an array where the first dimension is days, the second dimension is mentioned users, the third dimension is sentiment categories (positive, negative, and average), and the cells give the mean sentiment scores of tweets in which users are mentioned in a day. First, load the necessary tables...

In [None]:
tweets = pd.read_csv(filepath_or_buffer = '../data/TweetsCOV19/TweetsCOV19_tables/tweets.tsv.gz', sep = '\t', index_col = None, encoding = 'utf-8', parse_dates = ['timestamp'])
users = pd.read_csv(filepath_or_buffer = '../data/TweetsCOV19/TweetsCOV19_tables/users.tsv.gz', sep = '\t', index_col = None, encoding = 'utf-8')
tweets_mentions = pd.read_csv(filepath_or_buffer = '../data/TweetsCOV19/TweetsCOV19_tables/tweets_mentions.tsv.gz', sep = '\t', index_col = None, encoding = 'utf-8')
tweets_hashtags = pd.read_csv(filepath_or_buffer = '../data/TweetsCOV19/TweetsCOV19_tables/tweets_hashtags.tsv.gz', sep = '\t', index_col = None, encoding = 'utf-8')
tweets_sentiments = pd.read_csv(filepath_or_buffer = '../data/TweetsCOV19/TweetsCOV19_tables/tweets_sentiments.tsv.gz', sep = '\t', index_col = None, encoding = 'utf-8')
mentions = pd.read_csv(filepath_or_buffer = '../data/TweetsCOV19/TweetsCOV19_tables/mentions.tsv.gz', sep = '\t', index_col = None, encoding = 'utf-8')
hashtags = pd.read_csv(filepath_or_buffer = '../data/TweetsCOV19/TweetsCOV19_tables/hashtags.tsv.gz', sep = '\t', index_col = None, encoding = 'utf-8')
sentiments = pd.read_csv(filepath_or_buffer = '../data/TweetsCOV19/TweetsCOV19_tables/sentiments.tsv.gz', sep = '\t', index_col = None, encoding = 'utf-8')

and prepare a list of mentions to analyze:

In [None]:
mention_list = ['realdonaldtrump', 'who', 'breitbartnews', 'cnn']

In [Session A2: Data management with Pandas](2_data_management_with_pandas.ipynb), we have normalized the sentiment information, which means, we have also created `tweets_sentiments` and `sentiments` tables even though this seemed to be a little more data processing than necessary. But now we benefit from having normalized the data all the way through because arrays can naturally be **populated from normalized tables**. The full information we need is stored in five tables, so we would need four joins. However, since we use all three sentiment categories (and all three indices) from the `sentiments` table, we will spare the fourth join to the `sentiments` table and use the `sentiment_idx` for the third dimension instead. Note that we do not `drop` the 'tweet_idx' in the first join (lines 1–2) because we will also need it in line 6 to prepare the third join (lines 6–8). Line 9 extracts the date. Line 11 computes the mean sentiment scores:

In [None]:
days_mentions_sentiments_long = tweets_mentions.set_index(keys='tweet_idx', drop=False).join(
    other = tweets[['timestamp']]
).set_index('mention_idx').join(
    other = mentions[mentions['mention'].isin(mention_list)][['mention']], 
    how = 'inner'
).set_index(keys='tweet_idx').join(
    other = tweets_sentiments.set_index(keys='tweet_idx')
).reset_index(drop=True)
days_mentions_sentiments_long['day'] = days_mentions_sentiments_long['timestamp'].dt.date
del days_mentions_sentiments_long['timestamp']
days_mentions_sentiments_long = days_mentions_sentiments_long.groupby(['day', 'mention', 'sentiment_idx']).mean().round(4).reset_index()
days_mentions_sentiments_long.head()

From this table, we now create the array, and the first three columns are the three dimensions. Since NumPy indexing is purely numerical, we must represent the days and mentions in this table by identifiers that start with 0 and are contiguous (just like the 'sentiment_idx'). In other words, we must make the first two variables categorical (if you know R, you will recognize the 'category' data type). Transform the first two columns using `astype('category')`:

In [None]:
days_mentions_sentiments_long['day'] = days_mentions_sentiments_long['day'].astype('category')
days_mentions_sentiments_long['mention'] = days_mentions_sentiments_long['mention'].astype('category')

Before replacing the category labels by their numerical codes, we save them:

In [None]:
day_categories = days_mentions_sentiments_long['day'].cat.categories
day_categories

In [None]:
mention_categories = days_mentions_sentiments_long['mention'].cat.categories
mention_categories

It is useful to also store the sentiment categories in a variable:

In [None]:
sentiment_categories = sentiments['sentiment'].tolist()
sentiment_categories

Now we can replace the categories by their codes:

In [None]:
days_mentions_sentiments_long['day'] = days_mentions_sentiments_long['day'].cat.codes
days_mentions_sentiments_long['mention'] = days_mentions_sentiments_long['mention'].cat.codes

In [None]:
days_mentions_sentiments_long.head()

We have arrived at a purely numerical table where all cells that represent the dimensions are integers. This is a data structure that NumPy can work with. To transform the table into an array:

In [None]:
days_mentions_sentiments_long_array = days_mentions_sentiments_long.to_numpy()
days_mentions_sentiments_long_array[:5]

To make this long array wide, first, create an empty container (array) with a 3-dimensional `shape` where the first dimension is as long as there are days, the second dimension is as long as there are mentions, and the third dimension is as long as there are sentiment categories. Note that `days_mentions_sentiments_long_array` contains float variables even though the first three columns were integers in the dataframe. This is because an array (of the `ndarray` class) only permits one data type and the last column contains floats. To handle this situation, we use a trick: we create a container for string variables (`dtype='object'`) because integers and floats can be encoded as strings. Initially, each cell is Not a Number (NaN).

In [None]:
import numpy as np
np.__version__

In [None]:
days_mentions_sentiments_wide_array = np.empty(shape=(len(day_categories), len(mention_categories), len(sentiments)), dtype='object')
days_mentions_sentiments_wide_array[:] = np.nan

Then fill this array by using the first three columns of `days_mentions_sentiments_long_array` as indices – recall that Pandas actually took over indexing from NumPy – for the wide array and filling the cells from the fourth column. Note that we transform the first three columns from 'object' to 'int' and the last to 'float'. Appending `[:5]` now means that the matrices of mentions and sentiments for the first five days are shown:

In [None]:
days_mentions_sentiments_wide_array[
    days_mentions_sentiments_long_array[:, 0].astype('int'), 
    days_mentions_sentiments_long_array[:, 1].astype('int'), 
    days_mentions_sentiments_long_array[:, 2].astype('int')
] = days_mentions_sentiments_long_array[:, 3].astype('float')
days_mentions_sentiments_wide_array[:5]

We can slice an $n$-dimensional array any way we want. For example, the matrix of days and sentiments for the first mentioned user 'breitbartnews' is:

In [None]:
print(mention_categories[0])
days_mentions_sentiments_wide_array[:, 0, :]

### A3.2. Using SciPy to handle sparse data

In many cases, we want to work with matrices and arrays mathematically. For example, to compute the logarithms of each cell in a Pandas Series, we must apply a NumPy function:

In [None]:
np.log10(users['followers_max'].replace(to_replace=0, value=np.nan))

NumPy's [`log10()`](https://numpy.org/doc/stable/reference/generated/numpy.log10.html) is a so-called [universal function](https://numpy.org/doc/stable/reference/ufuncs.html) that goes through the vector item by item. Universal functions are unaware of data sparsity. `log10()` tries to take the logarithm of each cell in a vector even if only one out of 1 million cells is larger than 0. Unawareness of data sparsity can easily cause your computer to run out of memory, for example, when algebraic operations like matrix multiplication are performed. Matrix multiplication is required to obtain **co-occurrence matrices**, for example, of hashtags in tweets. The raw TweetsCOV19 dataset is quite sparse, as you can tell by the many 'null;' entries. Besides eliminating redundancy, our transformation into a relational database has also made the data completely dense (*i.e.*, unsparse). A zero, like 0 retweets, actually is a piece of information.

<img src='images/scipy.png' style='height: 100px; float: right; margin-left: 10px'>

We will now see how we can obtain the hashtag co-occurrence matrix for the TweetsCOV19 dataset. To be efficient for different kinds of operations, SciPy offers different sparse matrix formats. While choosing the right format is not so important in our case, it is quite important when data gets really big. COOrdinate matrices are fast for constructing sparse matrices. To construct the occurrence matrix `TH` (T for tweets, H for hashtags), the [`coo_matrix()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html) constructor takes as input a (cells, (rows, columns)) triplet.

<div class='alert alert-block alert-info'>
<b>Insight</b>

It is one of the big benefits of **relationship tables**, as we have introduced them in [Session A2: Data management with Pandas](2_data_management_with_pandas.ipynb), that they only contain the contiguous identifiers which are required by matrix manipulation routines like in SciPy. In the current example, all necessary information for constructing the sparse occurrence matrix lies in the `tweets_hashtags` relationship table.
</div>

`cells` is just a vector of 1s; each hashtag is used only once in a tweet (as a result of the `create_relationship_table()` function):

In [None]:
rows = tweets_hashtags['tweet_idx']
cols = tweets_hashtags['hashtag_idx']
cells = [1]*len(tweets_hashtags)

In [None]:
from scipy.sparse import coo_matrix

In [None]:
TH = coo_matrix((cells, (rows, cols)), shape=(len(tweets), len(hashtags)))
TH

You can see that the matrix is quite large (1916440 tweets x 281296 hashtags), and still it requires not much memory. The technical summary is:

In [None]:
TH.__dict__

Easy to read:

In [None]:
print(TH)

The index pairs of a sparse matrix can be accessed via:

In [None]:
print(f'Tweet/row indices: {TH.nonzero()[0]}')
print(f'Hashtag/column indices: {TH.nonzero()[1]}')

#### Getting the co-occurrence matrix

Given an occurrence matrix $TH$ with tweet indices as rows and hashtag indices as columns, the co-occurrence matrix of hashtags co-occurring in tweets is $Co=HT\cdot TH$ where $HT$ is the transpose of $TH$ (*i.e.*, with hashtag indices as rows and tweet indices as columns) and $\cdot$
means that $HT$ is [multiplied](https://en.wikipedia.org/wiki/Matrix_multiplication) by $TH$ (<a href='batagelj_on_2013'>Batagelj & Cerinsěk, 2013</a>). To do fast vector operations (*e.g.*, matrix multiplication), it is recommended to transform the matrix into a Compressed Sparse Row (CSR) [`csr_matrix()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) before multiplication:

In [None]:
TH = TH.tocsr()

If there had been duplicate entries of the COO format they would have been summed during the conversion.

To get the co-occurrence matrix `Co` (where `TH.T` is the transpose of `TH` and [`dot()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.dot.html) is the method for matrix multiplication):

In [None]:
Co = TH.T.dot(other=TH)
Co

Note that the resulting matrix is in the Compressed Sparse Column (CSC) format.

Sparse matrices can be accessed via indices like arrays. Use [`todense()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.todense.html) to display the sparse matrix in wide (but dense), not long format. Do not do this for the whole matrix unless you want to test if your computer has enough memory (or if you are patient enough). The first five rows and columns show that the matrix is symmetric (*i.e.*, has redundant information in the upper and lower triangular portions) and that the diagonal contains the counts of how often a hashtag is used in all tweets:

In [None]:
Co[:5, :5].todense()

To eliminate the redundancy, use the [`triu()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.triu.html) method to extract just the matrix's upper triangular portion, including the diagonal. This operation is fastest when the matrix is in the COO format. Hence, transform it `.tocoo()` first:

In [None]:
from scipy.sparse import triu

In [None]:
Co = triu(Co.tocoo())
Co

In the SciPy version we are using, the diagonal cannot be removed, only set to 0 (fastest in COO format). The cell is commented out because we want to keep the diagonal:

In [None]:
#Co.setdiag(values=0)

The sparse matrix can be saved as a Pandas `hashtag_cooccurrences` table by transforming the row indexes (line 3), column indexes (line 4), and cells (line 5) to Series and concatenating them (if the diagonal values have been set to 0 they still show up as rows with NaN indices):

In [None]:
hashtag_cooccurrences = pd.concat(
    objs=[
        pd.Series(Co.nonzero()[0]), 
        pd.Series(Co.nonzero()[1]), 
        pd.Series(Co.data)
    ], 
    axis=1
)
hashtag_cooccurrences.columns = ['hashtag_idx_i', 'hashtag_idx_j', 'cooccurrence']
hashtag_cooccurrences

#### Getting the normalized co-occurrence matrix

You can also compute normalized co-occurrence scores. The idea is to give a hashtag a smaller weight the more it shares the "attention" it gets with other hashtags in a tweet. For example, if three hashtags are used in a tweet, each hashtag gets an attention or relationship weight on 1/3. These weights can be obtained via row normalization (of CSR matrices). For that purpose, `TH` must be normalized and the result stored as `N`. The [scikit-learn](https://scikit-learn.org/) library's [`normalize()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html) function can be used for this task:

In [None]:
type(TH)

In [None]:
from sklearn.preprocessing import normalize

In [None]:
N = normalize(TH, norm='l1', axis=1)
N

In [None]:
print(N)

The normalized co-occurrence matrix of hashtags co-occurring in tweets is $Cn=HT\cdot N$ (<a href='batagelj_on_2013'>Batagelj & Cerinsěk, 2013</a>):

In [None]:
Cn = TH.T.dot(other=N)
Cn = triu(Cn.tocoo())
#Cn.setdiag(values=0)

Attach the normalized co-occurrence scores to the `hashtag_cooccurrences` table (line 1), remove the diagonal rows if you want (line 2), and sort the table (line 3):

In [None]:
hashtag_cooccurrences['cooccurrence_norm'] = pd.Series(data=Cn.data.round(4))
#hashtag_cooccurrences = hashtag_cooccurrences[hashtag_cooccurrences['cooccurrence'] > 0]
hashtag_cooccurrences = hashtag_cooccurrences.sort_values(by=['hashtag_idx_i', 'hashtag_idx_j']).reset_index(drop=True)
hashtag_cooccurrences

Networks can be constructed directly from such co-occurrence tables, as we will see in [Session D1: Network analysis](). For now, we just save the table to a file:

In [None]:
import os

directory = 'results'
if not os.path.exists(directory):
    os.makedirs(directory)

In [None]:
hashtag_cooccurrences.to_csv(path_or_buf='results/hashtag_cooccurrences.tsv.gz', sep='\t', index=False, encoding='utf-8', compression='gzip')

## A3.3. Exploring the data visually

<img src='images/matplotlib.png' style='height: 50px; float: right; margin-left: 10px'>

We have claimed that Python does not need to hide behind R and the tidyverse. We hope that the previous and this session have demonstrated Python's appeal for the data management and processing steps. Now, you will learn how you can explore the data visually and produce publication-ready figures.

#### Trends

Plotting the number of tweets over time using plain Matplotlib, you will notice a downward trend and a weekly rhythm:

In [None]:
import matplotlib.pyplot as plt

In [None]:
tweets_over_time = tweets.groupby(tweets['timestamp'].dt.date).size()

In [None]:
plt.figure(figsize=[12, 2])
plt.plot(tweets_over_time)
plt.ylabel('Frequency')
plt.show()

<img src='images/seaborn.png' style='width: 100px; float: right; margin-left: 10px'>

With `set_theme()` from Seaborn, you can set a visual style that will be used even if you do plain Matplotlib plotting. You can choose from five styles: 'darkgrid', 'whitegrid', 'dark', 'white', and 'ticks'.

In [None]:
import seaborn as sns

In [None]:
sns.set_theme(style='darkgrid')

In [None]:
plt.figure(figsize=[12, 2])
plt.plot(tweets_over_time)
plt.ylabel('Frequency')
plt.show()

Now we will continue working with the arrays we have created in subsection A3.1, and we will do so in the spirit of Exploratory Data Analysis. The TweetsCOV19 webpage shows [plots](https://data.gesis.org/tweetscov19/#Statistics) about the frequency development of selected hashtags up until April 2020. We have stored the usage statistics for hashtags listed in `hashtag_list` in `days_hashtags_wide_array`. We can easily create figures for May 2020 from that 2-dimensional array. The following cell loops through the array by iterating through the `hashtag_indices` listed in line 1. Since the array that holds the y values is stripped of labels, we must take the x labels from the corresponding `days_hashtags_wide` dataframe:

In [None]:
days_hashtags_wide.index

In [None]:
hashtag_indices = [0, 1]
hashtag_list = ['coronavirus', 'covid19', 'hydroxychloroquine', 'vaccine']

plt.figure(figsize=[12, 2])
for hashtag_index in hashtag_indices:
    plt.plot(days_hashtags_wide.index, days_hashtags_wide_array[:, hashtag_index], label='#' + hashtag_list[hashtag_index])
plt.legend()
plt.xticks(rotation=45, ha='right')
plt.ylabel('Frequency')
plt.show()

The plot also reveals a downward trend and a weekly rhythm.

It is equally simple to plot data that lives in 3-dimensional arrays. The TweetsCOV19 webpage also shows time trends of the mean sentiment category scores of tweets in which prominent Twitter users are mentioned up until April 2020. We have retrieved the May 2020 data for mentions in `mention_list` and stored it in `days_mentions_sentiments_wide_array`. This time we make a first loop through all mentions from `mention_categories` (line 1) and a second loop through all sentiments from `sentiment_categories` (line 3). We use datetime objects stored in `day_categories` as x values and draw the y values from the array:

In [None]:
for mention_index in range(len(mention_categories)):
    plt.figure(figsize=[12, 2])
    for sentiment_index in range(len(sentiment_categories)):
        plt.plot(day_categories, days_mentions_sentiments_wide_array[:, mention_index, sentiment_index], label=sentiment_categories[sentiment_index])
    plt.legend()
    plt.title('@' + mention_categories[mention_index])
    plt.ylabel('Sentiment')
    plt.show()

Except for @who, the average sentiment is slightly negative. This means that the language of tweets that mention these users tends to be laden with negative emotions – it does not mean that negative sentiments are voiced about the mentioned users.

Consult Sundnes (<a href='sundnes_introduction_2020'>2020</a>, ch. 6) for more examples on how to combine NumPy and Matplotlib.

#### Distributions

Social media data is known to often be very skewed (*i.e.*, the mean does not represent the data). Indeed, we have already seen that some users have tens of millions of followers. For quantitative analysis, especially for the kinds of analyses performed in [Session D5: Statistics & supervised machine learning](), it is very important to know how variables are distributed. Boxplots can be a first step to assessing distributions. In the following, we are interested in the distributions of the 'followers_max' and 'friends_max' variables in the `users` table. To produce Seaborn boxplots, `melt()` the subtable with those two columns (*i.e.*, make it long), ...

In [None]:
followers_friends = pd.melt(users[['followers_max', 'friends_max']])
followers_friends.head()

then plot:

In [None]:
plt.figure(figsize=[3, 3])
sns.boxplot(x='variable', y='value', data=followers_friends)
plt.yscale('log')

Both variables are extremely skewed (note the logarithmic y-axis). Often such variables are transformed into their logarithm to make them behave better (add 1 before taking the log, then users with a value of 0 will keep it):

In [None]:
log_users = users[['followers_max', 'friends_max']].copy()
log_users = np.log10(log_users[['followers_max', 'friends_max']] + 1).round(4)
log_users.columns = ['log_followers_max', 'log_friends_max']
log_users.head()

Seaborn's [`histplot()`](https://seaborn.pydata.org/generated/seaborn.histplot.html) creates histograms and allows to add [kernel density estimates](https://en.wikipedia.org/wiki/Kernel_density_estimation). In line 3, we take a random sample from the data because density estimation takes quite long:

In [None]:
plt.figure(figsize=[3, 3])
sns.histplot(
    data=log_users.sample(n=10000, random_state=42), 
    bins=20, 
    kde=True
)
plt.show()

Both logged variables look normally distributed. That means it is a good hypothesis that the untransformed variables are [lognormally](https://en.wikipedia.org/wiki/Log-normal_distribution) distributed. We can test this hypothesis with the [powerlaw](https://github.com/jeffalstott/powerlaw) library.

In [None]:
# If you are running this session in Google Colab, install this package
#!pip install powerlaw==1.5

In [None]:
import powerlaw
powerlaw.__version__

First, we fit a number of candidate functions to the whole range (`xmin=1`) of the data using [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation):

In [None]:
fit_followers_max = powerlaw.Fit(data=users['followers_max'], xmin=1)
fit_friends_max = powerlaw.Fit(data=users['friends_max'], xmin=1)

We plot two of these candidate functions, the lognormal and a power law:

In [None]:
plt.figure(figsize=[3, 3])
fig = fit_followers_max.plot_pdf(marker='o', linestyle='', label='data')
fit_followers_max.lognormal_positive.plot_pdf(linestyle='-', ax=fig, label='lognormal')
fit_followers_max.power_law.plot_pdf(linestyle='--', ax=fig, label='power_law')
plt.legend()
plt.xlabel('followers_max')
plt.ylabel('PDF')
plt.show()

In [None]:
plt.figure(figsize=[3, 3])
fig = fit_friends_max.plot_pdf(marker='o', linestyle='', label='data')
fit_friends_max.lognormal_positive.plot_pdf(linestyle='-', ax=fig, label='lognormal')
fit_friends_max.power_law.plot_pdf(linestyle='--', ax=fig, label='power_law')
plt.legend()
plt.xlabel('friends_max')
plt.ylabel('PDF')
plt.show()

The lognormal seems to be a better fit for the data in both cases. We can test this using loglikelihood ratios. In both cases, the ratio (the first value in the brackets) is extremely large and significantly (the second value in the bracket) different from 0. A large significant value means that the first distribution, in these cases the lognormal distribution, is a better fit to the data:

In [None]:
fit_followers_max.distribution_compare('lognormal_positive', 'power_law')

In [None]:
fit_friends_max.distribution_compare('lognormal_positive', 'power_law')

We have found that both variables are, in fact, lognormally distributed.

<div class='alert alert-block alert-warning'>
<b>Additional resources</b>

These results from fitting functions to the data mean that the variables are certainly not **power-law distributions**. Knowing about power-law behavior is important because, depending on their exponent, power laws do not have characteristic sample variance or even sample mean, which is statistically problematic. To learn about the importance of power-law distributions, consult Clauset *et al.* (<a href='clauset_powerlaw_2009'>2009</a>).
</div>

#### Bivariate relationships

Identifying bivariate relationships or correlations is another part of EDA. Seaborn's [`jointplot()`](https://seaborn.pydata.org/generated/seaborn.jointplot.html) function creates joint and marginal views on two variables. There are four different `kind`s of views. Here, we show histograms (line 5): 

In [None]:
plot = sns.jointplot(
    data = log_users.sample(n=10000, random_state=42), 
    x = 'log_followers_max', 
    y = 'log_friends_max', 
    kind = 'hist', 
    joint_kws = dict(bins=40), 
    marginal_kws = dict(bins=20)
)
plot.fig.set_figwidth(3)
plot.fig.set_figheight(3)

If you have many variables whose relationships you want to explore, Seaborn offers the [`pairplot()`]() function. The diagonal of such a plot will be filled with the univariate distribution, and the kind of view can be set separately for the univariate (`diag_kind` parameter) and bivariate cases. This time, we plot the relationships for the four numerical variables of the `tweets` table (excluding the sentiment scores). Since they are very skewed, we log them first:

In [None]:
log_tweets = tweets[['followers', 'friends', 'retweets', 'favorites']].copy()
log_tweets = np.log10(log_tweets[['followers', 'friends', 'retweets', 'favorites']] + 1)
log_tweets.columns = ['log_followers', 'log_friends', 'log_retweets', 'log_favorites']

In [None]:
plot = sns.pairplot(
    data = log_tweets.sample(n=10000, random_state=42), 
    height = 2, 
    kind = 'hist', 
    diag_kind = 'hist'
)
plot.fig.set_figwidth(6)
plot.fig.set_figheight(6)

This concludes Session A3: Scientific computing and data visualization. Now that we have an idea of how to manage, process, and explore our data in a research pipeline, we move to the first step of the data life cycle. [Session B1: API Harvesting]() and [Session B2: Web scraping]() are dedicated to data collection.

## Commented references

<a id='batagelj_on_2013'></a>
Batagelj, V. & Cerinsěk, M. (2013). On bibliographic networks. *Scientometrics* 96:845–864. https://doi.org/10.1007/s11192-012-0940-1. *A systematic treatise about various ways matrices can be normalized.*

<a id='clauset_powerlaw_2009'></a>
Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). "Power-law distributions in empirical data". *SIAM Review* 51:661–703. https://doi.org/10.1137/070710111. *Review about the meaning of power laws and how to identify them using maximum likelihood estimation.*

<a id='mclevey_doing_2022'></a>
McLevey, J. (2022). *Doing Computational Social Science: A Practical Introduction*. SAGE. https://us.sagepub.com/en-us/nam/doing-computational-social-science/book266031. *A rather complete introduction to the field with well-structured and insightful chapters also on data visualization. The [website](https://github.com/UWNETLAB/dcss_supplementary) offers the code used in the book.*

<a id='sundnes_introduction_2020'></a>
Sundnes, J. (2020). *Introduction to Scientific Programming with Python*. Springer. https://doi.org/10.1007/978-3-030-50356-7. *An openly accessible introduction covering the basic functionalities of Python. The [website](https://sundnes.github.io/python_intro/) offers the code used in the book.*

<a id='weidmann_data_2022'></a>
Weidmann, N. B. (2022). *Data Management for Social Scientists: From Files to Databases*. Cambridge University Press. *A fresh account of data transformation using R.*

<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: Haiko Lietz

Contributors: Pouria Mirelmi & N. Gizem Bacaksizlar Turbic

Acknowledgements: Olga Zagovora

Version date: 29. March 2023

License: ...
</div>