# Practice Session 04: Networks from text

In this session we will learn to construct a network from a set of implicit relationships. The relationships that we will study are between accounts in Twitter, a micro-blogging service.

We will create two networks: one directed and one undirected.

* In the **directed mention network**, we will say that there is a link of weight *w* from account *x* to account *y*, if account *x* has re-tweeted (re-posted) or mentioned *w* times account *y*.

* In the **undirected co-mention network**, we will say that there is a link of weight *w* between accounts *x* and *y*, if both accounts have been mentioned together in *w* tweets.

The input material you will use is a file named `CovidLockdownCatalonia.json.gz` available in the [data/](data/) directory. This is a gzip-compressed file, which you can de-compress using the `gunzip` command. The file contain about 35,500 messages ("tweets") posted between March 13th, 2020, and March 14th, 2020, containing a hashtag or keyword related to COVID-19, and posted by a user declaring a location in Catalonia.

The tweets are in a format known as [JSON](https://en.wikipedia.org/wiki/JSON#Example). Python's JSON library takes care of translating it into a dictionary.

**How was this file obtained?** This file was obtained from the [CrisisNLP](https://crisisnlp.qcri.org/covid19). This is a website that provides COVID-19 collections of tweets, however, they only provide the identifier of the tweet, known as a tweet-id.

To recover the entire tweet, a process commonly known as *re-hydration* needs to be used, which involves querying an API from Twitter, giving the tweet-id, and obtaining the tweet. This can be done with a little bit of programming or using a software such as [twarc](https://github.com/DocNow/twarc#dehydrate).

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

Author: <font color="blue">Your name here</font>

E-mail: <font color="blue">Your e-mail here</font>

Date: <font color="blue">The current date here</font>

# 0. Code snippets you may need

## 0.1. Iterating through tweets on disk

We do not need to uncompress this file (it is about 236 MB uncompressed, but only 31 MB compressed).

```python
with gzip.open(COMPRESSED_INPUT_FILENAME, "rt", encoding="utf-8") as input_file:
    for line in input_file:
        tweet = json.loads(line)
        author = tweet["user"]["screen_name"]
        message = tweet["full_text"]
        print("%s: '%s'" % (author, message))
```

If instead you want to open it in uncompressed, first `gunzip` the file and then do:

```python
with io.open(INPUT_FILENAME, "r", encoding="utf-8") as input_file:
```

The rest of the code stays the same.

*Tip*: place all the `import` commands in a single cell at the top of your notebook.

## 0.2. Extracting mentions

What we need now is a function to extract mentions, so that if we give, for instance `RT @Jordi: check this post by @Xavier`, it returns the list `["Jordi", "Xavier"]`.

This is such function:

```python
def extract_mentions(text):
    return re.findall("@([a-zA-Z0-9_]{5,20})", text)
```

Note that you will need an `import re` command must be at the beginning of the file, together with the other imports. You may need to execute the cell that contains the import by pressing `Shift-Enter` on it.

You can now print all the links between accounts by doing:

```python
mentions = extract_mentions(message)
for mention in mentions:
    print("%s mentioned %s" % (author, mention))
```

## 0.3. Counting mentions

To count how many times a mention happen, you will keep a dictionary:

```python
mentions_counter = {}
```

Each key in the dictionary will be a tuple `(author, mention)` where `author` is the username of the person who writes the message, and `mention` the username of someone who is mentioned in the message. To update the dictionary, use this code while you are reading the input file:

```python
for mention in mentions:
    key = (author, mention)
    if key in mentions_counter:
        mentions_counter[key] += 1
    else:
        mentions_counter[key] = 1
```

## 0.4. Writing a CSV file

To write a CSV file, assuming you created the ``mentions_counter`` data structure above:

```python
with io.open(OUTPUT_FILENAME, "w") as output_file:
    writer = csv.writer(output_file, delimiter='\t', quotechar='"')
    writer.writerow(["Source", "Target", "Weight"])
    for key in mentions_counter:
        author = key[0]
        mention = key[1]
        weight = mentions_counter[key]
        writer.writerow([author, mention, weight])
```

## 0.5. Iterating through all co-mentions

Suppose mentions in a Tweet are in the array ``mentions``, then you can iterate through all pairs of co-mentioned like this:

```python
for mention1 in mentions:
    for mention2 in mentions:
        if mention1 < mention2:
            key = (mention1, mention2)
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 1. The directed mentions network

Create the **directed mention network**, which has a weighted edge (source, target, weight) if user *source* mentioned user *target* at least once; with *weight* indicating the number of mentions.

Create two files: one containing all edges, and one containing all edges having *count* greater or equal than 2.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [39]:
import io
import json
import gzip
import csv
import re

In [43]:
# Leave only one of these two following lines, then remove this comment
INPUT_FILENAME = "CovidLockdownCatalonia.json"
COMPRESSED_INPUT_FILENAME = "CovidLockdownCatalonia.json.gz"

OUTPUT_ALL_EDGES_FILENAME = "CovidLockdownCatalonia.csv"
OUTPUT_FILTERED_EDGES_FILENAME = "CovidLockdownCatalonia-min-weight-filtered.csv"

<font size="+1" color="red">Replace this cell with your code to create the directed mention network; your code can span multiple cells</font>

## Mentions network visualization

Open the filered edge file in Cytoscape, by importing its CSV file. You may have to set the delimiter to "Tab" in the advanced options, when importing.

The file is large so if you want to see all details while zooming out you may have to set ``View > Always show Graphic Details``. Note this makes the program run slower.

Keep only the largest connected component, deleting the rest of the nodes (you can hold shift while drawing a rectangle, to select some nodes).

Style the network:

* Run "Tools > Analyze Network ..." (as a directed graph)
* Style nodes by setting their size proportional to their in-degree
* Style edges by setting their width and color (darker=more) using the *weight* attribute.

Run the ClusterMaker2 plug-in to create a clustering (affinity propagation clustering) of this graph using the *weight* edge attribute. Color nodes according to their cluster, using a discrete mapping. Note that if you right-click on "Mapping type" when creating a discrete mapping, you can use an automatic mapping generator that you can fine-tune later.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Save the image as mentions.png and replace this cell with \!\[Mentions graph\]\(mentions.png\) to display your graph.</font>

Look at the Results Panel of the network analyzer. There is interesting information here, particularly the node degree distribution, but you can also find information such as characteristic path length, average degree, etc.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Save the degree distribution as mentions-degree-distribution.png and replace this cell with \!\[Mentions graph degree distribution\]\(mentions-degree-distribution.png\) to display it.</font>

<font size="+1" color="red">Replace this cell by a brief commentary, in your own words, of what you see in this graph. What type of graph is it? Who are the largest-degree nodes in the graph? Is there something interesting in the network analysis results? What else can you say about this graph?</font>

# 2. The undirected co-mention network

In [41]:
OUTPUT_CO_MENTIONS_FILENAME = "CovidLockdownCatalonia-co-mentions.csv"

The **undirected co-mention network** connects two accounts if they are both mentioned in the same tweet. The weight of the edge is the number of tweets in which the accounts are co-mentioned.

Create new code to generate the co-mention network by modifying the previous code (make a copy of those cells so you can keep your old code, too). Remember your code should not try to load tweets in memory, just iterating through the tweets on disk.

Write this to the file ``OUTPUT_CO_MENTIONS_FILENAME`` using the CSV writer, similarly to as we did before.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to create the undirected co-mentions network; your code can span multiple cells</font>

## Co-mentions network visualization

Style the network so that line widths are larger for edges with large weights, and node sizes are larger for nodes with large degrees. Remember you need to run the network analyzer first.

Use ``Layout > Prefuse Force Directed Layout > All Nodes > Weight`` to create a layout by edge weight.

Run the ClusterMaker2 plug-in to create a clustering of this graph using the *Weight* attribute as weight. Use the resulting clusters to color the nodes.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Save the image as co_mentions.png and replace this cell with \!\[Co-mentions graph\]\(co_mentions.png\) to display it.</font>

<font size="+1" color="red">Replace this cell by a brief commentary, in your own words, of what you see in this graph. What type of graph is it? Who are the largest-degree nodes in the graph? How do you compare this against the directed mentioned graphs, e.g., with respect to the nodes of larger degree or that are more central? Is there something interesting in the network analysis results? What else can you say about this graph?</font>

# DELIVER (individually)

Remember to read the section on "delivering your code" in the [course evaluation guidelines](https://github.com/chatox/networks-science-course/blob/master/upf/upf-evaluation.md).

Deliver a zip file containing:

* This notebook
* The mentions ``.csv`` file
* The co-mentions ``.csv``file

## Extra points available

For more learning and extra points, create a network of hashtags (e.g., #COVID19), in which two hashtags are connected by an edge if they appear in the same tweet. Draw only the top hashtags, those that appear in many tweets. Include in your zip file the .csv file and in this notebook your code and the drawing of the network.

**Note:** if you go for the extra points, add ``<font size="+2" color="blue">Additional results: hashtags graph included</font>`` at the top of your notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>