In this notebook, we label users as supporters of Clinton or Trump; we build the graph of comments in `/r/politics` for the labelled users; we analyze these graphs according to the labels, and considering the sentiment.

The notebook is divided into two parts: one where we compute the graphs, the other of analysis.
The second part can be executed without the first one, if the necessary processed data files are available.

The main outputs are:

*First part:*
1. A file with labelled users (merged with their geolocalization from Balsamo et al, WebConf 2019).
2. A file for each graph

*Second part:*
3. The interaction matrix, with the number of edges for each combination of labels for each graph.
4. The average sentiment for each combination of labels for `r/politics`.

#### Imports

In [1]:
%matplotlib inline

from collections import Counter
import json
from glob import glob

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm_notebook as tqdm

#### Input and output paths

You can download these files from [https://files.pushshift.io/reddit/](https://files.pushshift.io/reddit/).

In [2]:
posts_path = '../data/' #'/data/big/reddit/submissions/2016/RS_2016-*.bz2'
comments_path = '../data/' #'/data/big/reddit/comments/2016/RC_2016-*.bz2'

In [3]:
OUTPUT_PATH = '../data/processed/'

#### Definition of the home communities

In [4]:
SUBREDDIT_HOME_TRUMP = {'The_Donald'}
SUBREDDIT_HOME_CLINTON = {'hillaryclinton', 'HillaryForAmerica'}

# Label Trump and Clinton users using posts

⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️

**This part can be skipped.**
The output can be recovered just by reloading the files.

⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️

Extra imports, just for this part:

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from pyspark import SparkContext, SparkConf, SQLContext
sc = SparkContext()

In [None]:
# Ensure there is one file per month
assert len(glob(comments_path)) == 12
assert len(glob(posts_path)) == 12

#### Obtain all the users that posted in any of the home communities:

In [None]:
posts_rdd = sc.textFile(posts_path).map(json.loads)
trump_posts_rdd   = posts_rdd.filter(lambda x: 'subreddit' in x and x['subreddit'] in SUBREDDIT_HOME_TRUMP)
clinton_posts_rdd = posts_rdd.filter(lambda x: 'subreddit' in x and x['subreddit'] in SUBREDDIT_HOME_CLINTON)

In [None]:
trump_ncom_avgscore, clinton_ncom_avgscore = [
    rdd
     .filter(lambda x: x['author'] not in {'[deleted]', })
     .map(lambda x: (x['author'], x['score']))
     .groupByKey() # Result: author -> [post_score_0, ..., post_score_N]
     .map(lambda x: (x[0], len(x[1]), sum(x[1]) / len(x[1])))
     # Result: author, number of posts, average score
                                              
    for rdd in (trump_posts_rdd, clinton_posts_rdd)
]

#### We need to disambiguate the intersection and to remove trolls: let's look at the reddit scores

In [None]:
u2trumpscore = dict(trump_ncom_avgscore  .filter(lambda x: x[2] >= 1).map(lambda x: (x[0], x[1])).collect())
u2clintscore = dict(clinton_ncom_avgscore.filter(lambda x: x[2] >= 1).map(lambda x: (x[0], x[1])).collect())

In [None]:
intersection = set(u2clintscore.keys()) & set(u2trumpscore.keys())

For the users in the intersection, how are scores distributed?

In [None]:
dt_hc_scores = np.array([(u2trumpscore[u], u2clintscore[u]) for u in list(intersection)])

par = plt.hist2d(*dt_hc_scores.T, bins=np.arange(1, 10, 1), norm=matplotlib.colors.LogNorm())

plt.xlabel("Trump score")
plt.ylabel("Clinton score")
plt.colorbar()

Very few have a high score in _both_, so let's define the labels in this way:

In [None]:
rep = {u for u, score in u2trumpscore.items() if (u not in u2clintscore or u2clintscore.get(u) < score)}
dem = {u for u, score in u2clintscore.items() if (u not in u2trumpscore or u2trumpscore.get(u) < score)}

In [None]:
len(rep), len(dem), len(rep & dem)

#### Save these labels, together with Duilio's geolocalization

In [None]:
authors = pd.DataFrame([(u, 'R') for u in rep] + [(u, 'D') for u in dem], columns=['author', 'label'])

In [None]:
author_location_original = pd.read_csv("../data/raw/author_locations_16_17_new_opiates.csv.gz")

In [None]:
author_label_state = pd.merge(authors, author_location_original[['author', 'state']], how='left', on='author')

In [None]:
author_label_state.to_csv("../data/processed/author-label-state.csv.bz2", compression='bz2', index=False)

In [None]:
authors_set = set(author_label_state.author)

# Build the graph of comments in `/r/politics` for the labelled users

We also analyze and save the sentiment of each comment (so, the sentiment expressed by the child answering to the parent).

In [None]:
print(comments_path)

comments_rdd = sc.textFile(comments_path).map(json.loads)

In [None]:
def create_comment_graph_for_subreddit(comments_rdd, authors_set, subreddits_list, graph_output_path):
    """
    Take all the comments in the given RDD, and select only those
    in the given list of subreddits AND between authors included in authors_set.
    Saves this as a graph with parent_author, child_author, child_sentiment. 
    
    TAKES A VERY LONG TIME!
    """
    
    rdd_selected_comments = comments_rdd.filter(
        lambda x: 
                'author' in x.keys() and
                x['author'] in authors_set and
                'subreddit' in x.keys() and
                x['subreddit'] in subreddits_list
        )

    rdd_selected_parent_author = (rdd_selected_comments
        .filter(lambda x: 'parent_id' in x)
        .map(lambda x: (x['parent_id'].replace('t1_',''), (x['author'], x['body'])))
    )

    # Result: parent_id -> (author, body)

    analyzer = SentimentIntensityAnalyzer()
    edges_rdd = (rdd_selected_comments
        .map(lambda x: (x['id'], x['author']))
        .join(rdd_selected_parent_author) # Result: parent_id -> [ parent_author, (child_author, child_body) ]
        .map(lambda x: (x[1][0], x[1][1][0], analyzer.polarity_scores(x[1][1][1])['compound']))
    )

    # Result: parent_author, child_author, child_sentiment

    edges_rdd.map(lambda x: ','.join(str(d) for d in x)).repartition(1).saveAsTextFile(
        graph_output_path + "-folder",
        compressionCodecClass='org.apache.hadoop.io.compress.BZip2Codec')
    df = pd.read_csv(graph_output_path + "-folder/part-00000.bz2",
           names=['parent', 'child', 'sentiment'])
    df.to_csv(graph_output_path, index=False)

In [None]:
create_comment_graph_for_subreddit(comments_rdd, authors_set, {'politics'}, 
                                   OUTPUT_PATH + 'parent_child_sentiment_edges_politics.csv.bz2')

Saved the graph, as triplets parent, child, sentiment.

⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️

**From here, it can be executed without the part before** just reloading the files:

⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️

# Count interactions in the `r/politics` graph by label

In [5]:
edges = pd.read_csv(OUTPUT_PATH + "parent_child_sentiment_edges_politics.csv.bz2")

author_label_state = pd.read_csv("../data/processed/author-label-state.csv.bz2")

Label each edge:

In [6]:
author2label = dict(author_label_state[['author', 'label']].values)

In [7]:
def attach_labels_to_graph_dataframe(edges, author2label):
    edges['lparent'] = edges.parent.map(author2label.get)
    edges['lchild'] = edges.child.map(author2label.get)
    return edges

In [8]:
edges = attach_labels_to_graph_dataframe(edges, author2label)

**How many edges do we have for each label?**

In [9]:
edges.groupby('lparent').parent.count()

lparent
D    246838
R    469927
Name: parent, dtype: int64

In [10]:
edges.groupby('lchild').parent.count()

lchild
D    247052
R    469713
Name: parent, dtype: int64

**The interaction matrix for `/r/politics`!**

In [11]:
edges.groupby(['lparent', 'lchild']).child.count()

lparent  lchild
D        D          69800
         R         177038
R        D         177252
         R         292675
Name: child, dtype: int64

As join probabilities: P(R, R), P(R, D), P(D, R), P(D, D)

In [12]:
edges.groupby(['lparent', 'lchild']).child.count() / len(edges)

lparent  lchild
D        D         0.097382
         R         0.246996
R        D         0.247294
         R         0.408328
Name: child, dtype: float64

**Let's look also at the average sentiment!**

In [13]:
edges.groupby(['lparent', 'lchild']).sentiment.mean()

lparent  lchild
D        D         0.057470
         R         0.007168
R        D         0.011035
         R         0.012603
Name: sentiment, dtype: float64

Emotional contagion! RR is one extreme, DD is the other, and they move towards each other.