# Parsing and Cleaning PHEME Rumor Dataset Events

Although data from the Twitter API is at the level of individual tweets, we're interested in analysis and prediction at the thread-level because the data is richer at this level, in terms of meta-data and graph structure.


This notebook cleans tweet level data generated from `lib/pheme_parsing.py` and aggregates this tabular, individual data to tabular thread-level data. It also provides a useful sanity check after making modifications to `lib/pheme_parsing.py`. 

## Instructions
1. Update the variable `event` in the cell below with one of the following events:
    1. germanwings-crash
    1. ferguson
    1. ottawashooting
    1. sydneysiege
    1. charliehebdo
1. Run all the cells in this notebook to generate thread-level CSV files in the `data/threads` directory.

In [1]:
# Load dependencies for this Jupyter Notebook
import pandas as pd
import numpy as np
import networkx as nx
from functools import reduce
from lib.util import fetch_tweets

event = "germanwings-crash"  # Change this value to clear different PHEME datasets

## Parsing and Cleaning Data

We've including a `lib/pheme_parsing.py` script for extracting data from directories of JSON file to tabular, tweet-level CSV files.

The original PHEME dataset consists of JSON files organized into directories by event and category (rumor or non-rumor). These "cached" CSV files saved in `data/tweets` are tabularized versions of this data and include all the fields listed below. 

At the tweet level we've extracted 58 features.

In [2]:
data = fetch_tweets(event)

##  Tweet Level Features

| Name/Column       | Description                   | Type   | Notes  |
|-------------------|-------------------------------|--------| ------ |
| Adjective         | Number of Adjectives          | `int`  |   |
| Adverb            | Number of Adverbs             | `int`  |   |
| Noun              | Number of Nouns               | `int`  |   |
| Pronoun           | Number of Pronouns            | `int`  |   |
| Verb              | Number of Verbs               | `int`  |   |
| capitalratio      | Ratio of capital leters       | `float`|   |
| contentlength     | Length of content             | `int`  |   |
| contentlength     | Length of content             | `int`  |   |
| created           | Datetime Tweet was created    | `int`    | In Unix Epoch Time |
| event             | the name of the event         | `string` | | 
| has_exclaim       | Text has an exclaimation mark | `int` | |
| has_place         | Tweet has place location | `int` | |
| has_quest         | Tweet has question mark | `int` | |
| has_quest_or_exclaim | Tweet has question mark or has exclaimation point | `int` | |
| has_url_in_text   | Does the tweet have a url in the text | `int` | Either 0 for False or 1 for True |
| is_rumor          | Was this classified as rumor  | "bool" (`int`) | *Classification done by journalists* |
| thread            | Source tweet id               | `str`  |                                                   |
| in_reply_tweet    | Tweet ID in reply to          | `str`  |                                                   |
| event             | Name of the PHEME event       | `str`  | Corresponds to event in the PHEME dataset         |
| tweet_id          | Unique ID for tweet           | `str`  | This field is the ID referenced in `in_reply_tweet`     |
| is_source_tweet   | Was this classified as rumor  | "bool" (`int`) |                                                   |
| in_reply_user     | User ID in reply to           | `str`  |                                                   |
| user_id           | Twitter User's ID             | `str`  | This field is the ID referenced in `in_reply_user` |
| tweet_length      | Number of characters in tweet | `int`  |                                                   |
| urls_count        | Number of URLS in tweet       | `int`  |                                                   |
| hashtags_count    | Number of hashtags in tweet   | `int`  |                                                   |
| retweet_count     | Times the tweet was retweeted | `int`  |                                                   |
| favorite_count    | Number of times favorited     | `int`  |                                                   |
| mentions_count    | Number of users mentioned     | `int`  |                                                   |
| is_truncated      | Is this tweet truncated       | "bool" (`int`) | Did User type > 140 characters. [See Tweet updates](https://developer.twitter.com/en/docs/tweets/tweet-updates) |
| has_smile_emoji      | Does Tweet contain ""?        | "bool" (`int`) | 😊 is the smile emoji |
| user.tweets_count    | User's tweet total, currently | `int`  | |
| user.verified        | Is Twitter user verified?     | "bool" (`int`) |                                                   |
| user.followers_count | Total number of followers  | `int` | |
| user.listed_count    | Number of lists for this user | `int` | | 
| user.friends_count   | Count of user's Friends | `int` | |
| user.time_zone       | Timezone of the user's Twitter account | `str` | |
| user.desc_length     | Length of the user's biographic description | `int` |
| user.has_bg_img      | Does user have a profile background image?  | "bool" (`int`) |
| user.default_pric    | Does the user have the default profile picture | "bool" (`int`) |
| user.created_at      | Date and time Twitter account was activated | `datetime` | |
| user.utc_dist        | TK | `int` | See [this blog post time and the Twitter API](https://zacharyst.com/2017/04/05/assigning-the-correct-time-to-a-twee) |

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4489 entries, 0 to 4488
Data columns (total 59 columns):
is_rumor                4489 non-null int64
thread                  4489 non-null object
in_reply_tweet          4489 non-null object
event                   4489 non-null object
tweet_id                4489 non-null object
is_source_tweet         4489 non-null int64
in_reply_user           4489 non-null object
user_id                 4489 non-null object
tweet_length            4489 non-null int64
symbol_count            4489 non-null int64
user_mentions           4489 non-null int64
urls_count              4489 non-null int64
media_count             4489 non-null int64
hashtags_count          4489 non-null int64
retweet_count           4489 non-null int64
favorite_count          4489 non-null int64
mentions_count          4489 non-null int64
is_truncated            4489 non-null int64
created                 4489 non-null float64
has_smile_emoji         4489 non-null int64
sensi

Unlike previous projects, our contribution provides a list of every feature we used with code of how we extracted it. If someone in the future is ever interested in feature engineering with this dataset, we hope this notebook would be helpful.

In [4]:
data.head()

Unnamed: 0,is_rumor,thread,in_reply_tweet,event,tweet_id,is_source_tweet,in_reply_user,user_id,tweet_length,symbol_count,...,sentimentscore,Noun,Verb,Adjective,Pronoun,FirstPersonPronoun,SecondPersonPronoun,ThirdPersonPronoun,Adverb,has_url_in_text
0,0,580333314415919104,,germanwings-crash,580333314415919104,1,,7587032,135,0,...,0.0,8,0,2,0,0,0,0,0,1
1,0,580333314415919104,5.803333144159191e+17,germanwings-crash,580333739445764096,0,7587032.0,716785466,63,0,...,-0.195,4,1,1,0,0,0,0,0,0
2,0,580333314415919104,5.803344525026181e+17,germanwings-crash,580372174659313665,0,2668764799.0,288730262,121,0,...,0.0,6,2,4,0,0,0,0,0,0
3,0,580333314415919104,5.803721746593137e+17,germanwings-crash,580380418337345537,0,288730262.0,2668764799,56,0,...,0.2,5,2,0,0,0,0,0,0,0
4,0,580333314415919104,5.803333144159191e+17,germanwings-crash,580334092207722496,0,7587032.0,367907778,139,0,...,-0.6,8,1,2,0,0,0,0,0,1


## Boolean Features

We choose to keep columns with values of either 1 or 0 as integers instead of casting them to Boolean types in python to facilitate reading and writing to CSV files.

Depending upon the dataset, some boolean columns only had one unique value.

Just for a sanity check. The cell below converts these boolean columns into value of type `bool` and describes them.

In [5]:
bool_columns = ["is_rumor", "is_source_tweet", "is_truncated", 
                "has_smile_emoji", "user.verified", "user.has_bg_img", 
                "user.default_pic", "sensitive", "has_place", "has_coords", "user.notifications"]

data[bool_columns].astype(bool).describe(include="bool")

Unnamed: 0,is_rumor,is_source_tweet,is_truncated,has_smile_emoji,user.verified,user.has_bg_img,user.default_pic,sensitive,has_place,has_coords,user.notifications
count,4489,4489,4489,4489,4489,4489,4489,4489,4489,4489,4489
unique,2,2,1,2,2,2,2,2,2,2,1
top,True,False,False,False,False,True,False,False,False,False,False
freq,2494,4020,4489,4487,4109,3992,2827,4455,4228,4363,4489


Some columns in some PHEME events have only one unique value for all tweets. Instead of dropping them, we'll just be aware of them because they may vary across PHEME datasets.

In [6]:
for col in data.columns:
    if len(data[col].unique()) == 1:
        print("Warning, column `%s` only has one unique value \"%s\"" % (col, data[col][0]))



## Thead Level Data

Every tweet belongs to a thread, indexed by the tweet id of the source tweet. We'll aggregate this tweet-level data into thread-level and use this dataset for prediction.

Before aggregating, here are so summary statistics of our tweet-level data.

In [13]:
data.describe()

Unnamed: 0,is_rumor,is_source_tweet,tweet_length,symbol_count,user_mentions,urls_count,media_count,hashtags_count,retweet_count,favorite_count,...,sentimentscore,Noun,Verb,Adjective,Pronoun,FirstPersonPronoun,SecondPersonPronoun,ThirdPersonPronoun,Adverb,has_url_in_text
count,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,...,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0
mean,0.55558,0.104478,90.788372,0.0,1.506572,0.185787,0.127423,0.298062,25.290933,9.049677,...,-0.015889,5.637113,1.897082,0.873914,0.417688,0.10871,0.087325,0.291156,0.0,0.239474
std,0.496957,0.305913,39.920378,0.0,1.020736,0.400827,0.33415,0.693938,163.969869,70.400399,...,0.289531,3.006968,1.630293,0.979131,0.774458,0.348459,0.340967,0.594175,0.0,0.42681
min,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,56.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,97.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,5.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.0,130.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,...,0.0,8.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,151.0,0.0,12.0,2.0,2.0,8.0,4388.0,2541.0,...,1.0,22.0,9.0,7.0,8.0,3.0,4.0,5.0,0.0,1.0


In [8]:
def agg_tweets_by_thread(df):
    
    shared = lambda x: 1 - len(set(x)) / len(x)
    shared.__name__ = "shared"

    funcs = [np.mean, sum, np.var]
    agg_props = {
        "favorite_count": funcs,
        "user_mentions": funcs,
        "media_count": funcs,
        "sensitive": funcs,
        "has_place": funcs,
        "has_coords": funcs,
        "retweet_count": funcs,
        "hashtags_count": funcs + [shared],
        "urls_count": funcs,
        "user.tweets_count": funcs,
        "is_rumor": max,
        "tweet_id": len,
        "user.has_bg_img": funcs,
        "has_quest": funcs,
        "has_exclaim": funcs,
        "has_quest_or_exclaim": funcs,
        "user.default_pic": funcs,
        "has_smile_emoji": funcs,
        "user.verified": funcs,
        "user.name_length": funcs,
        "user.handle_length": funcs,
        "user.profile_sbcolor": funcs,
        "user.profile_bgcolor": funcs,
        
        "hasperiod": funcs,
        "number_punct": funcs,
        "negativewordcount" : funcs,
        "positivewordcount" : funcs,
        "capitalratio" : funcs,
        "contentlength" : funcs,
        "sentimentscore" : funcs,
        "Noun" : funcs,
        "Verb" : funcs,
        "Adjective" : funcs,
        "Pronoun" : funcs,
        "Adverb": funcs,
    }
    rename = {
        "tweet_id": "thread_length"
    }

    def g(x):
        # Add size of largest user-to-user conversation component in each thread        
        d = []
        thread_tweets = list(x["tweet_id"])
        G = nx.from_pandas_edgelist(df[df.tweet_id.isin(thread_tweets)], "user_id", "in_reply_user")
        Gc = max(nx.connected_component_subgraphs(G), key=len)
        d.append(nx.number_connected_components(G))
        d.append(nx.diameter(Gc))
        return pd.Series(d, index=["component_count", "largest_cc_diameter"])
    
    # Step 0: Build graph-based features
    graph = df.groupby("thread").apply(g)
    
    # Step 1: Build simple aggregate features
    agg = df.groupby("thread")\
        .agg(agg_props)\
        .rename(columns=rename)
    
    agg.columns = [ "_".join(x) for x in agg.columns.ravel() ]
    agg = agg.rename(columns={"is_rumor_max": "is_rumor", "thread_length_len": "thread_length"})
    
    # Step 2: Builds some features off the source tweet, which has tweet_id == thread            
    src = df[df["is_source_tweet"] == 1][["thread",
                                          "user.followers_count", 
                                          "user.listed_count",
                                          "user.verified",
                                          "created",
                                          "user.created_at",
                                          "user.tweets_count"]] \
                         .rename(columns={"user.followers_count": "src.followers_count",
                                          "user.listed_count": "src.listed_count",
                                          "user.verified": "src.user_verified",
                                          "user.created_at": "src.created_at",
                                          "user.tweets_count": "src.tweets_total"})
    
    # Step 3: Build features off of the reply tweets
    def f(x):
        d = []
        
        # Get various features from the distribution of times of reply tweet
        d.append(min(x["created"]))
        d.append(max(x["created"]))
        d.append(np.var(x["created"]))
                
        return pd.Series(d, index=["first_resp", "last_resp","resp_var"])
        
    replies = df[df["is_source_tweet"] == False] \
        .groupby("thread") \
        .apply(f)

    graph_features = df.groupby("thread").apply(g)
    
    dfs = [agg, src, replies, graph]
    thrd_data = reduce(lambda left, right: pd.merge(left,right, on="thread"), dfs)
    
    # Step 3: Add miscelaneous features
    # Remember timestamps increase as time progresses
    # src.created_at < created < first_resp < last_resp
    thrd_data["time_to_first_resp"] = thrd_data["first_resp"] - thrd_data["created"]
    thrd_data["time_to_last_resp"] = thrd_data["last_resp"] - thrd_data["created"]
    
    return thrd_data

In [17]:
thrds = agg_tweets_by_thread(data)
thrds.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 405 entries, 0 to 404
Columns: 116 entries, thread to time_to_last_resp
dtypes: float64(75), int64(40), object(1)
memory usage: 370.2+ KB


### Thread Level Feature Overview

**Bold features** represent high performing features identified in C. Buntain and J. Golbeck, ["Automatically Identifying Fake News in Popular Twitter Threads"](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8118443&isnumber=8118402) 

This table is just a description of the dimensions we aggregated. Our actual thread-level data includes many of these concepts aggregated by mean, sum, and variances.


| Name                | Description                               | Type    | Notes |
| ---                 | ---                                       | ---     | ----- |
| thread              | Tweet ID of the source tweet              | `str`   | |
| favorite_count      | Normalized favorite total                 | `float` | |
| retweet_count       | Normlaized retweet total                  | `float` | |
| **hashtags_count**  | Normlaized hashtag total                  | `float` | |
| urls_count          | URL total normalized by thread length     | `float`  | |
| user.tweets_count   | Total tweets by thread users              | `float` | |
| event               | Name of PHEME event                       | `str`  | |
| is_rumor            | Either rumor or nonrumor                  | `bool` | |
| thread_length       | Number of tweets in the thread            | `int`  | |
| user.has_bg_img     | Ratio of users who have bg image          | `float`| |
| user.default_pic    | Ratio of users with default profile pic   | `float`| |
| **has_smile_emoji** | Number of smile emojis in the thread      | `int`  | 😊 is the smile emoji |
| user.verified       | Count of verified users in the thread normalized by thread length     | `float`  | |
| **src.followers_count** | The number of followers of the original poster of the thread. | `int` | |
| src.listed_count    | How many lists did source user belong to | `int` | |
| src.user_verified   | Was the source user verified | `int` | |
| src.tweets_total    | How many tweets had the source user issued to that point | `int` | |
| reply_var           | The variance in the timestamps of responses to the source tweet | `float` |
| src_age             | Difference in src user's creation and tweet creation            | `int`   | Measured in seconds |
| time_to_first_resp  | The difference between tweet creation datetime and 1st reply    | `int`   | Measured in seconds |
| time_to_last_resp   | The difference between tweet creation datetime and last reply   | `int`   | Measured in seconds |

In [10]:
thrds.head()

Unnamed: 0,thread,favorite_count_mean,favorite_count_sum,favorite_count_var,user_mentions_mean,user_mentions_sum,user_mentions_var,media_count_mean,media_count_sum,media_count_var,...,created,src.created_at,src.tweets_total,first_resp,last_resp,resp_var,component_count,largest_cc_diameter,time_to_first_resp,time_to_last_resp
0,580317556516483072,5.0,40,167.714286,2.0,16,8.285714,0.125,1,0.125,...,1427193000000.0,1389095000000.0,14408,1427193000000.0,1427194000000.0,28107390000.0,3,2,36000.0,590000.0
1,580317998147325952,0.666667,2,1.333333,0.666667,2,0.333333,0.0,0,0.0,...,1427194000000.0,1250783000000.0,87411,1427194000000.0,1427194000000.0,1600000000.0,2,2,101000.0,181000.0
2,580318020192571392,0.5,1,0.5,0.5,1,0.5,0.0,0,0.0,...,1427194000000.0,1178640000000.0,22201,1427198000000.0,1427198000000.0,0.0,2,1,4476000.0,4476000.0
3,580318210609696769,4.5,9,40.5,1.0,2,2.0,0.0,0,0.0,...,1427194000000.0,1238071000000.0,13875,1427194000000.0,1427194000000.0,0.0,2,1,193000.0,193000.0
4,580318669483413504,1.8,9,16.2,0.8,4,0.2,0.0,0,0.0,...,1427194000000.0,1254230000000.0,5426,1427198000000.0,1427218000000.0,51941050000000.0,2,2,4258000.0,24336000.0


In [11]:
thrds.shape

(405, 116)

In [12]:
fn = "data/threads/%s.csv" % event
thrds.to_csv(fn, index=False)
"Wrote data to %s" % fn

'Wrote data to data/threads/germanwings-crash.csv'