# Parsing and Cleaning PHEME RNR Dataset Events

This notebook performs data-cleaning and aggrigation into thread-level data. It also provides a useful sanity check. Run all the cells in this notebook to generate thread-level CSV files in the `data/threads` directory.

In [1]:
# Load dependencies for this Jupyter Notebook
import pandas as pd
import numpy as np
import networkx as nx
from functools import reduce
from lib.util import fetch_tweets

## Parsing and Cleaning Data
This step takes the raw PHEME rumor dataset and saves it tabular format as CSV file. The original PHEME dataset consists of JSON files organized into directories by event and category (rumor or non-rumor). These three functions below parse the data, save it as a CSV file (if necessary), and load it into this notebook as a Pandas DataFrame from the "cached" CSV file.

In [2]:
gw = fetch_tweets("germanwings-crash")

##  Tweet Level Features

| Name/Column       | Description                   | Type   | Notes  |
|-------------------|-------------------------------|--------| ------ |
| is_rumor          | Was this classified as rumor  | "bool" (`int`) | *Classification done by journalists* |
| thread            | Source tweet id               | `str`  |                                                   |
| in_reply_tweet    | Tweet ID in reply to          | `str`  |                                                   |
| event             | Name of the PHEME event       | `str`  | Corresponds to event in the PHEME dataset         |
| tweet_id          | Unique ID for tweet           | `str`  | This field is the ID referenced in `in_reply_tweet`     |
| is_source_tweet   | Was this classified as rumor  | "bool" (`int`) |                                                   |
| in_reply_user     | User ID in reply to           | `str`  |                                                   |
| user_id           | Twitter User's ID             | `str`  | This field is the ID referenced in `in_reply_user` |
| tweet_length      | Number of characters in tweet | `int`  |                                                   |
| urls_count        | Number of URLS in tweet       | `int`  |                                                   |
| hashtags_count    | Number of hashtags in tweet   | `int`  |                                                   |
| retweet_count     | Times the tweet was retweeted | `int`  |                                                   |
| favorite_count    | Number of times favorited     | `int`  |                                                   |
| mentions_count    | Number of users mentioned     | `int`  |                                                   |
| is_truncated      | Is this tweet truncated       | "bool" (`int`) | Did User type > 140 characters. [See Tweet updates](https://developer.twitter.com/en/docs/tweets/tweet-updates) |
| created              | Datetime Tweet was created    | `datetime` | |
| has_smile_emoji      | Does Tweet contain ""?        | "bool" (`int`) | 😊 is the smile emoji |
| user.tweets_count    | User's tweet total, currently | `int`  | |
| user.verified        | Is Twitter user verified?     | "bool" (`int`) |                                                   |
| user.followers_count | Total number of followers  | `int` | |
| user.listed_count    | ?? | `int` | | 
| user.friends_count   | ?? | `int` | |
| user.time_zone       | Timezone of the user's Twitter account | `str` | |
| user.desc_length     | Length of the user's biographic description | `int` |
| user.has_bg_img      | Does user have a profile background image?  | "bool" (`int`) |
| user.default_pric    | Does the user have the default profile picture | "bool" (`int`) |
| user.created_at      | Date and time Twitter account was activated | `datetime` | |
| user.utc_dist        | TK | `int` | See [this blog post time and the Twitter API](https://zacharyst.com/2017/04/05/assigning-the-correct-time-to-a-twee) |

## Germanwings Crash

In [3]:
gw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4489 entries, 0 to 4488
Data columns (total 58 columns):
Adjective               4489 non-null int64
Adverb                  4489 non-null int64
Noun                    4489 non-null int64
Pronoun                 4489 non-null int64
Verb                    4489 non-null int64
capitalratio            4489 non-null float64
contentlength           4489 non-null int64
created                 4489 non-null float64
event                   4489 non-null object
favorite_count          4489 non-null int64
has_coords              4489 non-null int64
has_exclaim             4489 non-null int64
has_place               4489 non-null int64
has_quest               4489 non-null int64
has_quest_or_exclaim    4489 non-null int64
has_smile_emoji         4489 non-null int64
has_url_in_text         4489 non-null int64
hasemark                4489 non-null int64
hashtags_count          4489 non-null int64
hasperiod               4489 non-null int64
hasqmark

The `.head` method prints out the 5 first rows in the dataframe

In [4]:
gw.head()

Unnamed: 0,Adjective,Adverb,Noun,Pronoun,Verb,capitalratio,contentlength,created,event,favorite_count,...,user.name_length,user.notifications,user.profile_bgcolor,user.profile_sbcolor,user.time_zone,user.tweets_count,user.utc_dist,user.verified,user_id,user_mentions
0,0,0,9,0,1,0.051546,13,1427194000000.0,germanwings-crash,10,...,6,0,11453380,16777215,Madrid,107042,0.0,1,8330472,0
1,1,0,8,0,1,0.066038,14,1427195000000.0,germanwings-crash,6,...,7,0,12639981,12639981,,2076,,0,2307392966,1
2,0,0,2,0,0,0.0,5,1427195000000.0,germanwings-crash,0,...,13,0,0,0,,701,,0,2535310842,2
3,0,0,6,0,1,0.0,10,1427195000000.0,germanwings-crash,1,...,13,0,0,0,,701,,0,2535310842,2
4,0,0,4,0,2,0.129412,9,1427194000000.0,germanwings-crash,30,...,12,0,16777215,13421772,London,11447,,1,92771309,0


## Boolean Columns

The `describe` method will give summary information about each column in the dataframe. Each of these columns, except `is_truncated` should have two unique values.

Just for a sanity check. The cell below converts these boolean columns into value of type `bool` and describes them.

In [5]:
bool_columns = ["is_rumor", "is_source_tweet", "is_truncated", 
                "has_smile_emoji", "user.verified", "user.has_bg_img", 
                "user.default_pic", "sensitive", "has_place", "has_coords", "user.notifications"]

gw[bool_columns].astype(bool).describe(include="bool")

Unnamed: 0,is_rumor,is_source_tweet,is_truncated,has_smile_emoji,user.verified,user.has_bg_img,user.default_pic,sensitive,has_place,has_coords,user.notifications
count,4489,4489,4489,4489,4489,4489,4489,4489,4489,4489,4489
unique,2,2,1,2,2,2,2,2,2,2,1
top,True,False,False,False,False,True,False,False,False,False,False
freq,2494,4020,4489,4487,4109,3992,2827,4455,4228,4363,4489


Some columns in some PHEME events have only one unique value for all tweets. So we'll drop any where they exist.

In [6]:
for col in gw.columns:
    if len(gw[col].unique()) == 1:
        gw.drop(col, inplace=True, axis = 1)

In [7]:
gw.describe()

Unnamed: 0,Adjective,Noun,Pronoun,Verb,capitalratio,contentlength,created,favorite_count,has_coords,has_exclaim,...,user.has_bg_img,user.listed_count,user.location,user.name_length,user.profile_bgcolor,user.profile_sbcolor,user.tweets_count,user.utc_dist,user.verified,user_mentions
count,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,...,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,2969.0,4489.0,4489.0
mean,0.873914,5.637113,0.417688,1.896859,0.085311,12.709289,1427281000000.0,9.049677,0.028069,0.093785,...,0.889285,1814.252172,0.724215,10.434172,9096543.0,11178110.0,27548.61,3.905945,0.084651,1.506572
std,0.979131,3.006968,0.774458,1.630484,0.083078,6.407709,113922500.0,70.400399,0.165187,0.291562,...,0.313814,11458.251488,0.446959,2.663997,5702912.0,5620359.0,93527.55,3.112538,0.278393,1.020736
min,0.0,0.0,0.0,0.0,0.0,1.0,1427193000000.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,0.0,3.0,0.0,1.0,0.035398,7.0,1427198000000.0,0.0,0.0,0.0,...,1.0,2.0,0.0,8.0,1710879.0,11061240.0,1810.0,1.0,0.0,1.0
50%,1.0,5.0,0.0,2.0,0.067961,13.0,1427205000000.0,0.0,0.0,0.0,...,1.0,10.0,1.0,10.0,12639980.0,12639980.0,7685.0,5.0,0.0,1.0
75%,1.0,8.0,1.0,3.0,0.112903,18.0,1427405000000.0,1.0,0.0,0.0,...,1.0,46.0,1.0,12.0,12639980.0,15658730.0,25681.0,6.0,0.0,2.0
max,7.0,22.0,8.0,9.0,0.773333,30.0,1427919000000.0,2541.0,1.0,1.0,...,1.0,163464.0,1.0,15.0,16777220.0,16777220.0,4420429.0,12.0,1.0,12.0


## Thread Level Features

* **Bold features** represent high performing features identified in C. Buntain and J. Golbeck, ["Automatically Identifying Fake News in Popular Twitter Threads"](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8118443&isnumber=8118402)
* Features that are normalized are normalized by thread length


| Name                | Description                               | Type    | Notes |
| ---                 | ---                                       | ---     | ----- |
| thread              | Tweet ID of the source tweet              | `str`   | |
| favorite_count      | Normalized favorite total                 | `float` | |
| retweet_count       | Normlaized retweet total                  | `float` | |
| **hashtags_count**  | Normlaized hashtag total                  | `float` | |
| urls_count          | URL total normalized by thread length     | `float`  | |
| user.tweets_count   | Total tweets by thread users              | `float` | |
| event               | Name of PHEME event                       | `str`  | |
| is_rumor            | Either rumor or nonrumor                  | `bool` | |
| thread_length       | Number of tweets in the thread            | `int`  | |
| user.has_bg_img     | Ratio of users who have bg image          | `float`| |
| user.default_pic    | Ratio of users with default profile pic   | `float`| |
| **has_smile_emoji** | Number of smile emojis in the thread      | `int`  | 😊 is the smile emoji |
| user.verified       | Count of verified users in the thread normalized by thread length     | `float`  | |
| **src.followers_count** | The number of followers of the original poster of the thread. | `int` | |
| src.listed_count    | TODO | `int` | |
| src.user_verified   | TODO | `int` | |
| src.tweets_total    | TODO | `int` | |
| reply_var           | The variance in the timestamps of responses to the source tweet | `float` |
| src_age             | Difference in src user's creation and tweet creation            | `int`   | Measured in seconds |
| time_to_first_resp  | The difference between tweet creation datetime and 1st reply    | `int`   | Measured in seconds |
| time_to_last_resp   | The difference between tweet creation datetime and last reply   | `int`   | Measured in seconds |

In [8]:
def agg_tweets_by_thread(df):
    
    shared = lambda x: 1 - len(set(x)) / len(x)
    shared.__name__ = "shared"

    funcs = [np.mean, sum, np.var]
    agg_props = {
        "favorite_count": funcs,
        "user_mentions": funcs,
        "media_count": funcs,
        "sensitive": funcs,
        "has_place": funcs,
        "has_coords": funcs,
        "retweet_count": funcs,
        "hashtags_count": funcs + [shared],
        "urls_count": funcs,
        "user.tweets_count": funcs,
        "is_rumor": max,
        "tweet_id": len,
        "user.has_bg_img": funcs,
        "has_quest": funcs,
        "has_exclaim": funcs,
        "has_quest_or_exclaim": funcs,
        "user.default_pic": funcs,
        "has_smile_emoji": funcs,
        "user.verified": funcs,
        "user.name_length": funcs,
        "user.handle_length": funcs,
        "user.profile_sbcolor": funcs,
        "user.profile_bgcolor": funcs,
        
        
        "hasqmark": funcs,
        "hasemark": funcs,
        "hasperiod": funcs,
        "number_punct": funcs,
        "negativewordcount" : funcs,
        "positivewordcount" : funcs,
        "capitalratio" : funcs,
        "contentlength" : funcs,
        "sentimentscore" : funcs,
        "Noun" : funcs,
        "Verb" : funcs,
        "Adjective" : funcs,
        "Pronoun" : funcs,
        #"Adverb": funcs, #was dropped!
    }
    rename = {
        "tweet_id": "thread_length",
        "has_url":"url_proportion",
    }

    def g(x):
        # Add size of largest user-to-user conversation component in each thread        
        d = []
        thread_tweets = list(x["tweet_id"])
        G = nx.from_pandas_edgelist(df[df.tweet_id.isin(thread_tweets)], "user_id", "in_reply_user")
        Gc = max(nx.connected_component_subgraphs(G), key=len)
        d.append(nx.number_connected_components(G))
        d.append(nx.diameter(Gc))
        return pd.Series(d, index=["component_count", "largest_cc_diameter"])
    
    # Step 0: Build graph-based features
    graph = df.groupby("thread").apply(g)
    
    # Step 1: Build simple aggregate features
    agg = df.groupby("thread")\
        .agg(agg_props)\
        .rename(columns=rename)
    
    agg.columns = [ "_".join(x) for x in agg.columns.ravel() ]
    agg = agg.rename(columns={"is_rumor_max": "is_rumor", "thread_length_len": "thread_length"})
    
    # Step 2: Builds some features off the source tweet, which has tweet_id == thread            
    src = df[df["is_source_tweet"] == 1][["thread",
                                          "user.followers_count", 
                                          "user.listed_count",
                                          "user.verified",
                                          "created",
                                          "user.created_at",
                                          "user.tweets_count"]] \
                         .rename(columns={"user.followers_count": "src.followers_count",
                                          "user.listed_count": "src.listed_count",
                                          "user.verified": "src.user_verified",
                                          "user.created_at": "src.created_at",
                                          "user.tweets_count": "src.tweets_total"})
    
    # Step 3: Build features off of the reply tweets
    def f(x):
        d = []
        
        # Get various features from the distribution of times of reply tweet
        d.append(min(x["created"]))
        d.append(max(x["created"]))
        d.append(np.var(x["created"]))
                
        return pd.Series(d, index=["first_resp", "last_resp","resp_var"])
        
    replies = df[df["is_source_tweet"] == False] \
        .groupby("thread") \
        .apply(f)

    graph_features = df.groupby("thread").apply(g)
    
    dfs = [agg, src, replies, graph]
    thrd_data = reduce(lambda left, right: pd.merge(left,right, on="thread"), dfs)
    
    # Step 3: Add miscelaneous features
    # Remember timestamps increase as time progresses
    # src.created_at < created < first_resp < last_resp
    thrd_data["time_to_first_resp"] = thrd_data["first_resp"] - thrd_data["created"]
    thrd_data["time_to_last_resp"] = thrd_data["last_resp"] - thrd_data["created"]
    
    return thrd_data

In [9]:
gw_thrds = agg_tweets_by_thread(gw)
gw_thrds.columns

Index(['thread', 'user.profile_sbcolor_mean', 'user.profile_sbcolor_sum',
       'user.profile_sbcolor_var', 'user.name_length_mean',
       'user.name_length_sum', 'user.name_length_var', 'retweet_count_mean',
       'retweet_count_sum', 'retweet_count_var',
       ...
       'created', 'src.created_at', 'src.tweets_total', 'first_resp',
       'last_resp', 'resp_var', 'component_count', 'largest_cc_diameter',
       'time_to_first_resp', 'time_to_last_resp'],
      dtype='object', length=119)

In [10]:
gw_thrds.describe()

Unnamed: 0,user.profile_sbcolor_mean,user.profile_sbcolor_sum,user.profile_sbcolor_var,user.name_length_mean,user.name_length_sum,user.name_length_var,retweet_count_mean,retweet_count_sum,retweet_count_var,hasqmark_mean,...,created,src.created_at,src.tweets_total,first_resp,last_resp,resp_var,component_count,largest_cc_diameter,time_to_first_resp,time_to_last_resp
count,405.0,405.0,405.0,405.0,405.0,405.0,405.0,405.0,405.0,405.0,...,405.0,405.0,405.0,405.0,405.0,405.0,405.0,405.0,405.0,405.0
mean,11581580.0,122067800.0,27842950000000.0,10.462247,114.041975,6.960173,25.957137,262.135802,16013.174265,0.183988,...,1427252000000.0,1255751000000.0,68351.703704,1427254000000.0,1427293000000.0,971230200000000.0,2.439506,2.565432,2610360.0,41240930.0
std,2637869.0,105488900.0,23098040000000.0,1.238673,98.893786,4.902113,38.335612,501.952143,58490.520678,0.319193,...,92934280.0,65125340000.0,67962.053233,95030210.0,135747700.0,4148492000000000.0,0.855628,1.427728,22431480.0,84991590.0
min,1805712.0,6664410.0,0.0,5.0,10.0,0.0,0.918919,25.0,25.971014,0.0,...,1427193000000.0,1167702000000.0,481.0,1427193000000.0,1427194000000.0,0.0,1.0,1.0,4000.0,48000.0
25%,9992795.0,40478440.0,8431753000000.0,9.666667,36.0,3.595833,8.153846,47.0,401.785714,0.0,...,1427196000000.0,1213374000000.0,14441.0,1427197000000.0,1427202000000.0,172359000000.0,2.0,2.0,77000.0,2507000.0
50%,11814740.0,82251630.0,25413810000000.0,10.5,79.0,6.333333,15.571429,104.0,1562.892105,0.111111,...,1427199000000.0,1241688000000.0,49113.0,1427201000000.0,1427215000000.0,6242125000000.0,2.0,2.0,165000.0,8068000.0
75%,13467430.0,182754200.0,40446740000000.0,11.3,178.0,8.867754,28.727273,220.0,4692.25,0.272727,...,1427328000000.0,1285848000000.0,113480.0,1427328000000.0,1427371000000.0,100418000000000.0,3.0,3.0,542000.0,37038000.0
max,16777220.0,812713200.0,140737500000000.0,14.0,778.0,29.666667,341.333333,4417.0,583230.695076,4.75,...,1427487000000.0,1423252000000.0,520062.0,1427625000000.0,1427919000000.0,4.198625e+16,8.0,11.0,428419000.0,723608000.0


In [11]:
fn = "data/threads/germanwings-crash.csv"
gw_thrds.to_csv(fn, index=False)
"Wrote data to %s" % fn

'Wrote data to data/threads/germanwings-crash.csv'