# Parsing and Cleaning PHEME RNR Dataset Events

This notebook performs data-cleaning and aggrigation into thread-level data. It also provides a useful sanity check. Run all the cells in this notebook to generate thread-level CSV files in the `data/threads` directory.

In [1]:
# Load dependencies for this Jupyter Notebook
import pandas as pd
import numpy as np
import time
from functools import reduce
from lib.util import fetch_tweets, to_unix_tmsp

In [2]:
# Load dependencies for this Jupyter Notebook
import pandas as pd
import numpy as np
from functools import reduce
from sklearn.decomposition import PCA,SparsePCA,KernelPCA
from sklearn.manifold import TSNE, Isomap
import matplotlib.pyplot as plt

#Train and Test preprocessing
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

#Classifiers:
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

## Parsing and Cleaning Data
This step takes the raw PHEME rumor dataset and saves it tabular format as CSV file. The original PHEME dataset consists of JSON files organized into directories by event and category (rumor or non-rumor). These three functions below parse the data, save it as a CSV file (if necessary), and load it into this notebook as a Pandas DataFrame from the "cached" CSV file.

In [31]:
gw = fetch_tweets("germanwings-crash")

##  Tweet Level Features

| Name/Column       | Description                   | Type   | Notes  |
|-------------------|-------------------------------|--------| ------ |
| is_rumor          | Was this classified as rumor  | `bool` | *Classification done by journalists* |
| thread            | Source tweet id               | `str`  |                                                   |
| in_reply_tweet    | Tweet ID in reply to          | `str`  |                                                   |
| event             | Name of the PHEME event       | `str`  | Corresponds to event in the PHEME dataset         |
| tweet_id          | Unique ID for tweet           | `str`  | This field is the ID referenced in `in_reply_tweet`     |
| is_source_tweet   | Was this classified as rumor  | `bool` |                                                   |
| in_reply_user     | User ID in reply to           | `str`  |                                                   |
| user_id           | Twitter User's ID             | `str`  | This field is the ID referenced in `in_reply_user` |
| tweet_length      | Number of characters in tweet | `int`  |                                                   |
| urls_count        | Number of URLS in tweet       | `int`  |                                                   |
| hashtags_count    | Number of hashtags in tweet   | `int`  |                                                   |
| retweet_count     | Times the tweet was retweeted | `int`  |                                                   |
| favorite_count    | Number of times favorited     | `int`  |                                                   |
| mentions_count    | Number of users mentioned     | `int`  |                                                   |
| is_truncated      | Is this tweet truncated       | `bool` | Did User type > 140 characters. [See Tweet updates](https://developer.twitter.com/en/docs/tweets/tweet-updates) |
| created              | Datetime Tweet was created    | `datetime` | |
| has_smile_emoji      | Does Tweet contain ""?        | `bool` | 😊 is the smile emoji |
| user.tweets_count    | User's tweet total, currently | `int`  | |
| user.verified        | Is Twitter user verified?     | `bool` |                                                   |
| user.followers_count | Total number of followers  | `int` | |
| user.listed_count    | ?? | `int` | | 
| user.friends_count   | ?? | `int` | |
| user.time_zone       | Timezone of the user's Twitter account | `str` | |
| user.desc_length     | Length of the user's biographic description | `int` |
| user.has_bg_img      | Does user have a profile background image?  | `bool` |
| user.default_pric    | Does the user have the default profile picture | `bool` |
| user.created_at      | Date and time Twitter account was activated | `datetime` | |

## Germanwings Crash

In [32]:
gw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4489 entries, 0 to 4488
Data columns (total 27 columns):
is_rumor                4489 non-null bool
thread                  4489 non-null object
in_reply_tweet          4489 non-null object
event                   4489 non-null object
tweet_id                4489 non-null object
is_source_tweet         4489 non-null bool
in_reply_user           4489 non-null object
user_id                 4489 non-null object
tweet_length            4489 non-null int64
urls_count              4489 non-null int64
hashtags_count          4489 non-null int64
retweet_count           4489 non-null int64
favorite_count          4489 non-null int64
mentions_count          4489 non-null int64
is_truncated            4489 non-null bool
created                 4489 non-null datetime64[ns, UTC+00:00]
has_smile_emoji         4489 non-null bool
user.tweets_count       4489 non-null int64
user.verified           4489 non-null bool
user.followers_count    4489 non-nul

The `.head` method prints out the 5 first rows in the dataframe

In [33]:
gw.head()

Unnamed: 0,is_rumor,thread,in_reply_tweet,event,tweet_id,is_source_tweet,in_reply_user,user_id,tweet_length,urls_count,...,user.tweets_count,user.verified,user.followers_count,user.listed_count,user.friends_count,user.time_zone,user.desc_length,user.has_bg_img,user.default_pic,user.created_at
0,False,580319983676313601,,germanwings-crash,580319983676313601,True,,8330472,98,1,...,107042,True,179430,3550,74,Madrid,115,True,False,2007-08-21 14:03:19+00:00
1,False,580319983676313601,5.803199836763136e+17,germanwings-crash,580322851850461184,False,8330472.0,2307392966,109,1,...,2076,False,988,7,1782,,121,True,True,2014-01-23 23:26:57+00:00
2,False,580319983676313601,5.803228518504612e+17,germanwings-crash,580323127089082368,False,2307392966.0,2535310842,36,0,...,701,False,62,1,121,,34,False,False,2014-05-30 15:39:18+00:00
3,False,580319983676313601,5.803231270890824e+17,germanwings-crash,580325737619685377,False,2535310842.0,2535310842,72,0,...,701,False,62,1,121,,34,False,False,2014-05-30 15:39:18+00:00
4,False,580321203757387776,,germanwings-crash,580321203757387776,True,,92771309,85,1,...,11447,True,20839,369,1354,London,108,True,False,2009-11-26 15:08:39+00:00


The `describe` method will give summary information about each column in the dataframe. Each of these columns, except `is_truncated` should have two unique values.

In [34]:
gw.describe(include="bool")

Unnamed: 0,is_rumor,is_source_tweet,is_truncated,has_smile_emoji,user.verified,user.has_bg_img,user.default_pic
count,4489,4489,4489,4489,4489,4489,4489
unique,2,2,1,2,2,2,2
top,True,False,False,False,False,True,False
freq,2494,4020,4489,4487,4109,3992,2827


In [35]:
gw.describe()

Unnamed: 0,tweet_length,urls_count,hashtags_count,retweet_count,favorite_count,mentions_count,user.tweets_count,user.followers_count,user.listed_count,user.friends_count,user.desc_length
count,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0,4489.0
mean,90.788372,0.185787,0.298062,25.290933,9.049677,1.506572,27548.61,187391.3,1814.252172,1576.522834,82.602584
std,39.920378,0.400827,0.693938,163.969869,70.400399,1.020736,93527.55,1345395.0,11458.251488,8865.247468,56.157491
min,5.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,56.0,0.0,0.0,0.0,0.0,1.0,1810.0,119.0,2.0,184.0,30.0
50%,97.0,0.0,0.0,0.0,0.0,1.0,7685.0,477.0,10.0,448.0,88.0
75%,130.0,0.0,0.0,0.0,1.0,2.0,25681.0,2288.0,46.0,1210.0,135.0
max,151.0,2.0,8.0,4388.0,2541.0,12.0,4420429.0,25303090.0,163464.0,453460.0,160.0


## Thread Level Features

* **Bold features** represent high performing features identified in C. Buntain and J. Golbeck, ["Automatically Identifying Fake News in Popular Twitter Threads"](http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8118443&isnumber=8118402)
* Features that are normalized are normalized by thread length


| Name                | Description                               | Type    | Notes |
| ---                 | ---                                       | ---     | ----- |
| thread              | Tweet ID of the source tweet              | `str`   | |
| favorite_count      | Normalized favorite total                 | `float` | |
| retweet_count       | Normlaized retweet total                  | `float` | |
| **hashtags_count**  | Normlaized hashtag total                  | `float` | |
| urls_count          | URL total normalized by thread length     | `float`  | |
| user.tweets_count   | Total tweets by thread users              | `float` | |
| event               | Name of PHEME event                       | `str`  | |
| is_rumor            | Either rumor or nonrumor                  | `bool` | |
| thread_length       | Number of tweets in the thread            | `int`  | |
| user.has_bg_img     | Ratio of users who have bg image          | `float`| |
| user.default_pic    | Ratio of users with default profile pic   | `float`| |
| **has_smile_emoji** | Number of smile emojis in the thread      | `int`  | 😊 is the smile emoji |
| user.verified       | Count of verified users in the thread normalized by thread length     | `float`  | |
| **src.followers_count** | The number of followers of the original poster of the thread. | `int` | |
| src.listed_count    | ??? | `int` | |
| src.user_verified   | ??? | `int` | |
| src.tweets_total    | ??? | `int` | |
| reply_var           | The variance in the timestamps of responses to the source tweet | `float` |
| src_age             | Difference in src user's creation and tweet creation            | `int`   | Measured in seconds |
| time_to_first_resp  | The difference between tweet creation datetime and 1st reply    | `int`   | Measured in seconds |
| time_to_last_resp   | The difference between tweet creation datetime and last reply   | `int`   | Measured in seconds |

In [50]:

def agg_tweets_by_thread(df):

    # Returns the proportion of True/1 values in col
    normal_sum = lambda col : np.sum(col) / len(col)
    agg_props = {
        "favorite_count": normal_sum,
        "retweet_count": normal_sum,
        "hashtags_count": normal_sum,    
        "urls_count": normal_sum,
        "user.tweets_count": normal_sum,        
        "event": max,
        "is_rumor": max,
        "tweet_id": len,
        "user.has_bg_img": normal_sum,
        "user.default_pic": normal_sum,
        "has_smile_emoji": normal_sum,
        "user.verified": normal_sum,
    }
    rename = {
        "tweet_id": "thread_length",
        "has_url":"url_proportion",
    }
    agg = df.groupby("thread").agg(agg_props).rename(columns=rename)
    src = df[df["is_source_tweet"] == True][["thread", 
                                          "user.followers_count", 
                                          "user.listed_count",
                                          "user.verified",
                                          "created",
                                          "user.created_at",
                                          "user.tweets_count"]] \
                         .rename(columns={"user.followers_count": "src.followers_count",
                                          "user.listed_count": "src.listed_count",
                                          "user.verified": "src.user_verified",
                                          "user.created_at": "src.created_at",
                                          "user.tweets_count": "src.tweets_total"})
    
    def f(x):
        d = []
        d.append(min(x["created"]))
        d.append(max(x["created"]))
        d.append(np.var(to_unix_tmsp(x["created"])))
        return pd.Series(d, index=["first_resp", "last_resp","resp_var"])
        
    replies = df[df["is_source_tweet"] == False] \
        .groupby("thread") \
        .apply(f)

    dfs = [agg, src, replies]
    thrd_data = reduce(lambda left, right: pd.merge(left,right, on="thread"), dfs)
    
    thrd_data["src_age"] = thrd_data["created"] - src["src.created_at"]
    thrd_data["time_to_first_resp"] = thrd_data["first_resp"] - thrd_data["created"]
    thrd_data["time_to_last_resp"] = thrd_data["last_resp"] - thrd_data["created"]
    
    return thrd_data

In [51]:
gw_thrds = agg_tweets_by_thread(gw)
gw_thrds.head()

Unnamed: 0,thread,urls_count,is_rumor,thread_length,hashtags_count,favorite_count,user.has_bg_img,retweet_count,user.default_pic,user.verified,...,src.user_verified,created,src.created_at,src.tweets_total,first_resp,last_resp,resp_var,src_age,time_to_first_resp,time_to_last_resp
0,580317556516483072,0.125,True,8,0.125,5.0,1.0,72.25,0.375,0.0,...,False,2015-03-24 10:37:41+00:00,2014-01-07 11:38:00+00:00,14408,2015-03-24 10:38:17+00:00,2015-03-24 10:47:31+00:00,28107390000.0,2771 days 20:34:22,00:00:36,00:09:50
1,580317998147325952,0.0,True,3,0.0,0.666667,1.0,13.333333,0.333333,0.0,...,False,2015-03-24 10:39:27+00:00,2009-08-20 15:42:13+00:00,87411,2015-03-24 10:41:08+00:00,2015-03-24 10:42:28+00:00,1600000000.0,2201 days 10:01:11,00:01:41,00:03:01
2,580318020192571392,0.5,True,2,0.0,0.5,1.0,13.5,0.0,0.0,...,False,2015-03-24 10:39:32+00:00,2007-05-08 16:02:23+00:00,22201,2015-03-24 11:54:08+00:00,2015-03-24 11:54:08+00:00,0.0,2201 days 10:01:16,01:14:36,01:14:36
3,580318210609696769,0.0,True,2,1.0,4.5,1.0,23.0,0.0,0.5,...,True,2015-03-24 10:40:17+00:00,2009-03-26 12:33:47+00:00,13875,2015-03-24 10:43:30+00:00,2015-03-24 10:43:30+00:00,0.0,2201 days 10:02:01,00:03:13,00:03:13
4,580318669483413504,0.6,True,5,0.6,1.8,1.0,12.2,0.2,0.0,...,False,2015-03-24 10:42:07+00:00,2009-09-29 13:06:54+00:00,5426,2015-03-24 11:53:05+00:00,2015-03-24 17:27:43+00:00,51941050000000.0,1943 days 19:33:28,01:10:58,06:45:36


In [52]:
fn = "data/threads/germanwings-crash.csv"
gw_thrds.to_csv(fn)
"Wrote data to %s" % fn

'Wrote data to data/threads/germanwings-crash.csv'

### Convert times to integers of second and separate is_rumor tag with data:

In [59]:
gw_thrds_rumortags=gw_thrds["is_rumor"]
gw_thrds_without_rumor_tag=gw_thrds.drop(['is_rumor'],axis=1)

print(gw_thrds_without_rumor_tag.columns.values)
gw_thrds_without_rumor_tag=gw_thrds_without_rumor_tag.drop(['event'],axis=1)

gw_thrds_without_rumor_tag["created"]=to_unix_tmsp(gw_thrds["created"])
gw_thrds_without_rumor_tag["src.created_at"]=to_unix_tmsp(gw_thrds["src.created_at"])
gw_thrds_without_rumor_tag["first_resp"]=to_unix_tmsp(gw_thrds["first_resp"])
gw_thrds_without_rumor_tag["src_age"]=to_unix_tmsp(gw_thrds["src_age"])
gw_thrds_without_rumor_tag["last_resp"]=to_unix_tmsp(gw_thrds["last_resp"])
gw_thrds_without_rumor_tag["time_to_first_resp"]=to_unix_tmsp(gw_thrds["time_to_first_resp"])
gw_thrds_without_rumor_tag["time_to_last_resp"]=to_unix_tmsp(gw_thrds["time_to_last_resp"])



['thread' 'urls_count' 'thread_length' 'hashtags_count' 'favorite_count'
 'user.has_bg_img' 'retweet_count' 'user.default_pic' 'user.verified'
 'event' 'has_smile_emoji' 'user.tweets_count' 'src.followers_count'
 'src.listed_count' 'src.user_verified' 'created' 'src.created_at'
 'src.tweets_total' 'first_resp' 'last_resp' 'resp_var' 'src_age'
 'time_to_first_resp' 'time_to_last_resp']


### Used functions:

In [60]:
def convertTrueFalseTo01(X):
    X[X==True]=1.0
    X[X==False]=0.0
    X[X=='True']=1.0
    X[X=='False']=0.0
    return X

def standardize_cols(X, mu=None, sigma=None):
    # Standardize each column with mean 0 and variance 1
    n_rows, n_cols = X.shape

    if mu is None:
        mu = np.mean(X, axis=0)

    if sigma is None:
        sigma = np.std(X, axis=0)
        sigma[sigma < 1e-8] = 1.

    return (X - mu) / sigma, mu, sigma


### Data Preprocessing:

In [93]:
gw_thrds_values=gw_thrds_without_rumor_tag.values
n,d=gw_thrds_values.shape
#gw_thrds_values=convertTrueFalseTo01(gw_thrds_values[:,1:d])
n,d=gw_thrds_values.shape

gw_thrds_rumortags_values=convertTrueFalseTo01(gw_thrds_rumortags.values)
print(gw_thrds_values)
gw_thrds_values,_,_=standardize_cols(gw_thrds_values.astype(float))

n,d=gw_thrds_values.shape
print(gw_thrds_values.shape)

[['580317556516483072' 0.125 8 ... 239488462000.0 36000.0 590000.0]
 ['580317998147325952' 0.0 3 ... 190202471000.0 101000.0 181000.0]
 ['580318020192571392' 0.5 2 ... 190202476000.0 4476000.0 4476000.0]
 ...
 ['581479017770979329' 0.0 21 ... 190479279000.0 222000.0 19073000.0]
 ['581546828954411008' 0.037037037037037035 27 ... 190495447000.0 77000.0
  66210000.0]
 ['581550667753504768' 0.06666666666666667 30 ... 190496362000.0
  25389000.0 102964000.0]]
(405, 23)




### PCA:

In [70]:
model=PCA(n_components=2)
model.fit(gw_thrds_values)
Z_PCA=model.transform(gw_thrds_values)
plt.figure()
plt.title("PCA")
plt.scatter(Z_PCA[:,0],Z_PCA[:,1],c=gw_thrds_rumortags_values)
plt.show()

### TSNE:

In [71]:
model=TSNE(n_components=2)
Z_TSNE=model.fit_transform(gw_thrds_values)
plt.figure()
plt.title("TSNE")
plt.scatter(Z_TSNE[:,0],Z_TSNE[:,1],c=gw_thrds_rumortags_values)
plt.show()

### Isomap:

In [78]:
model=Isomap(n_components=2,n_neighbors=4)
Z_Isomap=model.fit_transform(gw_thrds_values)
plt.figure()
plt.title("Isomap")
plt.scatter(Z_Isomap[:,0],Z_Isomap[:,1],c=gw_thrds_rumortags_values)
plt.show()

### SparsePCA:

In [67]:
model=SparsePCA(n_components=2,normalize_components=True)
model.fit(gw_thrds_values)
Z_PCA=model.transform(gw_thrds_values)
plt.figure()
plt.title("SparsePCA")
plt.scatter(Z_PCA[:,0],Z_PCA[:,1],c=gw_thrds_rumortags_values)
plt.show()

### KernelPCA:

In [69]:
model=KernelPCA(n_components=2)
model.fit(gw_thrds_values)
Z_PCA=model.transform(gw_thrds_values)
plt.figure()
plt.title("KernelPCA")
plt.scatter(Z_PCA[:,0],Z_PCA[:,1],c=gw_thrds_rumortags_values)
plt.show()

## Running some classifiers
### Train and Test data separation:

In [79]:
X_train, X_test, y_train, y_test = train_test_split(gw_thrds_values, gw_thrds_rumortags_values, test_size=0.25, random_state=45)
le = preprocessing.LabelEncoder()
le.fit(y_train)
y_train=le.transform(y_train)
y_test=le.transform(y_test)
print(X_train.shape,X_test.shape,y_train.shape)
print('y_train bincount:', np.bincount(y_train)/np.sum(np.bincount(y_train)))
print('y_test bincount:', np.bincount(y_test)/np.sum(np.bincount(y_test)))

(303, 22) (102, 22) (303,)
y_train bincount: [0.49834983 0.50165017]
y_test bincount: [0.50980392 0.49019608]


### SVM.SVC:

In [80]:
def test_model(model):
    model.fit(X_train,y_train)
    y_test_hat=model.predict(X_test)
    print('train error:', np.mean(model.predict(X_train)==y_train))
    print('test error:', np.mean(y_test_hat==y_test))

In [82]:
model = svm.SVC(gamma='scale', kernel='linear', C=1)
test_model(model)

train error: 0.735973597359736
test error: 0.6176470588235294


In [90]:
model = KNeighborsClassifier(n_neighbors=5)
test_model(model)

train error: 0.7755775577557755
test error: 0.5294117647058824


In [84]:
model=DecisionTreeClassifier(random_state=0)
test_model(model)

train error: 1.0
test error: 0.6078431372549019


In [85]:
model=RandomForestClassifier(n_estimators=100, max_depth=4, random_state=4, max_features=2)
test_model(model)

train error: 0.8679867986798679
test error: 0.6862745098039216


In [86]:
model=AdaBoostClassifier(n_estimators=100)
test_model(model)

train error: 0.9900990099009901
test error: 0.6764705882352942


In [87]:
model=GaussianProcessClassifier(1.0 * RBF(1.0))
test_model(model)

train error: 0.7953795379537953
test error: 0.5686274509803921
