<p><span style="font-size: 36pt; font-family: georgia, palatino, serif; color: #800000;">Learning Topical Social Sensors</span></p>

<h1><span style="color: #000080;"><strong>How useful is twitter to you in terms of finding the right information?</strong></span></h1>

![caption](https://github.com/demoonism/TwitterSensor/blob/master/Screenshot/search.JPG?raw=true)

<p style="text-align: center;"><span style="text-decoration: underline;"><span style="font-size: 20pt;"><em><strong>We can do better than this!</strong> </em></span></span></p>

<p style="text-align: left;"><strong>In this project, we are aiming to train a classifier to identify targeted information on Twitter with high precision. </strong></p>
<p style="text-align: left;"><strong>For example, if you are interested in:</strong></p>
<p style="text-align: left;"><em><strong>&bull; Global social issues</strong></em><br /><em><strong>&bull; Politics in the Pacific Northwest</strong></em><br /><em><strong>&bull; Public transit in New York City</strong></em></p>
<p style="text-align: left;"><strong>The classifier would serve as a "sensor" to identify topical tweets based on your tailored interests!</strong></p>

<h1><span style="color: #000080;"><strong>Challenges</strong></span></h1>

<p style="text-align: left;"><strong>(1) &nbsp;Billions of potential features, thousands of useful ones (Hashtags, users, mentions, terms, locations)</strong></p>
<p style="text-align: left;"><strong>(2) &nbsp;Need a lot of labeled data to learn feature weights well</strong></p>

<h1><span style="color: #000080;"><strong>Solution</strong></span></h1>

<p><span style="font-size: 12pt;"><strong>(1) Careful feature engineering and feature selection using Apache Spark.</strong></span></p>
<p><span style="font-size: 10pt;"><strong>We performed feature selection and transformation with Apache Spark on a standalone server with eight 1TB Hard disks, two 20 core CPU (40 threads) and 256GB RAM. </strong></span></p>
<p><span style="font-size: 12pt;"><strong>(2)</strong> <strong>Hashtags!</strong>&nbsp;</span></p>
<p><span style="font-size: 10pt;"><strong>Hashtags&nbsp;originated on IRC chat, were&nbsp;adopted later (and perhaps most famously) on Twitter, and&nbsp;now appear on other social media platforms such as Instagram,&nbsp;Tumblr, and Facebook. They usually serve as surogates for topics. Therefore, for each topic,&nbsp;we leverage a (small)&nbsp;set of user-curated topical hashtags to efficiently provide&nbsp;a large number of supervised topic labels for social media&nbsp;content.&nbsp;</strong></span></p>
<p><span style="font-size: 10pt;"><strong>We used 4 independent annotators to query the Twitter search API to identify candidate hashtags for each topic. A&nbsp;hashtag is assigned to a topic set if 3 out of 4 annotators agrees on the assignment.</strong></span></p>
<p><span style="font-size: 10pt;"><strong>For example, for the topic, "Natural Disaster", the set of hashtags are ["sandy", "drought", "storm", "hurricane", "tornado" .... etc]. If a tweet contains one or more of the pre-determined hashtags, we say it is "topical" for a particular toic, and it is labeled 1 (0 otherwise). We will revisit this in the feature selection section</strong></span></p>
<p><span style="font-size: 10pt;">&nbsp;</span></p>
<p><span style="font-size: 18pt; color: #ff0000;"><strong>Catch!</strong></span></p>
<p><strong><span style="font-size: 10pt;">Hashtag is part of our feature, wouldn't the classifier simply learn to remember the hashtag?</span></strong></p>
<p><strong><span style="font-size: 10pt;">To ensure maximum generality, we remove training hashtags from the validation and test set to ensure the classifier making prediction on the learnt feature and not just remembering hashtags. This would be further illlustrated in the Train-Validation split section later.</span></strong></p>

<h1><span style="color: #000080;"><strong>Now we have labeled data, what features could be useful for predciting topicality?</strong></span></h1>

![caption](https://github.com/demoonism/TwitterSensor/blob/master/Screenshot/twt.JPG?raw=true)

<p style="text-align: left;"><span style="font-size: 18pt; color: #000080;"><strong>Why might these tweet features be useful?</strong> </span></p>
<p style="text-align: left;"><br /><span style="font-size: 10pt;"><strong>&bull; Users: who tweets on the topic?</strong></span><br /><span style="font-size: 10pt;"> <em><strong>-</strong></em> <span style="text-decoration: underline;"><em><strong>Tweets from the weather channel might be a good indicator for Natural Disasters</strong></em></span></span></p>
<p style="text-align: left;"><br /><span style="font-size: 10pt;"><strong>&bull; Hashtags: What hashtags co-occur with the topic?</strong></span><br /><span style="font-size: 10pt;"> <em><strong>-</strong></em> <span style="text-decoration: underline;"><em><strong>#teaparty could imply LBGT rights</strong></em></span></span></p>
<p style="text-align: left;"><br /><span style="font-size: 10pt;"><strong>&bull; Mentions:</strong></span><br /><span style="font-size: 10pt;"> <em><strong>-</strong></em> <span style="text-decoration: underline;"><em><strong>@Redcross might be releavant to Natural Disaster</strong></em></span></span></p>
<p style="text-align: left;"><br /><span style="font-size: 10pt;"><strong>&bull; Locations:</strong></span><br /><span style="font-size: 10pt;"> <em><strong>-</strong></em> <span style="text-decoration: underline;"><em><strong>Philippines where a lot of natural disaster happend in the last few years is a descent guess for releavant topics</strong></em></span></span></p>
<p style="text-align: left;"><br /><span style="font-size: 10pt;"><strong>&bull; Terms:</strong></span><br /><span style="font-size: 10pt;"> <em><strong>-</strong></em> <span style="text-decoration: underline;"><em><strong> Word features are strong indicators of a particular topic</strong></em></span></span></p>

<h1><span style="color: #000080;"><strong>Implementation</strong></span></h1>

<p><strong><span style="font-size: 10pt;">The original Twitter data were collected over 2 years, which contains over 2TB compressed data. It consists of hundreds of millions lines of tweets.</span></strong></p>
<p><strong><span style="font-size: 10pt;">How do we go from the raw data to an efficient classifier?</span></strong></p>
<p><strong><span style="font-size: 10pt;">The following three-step processes serves an end-to-end pipeline to perform ETL and ML training.</span></strong></p>
<p>&nbsp;</p>

<p><span style="color: #000080;"><strong><span style="font-family: georgia, palatino, serif; font-size: 24pt;">Step One: Pre-Processing</span></strong></span></p>

<p><span style="font-size: 13.3333px;"><strong>Each valid tweet crawled from the server is a json object with over 100 attributes. An example could be find as following:</strong></span></p>

<p><span style="font-size: 10pt;"><strong>Sample Tweet</strong></span></p>
<p><span style="font-size: 8pt;"><strong>{</strong>"created_at":"Thu Jan 31 12:58:06 +0000 2013",</span><br /><span style="font-size: 8pt;"> "id":296965581582786560,</span><br /><span style="font-size: 8pt;"> "id_str":"296965581582786560",</span><br /><span style="font-size: 8pt;"> "text":"Im ready for whatever",</span><br /><span style="font-size: 8pt;"> "source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e",</span><br /><span style="font-size: 8pt;"> "truncated":false,</span><br /><span style="font-size: 8pt;"> "in_reply_to_status_id":null,</span><br /><span style="font-size: 8pt;"> "in_reply_to_status_id_str":null,</span><br /><span style="font-size: 8pt;"> "in_reply_to_user_id":null,</span><br /><span style="font-size: 8pt;"> "in_reply_to_user_id_str":null,</span><br /><span style="font-size: 8pt;"> "in_reply_to_screen_name":null,</span><br /><span style="font-size: 8pt;"> "user":{</span><br /><span style="font-size: 8pt;"> "id":1059349532,</span><br /><span style="font-size: 8pt;"> "id_str":"1059349532",</span><br /><span style="font-size: 8pt;"> "name":"Don Dada",</span><br /><span style="font-size: 8pt;"> "screen_name":"ImDatNiggaBD",</span><br /><span style="font-size: 8pt;"> "location":"South Side Of Little Rock",</span><br /><span style="font-size: 8pt;"> "url":null,</span><br /><span style="font-size: 8pt;"> "description":"Weed Smoker (Kush)",</span><br /><span style="font-size: 8pt;"> "protected":false,</span><br /><span style="font-size: 8pt;"> "followers_count":109,</span><br /><span style="font-size: 8pt;"> "friends_count":110,</span><br /><span style="font-size: 8pt;"> "listed_count":0,</span><br /><span style="font-size: 8pt;"> "created_at":"Fri Jan 04 02:37:28 +0000 2013",</span><br /><span style="font-size: 8pt;"> "favourites_count":14,</span><br /><span style="font-size: 8pt;"> "utc_offset":null,</span><br /><span style="font-size: 8pt;"> "time_zone":null,</span><br /><span style="font-size: 8pt;"> "geo_enabled":false,</span><br /><span style="font-size: 8pt;"> "verified":false,</span><br /><span style="font-size: 8pt;"> "statuses_count":1312,</span><br /><span style="font-size: 8pt;"> "lang":"en",</span><br /><span style="font-size: 8pt;"> "contributors_enabled":false,</span><br /><span style="font-size: 8pt;"> "is_translator":false,</span><br /><span style="font-size: 8pt;"> "profile_background_color":"C0DEED",</span><br /><span style="font-size: 8pt;"> "profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme1\/bg.png",</span><br /><span style="font-size: 8pt;"> "profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme1\/bg.png",</span><br /><span style="font-size: 8pt;"> "profile_background_tile":false,</span><br /><span style="font-size: 8pt;"> "profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3184813228\/d6d3a95d902f088f412cf1bd90c126c7_normal.jpeg",</span><br /><span style="font-size: 8pt;"> "profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3184813228\/d6d3a95d902f088f412cf1bd90c126c7_normal.jpeg",</span><br /><span style="font-size: 8pt;"> "profile_banner_url":"https:\/\/si0.twimg.com\/profile_banners\/1059349532\/1359068332",</span><br /><span style="font-size: 8pt;"> "profile_link_color":"0084B4",</span><br /><span style="font-size: 8pt;"> "profile_sidebar_border_color":"C0DEED",</span><br /><span style="font-size: 8pt;"> "profile_sidebar_fill_color":"DDEEF6",</span><br /><span style="font-size: 8pt;"> "profile_text_color":"333333",</span><br /><span style="font-size: 8pt;"> "profile_use_background_image":true,</span><br /><span style="font-size: 8pt;"> "default_profile":true,</span><br /><span style="font-size: 8pt;"> "default_profile_image":false,</span><br /><span style="font-size: 8pt;"> "following":null,</span><br /><span style="font-size: 8pt;"> "follow_request_sent":null,</span><br /><span style="font-size: 8pt;"> "notifications":null},</span><br /><span style="font-size: 8pt;"> "geo":null,</span><br /><span style="font-size: 8pt;"> "coordinates":null,</span><br /><span style="font-size: 8pt;"> "place":null,</span><br /><span style="font-size: 8pt;"> "contributors":null,</span><br /><span style="font-size: 8pt;"> "retweet_count":0,</span><br /><span style="font-size: 8pt;"> "entities":{"hashtags":[],</span><br /><span style="font-size: 8pt;"> "urls":[],</span><br /><span style="font-size: 8pt;"> "user_mentions":[]},</span><br /><span style="font-size: 8pt;"> "favorited":false,</span><br /><span style="font-size: 8pt;"> "retweeted":false,</span><br /><span style="font-size: 8pt;"> "lang":"en"<strong>}</strong></span></p>

<p><span style="font-size: 13.3333px;"><strong>Obviously, not all attributes are relevant to our analysis. In the context of this paper, the only releavant fields in our features are:</strong></span></p>
<p><span style="color: #0000ff;"><em><strong>Hashtags, From_User, Create_Time, Location, Mentions</strong></em></span></p>
<p><span style="font-size: 13.3333px;"><strong>Moreover, the raw text is quite dirty. We need to perform some data cleaning in order to get clean features.</strong></span></p>
<p><span style="font-size: 13.3333px;"><strong>Since is step is fairly involveda and independet of the analysis here, I keep them in a separate Notebook. </strong></span></p>
<blockquote>
<p><span style="font-size: 13.3333px;"><strong>Spark-Twt-PreProcessing.ipynb</strong></span></p>
</blockquote>
<p><span style="font-size: 13.3333px;"><strong> You should be able to follow along as an indepent module.</strong></span></p>

<p><span style="font-size: 13.3333px;"><strong>The resulting data looks like this:</strong></span></p>

<p><strong>Processed-tweet:</strong></p>
<p><strong>{</strong>u'Create_time': 1359737884.0,<br /> u'from_id': 87151732,<br /> u'from_user': u'ishiPTI',<br /> u'hashtag': u'thuglife',<br /> u'location': u'loc_lakeshore',<br /> u'mention': u'BushraShekhani',<br /> u'term': u'I am ready for whatever',<br /> u'tweet_id': 297312861586325504<strong>}</strong></p>

<p><span style="font-size: 13.3333px;"><strong>Now we have a small (sort of) and clean dataset to work with, it is time to move on to spark to perform some reall analysis.</strong></span></p>

<p><span style="color: #000080;"><strong><span style="font-family: georgia, palatino, serif; font-size: 24pt;">Step Two: Feature Extraction</span></strong></span></p>

<p><span style="font-size: 13.3333px;"><strong>We need to turn the raw json data into a feature matrix. There are two keys here: </strong></span></p>
<p><span style="font-size: 13.3333px;"><strong>1. Data processing must be extremly efficient since we only have 40 cores and 256G ram.</strong></span></p>
<p><span style="font-size: 13.3333px;"><strong>2. The resulting matrix must be sparse to facilitate the training step&nbsp;later.</strong></span></p>
<p><span style="font-size: 13.3333px;"><strong>These are achieved through the following pipeline. </strong></span></p>

In [1]:
## Notebook property setup.
## Spark SQL
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import udf, col, lit, monotonically_increasing_id, explode
from pyspark.sql import functions as F
## Spark ML
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier
from pyspark.ml.param import Param, Params
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, Tokenizer, IDF, StopWordsRemover, CountVectorizer, VectorAssembler

## Helper
import sys
import time
import os.path
import json
from datetime import datetime
from operator import add
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
## Enable inline graphs
%matplotlib inline

import preprocessor as p
import string

## Display precision for pandas dataframe
pd.set_option('precision',10)

## Helper function to keep track the run time of a spark ops.
def getTime(start):
    sec = time.time() - start
    m, s = divmod(sec, 60)
    h, m = divmod(m, 60)
    print('Spark operation takes - %d:%02d:%02d which is %d seconds in total' % (h,m,s,sec))
    
# load json object, if a line is invalid, substitute as an empty dict (which has len() == 0 )
def loadJson(d):
    try:
        js = json.loads(d)
    except ValueError as e:
        js = {}
    except Exception:
        js = {}
    return js

def translating(x):
    return x.encode('utf-8').lower().translate(None, string.punctuation)

def Cleansing(d):
    txt = p.clean(d['term'].encode('ascii', 'ignore')).replace(":", "").lower()

    if d['location'] == None:
        loc_term = "empty_location"
    elif d['location'].strip(' ') == '':
        loc_term = "empty_location"
    else:
        loc_term = 'loc_' + "_".join(map(translating, d['location'].strip(' ').split(" ")))
        
    if txt == None:
        terms = "empty_tweet"
    elif txt.strip(' ') == '':
        terms = "empty_tweet"
    else:
        terms = txt.encode('utf-8').translate(None, string.punctuation).strip(' ')
        
    if d['hashtag'] == None:
        hashtags = "empty_hashtag"
    elif d['hashtag'].strip(' ') == '':
        hashtags = "empty_hashtag"
    else:
        hashtags = d['hashtag']
        
    if d['mention'] == None:
        mentions = "empty_mention"
    elif d['mention'].strip(' ') == '':
        mentions = "empty_mention"
    else:
        mentions = d['mention']
        
    processed = {"from_user":d['from_user'],
                 "from_id":d['from_id'],
                 "tweet_id":d['tweet_id'],
                 "hashtag":hashtags,
                 "term": terms,
                 "location":loc_term,
                 "mention":mentions,
                 "create_time":d['create_time']
                }
    return processed

<p><span style="font-size: 18px;"><strong>Reading DATA </strong></span></p>
<p><span style="font-size: 13.3333px;"><strong>After preprocessing, data are saved as parquet fiels. We need to load and parse these data into dataframes. Note that, the sc.textFile function's input directory could be either a file or a directory. Spark context will create partitions automatically. Note that the pre-processed data are stored in two directories.</strong></span></p>

In [2]:
# full
data_Eng = sc.textFile("/mnt/1e69d2b1-91a9-473c-a164-db90daf43a3d/Eng_Json/,/mnt/2b53fde0-61da-4eeb-a038-9910540ff9ad/Eng_Json/")

<p><span style="font-size: 13.3333px;"><strong>Taking a look at the (parsed) first line of our input files </strong></span></p>

In [3]:
data = data_Eng.map(loadJson)
data.take(1)

[{u'create_time': 1380034531.0,
  u'from_id': u'330743066',
  u'from_user': u'maaddieeeb',
  u'hashtag': u'',
  u'location': u'',
  u'mention': u'RickyPDillon tyleroakley',
  u'term': u'RT @RickyPDillon: @tyleroakley you literally travel the world can i have your life',
  u'tweet_id': u'382458268779421696'}]

<h1><span style="color: #000080;"><strong>Turning to dataframe</strong></span></h1>


<p><span style="font-size: 13.3333px;"><strong>An RDD (Resilient Distributed Dataset) is more of a blackbox dataset that cannot be easily optimized as the operations that can be performed against it are not as constrained. (Available in Spark since 1.0) </strong></span></p>
<p><span style="font-size: 13.3333px;"><strong>A dataframe is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. Therefore, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query. (Added since 1.3) </strong></span></p>
<p><span style="font-size: 13.3333px;"><strong>In summary, you are able to write traditional map-reduce type of code on both RDD and Dataframe, but Dataframe also support SQL command and built-in analytical functions. For performance conideration, we are turning our RDD into Dataframes first.</strong></span></p>

In [4]:
## Define Dataframe schema.
schema = StructType([StructField('create_time', DoubleType(), False),
                     StructField('from_id', StringType(), False),
                     StructField('from_user', StringType(), False),
                     StructField('hashtag', StringType(), True),
                     StructField('location', StringType(), True),
                     StructField('mention', StringType(), True),
                     StructField('term', StringType(), True),
                     StructField('tweet_id', StringType(), False)
                    ])
df = sqlContext.createDataFrame(data, schema)

<p><span style="font-size: 13.3333px;"><strong>Input are shown as tabular form (Dataframe) below. Note that hashtag field is happend to be null for the first few records. </strong></span></p>

In [6]:
df.show(5)

+-------------+----------+------------+-------+--------------------+--------------------+--------------------+------------------+
|  create_time|   from_id|   from_user|hashtag|            location|             mention|                term|          tweet_id|
+-------------+----------+------------+-------+--------------------+--------------------+--------------------+------------------+
|1.380034531E9| 330743066|  maaddieeeb|       |                    |RickyPDillon tyle...|RT @RickyPDillon:...|382458268779421696|
|1.380034532E9| 993373555|deanojames95|       |          Birmingham|spinky1996 deanoj...|RT @spinky1996: @...|382458272940183552|
|1.380034532E9|1378264339|  JonahsEyes|       |3/6 closet drake ...|                    |WHY AM I SITTING ...|382458272940589057|
|1.380034532E9| 603247328|emilybalzano|       |                    |                    |3 people followed...|382458272974135296|
|1.380034532E9|1336196430|Bah7ar_Fa7al|       |             Bahrain|            TellyApp|I

In [8]:
df.select("location").distinct().count()

23937570

<p><span style="font-size: 13.3333px;"><strong>Now we need to transform the textual features into a sparse vector to be piped into our learning algorithm later.</strong></span></p>

<h1><span style="color: #000080;"><strong>Vectorizing user, hashtag, location, mention, term into feature vectors</strong></span></h1>

<p><span style="font-size: 13.3333px;"><strong>We use the same threshold as describbed in the paper. Note that the threshold is for DF, not TF.</strong></span></p>
<table style="height: 51px; margin-left: auto; margin-right: auto;" width="30">
<tbody>
<tr>
<td><strong>Feature</strong></td>
<td><strong>Threshold</strong></td>
</tr>
<tr>
<td>From_User</td>
<td>159</td>
</tr>
<tr>
<td>Hashtag</td>
<td>159</td>
</tr>
<tr>
<td>Mention</td>
<td>159</td>
</tr>
<tr>
<td>Location</td>
<td>50</td>
</tr>
<tr>
<td>term</td>
<td>50</td>
</tr>
</tbody>
</table>

<p><span style="font-size: 13.3333px;"><strong>In this section, we vectorize each feature according to the count threshold above. </strong></span></p>

In [None]:
term_tokenizer = Tokenizer(inputCol="term", outputCol="words")
term_remover = StopWordsRemover(inputCol=term_tokenizer.getOutputCol(), outputCol="filtered")
term_cv = CountVectorizer(inputCol=term_remover.getOutputCol(), outputCol="term_features", minDF=50)

hashtag_tokenizer = Tokenizer(inputCol="hashtag", outputCol="tags")
hashtag_cv = CountVectorizer(inputCol=hashtag_tokenizer.getOutputCol(), outputCol="hashtag_features", minDF=159)

mention_tokenizer = Tokenizer(inputCol="mention", outputCol="mentions")
mention_cv = CountVectorizer(inputCol=mention_tokenizer.getOutputCol(), outputCol="mention_features", minDF=159)

user_tokenizer = Tokenizer(inputCol="from_user", outputCol="users")
user_cv = CountVectorizer(inputCol=user_tokenizer.getOutputCol(), outputCol="user_features", minDF=159)

loc_tokenizer = Tokenizer(inputCol="location", outputCol="locs")
loc_cv = CountVectorizer(inputCol=loc_tokenizer.getOutputCol(), outputCol="loc_features", minDF=50)

pipeline = Pipeline(stages=[term_tokenizer,term_remover,term_cv,hashtag_tokenizer,hashtag_cv,mention_tokenizer, \
                            mention_cv,user_tokenizer, user_cv, loc_tokenizer, loc_cv])

<p><span style="font-size: 13.3333px;"><strong>In this section, we vectorize each feature according to the count threshold above. </strong></span></p>
<p><span style="font-size: 13.3333px;"><strong>Fit and tranfer the original dataframe into a set of feature vectors</strong></span></p>

In [None]:
loading = time.time()

model = pipeline.fit(df)
Train_X = model.transform(df)

getTime(loading)


<p><span style="font-size: 13.3333px;"><strong>We can now compare our stats with the original paper</strong></span></p>

<p><span style="font-size: 18px;"><strong>Original </strong></span></p>
![caption](https://github.com/demoonism/TwitterSensor/blob/master/Screenshot/featurecount.JPG?raw=true)

<p><span style="font-size: 18px;"><strong> New </strong></span></p>

In [11]:
### To do...

<p><span style="font-size: 13.3333px;"><strong>Now we need to concatenate the feature vectors to obtain a feature matrix. Note that we still keep the tweet id in the output because we want to keep a mapping to the original tweet for manual examination</strong></span></p>

In [None]:
assembler = VectorAssembler(inputCols = ["term_features","hashtag_features","mention_features","user_features","loc_features"], outputCol="features")
transformed = assembler.transform(Train_X).select("tweet_id","features","creat_time")

<p><span style="font-size: 13.3333px;"><strong>We also get the text for each feature by printing out the vocabular. This is useful for manual inspection.</strong></span></p>

In [None]:
#http://stackoverflow.com/questions/32285699/how-to-get-word-details-from-tf-vector-rdd-in-spark-ml-lib
terms_meta = model.stages[2]

In [None]:
terms_meta.vocabulary[:5]

In [None]:
Hashtag_meta = model.stages[4]

In [None]:
Hashtag_meta.vocabulary[:5]

In [None]:
# dirty but efficient way to get a list of all feature names. Usful later when we select and examine features.
featurelist = model.stages[2].vocabulary+model.stages[4].vocabulary+model.stages[6].vocabulary+model.stages[8].vocabulary+model.stages[10].vocabulary

<h1><span style="color: #000080;"><strong>Temporal Split</strong></span></h1>

<p><span style="font-size: 13.3333px;"><strong>Now we have out feature matrix, it is time to estabulish the training, validation and test set for training the classifier</strong></span></p>
<p><span style="font-size: 13.3333px;"><strong>To ensure our classifier generalize to a wide range of features and not simply remeber the past hashtag, we will perform a teppral split to exclude training hashtags in validation and test.</strong></span></p>

![caption](https://github.com/demoonism/TwitterSensor/blob/master/Screenshot/Capture.JPG?raw=true)

In [12]:
term_tokenizer = Tokenizer(inputCol="hashtag", outputCol="each_hashtag")
hashtags_df = term_tokenizer.transform(df)

hashtag =  hashtags_df.select("tweet_id","create_time","each_hashtag")
hash_exploded = hashtag.withColumn('each_hashtag', explode('each_hashtag'))

In [13]:
hash_exploded.select("each_hashtag").distinct().count()

13573239

In [None]:
## Define topical hashtag list
topic_dict = {
    "soccer":{"soccer", "football", "worldcup", "sports", "futbol", "fifa", "mls", "worldcup2014", "epl", "sportsroadhouse", "sport", "adidas", "messi", "usmnt", "arsenal", "manchesterunited", "nike", "ronaldo", "manutd", "fifaworldcup", "foot", "ussoccer", "sportsbetting", "realmadrid", "aleague", "chelsea", "manchester", "cr7", "footballnews", "championsleague", "youthsoccer", "eplleague", "barcelona", "brazil2014", "soccerproblems", "premierleague", "brasil2014", "soccerlife", "cristianoronaldo", "uefa", "fifa2014", "beckham", "fifa14", "neymar", "fussball", "soccergirls", "barca", "manchestercity", "league", "fútbol", "halamadrid", "bayern", "women", "lfc", "goalkeeper", "everton", "bayernmunich", "soccerprobs", "league1", "juventus", "nufc", "mcfc", "cristiano", "eurosoccercup", "platini", "socce", "mancity", "torontofc", "dortmund", "derbyday", "fifa15", "liverpool", "league2", "ilovesoccer", "fcbarcelona", "maradona", "intermilan", "futebol", "soccergirlprobs", "soccersixfanplayer", "realfootball", "gunners", "confederationscup", "worldcupproblems", "ballondor", "collegesoccer", "rooney", "flagfootball", "realsaltlake", "lionelmessi", "usavsportugal", "europaleague", "soccernews", "uefachampionsleague", "psg", "gobrazil", "uslpro", "wc2014", "suarez", "bvb", "soccerprobz", "worldcupqualifiers", "torres", "footbal", "balotelli", "nashville", "inter", "milano", "cardiff", "jleague", "nwsl", "ozil", "worldcup2014brazil", "nycfc", "mess", "soccernation", "pelé", "tottenham", "ligue1", "landondonovan", "atletico", "worldcup14", "torino", "soccerislife", "fernandotorres", "ronaldinho", "goldenball", "wembley", "brazilvscroatia", "collegefootball", "elclassico", "footba", "fifa13", "soccersunday", "englandsoccercup", "usasoccer", "womensfootball", "fcbayern", "fifaworldcup2014", "usavsgermany", "neymarjr", "soccersucks", "arturovidal", "zidane", "ballislife", "usavsger", "mlscup", "worldcupfinal", "ajax", "soccerball", "lovesoccer", "euro2013", "soccergame", "premiereleague", "mu", "lionel", "soccermanager", "mundial2014", "portugalvsgermany", "soccerseason", "mondiali2014", "davidbeckham", "redbulls", "argvsned", "selecao", "usavsmex", "soccergirlproblems", "soccerlove", "2014worldcup", "soccergrlprobs", "germanyvsargentina", "zlatan", "napoli", "muller", "confederations_cup", "championsleaguefinal", "worldcuppredictions", "clasico", "liverpoolvsrealmadrid", "mundialsub17", "worldcupbrazil", "leaguechamps", "arsenalfans", "germanyvsalgeria", "netherlandsvsargentina", "belvsusa", "bravsned", "mexicovsusa", "englandvsuruguay", "germanyvsbrazil", "brazilvsnetherlands", "gervsarg", "engvsita", "brazilvsgermany", "englandvsitaly", "espvsned", "crcvsned", "ghanavsusa", "francevsswitzerland", "argentinavsgermany", "spainvsnetherlands", "usavscan", "worldcupbrazil2014", "brazil2014worldcup", "fifaworldcupbrazil", "worldcup2018", "championleague"},
    "Natr_Disaster":{"sandy", "drought", "storm", "hurricane", "tornado", "hurricanesandy", "earthquake", "arthur", "julio", "manuel", "flood", "hurricanes", "quakelive", "hurricaneseason", "hurricaneseason", "hurricanepride", "quake", "hurricanekatrina", "katrina", "floodwarning", "eqnz", "bertha", "tsunami", "tsunamimarch", "hurricanekid", "drought3", "hurricanenia", "hurricanenation", "cholera", "hurricanefly", "drought13", "laquake", "typhoon", "tsunami2004", "ukstorm", "hurricaneforever", "quakecon2013", "prayforchina", "quakecon", "manuelpellegrini", "flood2013", "prayforthephilippines", "hurricanepreparedness", "hurricaneharbor", "typhoons", "hurricane13", "abfloods", "ukfloods", "hurricaneweek", "typhoonmaring", "odile", "hurricaneprep", "phailin", "earthquakeph", "visayasquake", "haiyan", "typhoonyolanda", "typhoonhaiyan", "typhoonaid", "typhoonjet", "corkfloods", "laearthquake", "quakecon2014", "flood2014", "prayforchile", "chileearthquake", "serbiafloods", "tsunamihitsfaisalabad", "hurricanearthur", "tsunami4nayapakistan", "typhoonglenda", "hurricanebertha", "hurricaneiselle", "napaquake", "napaearthquake", "hurricanemarie", "kashmirfloods", "hurricaneodile", "hurricanegonzalo", "hurricaneana", "haiyan1year", "typhoonhagupit", "typhoonruby"},
    "health": {"health","uniteblue","ebola","healthcare","depression","hiv","cdc","crisis","obesity","aids","nurse","flu","alert","publichealth","bandaid30","malaria","disease","fever","antivirus","virus","lagos","unsg","sierraleone","ebolaresponse","ebolaoutbreak","chanyeolvirusday","aids2014","vaccine","mer","homeopathy","msf","allergy","nih","humanitarianheroes","stopthespread","dengue","flushot","epidemic","ebolainatlanta","tuberculosis","westafrica","quarantine","ebolavirus","viruses","kacihickox","emory","meningitis","ebolaczar","enterovirus","pandemic","stopebola","chikungunya","eplague","childhoodobesity","plague","allergyseason","coronavirus","healthworkers","endebola","ebolaqanda","obola","h1n1","aidsfree","factsnotfear","ebolafacts","chickenpox","birdflu","ebolainnyc","dallasebola","ebolachat","eboladallas","childobesity","healthsystems","aidsday","truedepressioniswhen","askebola","depressionawareness","ambervinson","depressionhurts","ninapham","nursesfightebola","mickeyvirus","rotavirus","blackdeath","theplague","fluvaccine","thomasericduncan","plagueinc","stomachvirus","seasonaldepression","mercervirus","beatdepression","aidswalk","depressionproblems","aidswalkny","westnilevirus","depressionkills","smallpox","blackplague","depressionawarenessweek","epoxy","teendepression","fluushots","delpoxespn","virushsality","deepdepression","theamebavirus","fightdepression","plagues","flubug","aidswalkla","bubonicplague","winterdepression","poliovirus","skarfluvvirus","aidsday2014","allergyattack","allergyawarenessweek","pox","spox","theligerplague","thevirus"},
    "Social_issue": {"blacklivesmatter","ferguson","icantbreathe","ericgarner","alllivesmatter","mikebrown","shutitdown","antoniomartin","fergusondecision","nypdlivesmatter","millionsmarchnyc","justice4all","justiceformikebrown","handsupdontshoot","moa","policelivesmatter","berkeleyprotests","thisstopstoday","tamirrice","nojusticenopeace","racism","aurarosser","michaelbrown","thesystemisbroken","blackxmas","policebrutality","deblasio","fergusonoctober","wecantbreathe","justiceforericgarner","every28hours","racist","stoptheparade","enoughisenough","justice","johncrawford","bodycameras","dcferguson","millionsmarch","whereisjustice","blacktwitter","london","police","yamecanse","boston","india","bluelivesmatter","protest","whitelivesmatter","newyork","tcot","justiceforall","equality4all","handsup","whitesilence","economicjustice","solidarity","handsupwalkout","crimingwhilewhite","dontshoot","whiteprivilege","ows","teaparty","wallst","occupy","occupy","p2","tcot","anonymous","teaparty","occupywallstreet","uniteblue","tlot","ferguson","occupylove"},
    "Celebrity_death":{"jamesavery","freshprince","unclephil","freshprinceofbelair","rip","ripjamesavery","thefreshprinceofbelair","robinwilliams","nelsonmandela","philipseymourhoffman","paulwalker","mandela","prayforap","madiba","mayaangelou","rippaulwalker","riprobinwilliams","ripnelsonmandela","ripcorymonteith","ripmandela","ripjoanrivers","riptalia","riplilsnupe","ripleerigby"}
}

<h1><span style="color: #000080;"><strong>(Side notes) Saving intermediate data</strong></span></h1>

<p><span style="font-size: 13.3333px;"><strong>If we want to save intermediate data for any topic, we could do so with the follwing steps (taking the example of natural disaster):</strong></span></p>

In [None]:
disaster_ids = hash_exploded.select(hash_exploded.tweet_id).where(hash_exploded.each_hashtag.isin(topic_dict["Natr_Disaster"])).distinct().cache()

In [None]:
df_disaster = df.join(disaster_ids,\
                                 df.tweet_id == disaster_ids.tweet_id,\
                                 "inner").select(df.create_time,\
                                                 df.from_id,\
                                                 df.from_user,\
                                                 df.hashtag,\
                                                 df.location,\
                                                 df.mention,\
                                                 df.tweet_id,\
                                                 df.term)

In [None]:
workdir = "/mnt/4e8ba653-f2f0-4e18-a51e-458026833dee/final_parquet"

In [None]:
# Parquet
df_disaster.write.save(workdir+"/Natrual_Disaster_new", format="parquet")

In [None]:
# json
df_disaster.write.json(workdir+"/Natrual_Disaster_json_final")

<h1><span style="color: #000080;"><strong>Hashtag Birthday</strong></span></h1>

<p><span style="font-size: 13.3333px;"><strong>Hashtag birthday indicates the first timestamp that a particular hashtag appears in the tweet corpus between year 2013 and 2014. We determine this by find the minimum "create time" for each hashtag </strong></span></p>

In [None]:
df_birthday = hash_exploded.join(disaster_ids,\
                                 hash_exploded.tweet_id == disaster_ids.tweet_id,\
                                 "inner").select(hash_exploded.create_time,\
                                                 hash_exploded.each_hashtag,\
                                                 hash_exploded.tweet_id)

In [None]:
## Find out the "birthday", or the earliest appearing time of each hashtag. 
## (add an extra column of 1 to mark as topical, will be used in a join later)

Ordered_Hashtag_set = df_birthday.\
                      groupby("each_hashtag").\
                      agg({"creat_time": "min"}).\
                      orderBy('min(creat_time)', ascending=True).\
                      withColumn("topical", lit(1))


In [None]:
## Find the total lenth of topical tweets.
loading = time.time()
time_span = Ordered_Hashtag_set.count()
getTime(loading)

In [None]:
# Get id of the corresponding time split (50% and 60%).
train_val_split_Ht = np.floor(np.multiply(time_span, 0.5)).astype(int)
val_test_split_Ht =  np.floor(np.multiply(time_span, 0.6)).astype(int)

In [None]:
# Converting to Pandas for random row access.

pd_Ordered_Hashtag_set = Ordered_Hashtag_set.toPandas()

In [None]:
# locate the timestamp of te 50% and 60% cutoff point. Will be used later to divide D.

train_val_time = pd_Ordered_Hashtag_set.iloc[train_val_split_Ht]['min(creat_time)']
val_test_time = pd_Ordered_Hashtag_set.iloc[val_test_split_Ht]['min(creat_time)']

In [None]:
# Split Hashtags into H_train, H_valid, H_test

Train_ids = Ordered_Hashtag_set.select("tweet_id").\
                                     where(col("min(creat_time)") <= train_val_time).distinct()
    
Valid_ids = Ordered_Hashtag_set.select("tweet_id").\
                                     where((col("min(creat_time)") > train_val_time) & (col("min(creat_time)") <= val_test_time)).distinct()
    
Test_ids = Ordered_Hashtag_set.select("tweet_id").\
                                     where(col("min(creat_time)") > val_test_time ).distinct()

<p><span style="font-size: 13.3333px;"><strong>Now we have identified the ids to be used in training, validation and test set, we can proceed to join the id with our feature set to obtain the corresponding data set.</strong></span></p>

![caption](https://github.com/demoonism/TwitterSensor/blob/master/Screenshot/remove_twit.JPG?raw=true)

<h1><span style="color: #000080;"><strong>Train-Valid-Test split</strong></span></h1>

#### Training Labeling

In [None]:
Training_set = transformed.select("tweet_id","features").where(col("creat_time") <= train_val_time)

Training_set_labled = Training_set.join(Train_ids, Training_set.tweet_id == Train_ids.tweet_id, "left").\
                           select(Training_set.tweet_id, Training_set.features, F.when(Train_ids.topical == 1, 1.0).otherwise(0.0).alias("label"))

In [None]:
loading = time.time()

tr_pos_sample = Training_set_labled.where(col("label") == 1.0).count()

getTime(loading)

In [None]:
tr_pos_sample

#### Validation Labeling

In [None]:
Raw_Validation_set = transformed.select("tweet_id","features").where((col("creat_time") > train_val_time) & (col("creat_time") <= val_test_time))

tr_hashtags_in_vals  = Raw_Validation_set.\
                       join(Train_ids, Raw_Validation_set.tweet_id == Train_ids.tweet_id, "inner").\
                       select(Raw_Validation_set.tweet_id)                        

Validation_set_staging =  Raw_Validation_set.\
                          join(tr_hashtags_in_vals, Raw_Validation_set.tweet_id == tr_hashtags_in_vals.tweet_id, "left_outer").\
                          toDF("tweet_id","features","new_id")
#### This is a huge assssss fucking bug. direct select would remove null type.

Validation_set =  Validation_set_staging.select(col("tweet_id"),col("features")).where(col("new_id").isNull())


Validation_set_labled = Validation_set.join(Valid_ids, Validation_set.tweet_id == Valid_ids.tweet_id, "left").\
                           select(Validation_set.tweet_id, Validation_set.features, F.when(Valid_ids.topical == 1, 1.0).otherwise(0.0).alias("label"))

In [None]:
val_pos_sample = Validation_set_labled.where(col("label") == 1.0).count()

In [None]:
val_pos_sample

#### Test Labeling

In [None]:
Raw_Test_set = transformed.select("tweet_id","features").where(col("creat_time") > val_test_time)

tr_hashtags_in_test  = Raw_Test_set.\
                       join(Train_ids, Raw_Test_set.tweet_id == Train_ids.tweet_id, "inner").\
                       select(Raw_Test_set.tweet_id)

Test_set_staging =  Raw_Test_set.\
                          join(tr_hashtags_in_test, Raw_Test_set.tweet_id == tr_hashtags_in_test.tweet_id, "left_outer").\
                          toDF("tweet_id","features","new_id")

Test_set =  Test_set_staging.select(col("tweet_id"),col("features")).where(col("new_id").isNull())


Test_set_labled = Test_set.join(Test_ids, Test_set.tweet_id == Test_ids.tweet_id, "left").\
                           select(Test_set.tweet_id, Test_set.features, F.when(Test_ids.topical == 1, 1.0).otherwise(0.0).alias("label"))

In [None]:
te_pos_sample = Test_set_labled.where(col("label") == 1.0).count()

In [None]:
te_pos_sample

## Sampling data to balance label

In [None]:
# Concatenate pos and neg training samples to form the final training set.

Input = Training_set_labled.sampleBy("label", fractions={0.0: 0.05, 1.0: 1}, seed=0) 

In [None]:
Input.count()

<p><span style="color: #000080;"><strong><span style="font-family: georgia, palatino, serif; font-size: 24pt;">Step Three: Training Classifier</span></strong></span></p>

## Feature Selection

In [None]:
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors
## note that the 100 here is just a dummy varible. The actual number of features to use will be part of the grid search.

selector = ChiSqSelector(numTopFeatures=100, featuresCol="features",
                         outputCol="selectedFeatures", labelCol="label").cache()
model = selector.fit(Input)
result = model.transform(Input)

In [None]:
result.where(col("label") == 1.0).show(20)

<h1><span style="color: #000080;"><strong>Train logistic regression and Hyper Parameter Tunning</strong></span></h1>

<p><span style="font-size: 13.3333px;"><strong>We are tunning two hyperparameters for the logistic regression, namly number of features and L2 penalty</strong></span></p>

In [None]:
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

blor = LogisticRegression(maxIter=5 featuresCol='selectedFeatures', labelCol='label')
patk = BinaryClassificationEvaluator()
#pipeline = Pipeline(stages=[selector, blor])
pipeline = Pipeline(stages=[blor])

paramGrid = ParamGridBuilder() \
    .addGrid(selector.numTopFeatures, [10, 100, 1000, 10000, 50000, 1000000, 5000000]) \
    .addGrid(blor.regParam, [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 0.5, 0.8, 1, 1.5, 2.0, 2.5, 10, 100]) \   
    .build()

lr.regParam, [0.0001, 0.001, 0.01, 0.1, 0.15, 0.2, 0.3]
    
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=3)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(result)


In [None]:
from pyspark.mllib.evaluation import MulticlassMetrics
predictionAndLabels = results.select('probability', 'prediction', 'prediction').map(lambda x: (x[1], x[2]))
metrics = MulticlassMetrics(predictionAndLabels)
metrics.confusionMatrix().toArray()

In [None]:
a = result.select("probability").toDF("a")

In [None]:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row

staging = slicer1.transform(result)
output =  slicer2.transform(staging)

output.select("a", "0_prob", "1_prob").orderBy('1_prob', ascending=False).show()

In [None]:
#As of Spark 2.0 ml and mllib API are no longer compatible and the latter one is going towards deprecation and removal. If you still need this you'll have to convert ml.Vectors to mllib.Vectors.

In [None]:
from pyspark.mllib import linalg as mllib_linalg
from pyspark.ml import linalg as ml_linalg

def as_old(v):
    if isinstance(v, ml_linalg.SparseVector):
        return mllib_linalg.SparseVector(v.size, v.indices, v.values)
    if isinstance(v, ml_linalg.DenseVector):
        return mllib_linalg.DenseVector(v.values)
    raise ValueError("Unsupported type {0}".format(type(v)))

# Evaluation

In [None]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.mllib.regression import LabeledPoint
#from pyspark.mllib.evaluation import RegressionMetrics, RankingMetrics

# Compute raw scores on the test set

predictionAndLabels = Valid_RDD.map(lambda lp: (float(model.predict(lp.features)), float(lp.label)))

Pred = predictionAndLabels.map(lambda x:x[0])
Truth = predictionAndLabels.map(lambda x:x[1])
Pred_truth = (b.take(100), c.take(100))
predictionAndLabels = sc.parallelize([Pred_truth])


# Instantiate metrics object
## Ranking metrics ONLY takes tuple of list (pred, groundtruth)
metrics = RankingMetrics(predictionAndLabels)
print("Precision @ k = %s" % metrics.precisionAt(100)) 

#print("Mean Average precision = %s" % metrics.meanAveragePrecision)


In [None]:
model.predictAll()

In [None]:
predictionAndLabels.take(5)