<p><span style="font-size: 36pt; font-family: georgia, palatino, serif; color: #800000;">Learning Topical Social Sensors</span></p>

# How useful is twitter to you in terms of finding the right information?

![caption](search.jpg)

<p style="text-align: center;"><span style="text-decoration: underline;"><span style="font-size: 14pt;"><em><strong>We can do better than this!</strong> </em></span></span></p>

<p style="text-align: left;"><strong>In this project, we are aiming to train a classifier to identify targeted information on Twitter with high precision. </strong></p>
<p style="text-align: left;"><strong>For example, if you are interested in:</strong></p>
<p style="text-align: left;"><em><strong>&bull; Global social issues</strong></em><br /><em><strong>&bull; Politics in the Pacific Northwest</strong></em><br /><em><strong>&bull; Public transit in New York City</strong></em></p>
<p style="text-align: left;"><strong>The classifier would serve as a "sensor" to identify topical tweets based on your tailored interests!</strong></p>

# Challenges

<p style="text-align: left;"><strong>(1) &nbsp;Billions of potential features, thousands of useful ones (Hashtags, users, mentions, terms, locations)</strong></p>
<p style="text-align: left;"><strong>(2) &nbsp;Need a lot of labeled data to learn feature weights well</strong></p>

# Solution

<p><span style="font-size: 12pt;">(1)<strong> Careful feature engineering and feature selection using Apache Spark.</strong></span></p>
<p><span style="font-size: 10pt;"><strong><em>We performed feature transformation and selection with Apache Spark on a standalone server with eight 1TB Hard disk, a 20 core CPU (40 threads) and 256GB RAM.</em> </strong></span></p>
<p><span style="font-size: 12pt;">(2) <strong>Hashtags!</strong>&nbsp;</span></p>
<p><span style="font-size: 10pt;"><strong><em>Hashtags</em><em>&nbsp;originated on IRC chat, were&nbsp;</em><em>adopted later (and perhaps most famously) on Twitter, and&nbsp;</em><em>now appear on other social media platforms such as Instagram,&nbsp;</em><em>Tumblr, and Facebook. They usually serve as surogates for topics. Therefore, for each topic,&nbsp;</em><em>we leverage a (small)&nbsp;</em><em>set of user-curated topical hashtags to efficiently provide&nbsp;</em><em>a large number of supervised topic labels for social media&nbsp;</em><em>content.&nbsp;</em></strong></span></p>
<p><span style="font-size: 10pt;"><strong><em>We used 4 independent annotators to query the Twitter search API to identify candidate hashtags for each topic. A&nbsp;hashtag is assigned to a topic set if 3 out of 4 annotators agrees on the assignment.</em></strong></span></p>
<p><span style="font-size: 10pt;"><strong><em>For example, for the topic, "Soccer", the set of hashtags are [........]</em></strong></span></p>
<p><span style="font-size: 10pt;">&nbsp;</span></p>
<p><span style="font-size: 18pt;"><strong>Catch!</strong></span></p>
<p><em><strong><span style="font-size: 10pt;">Hashtag is part of our feature, wouldn't the classifier simply learn to remember the hashtag?</span></strong></em></p>
<p><em><strong><span style="font-size: 10pt;">To ensure maximum generality, we remove training hashtags from the validation and test set to ensure the classifier making prediction on the learnt feature and not just remembering hashtags. This would be further illlustrated in the Train-Validation split section later</span></strong></em></p>

# Now we have labeled data, what features could be useful for predciting topicality?

![caption](twt.jpg)

<p style="text-align: left;"><span style="font-size: 18pt;"><strong>Why might thes tweet features useful?</strong> </span></p>
<p style="text-align: left;"><br /><span style="font-size: 10pt;"><em><strong>&bull; Users: who tweets on the topic?</strong></em></span><br /><span style="font-size: 10pt;"> <em><strong>-</strong></em> <span style="text-decoration: underline;"><em><strong>The weather channel for Natural Disasters</strong></em></span></span></p>
<p style="text-align: left;"><br /><span style="font-size: 10pt;"><em><strong>&bull; Hashtags: What hashtags co-occur with the topic?</strong></em></span><br /><span style="font-size: 10pt;"> <em><strong>-</strong></em> <span style="text-decoration: underline;"><em><strong>#teaparty for LBGT rights</strong></em></span></span></p>
<p style="text-align: left;"><br /><span style="font-size: 10pt;"><em><strong>&bull; Mentions:</strong></em></span><br /><span style="font-size: 10pt;"> <em><strong>-</strong></em> <span style="text-decoration: underline;"><em><strong>@Redcross for Natural Disaster</strong></em></span></span></p>
<p style="text-align: left;"><br /><span style="font-size: 10pt;"><em><strong>&bull; Locations:</strong></em></span><br /><span style="font-size: 10pt;"> <em><strong>-</strong></em> <span style="text-decoration: underline;"><em><strong>Philippines for Natural Disaster</strong></em></span></span></p>
<p style="text-align: left;"><br /><span style="font-size: 10pt;"><em><strong>&bull; Terms:</strong></em></span><br /><span style="font-size: 10pt;"> <em><strong>-</strong></em> <span style="text-decoration: underline;"><em><strong>Protest for Human Caused Disaster</strong></em></span></span></p>

# Implementation

<p><strong><span style="font-size: 10pt;">The original Twitter data were collected over 2 years, which contains over 2TB compressed data. It consists of hundreds of millions lines of tweets.</span></strong></p>
<p><strong><span style="font-size: 10pt;">How do we go from the raw data to an efficient classifier?</span></strong></p>
<p><strong><span style="font-size: 10pt;">The following three-step process serves an end-to-end pipeline to perform ETL and ML training.</span></strong></p>
<p>&nbsp;</p>

<p><span style="font-family: georgia, palatino, serif; font-size: 24pt; color: #800000;">Step One: Pre-Processing</span></p>

###### <p><em>Each valid tweet looks like this:</em></p>

<p><span style="font-size: 10pt;"><strong>Example</strong></span></p>
<p><span style="font-size: 8pt;"><strong>{</strong>"created_at":"Thu Jan 31 12:58:06 +0000 2013",</span><br /><span style="font-size: 8pt;"> "id":296965581582786560,</span><br /><span style="font-size: 8pt;"> "id_str":"296965581582786560",</span><br /><span style="font-size: 8pt;"> "text":"Im ready for whatever",</span><br /><span style="font-size: 8pt;"> "source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e",</span><br /><span style="font-size: 8pt;"> "truncated":false,</span><br /><span style="font-size: 8pt;"> "in_reply_to_status_id":null,</span><br /><span style="font-size: 8pt;"> "in_reply_to_status_id_str":null,</span><br /><span style="font-size: 8pt;"> "in_reply_to_user_id":null,</span><br /><span style="font-size: 8pt;"> "in_reply_to_user_id_str":null,</span><br /><span style="font-size: 8pt;"> "in_reply_to_screen_name":null,</span><br /><span style="font-size: 8pt;"> "user":{</span><br /><span style="font-size: 8pt;"> "id":1059349532,</span><br /><span style="font-size: 8pt;"> "id_str":"1059349532",</span><br /><span style="font-size: 8pt;"> "name":"Don Dada",</span><br /><span style="font-size: 8pt;"> "screen_name":"ImDatNiggaBD",</span><br /><span style="font-size: 8pt;"> "location":"South Side Of Little Rock",</span><br /><span style="font-size: 8pt;"> "url":null,</span><br /><span style="font-size: 8pt;"> "description":"Weed Smoker (Kush)",</span><br /><span style="font-size: 8pt;"> "protected":false,</span><br /><span style="font-size: 8pt;"> "followers_count":109,</span><br /><span style="font-size: 8pt;"> "friends_count":110,</span><br /><span style="font-size: 8pt;"> "listed_count":0,</span><br /><span style="font-size: 8pt;"> "created_at":"Fri Jan 04 02:37:28 +0000 2013",</span><br /><span style="font-size: 8pt;"> "favourites_count":14,</span><br /><span style="font-size: 8pt;"> "utc_offset":null,</span><br /><span style="font-size: 8pt;"> "time_zone":null,</span><br /><span style="font-size: 8pt;"> "geo_enabled":false,</span><br /><span style="font-size: 8pt;"> "verified":false,</span><br /><span style="font-size: 8pt;"> "statuses_count":1312,</span><br /><span style="font-size: 8pt;"> "lang":"en",</span><br /><span style="font-size: 8pt;"> "contributors_enabled":false,</span><br /><span style="font-size: 8pt;"> "is_translator":false,</span><br /><span style="font-size: 8pt;"> "profile_background_color":"C0DEED",</span><br /><span style="font-size: 8pt;"> "profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme1\/bg.png",</span><br /><span style="font-size: 8pt;"> "profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme1\/bg.png",</span><br /><span style="font-size: 8pt;"> "profile_background_tile":false,</span><br /><span style="font-size: 8pt;"> "profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3184813228\/d6d3a95d902f088f412cf1bd90c126c7_normal.jpeg",</span><br /><span style="font-size: 8pt;"> "profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3184813228\/d6d3a95d902f088f412cf1bd90c126c7_normal.jpeg",</span><br /><span style="font-size: 8pt;"> "profile_banner_url":"https:\/\/si0.twimg.com\/profile_banners\/1059349532\/1359068332",</span><br /><span style="font-size: 8pt;"> "profile_link_color":"0084B4",</span><br /><span style="font-size: 8pt;"> "profile_sidebar_border_color":"C0DEED",</span><br /><span style="font-size: 8pt;"> "profile_sidebar_fill_color":"DDEEF6",</span><br /><span style="font-size: 8pt;"> "profile_text_color":"333333",</span><br /><span style="font-size: 8pt;"> "profile_use_background_image":true,</span><br /><span style="font-size: 8pt;"> "default_profile":true,</span><br /><span style="font-size: 8pt;"> "default_profile_image":false,</span><br /><span style="font-size: 8pt;"> "following":null,</span><br /><span style="font-size: 8pt;"> "follow_request_sent":null,</span><br /><span style="font-size: 8pt;"> "notifications":null},</span><br /><span style="font-size: 8pt;"> "geo":null,</span><br /><span style="font-size: 8pt;"> "coordinates":null,</span><br /><span style="font-size: 8pt;"> "place":null,</span><br /><span style="font-size: 8pt;"> "contributors":null,</span><br /><span style="font-size: 8pt;"> "retweet_count":0,</span><br /><span style="font-size: 8pt;"> "entities":{"hashtags":[],</span><br /><span style="font-size: 8pt;"> "urls":[],</span><br /><span style="font-size: 8pt;"> "user_mentions":[]},</span><br /><span style="font-size: 8pt;"> "favorited":false,</span><br /><span style="font-size: 8pt;"> "retweeted":false,</span><br /><span style="font-size: 8pt;"> "lang":"en"<strong>}</strong></span></p>

###### <p><em>Obviously, not all data are relevant to our analysis. As according to the paper, the Only releavant fields in our features are</em></p>
<p><em><strong>{Hashtags}, {From_User}, {Create_Time}, {Location}, {Mentions}</strong></em></p>

###### To filter out irrelavant data and keep our input clean, we have a two-step process:
1. run the Pre-processing.py to parse all data. 
2. run the filterEng.py to filterout all non-English Tweets.

###### The resulting data looks like this:

<p><strong>Processed-tweet:</strong></p>
<p><strong>{</strong>u'Create_time': 1359737884.0,<br /> u'from_id': 87151732,<br /> u'from_user': u'ishiPTI',<br /> u'hashtag': u'',<br /> u'location': u'loc_dha_lahore_cantt_',<br /> u'mention': u'BushraShekhani',<br /> u'term': u'I am ready for whatever',<br /> u'tweet_id': 297312861586325504<strong>}</strong></p>

###### <p><em>Now we have a small (sort of) and clean dataset to work with, it is time to move on to spark to perform some reall analysis.</em></p>

<p><span style="font-size: 24pt; color: #800000; font-family: georgia, palatino, serif;">Step Two: Feature Processing</span></p>


###### We need to turn the raw json data into feature matrix. There are two keys here: 1. data processing must be extremly efficient and 2. The resulting matrix must be sparse. These are achieved through the following pipeline. 

In [None]:
## Notebook property setup.

from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import udf, col, lit, monotonically_increasing_id, explode

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.param import Param, Params
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, Tokenizer, IDF, StopWordsRemover, CountVectorizer, VectorAssembler

import sys

import time
import datetime

import os.path
import json
import time
#import simplejson as json
from datetime import datetime
from operator import add

## Helper function to keep track the run time of a spark ops.

def getTime(start):
    sec = time.time() - start
    m, s = divmod(sec, 60)
    h, m = divmod(m, 60)
    print('Spark operation takes - %d:%02d:%02d which is %d seconds in total' % (h,m,s,sec))

In [None]:
from pyspark.sql import functions as F
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
## Enable inline graphs
%matplotlib inline

## Display precision for pandas dataframe
pd.set_option('precision',10)

## Reading DATA

###### After preprocessing, data are saved as json.gz format. We need to load and parse these data into spark RDD. Note that, the sc.textFile function's input directory could be either a file or a directory. Spark context will create partitions automatically. 

In [None]:
# full
#data_raw = sc.textFile('/mnt/66e695cd-1a0c-4e3b-9a50-55e01b788529/Tweet_Output/Sample')
data_raw = sc.textFile('/mnt/66e695cd-1a0c-4e3b-9a50-55e01b788529/Tweet_Output/small_sample')
data = data_raw.map(lambda line: json.loads(line))


###### Taking a look at the (parsed) first line of our input files

In [None]:
sample = data.take(1)

## Turning to dataframe

###### An RDD (Resilient Distributed Dataset) is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained. (Available in Spark since 1.0)

###### A dataframe is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. Therefore, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query. (Added since 1.3)

###### In short, you are able to write traditional map-reduce type of code on both RDD and Dataframe, but Dataframe also support SQL command and built-in analytical functions. For performance conideration, we are turning our RDD into Dataframes first.

In [None]:
## Define Dataframe schema.
schema = StructType([StructField('HashTag_Birthday', DoubleType(), False),
                     StructField('from_id', IntegerType(), False),
                     StructField('from_user', StringType(), False),
                     StructField('hashtag', StringType(), True),
                     StructField('location', StringType(), True),
                     StructField('mention', StringType(), True),
                     StructField('term', StringType(), True),
                     StructField('tweet_id', StringType(), False)                     
                    ])
df = sqlContext.createDataFrame(data, schema)


##### Input are shown as tabular form (Dataframe) below. Note that hashtag field is happend to be null for the first few records. 

In [None]:
df.show(5)

###### Now we need to transform the textual features into a sparse vector to be piped into our learning algorithm later.

## Vectorizing user, hashtag, location, mention, term

In [None]:
term_tokenizer = Tokenizer(inputCol="term", outputCol="words")
term_remover = StopWordsRemover(inputCol=term_tokenizer.getOutputCol(), outputCol="filtered")
term_cv = CountVectorizer(inputCol=term_remover.getOutputCol(), outputCol="term_features", vocabSize=500, minTF= 2, minDF=1)
#term_pipeline = Pipeline(stages=[term_tokenizer, term_remover,term_cv])


hashtag_tokenizer = Tokenizer(inputCol="hashtag", outputCol="tags")
hashtag_cv = CountVectorizer(inputCol=hashtag_tokenizer.getOutputCol(), outputCol="hashtag_features", vocabSize=500, minTF = 2, minDF=1)
#hashtag_pipeline = Pipeline(stages=[hashtag_tokenizer,hashtag_cv])

mention_tokenizer = Tokenizer(inputCol="mention", outputCol="mentions")
mention_cv = CountVectorizer(inputCol=mention_tokenizer.getOutputCol(), outputCol="mention_features", vocabSize=100, minTF= 2, minDF=1)
#mention_pipeline = Pipeline(stages=[mention_tokenizer, mention_cv])

user_tokenizer = Tokenizer(inputCol="from_user", outputCol="users")
user_cv = CountVectorizer(inputCol=user_tokenizer.getOutputCol(), outputCol="user_features", vocabSize=100, minTF= 2, minDF=1)

loc_tokenizer = Tokenizer(inputCol="location", outputCol="locs")
loc_cv = CountVectorizer(inputCol=loc_tokenizer.getOutputCol(), outputCol="loc_features", vocabSize=100, minTF= 2, minDF=1)

pipeline = Pipeline(stages=[term_tokenizer,term_remover,term_cv,hashtag_tokenizer,hashtag_cv,mention_tokenizer, \
                            mention_cv,user_tokenizer, user_cv, loc_tokenizer, loc_cv])

In [None]:
loading = time.time()

model = pipeline.fit(df)
Train_X = model.transform(df)

getTime(loading)


In [None]:
feat = Train_X.select("tweet_id","term_features","hashtag_features","mention_features","user_features","loc_features")

In [None]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols = ["term_features","hashtag_features","mention_features","user_features","loc_features"], outputCol="features")
transformed = assembler.transform(Train_X).select("tweet_id","features","HashTag_Birthday")

## Temporal Split
##### To ensure our classifier generalize to a wide range of features and not simply remeber the past hashtag, we will perform a teppral split to exclude training hashtags in validation and test.

![caption](Capture.jpg)

In [None]:
term_tokenizer = Tokenizer(inputCol="hashtag", outputCol="each_hashtag")
hashtags_df = term_tokenizer.transform(df)

hashtag =  hashtags_df.select("tweet_id","HashTag_Birthday","each_hashtag")
hash_exploded = hashtag.withColumn('each_hashtag', explode('each_hashtag'))

In [None]:
## Define topical hashtag list
title = sqlContext.createDataFrame(\
[("soccer",1,["princessandgino","migikahitnaatkahitpa","royalmigiending","royalmigiendgame","thankyoumikayandgino","asahi","jipped","news","litbus_anime","ff","mink","lol","happybdayharrystyles","gameinsight","androidgames","teamheat","teamnosleep","android","supportlocalband"])],["topics","topical","hashtags"]\
                                  )

In [None]:
## Join hashtag DF with the original DF to obtain all topical tweets for a particular topic

title_exploded = title.withColumn('hashtags', explode('hashtags'))

Hashtag_set = hash_exploded.join(title_exploded,\
                                 hash_exploded.each_hashtag == title_exploded.hashtags,\
                                 "right").select(hash_exploded.tweet_id,\
                                                 hash_exploded.HashTag_Birthday,\
                                                 hash_exploded.each_hashtag)
## Right join to obtain all topical tweets.

In [None]:
## Find out the "birthday", or the earliest appearing time of each hashtag. 
## (add an extra column of 1 to mark as topical, will be used in a join later)

Ordered_Hashtag_set = Hashtag_set.\
                      groupby("each_hashtag").\
                      agg({"Hashtag_Birthday": "min"}).\
                      orderBy('min(Hashtag_Birthday)', ascending=True).\
                      withColumn("topical", lit(1))


In [None]:
## Find the total lenth of topical tweets.
loading = time.time()
time_span = Ordered_Hashtag_set.count()
getTime(loading)

In [None]:
# Get id of the corresponding time split (50% and 60%).

train_val_split_Ht = np.floor(np.multiply(time_span, 0.5)).astype(int)
val_test_split_Ht =  np.floor(np.multiply(time_span, 0.6)).astype(int)

In [None]:
# Converting to Pandas for random row access.

pd_Ordered_Hashtag_set = Ordered_Hashtag_set.toPandas()

In [None]:
# locate the timestamp of te 50% and 60% cutoff point. Will be used later to divide D.

train_val_time = pd_Ordered_Hashtag_set.iloc[train_val_split_Ht]['min(Hashtag_Birthday)']
val_test_time = pd_Ordered_Hashtag_set.iloc[val_test_split_Ht]['min(Hashtag_Birthday)']

In [None]:
# Split Hashtags into H_train, H_valid, H_test

train_hashtags = Ordered_Hashtag_set.select("each_hashtag", "topical").\
                                     where(col("min(Hashtag_Birthday)") <= train_val_time)
    
valid_hashtags = Ordered_Hashtag_set.select("each_hashtag", "topical").\
                                     where((col("min(Hashtag_Birthday)") > train_val_time) & (col("min(Hashtag_Birthday)") <= val_test_time))
    
test_hashtags = Ordered_Hashtag_set.select("each_hashtag", "topical").\
                                     where(col("min(Hashtag_Birthday)") > val_test_time )

In [None]:
Train_ids = train_hashtags.join(Hashtag_set,\
                                 train_hashtags.each_hashtag == Hashtag_set.each_hashtag,\
                                 "inner").select(Hashtag_set.tweet_id,\
                                                 train_hashtags.topical)
Valid_ids = valid_hashtags.join(Hashtag_set,\
                                 valid_hashtags.each_hashtag == Hashtag_set.each_hashtag,\
                                 "inner").select(Hashtag_set.tweet_id,\
                                                 valid_hashtags.topical)
Test_ids = test_hashtags.join(Hashtag_set,\
                                 test_hashtags.each_hashtag == Hashtag_set.each_hashtag,\
                                 "inner").select(Hashtag_set.tweet_id,\
                                                 test_hashtags.topical)


###### Now we have identified the ids to be used in training, validation and test set, we can proceed to join the id with our feature set to obtain the corresponding data set.

![caption](remove_twit.jpg)

# Train-Valid-Test split

#### Training Labeling

In [None]:
Training_set = transformed.select("tweet_id","features").where(col("HashTag_Birthday") <= train_val_time)

Training_set_labled = Training_set.join(Train_ids, Training_set.tweet_id == Train_ids.tweet_id, "left").\
                           drop("tweet_id").\
                           select(Training_set.features, F.when(Train_ids.topical == 1, 1).otherwise(0).alias("target"))

In [None]:
loading = time.time()

pos_sample = Training_set_labled.where(col("target") == 0).count()

getTime(loading)

#### Validation Labeling

In [None]:
Raw_Validation_set = transformed.select("tweet_id","features").where((col("HashTag_Birthday") > train_val_time) & (col("HashTag_Birthday") <= val_test_time))

tr_hashtags_in_vals  = Raw_Validation_set.\
                       join(Train_ids, Raw_Validation_set.tweet_id == Train_ids.tweet_id, "inner").\
                       select(Raw_Validation_set.tweet_id)                        

Validation_set_staging =  Raw_Validation_set.\
                          join(tr_hashtags_in_vals, Raw_Validation_set.tweet_id == tr_hashtags_in_vals.tweet_id, "left_outer").\
                          toDF("tweet_id","features","new_id")
#### This is a huge assssss fucking bug. irect select would remove null type.

Validation_set =  Validation_set_staging.select(col("tweet_id"),col("features")).where(col("new_id").isNull())


Validation_set_labled = Validation_set.join(Valid_ids, Validation_set.tweet_id == Valid_ids.tweet_id, "left").\
                           drop(Validation_set.tweet_id).\
                           select(Validation_set.features, F.when(Valid_ids.topical == 1, 1).otherwise(0).alias("target"))

In [None]:
pos_sample = Validation_set_labled.where(col("target") == 1).count()

#### Test Labeling

In [None]:
Raw_Test_set = transformed.select("tweet_id","features").where(col("HashTag_Birthday") > val_test_time)

tr_hashtags_in_test  = Raw_Test_set.\
                       join(Train_ids, Raw_Test_set.tweet_id == Train_ids.tweet_id, "inner").\
                       select(Raw_Test_set.tweet_id)

Test_set_staging =  Raw_Validation_set.\
                          join(tr_hashtags_in_test, Raw_Validation_set.tweet_id == tr_hashtags_in_test.tweet_id, "left_outer").\
                          toDF("tweet_id","features","new_id")

Test_set =  Test_set_staging.select(col("tweet_id"),col("features")).where(col("new_id").isNull())


Test_set_labled = Test_set.join(Test_ids, Test_set.tweet_id == Test_ids.tweet_id, "left").\
                           drop("tweet_id").\
                           select(Test_set.features, F.when(Test_ids.topical == 1, 1).otherwise(0).alias("target"))

## Sampling data to balance label

In [None]:
# Concatenate pos and neg training samples to form the final training set.

Input = Training_set_labled.sampleBy("target", fractions={0: 0.5, 1: 1}, seed=0) 

<p><span style="color: #800000; font-size: 24pt; font-family: georgia, palatino, serif;">Step Three: Training Classifier</span></p>

# Train logistic regression

In [None]:
#As of Spark 2.0 ml and mllib API are no longer compatible and the latter one is going towards deprecation and removal. If you still need this you'll have to convert ml.Vectors to mllib.Vectors.

In [None]:
from pyspark.mllib import linalg as mllib_linalg
from pyspark.ml import linalg as ml_linalg

def as_old(v):
    if isinstance(v, ml_linalg.SparseVector):
        return mllib_linalg.SparseVector(v.size, v.indices, v.values)
    if isinstance(v, ml_linalg.DenseVector):
        return mllib_linalg.DenseVector(v.values)
    raise ValueError("Unsupported type {0}".format(type(v)))

In [None]:
from pyspark.sql import Row
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.param import Param, Params
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel

TrainingRDD=Input.rdd.map(lambda row: LabeledPoint(row.target, as_old(row.features)))


In [None]:
Validation_set_labled.show(4)

In [None]:
Valid_RDD = Validation_set_labled.rdd.map(lambda row: LabeledPoint(row.target, as_old(row.features)))
Test_RDD = Test_set_labled.rdd.map(lambda row: LabeledPoint(row.target, as_old(row.features)))

In [None]:
Valid_RDD.take(5)

In [None]:
TrainingRDD.take(5)

In [None]:
model = LogisticRegressionWithLBFGS.train(TrainingRDD)

In [None]:
model

# Hyper Parameter Tunning

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder



# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.0001, 0.001, 0.01, 0.1, 0.15, 0.2, 0.3, ]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

# Prepare test documents, which are unlabeled.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "mapreduce spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    print(row)

# Evaluation

In [None]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.mllib.regression import LabeledPoint
#from pyspark.mllib.evaluation import RegressionMetrics, RankingMetrics

# Compute raw scores on the test set

predictionAndLabels = Valid_RDD.map(lambda lp: (float(model.predict(lp.features)), float(lp.label)))

Pred = predictionAndLabels.map(lambda x:x[0])
Truth = predictionAndLabels.map(lambda x:x[1])
Pred_truth = (b.take(100), c.take(100))
predictionAndLabels = sc.parallelize([Pred_truth])


# Instantiate metrics object
## Ranking metrics ONLY takes tuple of list (pred, groundtruth)
metrics = RankingMetrics(predictionAndLabels)
print("Precision @ k = %s" % metrics.precisionAt(100)) 

#print("Mean Average precision = %s" % metrics.meanAveragePrecision)


In [None]:
model.predictAll()

In [None]:
predictionAndLabels.take(5)

In [None]:
sc.version

## Feature Selection

In [None]:
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier()
model = rf.fit(data)
features = model.featureImportances
selected_features = features[:200]
print model.featureImportances