# Following trends on Twitter

[1. Abstract](#Abstract)

[2. Exploratory data analysis](#Exploratory-data-analysis)

[3. Topic Modeling using LDA](#A-very-simple-Topic-Modeling-using-LDA)

[4. Example of the pipeline](#Example-of-the-pipeline-that-we-will-follow-for-the-LDA-algorithm)

[5. Milestone 3](#Milestone-3:-the-data-story)

# Abstract


We realiazed that it will be hard to achieve the goal stated in milestone 1(detecting fake news). The problem is that we couldn't find a way to define fake news. And also the twitter dataset is not what we expected it to be. For example it doesn't contain the number of times a tweet has been retweeted, the geographical location, number of likes ... So we decided to go in a different, more feasable direction, which is following the process of creating and spreading trends on Twitter. Trying to find patterns between trends and users. 

In [1]:
import numpy as np
import json
import re
from pyspark.sql import *
from pyspark import SparkContext, SQLContext
from pyspark.ml.feature import *
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.sql import functions as F
import pickle
import string

#### Read data

> Each data entry has 5 fields:
     - language: language of the user 
     - id: id of the user
     - date: date when the tweet was published
     - username: username of the user
     - content: the tweet
     
Given that for the moment we consider only rows that have all 5 fields we don't have to deal with missing values.
     

In [2]:
sqlContext = SQLContext(sc)
data = sc.textFile("/datasets/tweets-leon")

# Exploratory data analysis

> 1 We will first clean the data and select only a subset that is useful for this project:
    - keep only the tweets that have all 5 fields
    - remove urls from the content
    - remove emojis
    - remove punctuation 
    - remove stopwords
    - apply lemmatization 
    - keep only english, spanish and french tweets
    - ...

In [4]:
frist_tweet = data.first()
frist_tweet

u'en\t345963923251539968\tSat Jun 15 18:00:01 +0000 2013\tLetataleta\tRT @silsilfani: the world is not a wish-granting machine. dont be surprised when everything always end up disappointing.'

In [5]:
"""Chose tweets that have exactly 5 components like normal 
         (language, id, date, username, content)"""

def selection_tweet(tweet):
    contents = tweet.split("\t")
    if (contents[0] == 'en'):
        if (len(contents) == 5):
            return True
    return False

In [6]:
"""Remove punctuation"""

table = string.maketrans("","")
def punctuation(s):
    s = re.sub(r"http\S+", "", s)
    return s.translate(table, string.punctuation)

In [7]:
"encode tweet by mapping"

def encode_tweet(tweet):
    "Encode UTF-8"
    encoded = [t.encode("utf8") for t in tweet.split("\t")]
    "Remove punctuation"
    encoded[4] = punctuation(encoded[4])
    "Remove 2-grams"
    encoded[4] = ' '.join([x for x in encoded[4].split(' ') if len(x) > 2])
    return encoded

In [8]:
encode_tweet(frist_tweet)

['en',
 '345963923251539968',
 'Sat Jun 15 18:00:01 +0000 2013',
 'Letataleta',
 'silsilfani the world not wishgranting machine dont surprised when everything always end disappointing']

Later on using the filter function as done below we will select only a useful subset of the data

In [None]:
"""remove urls"""

#TODO

In [None]:
"""remove emojis"""

#TODO

In [None]:
"""remove punctuation"""

#TODO

In [None]:
"""Lemmatization"""

#TODO

In [9]:
"""select a subset of the data"""

data = data.filter(selection_tweet)

en_data = data.filter(lambda x : x[:2]=='en')
es_data = data.filter(lambda x : x[:2]=='es')
fr_data = data.filter(lambda x : x[:2]=='fr')

data_2012 = data.filter(lambda tweet : 
                        encode_tweet(tweet)[2][-4:] == '2012')
data_2013 = data.filter(lambda tweet : 
                        encode_tweet(tweet)[2][-4:] == '2013')


In [None]:
some_fr_tweets = fr_data.take(5)
some_fr_tweets

In [None]:
some_fr_tweets = [encode_tweet(tweet) for tweet in some_fr_tweets]

In [None]:
print 'Some french tweets:'
for ind, t in enumerate(some_fr_tweets):
    print ind + 1,')User name:',t[3]
    print '         Tweets:', ((t[4]))
    print '         at:', t[2]
    print 

# A very simple Topic Modeling using LDA

In order to familiarize ourselves with the dataset we started with a very simple approach for topic extraction. For achieving this we will use the Latent Dirichlet allocation algorithm. We first need to build the tf-idf matrix using our data and then then pass it as a parameter to the LDA method. We also need to specify the number of topics to be extracted from the dataset, α(parameter of the Dirichlet prior on the per-document topic distributions) and β(parameter of the Dirichlet prior on the per-topic word distribution). We will determine this values in the next milestone. 

In [None]:

"""Data in english and encode UTF-8"""
en_data = data.filter(selection_tweet).map(encode_tweet)

"""Take only ID and CONTENT of a tweet"""
tweets = en_data.map(lambda tweet : Row(id=tweet[1], sentence=tweet[4]))

"""Create DF"""
df_tweets = sqlContext.createDataFrame(tweets)

df_tweets.show(3)

In [30]:
"""Tokenization"""
regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="raw", pattern="\\W")
regexTokenized = regexTokenizer.transform(df_tweets)

regexTokenized.show(3)

+------------------+--------------------+--------------------+
|                id|            sentence|                 raw|
+------------------+--------------------+--------------------+
|345963923251539968|RT @silsilfani: t...|[rt, silsilfani, ...|
|345963923297673217|RT @WhosThisHoe: ...|[rt, whosthishoe,...|
|345963923259924480|Can't stand peopl...|[can, t, stand, p...|
+------------------+--------------------+--------------------+
only showing top 3 rows



In [None]:
with open("regexTokenized.pickle", "wb") as f:
    pickle.dump(regexTokenized, f)

In [31]:
"""Remove Stop-words"""
remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
removed_stopwords = remover.transform(regexTokenized)
with open("removed_stopwords.pickle", "wb") as f:
    pickle.dump(removed_stopwords, f)
removed_stopwords.show(3)

+------------------+--------------------+--------------------+--------------------+
|                id|            sentence|                 raw|            filtered|
+------------------+--------------------+--------------------+--------------------+
|345963923251539968|RT @silsilfani: t...|[rt, silsilfani, ...|[rt, silsilfani, ...|
|345963923297673217|RT @WhosThisHoe: ...|[rt, whosthishoe,...|[rt, whosthishoe,...|
|345963923259924480|Can't stand peopl...|[can, t, stand, p...|[t, stand, people...|
+------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [None]:
"""apply lemmatization and remove punctuation"""


 Warning: Computation of TF-IDF and LDA take a lot of time (6h on the cluster, we don't know what's happening, why it takes so long ? Could you take a look at our code to see if maybe there is a problem with it). This is the reason why we have an extra python file (LDA.py) to submit to the cluster for the computation. 

In [None]:
"""TF-IDF"""

cv = CountVectorizer(inputCol="filtered", outputCol="vectors")
count_vectorizer_model = cv.fit(removed_stopwords)
tf = count_vectorizer_model.transform(removed_stopwords)

idf = IDF(inputCol="vectors", outputCol="tfidf")
idfModel = idf.fit(tf)
tfidf = idfModel.transform(tf)

tfidf.show(3)

In [None]:
"""Topics extraction with LDA"""

nbTopics=100
n_terms=15

corpus = tfidf.select(F.col('id').cast("long"), 'tfidf').rdd.map(lambda x: [x[0], x[1]])
ldaModel = LDA.train(corpus, k=nbTopics)


topics = ldaModel.describeTopics(maxTermsPerTopic=n_terms)
vocabulary = count_vectorizer_model.vocabulary

"""Store result"""
with open("topics.pickle", "wb") as f:
    pickle.dump(topics, f)
with open("vocabulary.pickle", "wb") as f:
    pickle.dump(vocabulary, f)   

In [None]:
"""Load result computed from cluster"""

with open("topics.pickle", "rb") as f:
    topics = pickle.load(f)
    
with open("vocabulary.pickle", "rb") as f:
    vocabulary = pickle.load(f)

In [None]:
for topic in range(len(topics)):
    print("topic {} : ".format(topic))
    words = topics[topic][0]
    scores = topics[topic][1]
    for word in range(len(words)):
        print(vocabulary[words[word]], "-", scores[word])

In [None]:
topics = sc.textFile("/user/khau/topics.pickle")
vocabulary = sc.textFile("/user/khau/vocabulary.pickle")

In [73]:
def plot_topics(topics, vocabulary, nbTopics, n_terms):
    topic_scores = [x[1] for x in topics] #all scores
    max_score = np.max(topic_scores)
    nbColsPlot = 4
    MAGIC_NUMBER = 50
    fontsize_init = MAGIC_NUMBER / max_score

    for topic in range(len(topics)):
        plt.subplot(1, nbColsPlot, topic%4 + 1)
        plt.ylim(0, num_top_words + 0.5)
        plt.xticks([]) 
        plt.yticks([])
        plt.title('Topic #{}'.format(topic+1))
        words = topics[topic][0]
        scores = topics[topic][1]
        for word in range(len(words)):
            font_size = fontsize_init*scores[word]
            font_size = min(font_size, MAGIC_NUMBER)
            plt.text(0.05, num_top_words-word-0.5, vocabulary[words[word]], fontsize=font_size) 
    plt.tight_layout()
    plt.show()
    
plot_topics(topics, vocabulary, nbTopics=20, n_terms=10)

# Example of the pipeline that we will follow for the LDA algorithm

In [19]:
# random dataframe 
sentenceDataFrame = sqlContext.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat"),
    (3, "I want a coffee before going to bed "),
    (4, "Today is a big day !!!")
], ["id", "sentence"])

# tokenization
regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W")
regexTokenized = regexTokenizer.transform(sentenceDataFrame)
#with open("regexTokenized.pickle", "wb") as f:
#    pickle.dump(regexTokenized, f)
# remove stop words
regexTokenized.save('aaa')
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
removed = remover.transform(regexTokenized)

removed.save('ooo')
#with open("removed_stopwords.pickle", "wb") as f:
#    pickle.dump(removed, f)
# create the tf-idf matrix
cv = CountVectorizer(inputCol="filtered", outputCol="vectors")
count_vectorizer_model = cv.fit(removed)
tf = count_vectorizer_model.transform(removed)
vocabulary = count_vectorizer_model.vocabulary
with open("bbb.pickle", "wb") as f:
    pickle.dump(vocabulary, f)   

idf = IDF(inputCol="vectors", outputCol="tfidf")
idfModel = idf.fit(tf)
tfidf = idfModel.transform(tf)

tfidf.save('ccc')
#with open("tfidf.pickle", "wb") as f:
#    pickle.dump(tfidf, f)

# initialize parameters
nbTopics=3
n_terms=3

corpus = tfidf.select(F.col('id').cast("long"), 'tfidf').rdd.map(lambda x: [x[0], x[1]])
ldaModel = LDA.train(corpus, k=nbTopics)
# extraction vocabulary


ldaModel.save('ddd')
#with open("vocabulary.pickle", "wb") as f:
#    pickle.dump(ldaModel, f)    
# extracting topics
#topics = ldaModel.describeTopics(maxTermsPerTopic=n_terms)
#with open("topics.pickle", "wb") as f:
#    pickle.dump(topics, f)


for topic in range(len(topics)):
    print("topic {} : ".format(topic))
    words = topics[topic][0]
    scores = topics[topic][1]
    for word in range(len(words)):
        print(vocabulary[words[word]], "-", scores[word])
        
plot_topics(topics, vocabulary, nbTopics=3, n_terms=2)        

AttributeError: 'CountVectorizerModel' object has no attribute 'save'

In [18]:
#sqlContext.createDataFrame('/user/khau/regexTokenized')
a = sqlContext.read.load('/user/khau/regexTokenized')
a.show(3)

+---+--------------------+--------------------+
| id|            sentence|               words|
+---+--------------------+--------------------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|
|  1|I wish Java could...|[i, wish, java, c...|
|  2|Logistic,regressi...|[logistic, regres...|
+---+--------------------+--------------------+
only showing top 3 rows



# Milestone 3: the data story

In [None]:
"""Play around with the parameters of the LDA algorithm in order to find the optimal values for α and β. """

#TODO

In [None]:
"""repeat the same technique for spanish and french"""

#TODO

In [None]:
"""algorithm for detecting the top trends"""

#TODO

In [None]:
"""trying to find patterns between trends"""

#TODO

In [None]:
"""analyse the data per user 
    ex. which topics he tweets most about ? 
        does it change over time ? """

#TODO

In [None]:
"""find top users that were the most mentioned by somebody else"""

#TODO

In [None]:
"""visualization of the result(per language, per month ...)"""

#TODO