This task is to break down hashtags into proper words. #GoldenGlobe -> Golden, Globe
Let's first bring the data into a list. Each line/tweet is a item in a list. Further, we will treat them as a document comprising of some words.

In [2]:
def loadTweetsInList(filename):
    tweets = open(filename, "rU")
    message = tweets.readlines()
    return message

In [3]:
tweets = loadTweetsInList("../as.txt")

In [4]:
tweets[0:10]

['American Sniper\n',
 "If American Sniper doesn't make you proud to be an American, you probably need to check your pulse. Merica. http://t.co/d2bkKzbfWA\n",
 "Retweet if you're going to see the movie American Sniper http://t.co/8Q93NiAWJJ\n",
 "Retweet if you're going to see the movie American Sniper http://t.co/AQAtAo3vep\n",
 "Retweet if you're going to see the movie American Sniper http://t.co/CYdwIZxcSW\n",
 "Fav if you're going to see the movie American Sniper! http://t.co/DxWI1Sq6Jw\n",
 'I just watched American Sniper. You have to see what Bradley Cooper does in this movie. His performance is next level.\n',
 'Did you know #Jesseventura sued fallen seal estate #AmericanSniper for $1.8M. You worthless piece of \xf0\x9f\x92\xa9. Shameless.\n',
 "American Sniper kind of reminds me of the movie that's showing in the third act of Inglorious Basterds.\n",
 'Did y\xe2\x80\x99all see "American Sniper"? What a powerful movie. Bradley Cooper isn\xe2\x80\x99t too hard on the eyes, either

lets drop first item since its just a title.

In [5]:
tweets = tweets[1:]

Now, lets tokenize each string in this list. The result should be list of lists.

In [6]:
import re
import string
from nltk.corpus import stopwords

def cleanAndTokenizeTweet(tweet):
    
    #only read ascii english characters. Get rid od smiley unicode and other languages
    tweet = filter(lambda x: x in string.printable, tweet)
    
    #remove newline characters
    tweet = re.sub("\n","", tweet)
    
    tweet_no_punct = re.sub('[^a-zA-Z0-9\#]+', ' ', tweet)
    
    words = [word for word in tweet_no_punct.split()]
    
    #remove stopwords 
    stop_words = set(stopwords.words('english'))
    stop_words.add("http")
    stop_words.add("the")
    meaningful_words = [w for w in words if not w in stop_words]
    
    #remove words less than 2 characters
    long_words = [w for w in meaningful_words if len(w) > 2]
    
    return long_words


def getWordsFromTweets(tweets):
    tokenized_tweets = []
    for tweet in tweets:
        words = cleanAndTokenizeTweet(tweet)
        tokenized_tweets.append(words)
    return tokenized_tweets

In [7]:
tokenizedtweets = getWordsFromTweets(tweets)

In [8]:
tokenizedtweets[0]

['American',
 'Sniper',
 'doesn',
 'make',
 'proud',
 'American',
 'probably',
 'need',
 'check',
 'pulse',
 'Merica',
 'd2bkKzbfWA']

Lets write function for splitting a hasthtag into meaningful words.

By close observation of hashtags, hashtags are either of these forms - 
1. Camel Case
2. separated by underscore
3. lowercase word sequence

In case of 1 and 2, extracting words is simple. function hashtagToWords does the same. But in case of 3, we will use a different approach since it is not trivial.

To split the hashtags all in lowercase like this -> "imissmyfriends", we can look up into dictionary to see for valid words and decide split positions using like http://www.geeksforgeeks.org/dynamic-programming-set-32-word-break-problem/

But how do we decide which split is the best. We need some basline corpus to decide which words are more frequent or appear together. Then we can rank all splits based on their probability. As it happens there is already a algorithm for doing this called Viterbi Algorithm. Lets see how this algorithm works

In [9]:
#uncomment to see code, or look in viterbi.py
# %load ../viterbi.py

In [10]:
from viterbi import viterbi_segment
segments, probs = viterbi_segment("imissmyfriends")

In [11]:
print "words:"+str(segments)
print "probability:"+str(probs)

words:['i', 'miss', 'my', 'friends']
probability:2.02391305979e-13


In [12]:
first_cap_re = re.compile('(.)([A-Z][a-z]+|[0-9]+)')
all_cap_re = re.compile('([a-z0-9])([A-Z])')
            
def hashtagToWords(word):
    words = []
    if "_" in word:
        words = word.split("_")
    elif all(x.isupper() for x in word):
        words.append(word)
    elif any(x.isupper() for x in word):
        s1 = first_cap_re.sub(r'\1 \2', word)
        words = all_cap_re.sub(r'\1 \2', s1).split()
    else:
        segs, prob = viterbi_segment(word)
        words.extend(segs)
    
    hashtag_formatted = cleanAndTokenizeTweet(' '.join(words).lower())
    
    return hashtag_formatted

In [13]:
print hashtagToWords("AmericanSniper")
print hashtagToWords("IMAX")
print hashtagToWords("BradleyCooper's")
print hashtagToWords("AmericanSniper:")
print hashtagToWords("Oscars2015")
print hashtagToWords("Birdman")
print hashtagToWords("The12DaysOfSummer")
print hashtagToWords("boxoffice")
print hashtagToWords("imissthis")

['american', 'sniper']
['imax']
['bradley', 'cooper']
['american', 'sniper']
['oscars', '2015']
['birdman']
['days', 'summer']
['box', 'office']
['miss']


*Awesome!!!* We are done with task 1. And we have tested the function for all cases. Lets move of on to task 2.

In [14]:
#function to replace hashtags with split words.
def getFormattedTweets(tokenizedtweets):
    formatted_tweets = []

    for item in tokenizedtweets:
        templist = []
        for word in item:
            if word.startswith('#'):
                templist.extend(hashtagToWords(word[1:]))
            else:
                templist.append(word)
        
        formatted_tweets.append(' '.join(templist).lower())
    
    return formatted_tweets

In [15]:
formatted_tweets = getFormattedTweets(tokenizedtweets)
formatted_tweets[0]

'american sniper doesn make proud american probably need check pulse merica d2bkkzbfwa'

In [16]:
bradley_cooper_tweets = filter(lambda x: "bradley cooper" in x, formatted_tweets)
clint_eastwood_tweets = filter(lambda x: "clint eastwood" in x, formatted_tweets)
chris_kyle_tweets = filter(lambda x: "chris kyle" in x, formatted_tweets)
print "Bradley Cooper tweets:"+ str(len(bradley_cooper_tweets))
print "Clint Eastwood tweets:"+ str(len(clint_eastwood_tweets))
print "Chris Kyle tweets:"+ str(len(chris_kyle_tweets))

Bradley Cooper tweets:4494
Clint Eastwood tweets:3514
Chris Kyle tweets:5366


In [17]:
chris_kyle_tweets[:5]

['ultimate respect chris kyle 1974 2013 rip solider american sniper xu3woykitv',
 'american sniper chris kyle widow this movie kids remember dad 3xoenmufuz lftadfk4nt',
 'retweet support chris kyle 100 american sniper t6dnbzhlyl',
 'intvd bradley cooper clint eastwood taya kyle sienna miller jason hall today chris kyle american sniper 8dipuwx5vq',
 'american sniper chris kyle widow how coping xbysnjfryz 5mwacveorz']

To figure out top trending topics in a given set of tweets, there are number of approaches. We will explore only one of them which is very simple. Here are steps - 

1. Get the text corpus for matching tweets given a keyword.
2. Vectorize the corpus using either Word-count or TF-IDF. (We will choose whichever gives best results).
3. Club together synonyms and similar words/grams.
4. Get top 5 words/grams with highest value of TF or TF-IDF.

Lets only consider 1-gram and 2-gram only for words.
For finding similar words and phrases, we will use wordnet dictionary.

In [18]:
import tfidf 
import numpy as np
#build term frequency-inverse document frequency matrix
def getTFIDF(input_tweets):
    td_dict, vocab = tfidf.tc(input_tweets, tokenizer=tfidf.get_bigrams)
    td = tfidf.to_sparse_matrix(td_dict, vocab).toarray()
    idf = tfidf.to_vector(tfidf.idf_from_tc(td_dict), vocab)
    idc = tfidf.idc_from_tc(td_dict)
    tf_by_idf = tfidf.tf_mul_idf(td, idf)
    
    #sum all tf-idf for a word and sort to get most important
    tfmatrix = tf_by_idf.sum(axis=1)
    sortedmatrix = np.argsort(tfmatrix, axis=0, )
    inv_vocab = tfidf.inverse_vocab(vocab)
    
    results = [(inv_vocab[x], tfmatrix[x]) for x in sortedmatrix[-100:]]
    results_dict = dict(results)
    return results_dict, idc

Well, we have got some interesting results. Some observations on this -

1. Some concepts are similar and hence results are repetitive. For ex, best, actor, best actor all represent same meaning
2. We have ignored synonyms till now. But as we see in results above, a lot of synonyms can be clubbed together to represent a topic. For ex, amazing, great, sensational, etc
3. One more thing. Clint Eastwood, Chris Kyle, Bradley Cooper happen to be the most used words in all tweets. So our results are always showing them as important. 

Lets tackle these issues one by one.
First for every group of similar concepts, we will choose a bigram and average value of count among them.
For ex, 
'chris kyle, chris, kyle' will come together as ('chris kyle', 520+565+651/3)

In [19]:
bradley_dict, bradley_idc = getTFIDF(bradley_cooper_tweets)
clint_dict, clint_idc = getTFIDF(clint_eastwood_tweets)
kyle_dict, kyle_idc = getTFIDF(chris_kyle_tweets)



In [20]:
def getFinalResultDict(in_dict, in_idc):
    final_dict = {}
    bigram_words = []

    for key, value in in_dict.iteritems():
        one_grams = key.split()
        fsum = value
        counter = 1
        if len(one_grams)==2:
            if one_grams[0] in in_dict:
                fsum = fsum + in_dict[one_grams[0]]
                counter = counter + 1
            if one_grams[1] in in_dict:
                fsum = fsum + in_dict[one_grams[1]]
                counter = counter + 1
            final_dict[key] = (fsum / counter, in_idc[key])
            bigram_words.append(one_grams[0])
            bigram_words.append(one_grams[1])

    for key, value in in_dict.iteritems():
        one_grams = key.split()
        if len(one_grams)==1:
            if one_grams[0] not in bigram_words:
                final_dict[key] = (value, in_idc[key])
    
    return final_dict

In [22]:
bradley_final_dict = getFinalResultDict(bradley_dict, bradley_idc)
print "Most frequent Topics for Bradley Cooper"
sorted( ((v[1],k) for k,v in bradley_final_dict.iteritems()), reverse=True)[:10]

Most frequent Topics for Bradley Cooper


[(4021, 'american sniper'),
 (520, 'chris kyle'),
 (484, 'cooper american'),
 (413, 'americansniper'),
 (395, 'clint eastwood'),
 (372, 'sniper bradley'),
 (271, 'film'),
 (228, 'good'),
 (210, 'performance'),
 (196, 'great'),
 (195, 'navy seal'),
 (187, 'the'),
 (158, 'premiere'),
 (149, 'new'),
 (149, 'cooper sensational'),
 (141, 'portrayal navy'),
 (141, 'eastwood bradley'),
 (138, 'bradley coopers'),
 (130, 'see american'),
 (126, 'watch')]

In [23]:
clint_final_dict = getFinalResultDict(clint_dict, clint_idc)
print "Most frequent Topics for Clint Eastwood"
sorted( ((v[1],k) for k,v in clint_final_dict.iteritems()), reverse=True)[:10]

Most frequent Topics for Clint Eastwood


[(3304, 'american sniper'),
 (2040, 'clint eastwood'),
 (1462, 'clint eastwoods'),
 (1386, 'review clint'),
 (1383, 'movie american'),
 (1376, 'movie review'),
 (1373, 'sniper movie'),
 (1364, 'book interpretation'),
 (1363, 'eastwoods book'),
 (400, 'bradley cooper')]

In [None]:
kyle_final_dict = getFinalResultDict(kyle_dict, kyle_idc)
print "Most frequent Topics for Chris Kyle"
sorted( ((v[1],k) for k,v in kyle_final_dict.iteritems()), reverse=True)[:10]

Just for fun, lets find out how unsuperived topic extraction using Non-Negative Matrix Factorization works from scikit-learn. 

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

n_topics = 10
n_top_words = 5

vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                             stop_words='english')
tfidf = vectorizer.fit_transform(bradley_cooper_tweets)
nmf = NMF(n_components=n_topics, random_state=1).fit(tfidf)
feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(nmf.components_):
    print("Topic #%d:" % topic_idx)
    print(" ".join([feature_names[i]
                    for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

Topic #0:
seal sensational navy portrayal chris
()
Topic #1:
response claims anti coopers american
()
Topic #2:
best oscar actor oscars 2015
()
Topic #3:
talks sniper american veterans news
()
Topic #4:
eastwood clint featurette new talk
()
Topic #5:
good amazing job looks film
()
Topic #6:
movie great americansniper star awesome
()
Topic #7:
performance eerie inside news american
()
Topic #8:
sniper american watching film premiere
()
Topic #9:
kyle chris taya americansniper miller
()


The results are pretty impressive and each group of word is clearly representing an idea. But this is not what is aksed for the task, so we will not use this method. This is just shown for illustration purposes.