This task is to break down hashtags into proper words. #GoldenGlobe -> Golden, Globe
Let's first bring the data into a list. Each line/tweet is a item in a list. Further, we will treat them as a document comprising of some words.

In [1]:
def loadTweetsInList(filename):
    tweets = open(filename, "rU")
    message = tweets.readlines()
    return message

In [2]:
tweets = loadTweetsInList("../as.txt")

In [3]:
tweets[0:10]

['American Sniper\n',
 "If American Sniper doesn't make you proud to be an American, you probably need to check your pulse. Merica. http://t.co/d2bkKzbfWA\n",
 "Retweet if you're going to see the movie American Sniper http://t.co/8Q93NiAWJJ\n",
 "Retweet if you're going to see the movie American Sniper http://t.co/AQAtAo3vep\n",
 "Retweet if you're going to see the movie American Sniper http://t.co/CYdwIZxcSW\n",
 "Fav if you're going to see the movie American Sniper! http://t.co/DxWI1Sq6Jw\n",
 'I just watched American Sniper. You have to see what Bradley Cooper does in this movie. His performance is next level.\n',
 'Did you know #Jesseventura sued fallen seal estate #AmericanSniper for $1.8M. You worthless piece of \xf0\x9f\x92\xa9. Shameless.\n',
 "American Sniper kind of reminds me of the movie that's showing in the third act of Inglorious Basterds.\n",
 'Did y\xe2\x80\x99all see "American Sniper"? What a powerful movie. Bradley Cooper isn\xe2\x80\x99t too hard on the eyes, either

lets drop first item since its just a title.

In [4]:
tweets = tweets[1:]

Now, lets tokenize each string in this list. The result should be list of lists.

In [5]:
import re
import string
from nltk.corpus import stopwords

def cleanAndTokenizeTweet(tweet):
    
    #only read ascii english characters. Get rid od smiley unicode and other languages
    tweet = filter(lambda x: x in string.printable, tweet)
    
    #remove newline characters
    tweet = re.sub("\n","", tweet)
    
    words = [word for word in tweet.split()]
    
    #remove stopwords 
    stop_words = set(stopwords.words('english'))
    meaningful_words = [w for w in words if not w in stop_words]
    
    #remove words less than 2 characters
    long_words = [w for w in meaningful_words if len(w) > 2]
    
    return long_words


def getWordsFromTweets(tweets):
    tokenized_tweets = []
    for tweet in tweets:
        words = cleanAndTokenizeTweet(tweet)
        tokenized_tweets.append(words)
    return tokenized_tweets

In [6]:
tokenizedtweets = getWordsFromTweets(tweets)

In [7]:
tokenizedtweets[0]

['American',
 'Sniper',
 "doesn't",
 'make',
 'proud',
 'American,',
 'probably',
 'need',
 'check',
 'pulse.',
 'Merica.',
 'http://t.co/d2bkKzbfWA']

In [8]:
hashtags=[]
for item in tokenizedtweets:
    for word in item:
        if word.startswith('#'):
            hashtags.append(word)

In [9]:
hashtags[:10]

['#Jesseventura',
 '#AmericanSniper',
 '#AmericanSniper',
 '#AmericanSniper',
 '#AmericanSniper',
 '#The12DaysOfContent',
 '#AmericanSniper',
 '#AmericanSniper',
 '#AmericanSniper',
 '#AmericanSniper']

Lets write function for splitting a hasthtag into meaningful words.

By close observation of hashtags, hashtags are either of these forms - 
1. Camel Case
2. separated by underscore
3. lowercase word sequence

In case of 1 and 2, extracting words is simple. function hashtagToWords does the same. But in case of 3, we will use a different approach since it is not trivial.

To split the hashtags all in lowercase like this -> "imissmyfriends", we can look up into dictionary to see for valid words and decide split positions using like http://www.geeksforgeeks.org/dynamic-programming-set-32-word-break-problem/

But how do we decide which split is the best. We need some basline corpus to decide which words are more frequent or appear together. Then we can rank all splits based on their probability. As it happens there is already a algorithm for doing this called Viterbi Algorithm. Lets see how this algorithm works

In [10]:
#uncomment to see code, or look in viterbi.py
# %load ../viterbi.py

In [11]:
from viterbi import viterbi_segment
segments, probs = viterbi_segment("imissmyfriends")

In [12]:
print "words:"+str(segments)
print "probability:"+str(probs)

words:['i', 'miss', 'my', 'friends']
probability:2.02391305979e-13


In [13]:
first_cap_re = re.compile('(.)([A-Z][a-z]+|[0-9]+)')
all_cap_re = re.compile('([a-z0-9])([A-Z])')
            
def hashtagToWords(word):
    words = []
    if "_" in word:
        words = word.split("_")
    elif all(x.isupper() for x in word):
        words.append(word)
    elif any(x.isupper() for x in word):
        s1 = first_cap_re.sub(r'\1 \2', word)
        words = all_cap_re.sub(r'\1 \2', s1).split()
    else:
        segs, prob = viterbi_segment(word)
        words.extend(segs)
    return words
        
    


In [14]:
print hashtagToWords("AmericanSniper")
print hashtagToWords("IMAX")
print hashtagToWords("BradleyCooper's")
print hashtagToWords("AmericanSniper:")
print hashtagToWords("Oscars2015")
print hashtagToWords("Birdman")
print hashtagToWords("The12DaysOfSummer")
print hashtagToWords("boxoffice")
print hashtagToWords("imissthis")

['American', 'Sniper']
['IMAX']
['Bradley', "Cooper's"]
['American', 'Sniper:']
['Oscars', '2015']
['Birdman']
['The', '12', 'Days', 'Of', 'Summer']
['box', 'office']
['i', 'miss', 'this']


*Awesome!!!* We are done with task 1. And we have tested the function for all cases. Lets move of on to task 2.

In [15]:
#function to replace hashtags with split words.
formatted_tweets = []

for item in tokenizedtweets:
    templist = []
    for word in item:
        if word.startswith('#'):
            templist.extend(hashtagToWords(word[1:]))
        else:
            templist.append(word)
    formatted_tweets.append(templist)

In [16]:
formatted_tweets[6]

['Did',
 'know',
 'Jesseventura',
 'sued',
 'fallen',
 'seal',
 'estate',
 'American',
 'Sniper',
 '$1.8M.',
 'You',
 'worthless',
 'piece',
 'Shameless.']

To figure out top trending topics in a given set of tweets, there are number of approaches. We will explore only one of them which is very simple. Here are steps - 

1. Get the text corpus for matching tweets given a keyword.
2. Vectorize the corpus using either Word-count or TF-IDF. (We will choose whichever gives best results).
3. Club together synonyms and similar words/grams.
4. Get top 5 words/grams with highest value of TF or TF-IDF.

Lets only consider 1-gram and 2-gram only for words.
For finding similar words and phrases, we will use wordnet dictionary.

In [20]:
import nltk
import itertools
totallist = list(itertools.chain.from_iterable(formatted_tweets))


In [29]:
text = nltk.Text(totallist[:500])

In [30]:
type(text)

nltk.text.Text

In [31]:
text.collocations()

American Sniper; you're going; Bradley Cooper; going see; Retweet
you're; american sniper; 'American Sniper'; January 16th?; box office;
come January; didn't come; (to use; see movie; Sniper didn't; American
Sniper!; American Sniper.; thought American; movie American; saw
American; see American
