# Analysis of Twitter Data
## Tweets Processing with NLTK Library

### Term co-occurrences
Sometimes we are interested in the terms that occur together. This is mainly because the context gives us a better insight about the meaning of a term, supporting applications such as 'word disambiguation' or 'semantic similarity'. We discussed the option of using bigrams in the previous lab, but we want to extend the context of a term to the whole tweet.

We can *refactor* the code from the previous lab in order to capture the co-occurrences. We build a co-occurrence matrix com such that com[x][y] contains the number of times the term x has been seen in the same tweet as the term y:

In [6]:
import json
import re
############################# from the old code of the previous lab
# ;P
from nltk.corpus import stopwords
import string
 
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via']

emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str, # the emoticon strings defined above
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs

    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)

def tokenize(s):
    return tokens_re.findall(s)

def preprocess(s, lowercase=True):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens
############################# End old code here 
# new code begins here

from collections import defaultdict
com = defaultdict(lambda : defaultdict(int))
# f is the file pointer to the JSON data set
fname = "tweets.json"
f = open(fname, 'r')
for line in f: 
    tweet = json.loads(line)
    # substitute, e.g. \u2026, with space
    tw = re.sub(r'[^\x00-\x7F]+',' ', tweet['text'])
    terms_only = [term for term in preprocess(tw) 
                  if term not in stop 
                  and not term.startswith(('#', '@'))]
 
    # Build co-occurrence matrix
    for i in range(len(terms_only)-1):            
        for j in range(i+1, len(terms_only)):
            w1, w2 = sorted([terms_only[i], terms_only[j]])                
            if w1 != w2:
                com[w1][w2] += 1
print "Done.."

ValueError: No JSON object could be decoded

While building the co-occurrence matrix, we don’t want to count the same term pair twice, e.g. com[A][B] == com[B][A], so the inner for loop starts from i+1 in order to build a triangular matrix, while sorted will preserve the alphabetical order of the terms.

For each term, we then extract the 10 most frequent co-occurrent terms, creating a list of tuples in the form ((term1, term2), count):

In [None]:
import operator
com_max = []
# For each term, look for the most common co-occurrent terms
for t1 in com:
    t1_max_terms = sorted(com[t1].items(), key=operator.itemgetter(1),\
        reverse=True)[:10]
    for t2, t2_count in t1_max_terms:
        com_max.append(((t1, t2), t2_count))
# Get the most frequent co-occurrences
terms_max = sorted(com_max, key=operator.itemgetter(1), reverse=True)
print(terms_max[:10])

We could also look for a specific term and extract its most frequent co-occurrences. We simply need to modify the main loop including an extra counter, for example:

In [None]:
import sys
from collections import Counter

search_word = raw_input('Enter your query: ') # input a term query
count_search = Counter()

fname = "c:\\Program Files\\Anaconda2\\tweets_bigData_dataAnalytic.json"
f = open(fname, 'r')

count = 0
for line in f:
    count = count + 1
    if count%500 == 0:
        sys.stdout.write('.')
    if count%35000 == 0:
        sys.stdout.write('\n')
    tweet = json.loads(line)
    # substitute, e.g. \u2026, with space
    tw = re.sub(r'[^\x00-\x7F]+',' ', tweet['text'])
    terms_only = [term for term in preprocess(tw)\
        if term not in stop and not term.startswith(('#', '@'))]
    if search_word in terms_only:
        count_search.update(terms_only)
print("\nCo-occurrence for %s:" % search_word)
print(count_search.most_common(10))