# Analysis of Twitter Data
## Tweets Processing with NLTK Library

### Tokenize a Tweet Text
Now, let’s see an example, using the popular *NLTK* library to tokenise a fictitious tweet:

In [None]:
from nltk.tokenize import word_tokenize
 
tweet = 'RT @toto How #BigData and CRM are Shaping Modern Marketing :D https://t.co/TgUYSUp9jT'
print(word_tokenize(tweet))

We can see some peculiarities that are not captured by a general-purpose English tokeniser like the one from NLTK:

*@-mentions*, *emoticons*, *URLs* and *#hash-tags* are not recognised as "single tokens". 

The following code will propose a pre-processing chain that will consider these aspects of the language.

In [None]:
import re
 
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str, # the emoticon strings defined above
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs

    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
 
def tokenize(s):
    return tokens_re.findall(s)
 
def preprocess(s, lowercase=True):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens
 
tweet = 'RT @toto How #BigData and CRM are Shaping Modern Marketing :D https://t.co/TgUYSUp9jT'
print(preprocess(tweet))

### Counting Terms
Assuming we have already collected a list of tweets, now we will do a simple word count. 

In this way, we can observe what are the terms most commonly used in the data set. In this example, we’ll use the set of  tweets, so *the most frequent words should correspond to the topics we discuss* (not necessarily, but bear with be for a couple of paragraphs).

We can use a custom tokeniser to split the tweets into a list of terms. The following code uses the preprocess() function described previously, in order to capture Twitter-specific aspects of the text, such as #hashtags, @-mentions, emoticons and URLs. 

In order to keep track of the frequencies while we are processing the tweets, we can use *collections.Counter()* which internally is a dictionary (term: count) with some useful methods like *most_common()*:

In [None]:
import json
from collections import Counter
 
fname = 'tweets.json'
with open(fname, 'r') as f:
    count_all = Counter()
    for line in f:
        tweet = json.loads(line)
        # Create a list with all the terms
        terms_all = [term for term in preprocess(tweet['text'])]
        # Update the counter
        count_all.update(terms_all)
    # Print the first 10 most frequent words
    most10_freqWords = count_all.most_common(10)
    print(most10_freqWords)

As we can see, the most frequent words (or should we say, tokens), are not exactly meaningful. ;D

### Removing stop-words
In every language, some words are particularly common. While their use in the language is crucial, they don’t usually convey a particular meaning, especially if taken out of context. This is the case of articles, conjunctions, some adverbs, etc. which are commonly called "*stop-words*". In the example above, we can see three common stop-words, for example '*to*', '*the*' and '*\u2026*'. 

Stop-word removal is one important step that should be considered during the pre-processing stages. One can build a custom list of stop-words, or use available lists (e.g. NLTK provides a simple list for English stop-words).

Given the nature of our data and our tokenisation, we should also be careful with all the punctuation marks and with terms like "RT" (used for re-tweets) and "via" (used to mention the original author of an article or a re-tweet), which are not in the default stop-word list.

In [None]:
from nltk.corpus import stopwords
import string
 
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via']

We can now substitute the variable **terms_all** in the first example with something like:

In [None]:
import json
from collections import Counter

from nltk.corpus import stopwords
import string
 
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via']

fname = 'tweets.json'
with open(fname, 'r') as f:
    count_all = Counter()
    for line in f:
        tweet = json.loads(line)
        # Create a list with all the terms
        #terms_all = [term for term in preprocess(tweet['text'])]
        terms_stop = [term for term in preprocess(tweet['text']) if term not in stop]
        # Update the counter
        #count_all.update(terms_all)
        count_all.update(terms_stop)
    # Print the first 10 most frequent words
    print(count_all.most_common(10))
    # compare with the previous one
    print "Old ones: " + str(most10_freqWords)
    # note that you can include those \u2026, de, la, ... to stopword list

### More term filters
Besides stop-word removal, we can further customise the list of terms / tokens we are interested in. Here you have some examples that you can embed in the first fragment of code:

In [None]:
import json
from collections import Counter

from nltk.corpus import stopwords
import string
 
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via']

fname = 'tweets.json'
with open(fname, 'r') as f:
    count_all_single = Counter()
    count_all_hash = Counter()
    count_all_only = Counter()
    for line in f:
        tweet = json.loads(line)
        # Create a list with all the terms, excluding stopwords
        terms_all = [term for term in preprocess(tweet['text'])\
              if term not in stop]

        # Count terms only once, equivalent to "Document Frequency"
        terms_single = set(terms_all)

        # Count hashtags only
        terms_hash = [term for term in preprocess(tweet['text'])\
              if term.startswith('#')]

        # Count terms only (no hashtags, no mentions)
        terms_only = [term for term in preprocess(tweet['text'])\
              if term not in stop and\
              not term.startswith(('#', '@'))] 
              # mind the ((double brackets))
              # startswith() takes a tuple (not a list) if 
              # we pass a list of inputs
        
        # Update the counter
        count_all_single.update(terms_single)
        count_all_hash.update(terms_hash)
        count_all_only.update(terms_only)
    # Print the first 10 most frequent words
    print "DF (all terms): " + str(count_all_single.most_common(10))
    print "Hashtag: " +str(count_all_hash.most_common(10))
    print "Term only: " +str(count_all_only.most_common(10))

### Tuples of adjacent tokens
While the other frequent terms represent a clear topic, more often than not simple term frequencies don’t give us a deep explanation of what the text is about. To put things in context, let’s consider sequences of two terms (a.k.a. *bigrams*).

In [None]:
from nltk import bigrams 
 
terms_bigram = bigrams(terms_stop)

The *bigrams()* function from *NLTK* will take a list of tokens and produce a list of tuples using adjacent tokens. 

Notice that we could use terms_all to compute the bigrams, but we would probably end up with a lot of garbage. In case we decide to analyse longer n-grams (sequences of n tokens), it could make sense to keep the stop-words, just in case we want to capture phrases like “to be or not to be”.

So after counting and sorting the bigrams, this is the result:

In [None]:
import json
from collections import Counter
from nltk import bigrams 
from nltk.corpus import stopwords
import string
 
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via']

fname = 'tweets.json'
with open(fname, 'r') as f:
    count_all = Counter()
    for line in f:
        tweet = json.loads(line)
        # Create a list with all the terms
        terms_stop = [term for term in preprocess(tweet['text']) if term not in stop]
        terms_bigram = bigrams(terms_stop)
        # Update the counter
        count_all.update(terms_bigram)
    # Print the first 10 most frequent words
    print(count_all.most_common(10))