### The Backstory and the Setup

You just signed up for PyDataLondon and you are super excited about it! Since you hear that measuring twitter sentiment is all the craze these days (be it for speculating in the stock market, or identifying a viral product), you decide that you also want in. Let's try to apply some NLP (natural language processing) goodness to analyze #PyDataLondon tweets!

In [None]:
# grab the data that we've downloaded for you
# again, don't be worried if you don't understand this part- it's just to set you up for the main parts
import pickle

with open('./datasets/twitter_data.pkl', 'rb') as pickled_file:
    tweets_list = pickle.load(pickled_file)
# quick sanity check
print(len(tweets_list))

In [None]:
# let's see what a tweet looks like
tweets_list[0].keys()

In [None]:
import pandas as pd
tweets_series = pd.Series(tweets_list)
tweets = tweets_series.apply(lambda x: x['text'].lower())
# print out the first 5 stations just as a sanity check
tweets.head()

# A Detour

If you may be pressed for time, perhaps skip to the next section (Back on Track) and come back here later.

For those of you who tried collecting your own twitter stream and didn't filter it as aggressively, you may notice that importing it and later manipulating it may have taken a long time. Exactly how memory intensive is this?

In [None]:
twitter_char_limit = 140
worst_case_utf8_bytes_per_char = 4
ram = twitter_char_limit * worst_case_utf8_bytes_per_char * len(tweets)

print('Series could take up to {:.2f} MB of memory'.format(ram / 1024. / 1024.))

Thanksfully, UTF-8 is variable length, and works similar to [Huffman Coding](https://en.wikipedia.org/wiki/Huffman_coding), which helps cut down the number of bytes per char significantly. Also people's tweets don't take up the max 140 characters.

Let's take a quick detour to look at the number of characters in a tweets

In [None]:
tweet_lengths = tweets.apply(lambda text: len(text))

Let's plot a histogram of the character lengths!

In [None]:
import matplotlib
import matplotlib.pyplot as plt
# use new pretty plots
matplotlib.style.use('ggplot')
# get notebook to show graphs
%pylab inline

# because data scientists hate charts with no labels :D
plt.ylabel('frequency')
plt.xlabel('number of characters in tweet')
tweet_lengths.hist()

In [None]:
# what's the average number of characters?
import numpy as np
tweet_lengths.mean()

In [None]:
average_chars_per_tweet = tweet_lengths.mean()
average_utf8_bytes_per_char = 2  # random approximation
ram = average_chars_per_tweet * average_utf8_bytes_per_char * len(tweets)

print('Expect series to take about {:.2f} MB of memory'.format(ram / 1024. / 1024.))

Here is another way to roughly check how much memory is being used:

Pandas can store objects into hdf5 files (similar to pickle in python)
You can then load the object back out from the permanent file into memory later.

In [None]:
# make sure you have `pip3.5 install --user tables`
tweets.to_hdf('tweets.h5', key='tweets')
# you can access that same file and take the data back out
data = pd.read_hdf('tweets.h5')
data.describe()

In [None]:
# OMG what is this magic- you can access the linux commandline toolkit from the notebook
!ls -lh datasets

Notice what happens to the file size of tweets.h5 if you run the store command more than once.

In [None]:
for ii in range(3):
    with pd.HDFStore('datasets/tweets.h5') as store:
        store.put('tweets', tweets)
        print(store)
    !ls -lh datasets/tweets.h5

In [None]:
# in fact, look at what happens even if you delete the object
with pd.HDFStore('datasets/tweets.h5') as store:
    store.remove('tweets')
    print(store)
!ls -lh datasets/tweets.h5

While the HDF5 format used here is not exactly a completely read-only and immutable file system, this characteristic is typical and extremely important for a lot of distributed file systems (Hadoop's [HDFS](https://www-01.ibm.com/software/data/infosphere/hadoop/hdfs/), Google's [GFS](https://en.wikipedia.org/wiki/Google_File_System)... etc)

### Back on Track

Well. Afer that huge detour, let's go back to analyzing the tweets. We are going to use a technique called [word vectors](http://www.eecs.qmul.ac.uk/~dm303/static/eecs_open14/eecs_open14.pdf), and try to find out which words are most commonly used with other words.

In [None]:
from collections import defaultdict

word_count = defaultdict(int)

for tweet in tweets.values:
    for word in tweet.split():
        word_count[word] += 1

print('{} unique words'.format(len(word_count)))

In [None]:
# let's show off another python standard library feature
from collections import Counter

words = Counter(word_count)
print(words.most_common(10))

If you were asked to find the best chart to visualize word counts, what would your choice be?

I know what my choice is.

Here's a cool little non-standard library that you should be able to install with a single command.
Python is amazing.

In [None]:
# pip3.5 install --user wordcloud
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=600).generate_from_frequencies(words.items())
plt.imshow(wordcloud)
plt.axis("off")

Word clouds are so coool. In view of that, let's make the picture take up the whole screen, so we can stare at it __IN ALL ITS GLORY__ :D

In [None]:
def enlarge(multiplier=2):
    # if you want to understand more about this function, refer to the data visualization notebook
    params = plt.gcf()
    original_width, original_height = params.get_size_inches()
    new_size = (original_width * multiplier, original_height * multiplier)
    params.set_size_inches(new_size)

enlarge()
plt.imshow(wordcloud)
plt.axis("off")

### Ahem
Let's get back on track again... Too much chart porn is bad for you after all.

First, let's do some long overdue data cleanup that we spotted from the word cloud. We probably don't care about retweets, prepositions etc. And on that note, we also probably don't care about the words which only occur a couple times.

In [None]:
exclude_words = [
    'rt', 'to', 'for', 'the', 'with', 'at', 'via', 'on', 'if', 'by', 'how', 'are', 'this'
    'do', 'into', 'or', '-', 'you', 'is', 'a', 'i', 'it', 'in', 'and', 'of', 'from'
]
word_count_filtered = {k: v for k, v in word_count.items() if k not in exclude_words}
words = pd.DataFrame.from_dict(word_count_filtered, orient='index').rename(columns={0: 'frequency'})
words.head()

In [None]:
limit = 30
shortened_list = words[words.frequency > limit]
print(
    'If we limit the words to any word that at least occurs {} times, '
    'we are left with {} words (from {} words)'.format(
        limit,
        len(shortened_list), len(words)
    )
)

Let's calculate the colocation/co-occurrence frequency:

ie. if this word is in the tweet, how frequent is it that these other words are also in the tweet?

In [None]:
# First, let's create a DataFrame filled with zeros
occurrence_frequency = pd.DataFrame(0, index=shortened_list.index.values, columns=shortened_list.index.values)
# sanity check again
occurrence_frequency.iloc[:5, :5]

In [None]:
# next, let's remove all the unncessary words 
cleaned_tweets = tweets.apply(lambda tweet: [word for word in tweet.split() if word in occurrence_frequency.index])

In [None]:
# a triple for-loop to add up and fill in the counts for each word vis-a-vis other words
for word_list in cleaned_tweets.values:
    for word in word_list:
        for other_word in word_list:
            occurrence_frequency[word][other_word] += 1

In [None]:
occurrence_frequency.head()

Great! Now we have everything setup and we are ready to look at the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between different words.

We are thinking of each word as a n-dimensional vector (where each dimension is the co-occurence frequency for another specific word) The cosine similarity basically looks and says, "hey `word_a` co-occurs a lot with `word_b` but does not appear with `word_c`. Oh hey, `word_d` also co-occurs a lot with `word_b` but not with `word_c`. I guess that `word_a` and `word_d` must be quite similar then."

In [None]:
from scipy.spatial.distance import pdist, squareform
cosine_distances = squareform(pdist(occurrence_frequency, metric='cosine'))
cosine_distances.shape

In [None]:
cosine_distances[:5,:5]

You can see that the distances between any word and itself is 0.
Let's flip it around for a second and look at similarity instead.

In [None]:
cosine_similarities_array = np.exp(-cosine_distances)
similarity = pd.DataFrame(
    cosine_similarities_array, 
    index=occurrence_frequency.index, 
    columns=occurrence_frequency.index
)
similarity.head()

Now you can see that any word is 100% similar with itself.

Well that is great and all, but how would you visualize word similarity?

It turns out that scikit learn has just the tool for us:

In [None]:
from sklearn import manifold
# http://scikit-learn.org/stable/modules/manifold.html#multidimensional-scaling
mds = manifold.MDS(n_components=2, dissimilarity='precomputed')
words_in_2d = mds.fit_transform(cosine_distances)
words_in_2d[:5]

[MDS](https://en.wikipedia.org/wiki/Multidimensional_scaling) allows us to go from the n by n matrix down to a more manageable lower-dimension representation of the n words. In this case, we choose a 2-d representation, which allows us to...

In [None]:
# make a bubble chart

counts = [word_count[word] for word in occurrence_frequency.index.values]
plt.scatter(x=words_in_2d[:,0], y=words_in_2d[:,1], s=counts)

In [None]:
enlarge()
important_words = words[words.frequency > 80].index.values
for word in important_words:
    idx = occurrence_frequency.index.get_loc(word)
    plt.annotate(word, xy=words_in_2d[idx], xytext=(0,0), textcoords='offset points')
plt.scatter(x=words_in_2d[:,0], y=words_in_2d[:,1], s=counts, alpha=0.3)

That's cool- you can see there is:
- a cluster with monty + python
- a cluster of (I'm guessing) Spanish words
- a cluster of data science / big data / machine learning / data analytics, which weirdly also contains @kirkdborne. Checking his twitter, it turns out he posts a lot about data science!

If you've gotten to here, a big congratulations on finishing the hardest tutorial of the bunch!

If you stil have time, here are a couple suggestions for you to work on:

- Try to write your own code to download twitter tweets. Ask me to reference the code I used. [Here](http://adilmoujahid.com/posts/2014/07/twitter-analytics/) is another tutorial that is quite comprehensive. You will have to setup a twitter developer's account, crerate an app and get an api token first.
- Try to use what we have developed so far to create your own search algorithm. eg: search for all the tweets that has to do with machine learning (and it knows to shows anything related to data science, big data, data analytics etc)
- This was definitely a case where we kept bumping up against resource limits. The triple for loop when filling out the occurrence_frequency counts is a killer- given n tweets, there are probably k*n words, and so it has (very very roughly) a [computation complexity](https://en.wikipedia.org/wiki/Big_O_notation) of O(n^3), compared to most of the other stuff we did, which was mainly O(kn). Can we rewrite the code to make it better?
- For the last scatter plot we just generated, user a clustering algorithm to color them, so that we can see the clusters that we just observed more clearly