# Working with Tweets

In this notebook, we will delve into the analysis of tweet contents.

We consider the dataset of tweets from Elon Musk, SpaceX and Tesla founder, and ask the following questions:
* What is Elon most actively tweeting about?
* Who is Elon most frequently referring to?

We will explore how to work with the contents of tweets.

In [None]:
# imports

import os, codecs
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Let's get some basics (or a refresher) of working with texts in Python. Texts are sequences of discrete symbols (words or, more generically, tokens).

## Import the dataset
Let us import the Elon Musk's tweets dataset in memory.

<img src="images/elon_loop.jpeg" width="400px" heigth="400px">

In [None]:
# import the dataset using Pandas, and create a data frame

root_folder = "data"
df_elon = pd.read_csv(codecs.open(os.path.join(root_folder,"elonmusk_tweets.csv"), encoding="utf8"), sep=",")
df_elon['text'] = df_elon['text'].str[1:] #remove the starting 'b' from every tweet

In [None]:
df_elon.head(10)

In [None]:
df_elon.tail(5)

In [None]:
df_elon.shape # (number of rows, number of columns)

In [None]:
df_elon["text"].tolist()[:10] #convert a column to a list

## Working with tweet contents

In [None]:
# import some of the most popular libraries for NLP in Python
import nltk
import string
#import sklearn

In [None]:
#nltk.download('punkt')

A typical NLP pipeline might look like the following:
    
<img src="images/spacy_pipeline.png" width="600px" heigth="600px">

* Tokenization: split a text into tokens.
* Filtering: remove some of the tokens if not needed (e.g., punctuation). If and how to remove is task dependent.
* Tagger, parser: syntactic structure.
* NER (Named Entity Recognition): find named entities.
* ...

### Tokenization: splitting a text into constituent tokens.

In [None]:
# NLTK provides us with a tokenizers for tweets

from nltk.tokenize import TweetTokenizer, word_tokenize
tknzr = TweetTokenizer(preserve_case=True, reduce_len=False, strip_handles=False)

In [None]:
example_tweet = df_elon.text[1]
print(example_tweet)

A tokenizer takes a string and outputs a list of tokens.

We compare here two tokenizers: one for general English texts, and one specialized for tweets.

In [None]:
tkz1 = tknzr.tokenize(example_tweet)
print(tkz1)
print("\n======\n")
tkz2 = word_tokenize(example_tweet)
print(tkz2)

**Question**: can you spot what the Twitter tokenizer is doing instead of a standard one?

### Filtering unnecessary tokens

In [None]:
string.punctuation

In [None]:
# some more pre-processing

def filter_twt(tweet):
    
    # remove punctuation and short words and urls
    tweet = [t for t in tweet if t not in string.punctuation and len(t) > 3 and not t.startswith("http") and not t.startswith("www")]
    return tweet

def tokenize_and_string(tweet):
    
    tkz = tknzr.tokenize(tweet)
    
    tkz = filter_twt(tkz)
    
    return " ".join(tkz)

In [None]:
print(tkz1)
print("======")
print(filter_twt(tkz1))

In [None]:
df_elon["clean_text"] = df_elon["text"].apply(tokenize_and_string)

In [None]:
df_elon.head(5)

In [None]:
# save cleaned up version

#df_elon.to_csv(os.path.join(root_folder,"df_elon.csv"), index=False)

### Building a dictionary with word occurrences

We want to build a dictionary of unique tokens, containing the number of times they appear in the corpus.

In [None]:
from collections import Counter

all_tokens = list()
for tweet in df_elon["clean_text"].tolist():
    all_tokens.extend(tweet.split())

c = Counter(all_tokens)

#### Questions

* Find the tokens most used by Elon.
* Find the Twitter users most referred to by Elon (hint: use the @ handler to spot them).

In [None]:
[d for d in c.most_common(1000) if not d[0].startswith('@')][:10]

In [None]:
[d for d in c.most_common(1000) if d[0].startswith('@')][:10]

## Data visualization

The `pandas`' API provides integration with the plotting functionalities provided by the `matplotlib` library.

This seamless integration – which is very nice! – hides away from users some of the complexities of `matplotlib`.

However, as there cases where advanced customizations are needed, it's useful to learn the high-level plotting functionalities of `pandas` or `seaborn` as well as being aware of how to perform more advanced customizations by means of `matplotlib`.

Very useful [`matplotlib` cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

Let's plot the number of tweets mentioning one of the top 10 tokens over time.

In [None]:
# convert the created_at column to datetime

df_elon.created_at = pd.to_datetime(df_elon.created_at)

In [None]:
df_elon["year"] = df_elon.created_at.dt.year

In [None]:
df_elon.head()

In [None]:
# count the number of tweets containing a certain word (or user name)

which_word = '@SpaceX' #Tesla

df_elon["word_in_tweet"] = df_elon.clean_text.apply(lambda x: which_word in x)

In [None]:
d = df_elon.groupby('year').word_in_tweet.agg('sum')

In [None]:
sns.barplot(d.index,d.values,color="skyblue")

In [None]:
sns.barplot(d.index,d.values,color="skyblue")
plt.xlabel("Year", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.title("Number of tweets mentioning %s"%which_word, fontsize=14)
plt.tight_layout()
plt.savefig("stuff/elon_plot.pdf")

Another question: how many tweets per month over time? We need to change the index and group..

In [None]:
import seaborn as sns
# Use seaborn style defaults and set the default figure size
sns.set(rc={'figure.figsize':(11, 4)})

In [None]:
df_elon.index = pd.to_datetime(df_elon['created_at'],format='%m/%d/%y %I:%M%p')

In [None]:
df_elon.groupby(pd.Grouper(freq='M')).agg('count')['id'][-10:]

In [None]:
df_elon.groupby(pd.Grouper(freq='M')).agg('count')['id'].plot()

**Remark**: there is much more to this than plotting. Take a loot at the [Seaborn](https://seaborn.pydata.org/examples/index.html) or [Matplotlib](https://matplotlib.org/gallery.html) galleries for some compelling examples.

---

### Anatomy of a plot (OPT)

In [None]:
# first we create the figure, which is the 
# container where all plots reside

fig = plt.figure(figsize=(10, 10))

ax1 = fig.add_subplot(2, 2, 1)
plt.plot(np.random.randn(50).cumsum(), 'k--')

ax2 = fig.add_subplot(2, 2, 2)
ax3 = fig.add_subplot(2, 2, 3)
ax4 = fig.add_subplot(2, 2, 4)
plt.show()

In [None]:
%matplotlib inline

# first we creta the figure, which is the 
# container where all plots reside
fig = plt.figure(figsize=(10, 10))

ax1 = fig.add_subplot(2, 2, 1)
plt.plot(np.random.randn(50), 'k--')

ax2 = fig.add_subplot(2, 2, 2)
ax3 = fig.add_subplot(2, 2, 3)
ax4 = fig.add_subplot(2, 2, 4)

Each plot resides within a `Figure` object.

Each subsplot resides within an `AxesSubplot` object.

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2)
axes[0,1].plot(np.random.randn(50), 'r--')
axes[0,1].plot(np.random.randn(50), 'b--')
axes[1,1].plot(np.random.randn(50), 'k--')
axes[1,0].plot(np.random.randn(50), '.')
axes[0,0].plot(np.random.randn(50), 'y-')
fig.set_size_inches(10, 10)

---

### Exercise 1.

* Plot the top n words together in a single figure, and show their trends over time.
* Do the same for the top n users mentioned.

In [None]:
# Your code here

---