**Preparation:** Run (Select the cell and `CTRL+Enter` or `CMD+Enter`) the following code just so that the output looks better.

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

Whenever you want to add extra cells, you can do that by clicking `+Code` in the upper-left corner, or by clicking `Esc` (stop editting cell) and `A` (add a cell above) or `B` (add a cell below)

# TRUMP TWEETS

President Trump was well known to communicate a lot through Twitter. His tweets got a varying level of engagement, measure either through *Favorites* or *Retweets*. Can we understand what got people engaged, based on the contents of the tweets?

We'll use a subset of the data set prepared by Brendan of https://www.thetrumparchive.com/ , tweets ranging from 2015-06-15 (the day he announced his candidacy) to 2022-01-08 (the dat his account was suspended).

1. **Download the file using the following code**

In [None]:
# The following code downloads the file from GitHub:
!wget https://raw.githubusercontent.com/amjassem/DxU-Intro-to-Text-Analytics/main/Data/trump_tweets.csv

This time our data is in a **.csv** format (the less fancy cousin of .xls) you're most likely familiar with it. It's basically a table of tweets and the meta-data.

The `pandas` package is very convenient for working with tables. 
2. **Read the data and see the first few rows**

In [None]:
import pandas as pd

# Read columns 1, 5, 6 and 7. Parse the dates
tweetData = pd.read_csv('trump_tweets.csv', usecols=[1, 5, 6, 7], parse_dates=[3])
# Print top 20 rows
tweetData.head(20)

## Task 1: Clean the texts

Tweets can often be messy to analyze, before we move forward we need to clean them up a bit.

Things to consider:
* Removing url links
* Removing account handles (@account) and emails
* Remove strings that contain numbers
* Removing hashtags?

We might use `re.sub(pattern, '', string)` for this. You can use the [regex cheatsheet](https://cheatography.com/davechild/cheat-sheets/regular-expressions/).

URLs can be fairly-reliably found using the patter `(http|ftp|https)[^\s]*` which looks for http or ftp or https, followed by any number of non-space characters

How about finding account handles?

3. **Finish the function below such that it removes:**
  * URLs, 
  * account handles
  * e-mails
  * strings containing numbers

In [None]:
import re

def CleanText(text):

  # Removes URLs
  text = re.sub('(http|ftp|https)[^\s]*', '', text)
  # Removes account handles
  text = re.sub('', '', text)
  # Removes emails
  text = re.sub('', '', text)
  # Remove strings with numbers (e.g. "401k")
  text = re.sub('', '', text)

  return text

# Test on a couple of tweets
sel = [40, 95, 976, 32140]

for i in sel:
  print("===================")
  print(tweetData.text[i])
  print(CleanText(tweetData.text[i]))

## Task 2: Vectorization

In case of book reviews we performed tokenization ourselves.

If we don't want to include any special rules, or if we already did all the "custom" changes (like above) we can use a package for this. This is often much faster to compute actually.

We'll run `CountVectorizer()`. Notice that the vocabulary is still a bit messy. It can usually be improved by setting a minimum count for terms to be included using the `min_df=` argument in `CountVectorizer()`. 

You can find out more options (arguments) by running `?CountVectorizer`.

4. **First run the code below**
  * Then set a minumum count, and re-run the code, see how the vocabulary changes?

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Get the text of the tweets
texts = [CleanText(text) for text in tweetData.text]

# The object for vectorizing the corpus
vecCounts = CountVectorizer(stop_words='english')
# Vectorize the corpus
counts = vecCounts.fit_transform(texts)

# Print the vocabulary
vocabulary = vecCounts.get_feature_names_out()
print('Number of terms:', len(vocabulary))
print(vocabulary[0:100])

## Task 3: Topic modelling

Let's run a topic model to try to find out what the tweets are about.

Select a number of topics `n_components=` and the number of iterations `max_iter=` (for now, let's choose a small number, e.g. 10 - when you run an actual analysis it's best to stop through convergence).

By setting `verbose=1` we can see the progress of the estimation.

5. **Fill in the missing code and run it**

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from tqdm.notebook import tqdm

# Initialize the model
lda = LatentDirichletAllocation(____, _____, verbose=1)
# Estimate the parameters
proportions = lda.fit_transform(____)

Let's see what topics we estimated. A convenient way to do this is to create a wordcloud for each topic.

6. **Run the following code.**
  * Do the topic distributions have a clear interpretation?

In [None]:
from matplotlib import pyplot as plt
import numpy as np
from wordcloud import WordCloud

nTerm = 150

# Initialize the subplots
nTop = lda.n_components
fig, axs = plt.subplots(nTop, figsize=(8, 4*nTop))

for k in range(nTop):
  
  # Find top terms for each topic
  sel = np.argsort(-lda.components_[k])[0:nTerm]
  topic = [(vocabulary[i], lda.components_[k, i]) for i in sel]
  topic = dict(topic)

  # Create a wordcloud and plot it
  wordcloud = WordCloud(prefer_horizontal=1).generate_from_frequencies(topic)
  axs[k].set_title("Topic " + str(k))
  axs[k].imshow(wordcloud)
  axs[k].axis("off")

## Task 4: Predict retweets

Can we perhaps use the estimated topics to predict the number of retweets/favorites? Try estimating a regression model.

First however, it would make sense to transform the retweets.

7. **Build a model of your choice using the methods we've used for book reviews.**

In [None]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tqdm.notebook import tqdm

retweets = np.log(tweetData.retweets + 1)

# Build a suitable model

Once you've selected a suitable model, you can re-do the wordclouds but include in the title the estimated coefficient for that topic:
* `axs[k].set_title("Topic " + str(k) + "Coef.: " + "{:.2f}".format(clf.coef_[i]))`
* You can also try printing them out in order based on the value of the coefficients.