# CPSC 330 hw6

In [None]:
import numpy as np
import pandas as pd
import string

from sklearn.linear_model import Ridge
from sklearn.feature_selection import RFE, RFECV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline

For `nltk` and `gensim`, see Lecture 13 for install instructions:

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
from gensim.models import Word2Vec, KeyedVectors, FastText

## Instructions
rubric={points:5}

Follow the [homework submission instructions](https://github.students.cs.ubc.ca/cpsc330-2019w-t2/home/blob/master/docs/homework_instructions.md). 

## The dataset

In this assignment we'll be looking at a [dataset of Donald Trump's tweets](https://www.kaggle.com/austinreese/trump-tweets), which were collected about 6 weeks ago (Jan 20th, 2020). You should start by downloading the dataset. As usual, please do not commit it to your repos.

In [None]:
df = pd.read_csv("trumptweets.csv")

In [None]:
df_train, df_test = train_test_split(df)
df_train.head()

In [None]:
df_train.shape

In this first part of the assignment, we'll try to predict the number of retweets. 

## Exercise 1

The columns in our dataset are as follows:

In [None]:
df_train.columns

#### 1(a)
rubric={points:1}

Our target column is `retweets`. Aside from the target column, there is a column here that we cannot use because it would be "cheating". What is this column? Briefly justify your answer.

#### 1(b)
rubric={points:1}

In addition to the column identified in part (a), there are one or more columns that are useless in our prediction task. What are these column(s)? Briefly justify your answer.

#### 1(c)
rubric={points:10}

Parts (a) and (b) aside, we're actually only going to look at the tweets themselves for this assignment, i.e. only the `content` column. Your tasks:

- Use `CountVectorizer` to create features from the text data and use `Ridge` as your model. 
- You will have to find reasonable hyperparameter values. 
- Use cross-validation on `df_train`. Report your train and cross-validation accuracy after a bit of tuning. 
- You should log-transform the targets. You can report all your scores as the output of `.score()` for the log-transformed data.
- Don't violate the Golden Rule!

Benchmark: you should be able to achieve a cross-validation $R^2$ score of at least $0.5$ (on the log-transformed data). 

#### 1(d)
rubric={points:2}

What are the most important features according to your `Ridge` model from part (c)?

#### 1(e)
rubric={points:4}

In part (c) you set up feature extraction from text to vectors using `CountVectorizer`. You may have tweaked some hyperparameters that control the number of features, such as `max_features`. Next, we will apply feature selection using sklearn's `RFECV`. But first - what is the point of applying feature selection here? If we wanted fewer features, why not just change the hyperparameters of `CountVectorizer`, for example by reducing `max_features`? 

#### 1(f)
rubric={points:5}

Use sklearn's `RFECV` to remove unecessary features from your approach in part (c). You are welcome to use the `step` hyperparameter for speed.

1. Did your training error go up or down? Is this what you expected?
2. Did your test error go up or down? Is this what you expected?

#### 1(g)
rubric={points:2}

Print out a few of the features that were removed by `RFECV`.

## Exercise 2

Let's now try to use pre-trained word embeddings.

As we saw in class, using pre-trained word embeddings is very common in NLP. These embeddings are created by training a model like [Word2vec](https://en.wikipedia.org/wiki/Word2vec) on a huge corpus of text. In this exercise we will explore a [GloVe](https://nlp.stanford.edu/projects/glove/) model which was trained on a **different** Twitter dataset (not the Trump tweets). Note that the gloVe model is case sensitive and it only has representations of lower-case words.

Your tasks are:
- Download [GloVe embeddings for Twitter](http://nlp.stanford.edu/data/glove.twitter.27B.zip). This is a large file (the compressed file is ~1.42 GB ). **Please do not commit this file!!** 
- Unzip the downloaded file. For this exercise we'll be using `glove.twitter.27B/glove.twitter.27B.100d.txt`. The file has words and their corresponding embeddings.
- Convert the GloVe embeddings to the Word2vec format using the following command. More details [here](https://www.pydoc.io/pypi/gensim-3.2.0/autoapi/scripts/glove2word2vec/index.html).

```
python -m gensim.scripts.glove2word2vec -i "glove.twitter.27B.100d.txt" -o "glove.twitter.27B.100d.w2v.txt"
```

Then, the following line of could should run (and it may take a few minutes to run).

In [None]:
glove_twitter_model = KeyedVectors.load_word2vec_format('glove.twitter.27B.100d.w2v.txt', binary=False)

#### 2(a)
rubric={points:1}

Compare the _vocabulary size_ (number of words) of `glove_twitter_model` and your `CountVectorizer` representation from Exercise 1. Which is larger?


In [None]:
len(glove_twitter_model.vocab)

#### 2(b)
rubric={points:3}

Our pre-trained `glove_twitter_model` gets us a representation of _words_:

In [None]:
glove_twitter_model.get_vector("donald") # equivalent to .transform()

However, we want a representation of _tweets_, which contain multiple words. The most straightforward way to obtain this is by _averaging_ embeddings of words in the document. Below we are providing a function `get_average_embeddings` which preprocesses the input text and then returns average embedding of all the words in some text. 

In [None]:
def get_average_embeddings(text, model = glove_twitter_model):
    """
    Returns the average embedding of the given text
    using the given trained model (e.g., Word2vec, FastText). 
    
    Arguments
    ---------
    text : (str) input text 
    model : (gensim.models) model to use to get embeddings
    
    Returns
    -------
    feat_vect : (numpy.ndarray) the average embedding vector of the given text
    """
    n_features = model.vector_size
    
    stop_words = list(set(stopwords.words('english')))
    punctuation = string.punctuation
    stop_words += list(punctuation)
    stop_words.extend(['``','’', '`','br','"',"”", "''", "'s"])        
        
    preprocessed = []    
    tokenized = word_tokenize(text)
    for token in tokenized:
        token = token.lower()
        if token not in stop_words:
            preprocessed.append(token)

    feat_vect = np.zeros(n_features, dtype='float64')
    #index2word_set = set(model.wv.index2word)    
    nwords = 0 

    for word in preprocessed:
        #if word in index2word_set:
        try: 
            nwords += 1
            feat_vect = np.add(feat_vect, model[word])
        except:
            continue
    if nwords == 0:
        nwords = 1
    feat_vect = np.divide(feat_vect, nwords)     
    return feat_vect

In [None]:
get_average_embeddings("My name is the Donald")

In part (a) you compared the number of words used in each now. Now, compare the _length of the representation_ for this pre-trained model vs. the `CountVectorizer` approach. Briefly discuss. (The pedagogical goal here is for you to distinguish between the number of words and the length of the vector.)

#### 2(c)
rubric={points:5}

In Exercise 1 you used `CountVectorizer` to generate features, which were then fed into a regression model. We can do the same here with the features from the pre-trained model (the code may take a few minutes to run):

In [None]:
X_train_glove_tweets = [get_average_embeddings(review) for review in df_train["content"]]
X_test_glove_tweets  = [get_average_embeddings(review) for review in df_test["content"]]

What sort of regression scores can you get with these features instead? Compare with your scores from Exercise 1 and briefly discuss.

Benchmark: even after abandoning `Ridge`, I wasn't able to get $R^2$ scores above $0.5$ on the log-transformed targets (although I got close). Maybe you will do better!

## Exercise 3

Now we're done with trying to predict the number of retweets. Our next task will be trying to find similar tweets to a query tweet using nearest neighbours, like the product recommendations we discussed in Lecture 15.

#### 3(a) 
rubric={points:5}

Using scikit-learn's `NearestNeighbours` on the word count features from `CountVectorizer`, find the 5 most similar tweets to this made-up tweet:

In [None]:
query_tweet = "I am Donald and I am THE BOSS around here #me #sogreat #boss"

Use Euclidean distance and the same `CountVectorizer` hyperparameters you used in Exercise 1.

#### 3(b)
rubric={points:2}

Repeat part (a) but using cosine similarity instead of Euclidean distance. 

#### 3(c)
rubric={points:5}

In lecture we talked about how Euclidean distance resulted in less popular items being recommended than with cosine similarity. What is the analog of "popularity" here? Are your results from parts (a) and (b) consistent with this notion?

#### 3(d)
rubric={points:3}

Repeat Parts (a) and (b) but this time with the pre-trained gloVe embeddings from Exercise 2.

#### 3(e)
rubric={points:1}

Did changing the distance metric from Euclidean to cosine have the same sort of effect with the `CountVectorizer` embeddings as with the pre-trained gloVe embeddings? Briefly discuss. 

Note: this question is quite difficult and I'm not sure I really gave you the necessary tools to answer it. As a compromise, I made it worth only 1 point. If you can't figure it out, I would just move on.

#### 3(f)
rubric={points:2}

Our first approach, using `CountVectorizer` features, should only retrieve similar tweets if they have some words in common with the query tweet. Is this also true for the pre-trained embedding approach as well? Briefly discuss.

## Exercise 4: Very short answer questions
rubric={points:16}

Each question is worth 2 points.

1. After running feature selection with `RFE`, `rfe.ranking_` tells you the order in which the features were removed. Why could this order be different from the order of the original feature importances, ranked from least to most important?
2. What are 2 differences between using TFIDF vs. calling `StandardScaler` on the output of `CountVectorizer`?
3. True or False (and briefly explain): converting numpy array to a scipy sparse matrix will always reduce its size in memory.
4. Why is word tokenization hard - can't you just separate words by searching for the space character (" ") and splitting on that?
5. In Lecture 13 we saw our pre-trained word embedding model output an analogy that reinforced a gender stereotype. Give an example of how using such a model could cause harm in the real world.
6. In Lecture 14 we discussed how neural networks are sort of like `Pipeline`s, in the sense that they involve multiple sequential transformations of the data, finally resulting in the prediction. Why was this property useful when it came to transfer learning?
7. In Lecture 14 we talked about [this blog post](https://medium.com/@jrzech/what-are-radiological-deep-learning-models-actually-learning-f97a546c5b98). Why is it a problem that the algorithm learned to detect the word "PORTABLE"? If that helps get a high accuracy, aren't we happy?
8. In Lecture 15 I claimed that supervised KNN is "interpretable". If you used `KNeighborsClassifier(n_neighbors=3)` to make a prediction, how would you explain/interpret the prediction to your boss?