# Lab 04 | PPOL564 
# COMPARING TWEETS

Let's play around with cosine similarity some more. This time we'll use `sklearn`'s implementation of the method.

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Data

Data from [Kaggle](https://www.kaggle.com/benhamner/clinton-trump-tweets). The dataset provides ~3000 recent tweets from Hillary Clinton and Donald Trump, the two major-party presidential nominees.

![](https://www.kaggle.io/svf/377009/a6e6d9eeb0a7158b8f31498b1274c30b/clinton_vs_trump_retweets_and_favorites.png)

In [2]:
tweets = pd.read_csv("tweets.csv")
tweets.head()

Unnamed: 0,handle,time,text
0,HillaryClinton,2016-09-28T00:22:34,The question in this election: Who can put the...
1,HillaryClinton,2016-09-27T23:08:41,"If we stand together, there's nothing we can't..."
2,HillaryClinton,2016-09-27T22:30:27,Both candidates were asked about how they'd co...
3,realDonaldTrump,2016-09-27T22:13:24,Join me for a 3pm rally - tomorrow at the Mid-...
4,HillaryClinton,2016-09-27T21:35:28,This election is too important to sit out. Go ...


## Clean Text

Let's focus on the Hillary Tweets. The text data needs to be organized as a list for the `sklearn` library. 

In [3]:
hillary_tweets = tweets.query('handle == "HillaryClinton"').text.values.tolist()

# Print some of the tweets
hillary_tweets[:3]

['The question in this election: Who can put the plans into action that will make your life better? https://t.co/XreEY9OicG',
 "If we stand together, there's nothing we can't do. \n\nMake sure you're ready to vote: https://t.co/tTgeqxNqYm https://t.co/Q3Ymbb7UNy",
 "Both candidates were asked about how they'd confront racial injustice. Only one had a real answer. https://t.co/sjnEokckis"]

In [4]:
# Clean URLS from the text
hillary_tweets = [re.sub(r'http\S+', '', tweet).strip() for tweet in hillary_tweets]

# Clean digits from the text
hillary_tweets = [re.sub(r'\d+', '', tweet).strip() for tweet in hillary_tweets]

# Print some of the tweets
hillary_tweets[:3]

['The question in this election: Who can put the plans into action that will make your life better?',
 "If we stand together, there's nothing we can't do. \n\nMake sure you're ready to vote:",
 "Both candidates were asked about how they'd confront racial injustice. Only one had a real answer."]

## Create the Document Term Matrix

This time we'll use `sklearn` to generate the document term matrix rather than build it from scratch.

In [5]:
# instantiate the object with auto english stop word detection
count_vectorizer = CountVectorizer(stop_words='english')

# "fit" the model (i.e. count the number of times words appear in each tweet)
sparse_dtm = count_vectorizer.fit_transform(hillary_tweets)

# Sparse document term matrix 
sparse_dtm 

<2629x4631 sparse matrix of type '<class 'numpy.int64'>'
	with 22385 stored elements in Compressed Sparse Row format>

`CountVectorizer()` will automatically create a document term matrix as a sparse matrix. This is useful when there are a lot of zeros in a matrix. 

But note that we can convert this back to the document term matrix that we're more familiar with. 

In [6]:
# Turn the sparse matrix into a dense matrix
dtm = sparse_dtm .todense()

# Convert to pandas data frame. Set the columns to the words (as we did)
hillary_tweets_dtm = pd.DataFrame(dtm, columns=count_vectorizer.get_feature_names())
hillary_tweets_dtm.head()

Unnamed: 0,________,_bxddxss,abandon,abandoned,abbott,abierta,abilities,ability,able,abolish,...,youtube,zandi,zero,zika,zip,zones,él,única,únicos,⁰⁰
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Refresh on cosine similarity 

Recall how we computed the cosine between two vectors from class:

In [7]:
a = np.array([1,2])
b = np.array([2,1])

# Manually calculate the cosine (as we did in class)
(a.dot(b)/(np.linalg.norm(a)*np.linalg.norm(b))).round(1)

0.8

Let's now gut check the `sklearn` implementation

In [8]:
# Compute Cosine Similarity using sklearn
M = np.vstack([a,b]) # Input must be a matrix
M

array([[1, 2],
       [2, 1]])

Output of sklearn's `cosine_similarity` is similar to a correlation matrix but with cosines!

In [9]:
cosine_similarity(M,M)

array([[1. , 0.8],
       [0.8, 1. ]])

## Similarity between Hillary Clinton's Tweets

Let's now take this idea and apply it all of Hillary Clinton's Tweets.

In [10]:
cos_mat = cosine_similarity(hillary_tweets_dtm,hillary_tweets_dtm)
cos_mat

array([[1.        , 0.16903085, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.16903085, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [11]:
# Drop the diagonals (because of course a document is has perfect similarity with itself)
np.fill_diagonal(cos_mat, np.nan) # let's set these all to Missing

Which tweets are the most similar?

In [14]:
# Let's look at the similarity between tweets by setting some arbitrary threhold

cosine_sim_threshold = .8 # Here the threshold at _exactly .80 degrees (so pretty similar!)

# Scan through the cosine similarity matrix and find the col, row indices 
# for those entries that meet the threshold. (Note we're rounding so values 
# will perfectly match up).
highly_similar_tweets = np.where(cos_mat.round(2) == cosine_sim_threshold) 

# How many tweet pairs are we talking about?
n_pairs = len(highly_similar_tweets[0])/2

# Let's print off a little message
print(f"{'----'*25}\n\t\t\tThere are {n_pairs} tweet pairs that share a cosine similarity of {cosine_sim_threshold}\n{'----'*25}")

# Let's loop through indices and print of the similar pairs. 
for i in range(len(highly_similar_tweets[0])):
    pos1 = highly_similar_tweets[0][i] # Move through the row positions
    pos2 = highly_similar_tweets[1][i] # move through the column positions

    print(f'''
        {hillary_tweets[pos1]}
        
        {hillary_tweets[pos2]}
        
        ----
    ''')

----------------------------------------------------------------------------------------------------
			There are 7.0 tweet pairs that share a cosine similarity of 0.8
----------------------------------------------------------------------------------------------------

        “The bottom line is that we can not afford suddenly to treat this like a reality show.” —@POTUS
        
        "The bottom line is that we cannot afford suddenly to treat this like a reality show.” —@POTUS on the media’s coverage of Donald Trump
        
        ----
    

        "The bottom line is that we cannot afford suddenly to treat this like a reality show.” —@POTUS on the media’s coverage of Donald Trump
        
        “The bottom line is that we can not afford suddenly to treat this like a reality show.” —@POTUS
        
        ----
    

        "I believe there has never been a man or woman more qualified than Hillary Clinton to serve as our president." —@POTUS
        
        “There has never b

# Now you try with Trump's Tweets!