# U4 L5 P1 - Unsupervised Learning Capstone

**Emile Badran - 24/July/2018**

In this project, we'll conduct Latent Semantic Analysis of a data set consisting of 1.6 million tweets. The goal is to predict a hashtag based on the contents of a tweet.

We'll start by creating a "bag of words" and "TF-IDF" feature sets and running supervised machine learning models to predict a tweet's hashtag. We'll then use unsupervised clustering techniques to see whether they can consistently group tweets with the same hashtags into the same cluster.

Finally, we'll calculate the cosine similarity of tweets and see if the most similar tweets have the same hashtag.

### Acknowledgements

We'll use a Kaggle data set consisting of 1.6 million tweets, labeled as [0 = negative, 2 = neutral, 4 = positive].

More information about the data set can be found [here](http://help.sentiment140.com/for-students/). The original research paper can be found [here](https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf).

Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

Link to download the data set:
https://www.kaggle.com/kazanova/sentiment140/home

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', 1000)
import re
import nltk
from nltk.corpus import stopwords
from nltk.corpus import gutenberg
import spacy
from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import ensemble
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.cluster import KMeans
from sklearn.cluster import MeanShift
from sklearn.cluster import SpectralClustering
from sklearn.cluster import AffinityPropagation
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import classification_report

In [2]:
# Instantiate the SpaCy module
nlp = spacy.load('en')

In [3]:
# Read the data set file into a data frame
twDF = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin1',
                       usecols=[5], names=['text'])

print(twDF.shape)
twDF.head(3)

(1600000, 1)


Unnamed: 0,text
0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds


### Find the top hashtags

Not all tweets have hashtags. We'll start by finding the top five hashtags, cleaning the data set, and creating our features.

In [4]:
# Convert tweets to lowercase
twDF.text = twDF.text.str.lower()

# Convert tweets into lists of words, spaces and punctuation marks. We'll need this to find the top hashtags
twDF['splitted'] = twDF.text.apply(lambda x: x.split())

# Create a list of words from all tweets
wordlist = []
twDF.splitted.apply(lambda x: [wordlist.append(i) for i in x])

# Convert lists to more efficient numpy arrays
word_array = np.asarray(wordlist)

# Delete those lists to free up memory
del wordlist

In [5]:
# Find the most common hashtags
top_hashtags = Counter([word for word in word_array if word.startswith('#')]).most_common(11)
top_hashtags

[('#followfriday', 2288),
 ('#fb', 1765),
 ('#squarespace', 867),
 ('#ff', 822),
 ('#seb-day', 498),
 ('#iranelection', 485),
 ('#', 472),
 ('#musicmonday', 397),
 ('#1', 391),
 ('#fail', 343),
 ('#asot400', 320)]

The #followfriday and #ff hashtags are the same. The other four hashtags that we'll try to predict are #squarespace, #iranelection, #musicmonday and #asot400.

In [12]:
# Instantiante a list of top hashtags while removing those that are too generic
top_tags = ['#ff', '#squarespace', '#iranelection', '#musicmonday', '#asot400']

# Copy tweeted hashtags (when they exist) to a separate column
twDF['hashtags'] = twDF.text.str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')

# Extract the first hashtag from the lists of hashtags
twDF.hashtags = twDF.hashtags.apply(lambda x: x[0] if x!=[] else '')

# Convert #followfriday tags to #ff, as they are the same
twDF.hashtags = twDF.hashtags.apply(lambda x: '#ff' if x == '#followfriday' else x)

# Filter a dataframe with only tweets with the top hashtags
twDF = twDF[twDF.hashtags.isin(top_tags)][['text','hashtags', 'splitted']]
twDF.reset_index(inplace=True, drop=True)

# Inspect the frequency of the most common hashtags
twDF.hashtags.value_counts()

#ff              2985
#squarespace      822
#iranelection     436
#musicmonday      387
#asot400          316
Name: hashtags, dtype: int64

We'll also create a separate feature containing the tweets **without their hashtags**, and another feature with their numerical classes:

1. #asot400 
2. #musicmonday
3. #ff
4. #squarespace
5. #iranelection

In [15]:
# Remove the hashtags from the tweets and store the remaining words in a separate column
twDF['unhashed'] = twDF.text.replace(to_replace=r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)',
                                                     value='', regex=True)

# Create a column with numerical hashtag classes
twDF['hash_classes'] = pd.factorize(twDF['hashtags'])[0] + 1

# Remove newlines and other extra whitespace by splitting and rejoining
twDF['tokenized'] = twDF.unhashed.apply(lambda x: ' '.join(x.split()))

# Create SpaCy tokens from words
twDF['tokenized'] = twDF.tokenized.apply(lambda x: nlp(x))
twDF.head(1)

Unnamed: 0,text,hashtags,splitted,unhashed,hash_classes,tokenized
0,i wish i could watch the video feed...but the buffering sucks! #asot400,#asot400,"[i, wish, i, could, watch, the, video, feed...but, the, buffering, sucks!, #asot400]",i wish i could watch the video feed...but the buffering sucks!,1,"(i, wish, i, could, watch, the, video, feed, ..., but, the, buffering, sucks, !)"


Our resulting dataset has nearly five thousand tweets.

In [16]:
twDF.shape

(4946, 6)

## Bag-of-words and parts of speech model

Words are converted to SpaCy tokens to create the bag-of-words model with the 800 most common words. SpaCy helps us filter punctuation marks and stop words, which don't carry much information. Also, words are converted to their "lemmas" (or stem words) before they're counted.

In [17]:
# Create an array of tokens from all tweets
tokenlist = []
twDF.tokenized.apply(lambda x: [tokenlist.append(i) for i in x])
token_array = np.asarray(tokenlist)
del tokenlist

In [18]:
# Instantiate a counter dictionary with the most common words
top_words = Counter([token.lemma_ for token in token_array
             if not token.is_punct
             and not token.is_stop]).most_common(800)

# Extract the most common words into a list
common_words = [item[0] for item in top_words]
common_words[:5]

['thank', 'be', 'love', 'not', 'follow']

In [19]:
# Create columns for every top word
for i in common_words:
    twDF[i] = 0

# Create wordcount features in the data frame
# Process each row, counting the occurrence of words in each tweet
for i, sentence in enumerate(twDF.tokenized):

    # Convert the sentence to lemmas, then filter out punctuation, stop words, and uncommon words
    words = [token.lemma_
             for token in sentence
             if (not token.is_punct
                 and not token.is_stop
                 and token.lemma_ in common_words)]

    # Populate the row with word counts
    for word in words:
        twDF.loc[i, word] += 1

    # This counter is just to make sure the kernel didn't hang
    if i % 800 == 0:
        print("Processing row {}".format(i))

Processing row 0
Processing row 800
Processing row 1600
Processing row 2400
Processing row 3200
Processing row 4000
Processing row 4800


We'll also create a series of features to count the number of "Parts of Speech" in the document (e.g., the number of nouns, verbs, numbers, etc...).

In [20]:
# Add columns with empty values for parts of speech types
pos_types = ['PROPN', 'ADV', 'NOUN', 'ADJ', 'VERB', 'CCONJ', 'PRON', 'NUM',
        'X', 'INTJ', 'DET', 'ADP', 'PUNCT', 'PART', 'SYM', 'SPACE']
for i in pos_types:
    twDF[i] = 0

In [21]:
# Create POS count features in the data frame
# Process each row, counting the occurrence of parts of speech in each tweet
for i, sentence in enumerate(twDF.tokenized):

    # Convert the sentence to lemmas, then filter out punctuation, stop words, and uncommon words
    POSs = [token.pos_ for token in sentence]

    # Populate the row with word counts
    for POS in POSs:
        twDF.loc[i, POS] += 1

    # This counter is just to make sure the kernel didn't hang
    if i % 800 == 0:
        print("Processing row {}".format(i))

Processing row 0
Processing row 800
Processing row 1600
Processing row 2400
Processing row 3200
Processing row 4000
Processing row 4800


In [22]:
twDF.head(1)

Unnamed: 0,text,hashtags,splitted,unhashed,hash_classes,tokenized,thank,be,love,not,...,PRON,NUM,X,INTJ,DET,ADP,PUNCT,PART,SYM,SPACE
0,i wish i could watch the video feed...but the buffering sucks! #asot400,#asot400,"[i, wish, i, could, watch, the, video, feed...but, the, buffering, sucks!, #asot400]",i wish i could watch the video feed...but the buffering sucks!,1,"(i, wish, i, could, watch, the, video, feed, ..., but, the, buffering, sucks, !)",0,0,0,0,...,2,0,0,0,2,0,2,0,0,0


Now our model has 822 features.

In [23]:
twDF.shape

(4946, 822)

### Testing the BOW + POS model

For our first attempt in predicting hashtags, we'll run a few supervised machine learning techniques to predict hashtags:

- Random Forest
- Logistic Regression
- K-Nearest Neighbors
- Gradient Boosting

In [24]:
# Split the data set to train and test samples
Y = twDF.hash_classes
X = twDF.iloc[:,6:]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=0)

In [25]:
# Instantiate and train the Random Forest Classifier model
rfc = ensemble.RandomForestClassifier()
train = rfc.fit(X_train, y_train)

# Inspect the results
print('Training set score:', rfc.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.9871924502864846

Cross validation test scores: [0.70953101 0.71363636 0.69604863]


In [26]:
# Instantiate and train the Logistic Regression model
lr = LogisticRegression()
train = lr.fit(X_train, y_train)

# Inspect the results
print('Training set score:', lr.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.8955173576002696

Cross validation test scores: [0.75189107 0.74393939 0.74620061]


In [27]:
# Instantiate and train the Gradient Boosting Classifier model
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

# Inspect the results
print('Training set score:', clf.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.8500168520390967

Cross validation test scores: [0.7397882  0.73636364 0.70364742]


In [28]:
# Instantiate and train the Nearest Neighbor model
knn = KNeighborsClassifier(n_neighbors=20)
train = knn.fit(X_train, y_train)

# Inspect the results
print('Training set score:', knn.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.6521739130434783

Cross validation test scores: [0.61270802 0.60909091 0.60334347]


### Model tuning

We'll tune the top scoring model (in this case, Logistic Regression) by randomly testing several parameter combinations.

In [31]:
# Define parameter combinations
params = {'C': [.1, 1, 10],
          'solver': ['liblinear', 'newton-cg', 'saga'],
          'warm_start': [True, False]}

# Instantiate RandomizedSearchCV to test all possible parameter combinations 
random_CV = RandomizedSearchCV(lr, params, n_iter=18, cv=3, scoring='f1_micro')

# Run RandomizedSearchCV
random_CV.fit(X_train, y_train)

# Inspect the best parameter combination
random_CV.best_params_

{'C': 1, 'solver': 'liblinear', 'warm_start': True}

It turns out that the ideal parameters are also the default parameters.

In [32]:
# Instantiate and train the Logistic Regression model
lr = LogisticRegression(C=1, solver='liblinear', warm_start=True)
train = lr.fit(X_train, y_train)

# Inspect the results
print('Training set score:', lr.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.8955173576002696

Cross validation test scores: [0.75189107 0.74393939 0.74620061]


Since we have five classes, we'll generate a confusion matrix to visualize whether hashtags have been consistently predicted:

In [33]:
# Test and inspect the results from the top performing model
y_pred = lr.predict(X_test)

# Generate a confusion matrix
ct = pd.crosstab(y_pred, y_test)

print('Confusion Matrix | Logistic Regression Predictions')

ct

Confusion Matrix | Logistic Regression Predictions


hash_classes,1,2,3,4,5
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,42,3,3,7,7
2,12,73,9,4,2
3,52,74,1136,86,48
4,21,18,31,213,17
5,5,4,6,10,96


We'll also view selected classification scoring methods:

- **Precision**: The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is the ability of the classifier not to misclassify samples of each specific class. The best value is 1 and the worst value is 0.


- **Recall**: The recall is the ratio tp / (tp + fn). The recall is intuitively the ability of the classifier to correctly predict each class. The best value is 1 and the worst value is 0.


- **f1-Score**: The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:
F1 = 2 * (precision * recall) / (precision + recall)
In the multi-class and multi-label case, this is the weighted average of the F1 score of each class.

In [34]:
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          1       0.68      0.32      0.43       132
          2       0.73      0.42      0.54       172
          3       0.81      0.96      0.88      1185
          4       0.71      0.67      0.69       320
          5       0.79      0.56      0.66       170

avg / total       0.78      0.79      0.77      1979



The Logistic Regression model was able to adequately predict the two classes with the greatest number of samples (class3 = #ff | class4 = #squarespace).

The model had low recall scores for the two classes with the least number of observations (in other words, it over-predicted those classes).

## Clustering with BOW

We'll now use three clustering methods and see if they can group tweets with the same hashtags.

### K-Means

K-means is a method to cluster data points with similar variances. The algorithm tries to choose means (called centroids) that minimize inertia. The formula for inertia is:

$\sum(\bar{x}_c - x_i)^2$

Inertia is the sum of the squared differences between the centroid of a cluster (the mean $\bar{x}_c$) and the data points in the cluster ($x_i$).  The goal is to define cluster means so that the distance between a cluster mean and all the data points within the cluster is as small as possible.

In [35]:
# Instantiate and fit the K-Means clustering model
y_pred = KMeans(n_clusters=5).fit_predict(X_train)

In [90]:
# Generate a confusion matrix
ct = pd.crosstab(y_pred, y_train)

ct

Confusion Matrix | Logistic Regression Predictions


hash_classes,1,2,3,4,5
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,110,96,880,244,83
1,27,34,305,156,90
2,0,0,2,0,0
3,25,56,440,69,70
4,22,29,173,33,23


There's really no consistency between k-means clusters and hashtags. 

### Mean Shift

Mean shift is a clustering method that works by first calculating the probability that a data point will be present at any point in the n-dimensional space defined by the number of features. The surface of probabilities is called a kernel density surface.

A kernel function $K(x_i - x)$ is used to determine the weight of nearby points. The weighted mean of the density in the window determined by K is:

<img src='Screen Shot 2018-07-23 at 9.22.43 PM.png' width=250>

Mean-shift is an iterative algorithm. At each iteration, each data point is shifted toward the nearest group of points (or cluster means). If a data point is already closest to its cluster mean, it stays where it is.

Once all data points have reached the point where they are at their nearest mean, and all further shifts (if any) are smaller than a given threshold, the algorithm terminates. The data points are then assigned a "cluster" based on their positions.

In [97]:
# Declare and fit the model
ms = MeanShift(bin_seeding=True)
ms.fit(X_train)

# Extract cluster assignments for each data point
labels = ms.labels_

# Coordinates of the cluster centers
cluster_centers = ms.cluster_centers_

# Count our clusters
n_clusters_ = len(np.unique(labels))

print("Number of estimated clusters: {}".format(n_clusters_))

Number of estimated clusters: 9


In [98]:
y_pred = ms.predict(X_test)

# Generate a confusion matrix
ct = pd.crosstab(y_pred, y_test)

ct

hash_classes,1,2,3,4,5
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,128,158,1134,292,154
1,0,2,6,2,3
3,2,4,17,13,6
4,1,2,6,6,3
5,1,6,21,7,4
6,0,0,1,0,0


Mean shift divided the test data set into seven clusters, grouping most values into one large cluster.

### Affinity Propagation

Affinity Propagation is based on defining exemplars for data points. An exemplar is a data point similar enough to another data point that one could conceivably be represented by the other. Similarity between points is interpreted to mean how well-suited a given data point is to be an exemplar of data another point (and vice-versa).

In [38]:
# Declare the model and fit it in one statement
af = AffinityPropagation().fit(X_train)

# Pull the number of clusters and cluster assignments for each data point
cluster_centers_indices = af.cluster_centers_indices_
n_clusters_ = len(cluster_centers_indices)
labels = af.labels_

print('Estimated number of clusters: {}'.format(n_clusters_))

Estimated number of clusters: 157


Affinity propagation wrongly estimated a much larger number of clusters and was uncapable of grouping tweets with the same hashtag.

## Singular Value Decomposition (SVD)

Similar to PCA, SVD is a mathematical method that is used to reduce the number of dimensions (or features) in a model.

We'll reduce the number of features from 800 to 150, and run the Gradient Boosting model again to see if it brings a considerable change in model accuracy, and especially if it helps improve our clusters.

In [104]:
# Reduce the number of variables with SVD
svd= TruncatedSVD(150)
lsa = make_pipeline(svd, Normalizer(copy=False))

# Declare and fit SVD
X_lsa = lsa.fit_transform(twDF.iloc[:,6:])

# Inspect the variance explained by the resulting SVD components
variance_explained = svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)

Percent variance captured by all components: 94.87921970404707


In [105]:
# Preview the shape of the current data frame
print('Data frame with all features:', twDF.shape)

# Looking at what sorts of paragraphs our solution considers similar, for the first five identified topics
df_lsa = pd.DataFrame(X_lsa, index=[twDF.hashtags, twDF.hash_classes])

# Inspect the resulting dataframe shape
print('Data frame after SVD:', df_lsa.shape)

Data frame with all features: (4946, 822)
Data frame after SVD: (4946, 150)


In [106]:
# Split the data set to train and test samples
Y = twDF.hash_classes
X_train, X_test, y_train, y_test = train_test_split(df_lsa, Y, test_size=0.4, random_state=0)

In [107]:
# Instantiate and train the Gradient Boosting Classifier model
clf = ensemble.GradientBoostingClassifier(loss='deviance', n_estimators=100, warm_start=True,
                                         max_depth=5)
train = clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

# Inspect the results
print('Training set score:', clf.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.9976407145264578

Cross validation test scores: [0.71860817 0.7030303  0.7112462 ]


In [108]:
# Check the solution against the data.
print('Comparing classes:')
print(classification_report(y_test, y_pred))

Comparing classes:
             precision    recall  f1-score   support

          1       0.36      0.16      0.22       132
          2       0.72      0.28      0.41       172
          3       0.76      0.96      0.85      1185
          4       0.74      0.61      0.67       320
          5       0.79      0.40      0.53       170

avg / total       0.73      0.74      0.71      1979



As expected, SVD was able to reduce the number of dimensions in our model without losing much variance. The accuracy of gradient boosting predictions dropped by only approximately two percentage points.

### Clustering with SVD

Now let's see what matters most - if SVD helps improve our clusters. We'll also do some parameter tuning.

### K-Means

We've changed the *n_init* parameter value to increase the number of times the k-means algorithm is run with different centroid seeds. The model then chooses the output with the best result in terms of inertia, which is a mathematical method to measure information gain. The number of model iterations was increased too.

In [120]:
# Declare and fit the model
y_pred = KMeans(n_clusters=5, n_init=200, max_iter=1500).fit_predict(X_train)

In [121]:
# Generate a confusion matrix
ct = pd.crosstab(y_pred, y_train)

ct

hash_classes,1,2,3,4,5
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,16,53,497,32,50
1,49,35,307,198,79
2,42,51,308,70,38
3,26,43,301,67,38
4,51,33,387,135,61


In [110]:
# Generate a confusion matrix
ct = pd.crosstab(y_pred, y_train)

ct

hash_classes,1,2,3,4,5
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,48,29,376,116,56
1,42,52,310,71,38
2,28,42,312,74,39
3,14,53,487,30,50
4,52,39,315,211,83


Despite the reduced number of variables and model parameter tuning, K-Means continues to be unable to distinguish clusters with similar hashtags.

### Mean Shift
There aren't many parameters to tune in Mean Shift. We'll let the model estimate the bandwidth using the *sklearn.cluster.estimate_bandwidth* method (which is the default). When *bin_seeding* is set to "False", the model takes the locations of all points when defining the initial locations of cluster centers.

In [122]:
# Declare and fit the model.
ms = MeanShift(bin_seeding=False)
y_pred = ms.fit_predict(X_train)

# Extract cluster assignments for each data point.
labels = ms.labels_

# Coordinates of the cluster centers.
cluster_centers = ms.cluster_centers_

# Count our clusters.
n_clusters_ = len(np.unique(labels))

print("Number of estimated clusters: {}".format(n_clusters_))

Number of estimated clusters: 5


In [123]:
# Generate a confusion matrix
ct = pd.crosstab(y_pred, y_train)

ct

hash_classes,1,2,3,4,5
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,172,211,1767,490,265
1,7,4,22,7,1
2,2,0,4,2,0
3,1,0,6,0,0
4,2,0,1,3,0


In [111]:
# Declare and fit the model.
ms = MeanShift(bin_seeding=True)
y_pred = ms.fit_predict(X_train)

# Extract cluster assignments for each data point.
labels = ms.labels_

# Coordinates of the cluster centers.
cluster_centers = ms.cluster_centers_

# Count our clusters.
n_clusters_ = len(np.unique(labels))

print("Number of estimated clusters: {}".format(n_clusters_))

Number of estimated clusters: 4


Once again, Mean Shift grouped most samples into a single, large cluster, while other clusters remained with just a very small portion of samples.

### Spectral Clustering

Spectral Clustering is based on quantifying similarity between data points. The method defines a similarity matrix of n x n dimensions, where n is the number of data points in the dataset. The matrix is made up of indices of similarity for every pairwise combination of data points. Then, a transformation matrix is applied to calculate a set of eigenvectors with appropriate eigenvalues.

In [137]:
# Declare and fit the model.
sc = SpectralClustering(n_clusters=5, affinity='rbf', gamma=3)
y_pred = sc.fit_predict(X_train)

In [138]:
# Generate a confusion matrix
ct = pd.crosstab(y_pred, y_train)

ct

hash_classes,1,2,3,4,5
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,6,4,367,11,5
1,30,45,377,96,56
2,28,59,292,54,65
3,39,53,310,69,33
4,81,54,454,272,107


Several parameter combinations were tested, including:

- {affinity: rbf} and {gamma: [.1, 1, 5, 10]}
- {affinity: nearest_neighbor} and {n_neighbors: [10, 30, 50, 100]}

Spectral clustering was also unable to create clusters with the same hashtags.

## TF-IDF Model

TF-IDF creates unique weights for each sentence that combine the *term frequency* (how often a word appears within an individual document) and the *IDF* (which gives more weight to words that occur less often).

The tf_idf score will be highest for a term that occurs a lot within a small number of sentences, and lowest for a word that occurs in most or all sentences.

We'll tune the TF-IDF model and run our supervised machine learning models to predict hashtags. We'll then attempt to create clusters and see if they perform better with this model.

In [54]:
# Instantiate Sklearn's TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_df=.7, # drop words that occur in more than 70% of the paragraphs
                             min_df=5, # only use words that appear at least five times
                             stop_words='english',
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

#Applying the vectorizer
vectorized = vectorizer.fit_transform(twDF.unhashed)

In [55]:
# Splitting into training and test sets
Y = twDF.hash_classes
X_train, X_test, y_train, y_test = train_test_split(vectorized, Y, test_size=0.4, random_state=0)

The results from TfidfVectorizer are stored in a compressed sparse row format where for each row in the data frame, the column index of the non-zero values are represented in a tuple, followed by the non-zero value itself.

In [56]:
# Inspect how the data is stored in Compressed Sparse Row Format
for i in X_train[4]:
    print(i)

  (0, 50)	0.8017848761672317
  (0, 60)	0.5976127612003419


## Using TF-IDF and supervised machine learning to predict Twitter hashtags

In [58]:
# Instantiate and train the Logistic Regression model
lr = LogisticRegression()
train = lr.fit(X_train, y_train)

# Inspect the results
print('Training set score:', lr.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.8257499157398045

Cross validation test scores: [0.70347958 0.69545455 0.70972644]


In [59]:
# Instantiate and train the Nearest Neighbor model
knn = KNeighborsClassifier(n_neighbors=30)
train = knn.fit(X_train, y_train)

# Inspect the results
print('Training set score:', knn.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.6215032018874284

Cross validation test scores: [0.60060514 0.6030303  0.60942249]


In [60]:
# Instantiate and train the Random Forest Classifier model
rfc = ensemble.RandomForestClassifier()
train = rfc.fit(X_train, y_train)

# Inspect the results
print('Training set score:', rfc.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.9605662285136501

Cross validation test scores: [0.72314675 0.71212121 0.72340426]


In [61]:
# Instantiate and train the Gradient Boosting Classifier model
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Inspect the results
print('Training set score:', clf.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(clf, X_test, y_test))

Training set score: 0.8446241995281429

Cross validation test scores: [0.73373676 0.71666667 0.7218845 ]


In [62]:
# Define parameter combinations
params = {'n_estimators': [100, 200],
          'criterion': ['gini', 'entropy'],
          'warm_start': [True, False]}

# Instantiate RandomizedSearchCV to test all possible parameter combinations 
random_CV = RandomizedSearchCV(rfc, params, n_iter=8, cv=3, scoring='f1_micro')

# Run RandomizedSearchCV
random_CV.fit(X_train, y_train)

# Inspect the best parameter combination
random_CV.best_params_

{'criterion': 'gini', 'n_estimators': 100, 'warm_start': True}

In [63]:
# Instantiate and train the Random Forest Classifier model
rfc = ensemble.RandomForestClassifier(criterion='entropy', n_estimators=200, warm_start=False)
train = rfc.fit(X_train, y_train)

# Inspect the results
print('Training set score:', rfc.score(X_train, y_train))
print('\nCross validation test scores:', cross_val_score(train, X_test, y_test))

Training set score: 0.9686552072800809

Cross validation test scores: [0.75037821 0.6969697  0.72796353]


In [64]:
# Visualize selected classification metrics
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          1       0.80      0.30      0.43       132
          2       0.81      0.31      0.45       172
          3       0.73      0.98      0.84      1185
          4       0.85      0.46      0.60       320
          5       0.88      0.47      0.61       170

avg / total       0.77      0.75      0.72      1979



Our supervised machine learning models had a similar result with the *TF-IDF* and the *Bag of Words* features. The latter still had a slightly higher performance.

Now let's see how our clusters come out.

## Clustering with TF-IDF

### K-Means

In [65]:
# Declare and fit the model
y_pred = KMeans(n_clusters=5).fit_predict(X_train)

In [66]:
# Check the solution against the data.
print('Comparing k-means clusters against the data:')
print(pd.crosstab(y_pred, y_train))

Comparing k-means clusters against the data:
hash_classes    1    2     3    4    5
row_0                                 
0             168  167  1020  490  255
1               3    3   365    3    1
2               6    2   196    0    0
3               6   42    28    9   10
4               1    1   191    0    0


Slightly better, but still not good enough. Cluster zero was able to match most of class 3, but approximately half of the samples from this class are still scattered among other clusters.

## Clustering based on TF-IDF Cosine Similarity

Our last clustering attempt will be with Cosine Similarity.

We'll use Sklearn's *cosine_similarity* method to calculate the cosine of the vectors of each tweet, and create a similarity matrix where each row has the cosine values of the row's tweet, and all other tweets.

The cosine similarity of two vectors is a number between 0 and 1.0 where a value of 1.0 means the two vectors are exactly the same.

Since we have 4946 rows, our similarity matrix will have the same number of columns.

In [68]:
# Generate a cosine similarity matrix and store it in a dataframe
sim_mtx = pd.DataFrame(data = cosine_similarity(vectorized))

sim_mtx.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4936,4937,4938,4939,4940,4941,4942,4943,4944,4945
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.250451,0.36547,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.732314,0.0,0.0,0.0,0.0,...,0.0,0.0,0.265646,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.181986,0.0,0.0,0.0,0.0,0.0,0.0


### K-means

In [69]:
# Declare and fit the clustering model
cos_sim_clusters = KMeans(n_clusters=5).fit_predict(sim_mtx)

In [70]:
# Generate a confusion matrix to inspect the results
print('Comparing the assigned categories to the ones in the data:')
pd.crosstab(cos_sim_clusters, twDF.hash_classes)

Comparing the assigned categories to the ones in the data:


hash_classes,1,2,3,4,5
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,4,1,280,2,0
1,7,1,229,1,0
2,299,381,1760,814,434
3,5,3,647,5,2
4,1,1,69,0,0


So we've ran out of options and haven't been able to generate useful clusters with unsupervised methods.

## Assigning hashtags based on Cosine Similarity

Nevertheless, as a final experiment, we'll see if tweets with high similarity also have the same hashtag.

In [72]:
# Create a dictionary where the key is the ordered row index, and the value is the index of the
# tweet with the highest cosine value (in other words, the most similar tweet)
most_similar = {}
for i in (sim_mtx):
    most_similar[i] = sim_mtx[i].nlargest(2).index[1]

In [73]:
# Generate a list with the hash classes of the most similar Tweets
y_similar = []
for i in most_similar.values():
    y_similar.append(twDF.loc[i,'hash_classes'])

In [74]:
# Check how often the original tweet and its most similar tweet have the same hashtag
print(classification_report(twDF.hash_classes, y_similar))

             precision    recall  f1-score   support

          1       0.22      0.48      0.30       316
          2       0.43      0.38      0.41       387
          3       0.83      0.76      0.79      2985
          4       0.56      0.51      0.54       822
          5       0.54      0.53      0.54       436

avg / total       0.69      0.65      0.67      4946



It turns out that cosine similarity is effective to predict at least the #ff hashtag. Moving forward, we can group together tweets with high cosine similarities (e.g., values above 0.7) and see if hashtags are consistent among groups.

### Visualizing most similar tweets

As a convenience, we'll create a DataFrame to visualize the most similar tweets.

In [75]:
# Create a dataframe - assign column names
sim_df = pd.DataFrame(columns=['original_tweet','most_similar'])

# Copy the original tweets to our new dataframe
sim_df['original_tweet'] = twDF.text

In [76]:
# Now copy the "original tweets'" most similar tweets
for key, value in most_similar.items():
    sim_df.loc[key, 'most_similar'] = twDF.loc[value, 'text']

In [77]:
sim_df.head()

Unnamed: 0,original_tweet,most_similar
0,i wish i could watch the video feed...but the buffering sucks! #asot400,my video feed went down again.. #asot400
1,@donaxvariabilis omt love you! opening it in a new tab now. im suckered i missed daniel kandi!! #asot400,@bobbymonkz yup replacing daniel kandi since he couldnt get a passport unfortunatly #asot400
2,ouch following the #asot400 in tweetdeck exceeded my tweet limit!,@jprigent i am following your sis #followfriday
3,i'm gonna be so sad when this is over #asot400,gonna do this again: #musicmonday craigslist by @alyankovic
4,want mooooooooore #asot400,i want beer now #squarespace


## Conclusion

It's hard to say exactly why the clustering methods had such low performance with the Twitter data set. Among all attempts, clustering with TF-IDF has proven to be the most effective, but still inadequate.

In the cells below, we compare the proportion of stopwords per tweet, against sentences in Alice in Wonderland. We notice that tweets have a smaller proportion of stopwords than common literature - only 33% of the words in a tweet are stopwords. This should mean that we have sufficient words to create and learn from our tweet vectors.

Nevertheless, there are several issues that complicate matters. For one, our Twitter data set was written by hundreds of thousands of users. Tweets also have many punctuation marks, and unsername handles,  which were all  stripped from the input variables. Also, there are many misspellings, and the sentences generally aren't well formed.

The supervised machine learning methods, on the other hand, were successful in predicting hashtags. The main reason is that we have labeled data to train the models, which is not the case with the unsupervised clusters.

In [78]:
# Create a list with the number of stopwords in each tweet
stopword_count = []
twDF.splitted.apply(
    lambda x: stopword_count.append(sum([1 for w in x if w in stopwords.words("english")]))
)

# Create a list with the number of words in each tweet
word_count = []
twDF.splitted.apply(
    lambda x: word_count.append(sum([1 for w in x]))
)

# Inspect the average number of stopwords and words per tweet
print('Average number of stopwords per Tweet:', np.mean(stopword_count))
print('Average number of words per Tweet:', np.mean(word_count))

Average number of stopwords per Tweet: 4.308329963606955
Average number of words per Tweet: 12.812778002426203


In [79]:
# Import the Alice In Wonderland text and perform some cleaning
alice = gutenberg.raw('carroll-alice.txt')

# This pattern matches all text between square brackets.
pattern = "[\[].*?[\]]"
alice = re.sub(pattern, "", alice)
alice = re.sub(r'CHAPTER .*', '', alice)
alice = ' '.join(alice.split())
alice_doc = nlp(alice)

# Initial exploration of sentences.
alice_sentences = list(alice_doc.sents)
print("Alice in Wonderland has {} sentences.".format(len(alice_sentences)))

# Create a list with the number of stopwords in each sentence from Alice in Wonderland
alice_stopword_count = []
alice_stopword_count.append(
    [[sum(1 for w in sent.text.split() if w.lower() in stopwords.words("english"))] for sent in alice_sentences])

# Create a list with the number of words in each sentence from Alice in Wonderland
alice_word_count = [len(i.text.split()) for i in alice_sentences]

# Inspect the average number of stopwords and words per sentence
print('Proportion of stopwords in sentences in Alice in Wonderland:', np.mean(alice_stopword_count) / np.mean(alice_word_count))
print('Proportion of stopwords in Tweets:', np.mean(stopword_count) / np.mean(word_count))

Alice in Wonderland has 1678 sentences.
Proportion of stopwords in sentences in Alice in Wonderland: 0.46135105204872645
Proportion of stopwords in Tweets: 0.33625260367354665
