## Assignment 2: Text Scraping & Clustering

This notebook is the second part of COMP41680 Assignment 2 and covers corpus exploration. The nltk.download, as seen at the bottom of the list of imports, was needed during implementation to run the lemmatizer function. Uncomment the download if it is required. There is also a stemming tokenizer which can be used instead if issues occur.

In [None]:
from sklearn.feature_extraction.text import  TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
import os
import nltk
import numpy as np
import scipy.cluster.hierarchy as hac
from sklearn.cluster import KMeans
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import MDS
import matplotlib as mpl
import pandas as pd
import mpld3
from scipy.cluster.hierarchy import fcluster
from scipy.cluster.hierarchy import ward, dendrogram
from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import display, HTML
#nltk.download('wordnet')
%matplotlib inline 


In [None]:
def stem_tokenizer(text):
    # use the standard scikit-learn tokenizer first
    standard_tokenizer = CountVectorizer().build_tokenizer()
    tokens = standard_tokenizer(text)
    # then use NLTK to perform stemming on each token
    stemmer = PorterStemmer()
    stems = []
    for token in tokens:
        stems.append( stemmer.stem(token) )
    return stems

def lemma_tokenizer(docs):
    # use the standard scikit-learn tokenizer first
    standard_tokenizer = CountVectorizer().build_tokenizer()
    tokens = standard_tokenizer(docs)
    # then use NLTK to perform lemmatisation on each token
    lemmatizer = nltk.stem.WordNetLemmatizer()
    lemma_tokens = []
    for token in tokens:
        lemma_tokens.append( lemmatizer.lemmatize(token) )
    return lemma_tokens

### Part 1: Read and Process Text
Each text file contains two strings, the first is the title and the second is the article. These are stored in a directory named data which is located in the same directory as this jupyter notebook. These text files will now be read in. Each article is appended to a list of strings named docs. Each title is read into a seperate list of strings named titles.

In [None]:
DATA_DIR = "data/"
for root, dirs, files in os.walk(DATA_DIR):
    docs=[]
    titles = []
    file_names = []
    for file in files:
        with open(root + file, 'r') as file_input:
            raw_document = file_input.readlines()
            docs.append(raw_document[1])
            titles.append(raw_document[0].replace('\n', ''))
            
    # Print some info to see what was read in
    print("root:",root)
    print("\nFirst 2 docs:\n", docs[0:2])
    print("\nFirst 10 titles:\n", titles[0:10])

### Part 2: Process Text and Create Document Term Matrix
The articles will now be pre-processed and the document term matrix created. This pre-processing involves making all characters lower case and removing terms which occur in more than 80% of documents and less than 5 documents. English stop words are also removed. 1, 2 and 3 ngrams are used and lemmatization is applied. The maximum number of features is set to 200,000.

In [None]:
vectorizer =  TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=5, stop_words='english',
                                 use_idf=True, tokenizer=lemma_tokenizer, ngram_range=(1,3))
tfidf_matrix = vectorizer.fit_transform(docs)
tfidf_list = tfidf_matrix.todense().tolist()
df_tfidf = pd.DataFrame(tfidf_list , index=titles, columns=vectorizer.get_feature_names())

print("Number of terms in tfidf matrix: ", df_tfidf.shape[1])
print("Number of documents in tfidf matrix: ", df_tfidf.shape[0])

In [None]:
# Display the pandas dataframe which represents the tfidf matrix
display(df_tfidf)

### Part 3: Summarise the Corpus

In order to explore the corpus the most common terms and highest weighted terms were found and plotted. The total number of terms is also shown.

In [None]:
dist = 1 - cosine_similarity(tfidf_matrix)
print(dist[0])

Convert to two components as we're plotting points in a two-dimensional plane.
dissimilarity is set to precomputed because a distance matrix is provided.
random_state is set so that the plot is reproducible.

In [None]:
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
pos = mds.fit_transform(dist)
xs, ys = pos[:, 0], pos[:, 1]

In [None]:
#set up colors per clusters using a dict
cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3', 3: '#e7298a'}

#set up cluster names using a dict
cluster_names = {0: 'Technology', 
                 1: 'Finance', 
                 2: 'Health', 
                 3: 'Sport'}

In [None]:
#define custom toolbar location
class TopToolbar(mpld3.plugins.PluginBase):
    """Plugin for moving toolbar to top of figure"""

    JAVASCRIPT = """
    mpld3.register_plugin("toptoolbar", TopToolbar);
    TopToolbar.prototype = Object.create(mpld3.Plugin.prototype);
    TopToolbar.prototype.constructor = TopToolbar;
    function TopToolbar(fig, props){
        mpld3.Plugin.call(this, fig, props);
    };

    TopToolbar.prototype.draw = function(){
      // the toolbar svg doesn't exist
      // yet, so first draw it
      this.fig.toolbar.draw();

      // then change the y position to be
      // at the top of the figure
      this.fig.toolbar.toolbar.attr("x", 150);
      this.fig.toolbar.toolbar.attr("y", 400);

      // then remove the draw function,
      // so that it is not called again
      this.fig.toolbar.draw = function() {}
    }
    """
    def __init__(self):
        self.dict_ = {"type": "toptoolbar"}

In [None]:
#define custom css to format the font and to remove the axis labeling
css = """
text.mpld3-text, div.mpld3-tooltip {
  font-family:Arial, Helvetica, sans-serif;
}

g.mpld3-xaxis, g.mpld3-yaxis {
display: none; }

svg.mpld3-figure {
margin-left: -200px;}
"""

### Part 5: K-Means Clustering

Create clusters using kmeans algorithm and display clusters. The clusters are grouped in a dataframe and this allows the content to be investigated.

In [None]:
# Create the kmeans clusters
km = KMeans(n_clusters=num_clusters)
kmeans = km.fit(tfidf_matrix)
clusters_kmeans = km.labels_.tolist()

# Display 10 of the cluster values
print(clusters_kmeans[0:10])

In [None]:
#create data frame that has the result of the MDS plus the cluster numbers and titles
df_kmeans = pd.DataFrame(dict(x=xs, y=ys, label=clusters_kmeans, title=titles)) 

#group by cluster
groups_kmeans = df_kmeans.groupby('label')
for index in range(len(cluster_names)):
    display(groups_kmeans.get_group(index)[0:10])


In [None]:
# Plot is interactive
fig, ax = plt.subplots(figsize=(14,8)) #set plot size
ax.margins(0.15) # Optional, just adds 15% padding to the autoscaling

#iterate through groups to layer the plot
#note that cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label
for name, group in groups_kmeans:
    points = ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, 
                     label=cluster_names[name], mec='none', 
                     color=cluster_colors[name])
    labels = [i for i in group.title]
    
    #set tooltip using points, labels and the already defined 'css'
    tooltip = mpld3.plugins.PointHTMLTooltip(points[0], labels, voffset=10, hoffset=10, css=css)
    #connect tooltip to fig
    mpld3.plugins.connect(fig, tooltip, TopToolbar())    
    
    #set tick marks as blank
    ax.axes.get_xaxis().set_ticks([])
    ax.axes.get_yaxis().set_ticks([])
    
    #set axis as blank
    ax.axes.get_xaxis().set_visible(False)
    ax.axes.get_yaxis().set_visible(False)

    
ax.legend(numpoints=1) #show legend with only one dot

mpld3.display() #show the plot


### Part 6: Agglomerative Clustering

In [None]:
# Get indices for terms in order of frequency
indices = np.argsort(vectorizer.idf_)[::-1]
print("indices of the 10 most highly weighted words:",indices[0:10])

# Get all terms
terms = vectorizer.get_feature_names()

# Order terms by weight
top_terms = [terms[i] for i in indices[:]]

# Display number of terms in corpus and most highly weighted words in corpus
print("\nNumber of terms in document term matrix: ", len(terms))
print("\nTop 10 terms in Corpus:")
print(top_terms[0:10])


In [None]:

plt.figure(figsize=(16,10))
df_tfidf.astype(bool).sum(axis=0).sort_values(ascending=False)[:50].plot.bar()
# plt.bar(list(range(len(terms))),)
plt.title('Frequency of most common words')

In [None]:
print(len(terms))
# print(len(df_tfidf.astype(bool).sum(axis=0).sort()))
plt.figure(figsize=(16,10))
df_tfidf.sum(axis=0).sort_values(ascending=False)[:50].plot.bar()
# plt.bar(list(range(len(terms))),)
plt.title('Words with highest weight')

### Part 4: Pre-processing before Clustering

The cosine similarities were calculated and multi dimensional scaling was used to reduce the dimensionality down to two dimensions so that the data could be visulised. Some CSS and Javascript was also included in order to display interactive plots. More info on this can be found at the source: http://brandonrose.org/clustering

In [None]:
num_clusters = 4

The cosine similarity is used to measure the similarity between documents. It ranges between 0 and 1 where a value of 1 indicates identical documents and a value of 0 indicates documents which share no terms in common. To demonstrate this, the cosine similarity between documents 1 and 2 can be seen to be 0.2.  The cosine similarity between documents 1 and 779 can be seen to be 0.82. 

In [None]:
print( "cos(D0,D2) = %.2f" % cosine_similarity(tfidf_matrix[0], tfidf_matrix[2]))
print( "cos(D0,D779) = %.2f" % cosine_similarity(tfidf_matrix[0], tfidf_matrix[779]))

print("Titles of documents 0, 2 and 779:")
print(titles[0])
print(titles[2])
print(titles[779])