<a href="https://colab.research.google.com/github/steve-wilson/ds32019/blob/master/01_Text_Processsing_Basics_DS3Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Fundamentals of Text Analysis for User Generated Content @ [DS3](https://www.ds3-datascience-polytechnique.fr/)

# Part 1: Text Processing Basics

[-> Next: Noisy Text Processing](https://colab.research.google.com/drive/1VlRz-wKYmsQ4gRHb02uLav8RodpvsCNG)

Dates: June 27-28, 2019

Facilitator: [Steve Wilson](https://steverw.com)

(To edit this notebook: File -> Open in Playground Mode)

---



## Initial Setup

- **Run "Setup" below first.**

    - This will load libraries and download some resources that we'll use throughout the tutorial.

    - You will see a message reading "Done with setup!" when this process completes.


In [0]:
#@title Setup (click the "run" button to the left) {display-mode: "form"}

## Setup ##

# imports

# built-in Python libraries
# -------------------------

# counting and data management
import collections
# operating system utils
import os
# regular expressions
import re
# additional string functions
import string
# system utilities
import sys
# request() will be used to load web content
import urllib.request


# 3rd party libraries
# -------------------

# Natural Language Toolkit (https://www.nltk.org/)
import nltk

# download punctuation related NLTK functions
# (needed for sent_tokenize())
nltk.download('punkt')
# download NLKT part-of-speech tagger
# (needed for pos_tag())
nltk.download('averaged_perceptron_tagger')
# download wordnet
# (needed for lemmatization)
nltk.download('wordnet')
# download stopword lists
# (needed for stopword removal)
nltk.download('stopwords')
# dictionary of English words
nltk.download('words')

# numpy: matrix library for Python
import numpy as np

# scipy: scientific operations
# works with numpy objects
import scipy

# matplotlib (and pyplot) for visualizations
import matplotlib
import matplotlib.pyplot as plt

# sklearn for basic machine learning operations
import sklearn
import sklearn.manifold
import sklearn.cluster

# worldcloud tool
!pip install wordcloud
from wordcloud import WordCloud

# for checking object memory usage
!pip install pympler
from pympler import asizeof

!pip install spacy
import spacy

# Downloading data
# ----------------
if not os.path.exists("aclImdb"):
    !wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    !tar -xzf aclImdb_v1.tar.gz

print()
print("Done with setup!")
print("If you'd like, you can click the (X) button to the left to clear this output.")

---
## A - Basic Text Preprocessing



### Built-in Python functions

- Basic Python fuctions provide a good starting place.

- First, we should try to split a sentence into individual words:

In [0]:
text = "École polytechnique (also known as EP or X) (English: " + \
       "Polytechnic School), is a French public institution of higher "+ \
       "education and research in Palaiseau, a suburb southwest of Paris."

# We can split on all whitespace with split()
words = text.split()
print("WORDS:",words)

- It is fairly straightforward to do things like remove punctuation, lowercase, or access individual letters:

In [0]:
# for the first 10 words
for word in words [:10]:
    
    # print the string "word:", the word itself, 
    # and end with a veritcal bar character instead of a newline
    print("word:", word, end=' | ')
    
    # strip removes characters at the beginning and end of a string
    # string.punctuation contains: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
    print("no punctuation:", word.strip(string.punctuation), end=' | ')
    
    # lower() and upper() change case
    print("lowercase:", word.lower(), end=' | ')
    
    # characters in strings can be indexed just like items in lists
    print("first letter:", word[0].upper())

- How about dealing with multiple sentences?

In [0]:
# From https://en.wikipedia.org/wiki/Data_science

text =  'Data science is a "concept to unify statistics, data analysis, machine ' + \
        'learning and their related methods" in order to "understand and analyze ' + \
        'actual phenomena" with data. '

text += 'It employs techniques and theories drawn from many fields within the ' + \
        'context of mathematics, statistics, computer science, and information ' + \
        'science. '

text += 'Turing award winner Jim Gray imagined data science as a "fourth paradigm"' + \
        'of science (empirical, theoretical, computational and now data-driven) ' + \
        'and asserted that "everything about science is changing because of the ' + \
        'impact of information technology" and the data deluge. '

text += 'In 2015, the American Statistical Association identified database ' + \
        'management, statistics and machine learning, and distributed and ' + \
        'parallel systems as the three emerging foundational professional communities."'

# We could try splitting on the period character...
sentences = text.split('.')
print('.\n'.join(sentences))

- When might this not work?

In [0]:
# Try this:
text =  "Dr. Martin registered the domain name drmartin.com before moving to the " + \
        "U.K. in January. "
text += "During that time, 1.6 million users visited her website... it was very " + \
        "unexpected and caused a server to crash."
sentences = text.split('.')
print('.\n'.join(sentences))

###Introducing the Natural Language Toolkit (NLTK)

- NLTK is a very handy library for basic text processing operations.

- We can split sentences in a much smarter way:

In [0]:
sentences = nltk.sent_tokenize(text)
print('\n'.join(sentences))

- **What else can we do with NLTK?**
- Smarter word tokenization:

In [0]:
sentence_words = nltk.word_tokenize(sentences[0])
print("Words:",' '.join(sentence_words))

- Finding word stems:

In [0]:
# Add the words from the 2nd sentence
sentence_words += nltk.word_tokenize(sentences[1])

# Stemming
stemmer = nltk.stem.PorterStemmer()
stems = [stemmer.stem(word) for word in sentence_words]
print(stems)

- Labeling words with their part-of-speech, and even finding their lemmas:

In [0]:
# Part-of-speech tagging
pos_tags = nltk.pos_tag(sentence_words)
print("Parts of speech:",pos_tags)

# Lemmatization
def lookup_pos(pos):
    pos_first_char = pos[0].lower()
    if pos_first_char in 'nv':
        return pos_first_char
    else:
        return 'n'
    
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word,lookup_pos(pos)) for (word,pos) in pos_tags]
print("Lemmas:", ' '.join(lemmas))

- Sometimes, it is helpful to remove "stopwords", like "a, the, I, do," and others.
    - It's worth thinking about whether or not these words are important in your application.
    - These kinds of words do carry a lot of important information!

In [0]:
# Stopword (non-content word) removal
stop_words = set(nltk.corpus.stopwords.words('english'))
content_words = [word for word in sentence_words if word not in stop_words]
removed_stop_words = [word for word in sentence_words if word in stop_words]
print("Content words:", ' '.join(content_words))
print("Removed Stop words:", ' '.join(removed_stop_words))

- Let's look at a simple plot of the word frequencies in our sample text.

In [0]:
# Get word frequencies
frequencies = nltk.probability.FreqDist(sentence_words)

# Plot the frequencies
frequencies.plot(15,cumulative=False)
plt.show()

### Putting it together: Creating a Word Cloud
- Now, it's your turn to try out some of the techniques we've covered.

1. First, run the code block below labeled "Run this code first" to perform some setup.
2. Then, modify the code marked "Exercise 1" to convert a document into **preprocessed lemma frequencies**.
    - There is a sample solution below. It's hidden for now, but you can take a peek when you are ready.
3. Finally, run the code labeled "build a word cloud" to see the result.

In [0]:
#@title Run this code first: Wordcloud function and loading the document (double-click to view) {display-mode: "form"}


# Draw a wordcloud!
# Inputs:
#   word_counts: a dictionary mapping strings to their counts
def draw_wordcloud(freq_dist, colormap):
    
    #TODO add a few corpus specific checks here to make sure people have done casing, lemmatization, punct removal
    uniq_count = len(freq_dist.keys())
    print("Building a word cloud with",uniq_count,"unique words...")
    wc = WordCloud(colormap=colormap, width=1500, 
                   height=1000).generate_from_frequencies(freq_dist)
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    
print("draw_wordcloud() function is ready to use.")

# Load the contents of the book "The Wonderful Wizard of Oz" 
#   by L. Frank Baum (from project Gutenberg)
document = urllib.request.urlopen("http://www.gutenberg.org/cache/epub/55/pg55.txt").read().decode('utf-8')

print('"The Wonderful Wizard of Oz" full text is loaded.')

**Exercise 1**

Write your code here. Make sure to click the "run" button when you're finished.

In [0]:
# Convert text to a dictionary mapping strings to a FreqDist object
# containing the frequences of the lemmas in the text.
# All stopwords should be removed.
# Inputs:
#   text: a string as input, possibly containing multiple sentences.
def text_to_lemma_frequencies(text):
    
# ------------- Exercise 1 -------------- #

    # write your preprocessing code here

    # replace this return function with your own
    return nltk.probability.FreqDist(["Hello", "world", "hello", "world."])
# ---------------- End ------------------ #

    
# quick test (do not modify this)
test_doc = "This is a test. Does this work?"
result = text_to_lemma_frequencies(test_doc)
passed = result == nltk.probability.FreqDist(["test","work"])
if passed:
    print ("Test passed!")
else:
    print("Test did not pass yet.")
    if type(result) == type(nltk.probability.FreqDist(["a"])):
        print("got these words:", result.keys(),\
              "\nwith these counts:", result.values())
    else:
        print("Did not return a FreqDist object.")

Now, let's **build a word cloud** for the book "[The Wonderful Wizard of Oz](http://www.gutenberg.org/cache/epub/55/pg55.txt)."

In [0]:
# Get the word frequency distribution
freq_dist = text_to_lemma_frequencies(document)

# Use default colormap
colormap = None
# Bonus: try out some other matplotlib colormaps
#colormap = "spring" # see more here: https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html

# Call the function to draw the word cloud
draw_wordcloud(freq_dist, colormap)

In [0]:
#@title Sample Solution (double-click to view) Run to load sample solution. {display-mode: "form"}

def text_to_lemma_frequencies(text, remove_stop_words=True):
    
    # split document into sentences
    sentences = nltk.sent_tokenize(text)
    
    # create a place to store (word, pos_tag) tuples
    words_and_pos_tags = []
    
    # get all words and pos tags
    for sentence in sentences:
        words_and_pos_tags += nltk.pos_tag(nltk.word_tokenize(sentence))
        
    # load the lemmatizer
    lemmatizer = nltk.stem.WordNetLemmatizer()
    
    # lemmatize the words
    lemmas = [lemmatizer.lemmatize(word,lookup_pos(pos)) for \
              (word,pos) in words_and_pos_tags]
    
    # convert to lowercase
    lowercase_lemmas = [lemma.lower() for lemma in lemmas]
    
    # load the stopword list for English
    stop_words = set([])
    if remove_stop_words:
        stop_words = set(nltk.corpus.stopwords.words('english'))
    
    # add punctuation to the set of things to remove
    all_removal_tokens = stop_words | set(string.punctuation)
    
    # bonus: also add some custom double-quote tokens to this set
    all_removal_tokens |= set(["''","``"])
    
    # only get lemmas that aren't in these lists
    content_lemmas = [lemma for lemma in lowercase_lemmas \
                      if lemma not in all_removal_tokens]
    
    # return the frequency distribution object
    return nltk.probability.FreqDist(content_lemmas)
    
# Lemmatization -- redefining this here to make
# code block more self-contained
def lookup_pos(pos):
    pos_first_char = pos[0].lower()
    if pos_first_char in 'nv':
        return pos_first_char
    else:
        return 'n'
    
# quick test:
test_doc = "This is a test. Does this work?"
result = text_to_lemma_frequencies(test_doc)
passed = result == nltk.probability.FreqDist(["test","work"])
if passed:
    print ("Test passed!")
else:
    print("Test did not pass yet.")
    if type(result) == type(nltk.probability.FreqDist(["a"])):
        print("got these words:", result.keys(),\
              "\nwith these counts:", result.values())
    else:
        print("Did not return a FreqDist object.")

### Bonus: Zipf's Law

- Let's check the frequency distribution over the top N words in the book.

In [0]:
top_n_words = 100
freq_dist.plot(top_n_words, cumulative=False)
plt.show()

- You've just observed (a "Wizard of Oz" version of) [Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law)  at work!

- Remember that we've also removed stopwords. 

- _Try this_: 
    - Load the sample `text_to_lemma_frequencies()` function, then run the code below to see what this looks like with stopwords.

    - Pay attention to how the y-axis is different from the example above.

    - Compare the result to [this example](https://phys.org/news/2017-08-unzipping-zipf-law-solution-century-old.html).

In [0]:
freq_dist = text_to_lemma_frequencies(document, remove_stop_words=False)
top_n_words = 100
freq_dist.plot(top_n_words, cumulative=False)
plt.show()

---
## B - Corpus-level Processing

In [0]:
#@title Skipped part A? Run this cell to load code needed moving forward. {display-mode: "form"}

print("Make sure that you have run 'Initial Setup'!")
# Setup from part 1

def text_to_lemma_frequencies(text, remove_stop_words=True):
    
    # split document into sentences
    sentences = nltk.sent_tokenize(text)
    
    # create a place to store (word, pos_tag) tuples
    words_and_pos_tags = []
    
    # get all words and pos tags
    for sentence in sentences:
        words_and_pos_tags += nltk.pos_tag(nltk.word_tokenize(sentence))
        
    # load the lemmatizer
    lemmatizer = nltk.stem.WordNetLemmatizer()
    
    # lemmatize the words
    lemmas = [lemmatizer.lemmatize(word,lookup_pos(pos)) for \
              (word,pos) in words_and_pos_tags]
    
    # convert to lowercase
    lowercase_lemmas = [lemma.lower() for lemma in lemmas]
    
    # load the stopword list for English
    stop_words = set([])
    if remove_stop_words:
        stop_words = set(nltk.corpus.stopwords.words('english'))
    
    # add punctuation to the set of things to remove
    all_removal_tokens = stop_words | set(string.punctuation)
    
    # bonus: also add some custom double-quote tokens to this set
    all_removal_tokens |= set(["''","``"])
    
    # only get lemmas that aren't in these lists
    content_lemmas = [lemma for lemma in lowercase_lemmas \
                      if lemma not in all_removal_tokens]
    
    # return the frequency distribution object
    return nltk.probability.FreqDist(content_lemmas)
    
# Lemmatization -- redefining this here to make
# code block more self-contained
def lookup_pos(pos):
    pos_first_char = pos[0].lower()
    if pos_first_char in 'nv':
        return pos_first_char
    else:
        return 'n'
    
print("Otherwise, you're now ready for part 2.")

### Matrix Representations

- Representing documents as vectors of words gets us one step closer to using traditional data science approaches.

- However, never forget that we're still working with language data!

- **How do we get a corpus matrix?**


- First, we'll load a small corpus into memory:

In [0]:
# from the Stanford Movie Reviews Data: 
# http://ai.stanford.edu/~amaas/data/sentiment/

# we downloaded this during our initial Setup
movie_review_dir = "aclImdb/train/unsup/"
movie_review_files = os.listdir(movie_review_dir)
n_movie_reviews = []
n = 50
for txt_file_path in sorted(movie_review_files, \
                            key=lambda x:int(x.split('_')[0]))[:n]:
        full_path = movie_review_dir + txt_file_path
        with open(full_path,'r') as txt_file:
            n_movie_reviews.append(txt_file.read())
            
print("Loaded",len(n_movie_reviews),"movie reviews from the Stanford IMDB " + \
      "corpus into memory.")

- Start by getting a bag-of-words representation for each review.
- Then, create a mapping between the full vocabulary and columns for our matrix.

In [0]:
review_frequency_distributions = []

# process each review, one at a time
for review in n_movie_reviews:
    
    # let's use our function from before
    frequencies = text_to_lemma_frequencies(review)
    review_frequency_distributions.append(frequencies)

# use a dictionary for faster lookup
vocab2index = {}
latest_index = 0
for rfd in review_frequency_distributions:
    for token in rfd.keys():
        if token not in vocab2index:
            vocab2index[token] = latest_index
            latest_index += 1
    
print("Built vocab lookup for vocab of size:",len(vocab2index))

- Given the frequencies and this index lookup, we can build a frequency matrix (as a numpy array).

In [0]:
# make an all-zero numpy array with shape n x v
# n = number of documents
# v = vocabulary size
corpus_matrix = np.zeros((len(review_frequency_distributions), len(vocab2index)))

# fill in the numpy array
for row, rfd in enumerate(review_frequency_distributions):
    for token, frequency in rfd.items():
        column = vocab2index[token]
        corpus_matrix[row][column] = frequency

In [0]:
# get some basic information about our matrix
def print_matrix_info(m):
    print("Our corpus matrix is",m.shape[0],'x',m.shape[1])
    print("Sparsity is:",round(float(100 * np.count_nonzero(m))/ \
                           (m.shape[0] * m.shape[1]),2),"%")

print_matrix_info(corpus_matrix)

- Now that we've seen how this works, let's see how some existing Python functions can do the heavy lifting for us.
- Scikit learn has some useful feature extraction methods:

In [0]:
# we can get a similar corpus matrix with just 3 lines of code
vectorizer = sklearn.feature_extraction.text.CountVectorizer()
sklearn_corpus_data = vectorizer.fit_transform(n_movie_reviews)
sklearn_corpus_matrix = sklearn_corpus_data.toarray()

# get the feature names (1:1 mapping to the columns in the matrix)
print("First 10 features:",vectorizer.get_feature_names()[:10])
print()

# let's check out the matrix
print_matrix_info(sklearn_corpus_matrix)

- These matrices are typically _very_ sparse.
- It's worth considering [different representations](https://docs.scipy.org/doc/scipy/reference/sparse.html) if memory is a concern.
    - Save space by only storing nonzero entries.

In [0]:
# E.g., using a CSR matrix representation
csr_corpus_matrix = scipy.sparse.csr_matrix(corpus_matrix)
print("Original matrix: using", asizeof.asizeof(corpus_matrix)/1000,"kB")
print("CSR matrix: using", asizeof.asizeof(csr_corpus_matrix)/1000,"kB")

- There will be a trade-off between memory usage and speed of operations.
    - consider the strengths and weaknesses of each representation.
        - e.g., CSR has fast row-level operations, but slow column-level operations.

### Document Retrieval and Similarity

- With this matrix, it's very easy to find all documents containing a specific word.

In [0]:
search_term = "funny"
if search_term in vocab2index:
    search_index = vocab2index[search_term]
    matches = [i for i in range(corpus_matrix.shape[0]) \
           if corpus_matrix[i][search_index]!=0]

    # list the documents that contain the search term
    print("These documents contain '"+search_term+"':",matches)
    print()

    # show excerpt where this word appears
    example_location = n_movie_reviews[matches[0]].find(search_term)
    start,end = max(example_location-30,0), min(example_location+30,len(n_movie_reviews[matches[0]]))
    print('For example: "...',n_movie_reviews[matches[0]][start:end],'..."')
    
else:
    print(search_term,"isn't in the sample corups.")

- We can even use the notion of vector representations to compute the similarity between two documents.

    - (we'll talk about more advanced ways to approach this task later in the tutorial)

In [0]:
example_docs =[ "My dog likes to eat vegetables",\
                "Your dog likes to eat fruit",\
                "The computer is offline",\
                "A computer shouldn't be offline" ]

vectorizer = sklearn.feature_extraction.text.CountVectorizer()
example_data = vectorizer.fit_transform(example_docs)
example_matrix = example_data.toarray()

sim_0_1 = 1-scipy.spatial.distance.cosine(example_matrix[0],example_matrix[1])
sim_2_3 = 1-scipy.spatial.distance.cosine(example_matrix[2],example_matrix[3])
sim_0_2 = 1-scipy.spatial.distance.cosine(example_matrix[0],example_matrix[2])

print("Similarity between 0 and 1:",round(sim_0_1,2))
print("Similarity between 2 and 3:",round(sim_2_3,2))
print("Similarity between 0 and 2:",round(sim_0_2,2))

- We can do the same thing with our corpus of movie reviews:

In [0]:
# choose a document, and find the most "similar" other document in the corpus
reference_doc = 0
ref_doc_vec = corpus_matrix[reference_doc]
sim_to_ref_doc = []
for row in corpus_matrix:
    sim_to_ref_doc.append(1-scipy.spatial.distance.cosine(ref_doc_vec,row))
    
print("similarity scores:",sim_to_ref_doc)
most_similar = sim_to_ref_doc.index(max(sim_to_ref_doc[1:]))
print(n_movie_reviews[0])
print("is most similar to")
print(n_movie_reviews[most_similar])

### Putting it together: Simple Document Clustering

- Let's apply the document to matrix idea to do some simple clustering.
- First, let's load a dataset that should exhibit some natural groupings based on topic.
    - [20news](http://qwone.com/~jason/20Newsgroups/) is classic NLP dataset for document classification.

In [0]:
# load 20 newsgroups dataset - just 100 texts from 3 categories
categories = ['comp.sys.ibm.pc.hardware', 'rec.sport.baseball']
newsgroups_train_all = sklearn.datasets.fetch_20newsgroups(subset='train',\
                                                 categories=categories)
newsgroups_train = newsgroups_train_all.data[:100]
newsgroups_labels = newsgroups_train_all.target[:100]

print("Loaded",len(newsgroups_train),"documents.")
print("Label distribution:",collections.Counter(newsgroups_labels))

**Exercise 2**

- Now, write a function that creates a corpus matrix from a list of strings containing documents.
    - We can use the `text_to_lemma_frequencies` that you wrote earlier as a starting point!

In [0]:
# ------------- Exercise 2 -------------- #
def docs2matrix(document_list):
    
    # this should be a nice starting point
    lemma_freqs = [text_to_lemma_frequencies(doc) for doc in document_list]

    # change this to return a 2d numpy array
    return None

# -------------     End    -------------- #

# quick test with first 10 documents
X = docs2matrix(newsgroups_train[:10])
if type(X) != type(np.zeros([3,3])):
    print("Did not return a 2d numpy matrix.")
elif X.shape[0] != 10:
    print("number of rows should be 10, but is",X.shape[0])
else:
    print("Created a matrix with shape:",X.shape)

In [0]:
#@title Sample Solution (double-click to view) {display-mode: "form"}

def docs2matrix(document_list):
    
    # use the vocab2index idea from before
    vocab2index = {}
    
    # this should be a nice starting point
    lemma_freqs = [text_to_lemma_frequencies(doc) for doc in document_list]

    latest_index = 0
    for lf in lemma_freqs:
        for token in lf.keys():
            if token not in vocab2index:
                vocab2index[token] = latest_index
                latest_index += 1
    
    # create the zeros matrix
    corpus_matrix = np.zeros((len(lemma_freqs), len(vocab2index)))
    
    for row, lf in enumerate(lemma_freqs):
        for token, frequency in lf.items():
            column = vocab2index[token]
            corpus_matrix[row][column] = frequency
    
    # change this to return a 2d numpy array
    return corpus_matrix


# quick test with first 10 documents
X = docs2matrix(newsgroups_train[:10])
if type(X) != type(np.zeros([3,3])):
    print("Did not return a 2d numpy matrix.")
elif X.shape[0] != 10:
    print("number of rows should be 10, but is",X.shape[0])
else:
    print("Created a matrix with shape:",X.shape)

- Let's visualize the data in 2 dimensions
    - We'll use [T-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) to do the dimensionality reduction.
    - Each color (red and blue) will represent one of the "groun truth" clusters.

In [0]:
# show corpus in 2d

X = docs2matrix(newsgroups_train)
print("Created a matrix with shape:",X.shape)
tsne = sklearn.manifold.TSNE(n_components=2, random_state=1)
X_2d = tsne.fit_transform(X)
colors = ['r', 'b']
target_ids = range(len(categories))
for target, c, label in zip(target_ids, colors, categories):
    plt.scatter(X_2d[newsgroups_labels == target, 0], X_2d[newsgroups_labels == target, 1], c=c, label=label)

- The groups have a fair degree of overlap. Can kmeans clustering recover them correctly?

In [0]:
# Do kmeans clustering

kmeans = sklearn.cluster.KMeans(n_clusters=2, random_state=0, algorithm="full").fit(X)
clusters = kmeans.labels_

for target, c, label in zip(target_ids, colors, categories):
    plt.scatter(X_2d[clusters == target, 0], X_2d[clusters == target, 1], c=c, label=label)

# out own purity function
def compute_average_purity(clusters, labels):
    # and computer the cluster purity
    cluster_labels = collections.defaultdict(list)
    for i in range(len(clusters)):
        cluster = clusters[i]
        label = labels[i]
        cluster_labels[cluster].append(label)
    cluster_purities = {}
    for cluster, labels in cluster_labels.items():
        most_common_count = collections.Counter(labels).most_common()[0][1]
        purity = float(most_common_count)/len(labels)
        cluster_purities[cluster] = purity
    avg_purity = sum(cluster_purities.values())/len(cluster_purities.keys())
    print("Average cluster purity:",avg_purity)
    
avg_purity = compute_average_purity(clusters, newsgroups_labels)

- That didn't work as well as we'd like it to.
- It's time to introduce better features that just word frequencies.
    - TF-IDF to the rescue!
    


### TF-IDF

- Some words are less important when making distinctions between documents in a corpus.
- How can we determine the "less important" words?
    - Using term-frequency * inverse document frequency, we make the assumption that words that appear in *many documents* are *less informative* overall.
    - Therefore, we weigh each term based on the inverse of the number of documents that that term appears in.
    - We can define $\operatorname{tfidf}(t,d,D) = \operatorname{tf}(t,d) * \log\frac{|D|}{|d \in D : t \in d|}$ , where
        - $t$ is a term (token) in a corpus
        - $d$ is a document in the corpus
        - $D$ is the corpus itself, containing documents, which, in turn, contain tokens
        - $\operatorname{tf}(t,d)$ is the frequency of $t$ in $d$ (typically normalized at the document level).
- sklearn has another vectorizer that takes care of this for us: the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
    - It behaves just like the CountVectorizer() that we saw before, except it computes tfidf scores instead of counts!

- Of course we can just use the TfidfVectorizer, but what would it look like to implement this ourselves?

In [0]:
# assume input matrix contains term frequencies
def tfidf_transform(mat):
    
    # convert matrix of counts to matrix of normalized frequencies
    normalized_mat = mat / np.transpose(mat.sum(axis=1)[np.newaxis])
    
    # compute IDF scores for each word given the corpus
    docs_using_terms = np.count_nonzero(mat,axis=0)
    idf_scores = np.log(mat.shape[1]/docs_using_terms)
    
    # compuite tfidf scores
    tfidf_mat = normalized_mat * idf_scores
    return tfidf_mat

tfidf_X = tfidf_transform(X)
print("Counts:",X[0][0:10])
print("TFIDF scores:",tfidf_X[0][0:10])

- What happens if we use tfidf instead of just counts or frequencies?

In [0]:
# show corpus in 2d

#X = docs2matrix(newsgroups_train)
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer()
X = vectorizer.fit_transform(newsgroups_train).todense()
print("Created a matrix with shape:",X.shape)
tsne = sklearn.manifold.TSNE(n_components=2, random_state=1)
X_2d = tsne.fit_transform(X)
colors = ['r', 'b']
target_ids = range(len(categories))
for target, c, label in zip(target_ids, colors, categories):
    plt.scatter(X_2d[newsgroups_labels == target, 0], X_2d[newsgroups_labels == target, 1], c=c, label=label)

- These groups appear to have a bit more separation.
- How well can kmeans recover the original groups now?

In [0]:
# Do kmeans clustering with TF-IDF matrisx

kmeans = sklearn.cluster.KMeans(n_clusters=2, random_state=0, algorithm="full").fit(X)
clusters = kmeans.labels_

for target, c, label in zip(target_ids, colors, categories):
    plt.scatter(X_2d[clusters == target, 0], X_2d[clusters == target, 1], c=c, label=label)
    
avg_purity = compute_average_purity(clusters, newsgroups_labels)

### Bonus: SpaCy
- If you have extra time, check out the [SpaCy 101 tutorial](https://spacy.io/usage/spacy-101)!
    - SpaCy is less research focused, but after you have a good grasp on the core concepts, it can provide a powerful set of NLP tools, and it is definitely worth knowing about.
        - It is also often faster to run than NLTK.
        - (we will time our nltk version first, for reference)

In [0]:
%timeit docs2matrix(newsgroups_train)

In [0]:
# Example preprocessing with SpaCy
def text_to_lemma_frequencies(text):
    nlp = spacy.load('en')
    doc = nlp(text)
    words = [token.lemma for token in doc if token.is_stop != True and token.is_punct != True]
    return collections.Counter(words)

In [0]:
# Example document matrix building 
X = docs2matrix(newsgroups_train)
print("Created a matrix with shape:",X.shape)

In [0]:
%timeit docs2matrix(newsgroups_train)

- Why so slow?
    - SpaCy is doing too many tasks that we don't need here.

In [0]:
NLP = spacy.load('en',disable=['ner','parser'])
def text_to_lemma_frequencies(text):    
    doc = NLP(text)
    words = [token.lemma for token in doc if token.is_stop != True and token.is_punct != True]
    return collections.Counter(words)

In [0]:
%timeit docs2matrix(newsgroups_train)

- That's all of the basic text processing that we're going to cover for now.

- [-> Next: Noisy Text Processing](https://colab.research.google.com/drive/1VlRz-wKYmsQ4gRHb02uLav8RodpvsCNG)