<a href="https://colab.research.google.com/github/ajaykrishna2013/NLP/blob/main/w3_Quiz_NLP_NLU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color=brown><i><b>Copyright Information:</b></i> All course material is copyrighted by the author and Stanford University with all rights reserved. The material may not be reproduced or distributed without an explicit written permission of the author. Stanford University also has a policy that one should know before audio/video recording. Do not post material online.</font>

# **w3 Quiz. NLP and NLU**

Make a copy of this Colab Notebook with the **starter code** below and continue building your solution **in Colab** (not another Python environment) to assure an exact environment and matching solutions.

Prof. Melnikov's video in module 3 demonstrates several ways to **tokenize sentences** and **words**. Here we evaluate their effectiveness in reducing the dimensionality of the vocabulary, while maintaining the "quality" of the tokens. We define "quality" with a binary function with *high* quality for words found in some commonly accepted vocabulary (such as [**Brown corpus**](https://www.nltk.org/book/ch02.html) from **NLTK library**). Other words are considered as *low* quality. This will require Python **set operations**, since we need to check whether a given word is in the vocabulary set or not.

Most evaluations here are done one the first 100 posts from each of the [**20 Newsgroups corpus**](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html) in **Scikit-Learn library**. We then use the average metric (**median**, in this case) to compute the final quantity, which you will submit to Canvas Quiz.

This quiz is a sequence of small projects, requiring computation of values. Do not round any values. If stuck for 10-15 minutes, raise a question in the w3 Discussion Q&A. See Syllabus for **Posting Guidelines**. In particular, describe your situation and what you have tried **without posting your code**.

To ease code readability  and avoid confusion, we use the following prefix convention for key variables: 
1. `s`=string, `n`=number, `b`=Boolean
1. `Ls`=list of strings, `Ln`=list of numbers, `Ss`=set of strings, `As`=NumPy array of strings, `Ds`=dictionary of string values
1. `LLs`=list of lists of strings, `LTs`=list of tuples of strings, etc.
1. `df`=Pandas dataframe or series, 

<p><font color=gray><i>Hint</i>: Refer to videos in Module 3 and some Python refresher videos of Corey Schafer.</font>

In [1]:
!pip -q install contractions   # quietly install contractions package
# allows multiple outputs from a single Colab code cell:
from IPython.core.interactiveshell import InteractiveShell  
InteractiveShell.ast_node_interactivity = "all"

import sys, matplotlib.pylab as plt, re, platform, matplotlib
import numpy as np, pandas as pd, nltk, sklearn, spacy, unicodedata, contractions 
from collections import Counter
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
tmp = nltk.download(['brown', 'stopwords','punkt','wordnet'], quiet=True) # See https://www.nltk.org/book/ch02.html
LsStopWords = nltk.corpus.stopwords.words('english')
from nltk.corpus import brown

# Increase viewable area of Pandas tables, NumPy arrays, plots
pd.set_option('max_rows', 5, 'max_columns', 500, 'max_colwidth', 1, 'precision', 2)
np.set_printoptions(linewidth=10000, precision=4, edgeitems=20, suppress=True)
plt.rcParams['figure.figsize'] = [16, 4]

def LoadNews(cat=['sci.space'], TopN=100):
    '''Function to load a string of news posts for the specified categories. Returns: TopN concatenated news'''
    Rem = ('headers', 'footers', 'quotes')  # remove these fields from result set
    bunch = fetch_20newsgroups(categories=cat, subset='test', shuffle=False, remove=Rem)
    return '\n'.join(bunch.data[:TopN])  # save first 100 posts concatenated as a single string.

# See doc: https://scikit-learn.org/stable/datasets/index.html#newsgroups-dataset
# We preload string variables containing concatenated news posts 
sNews = LoadNews(['sci.space'])   # a string. news from space ;)  
LsTgtNames = list(fetch_20newsgroups().target_names)   # names of 20 newsgroups

pso = nltk.stem.PorterStemmer()       # instantiates Porter Stemmer object
wlo = nltk.stem.WordNetLemmatizer()   # instantiates WordNet lemmatizer object
SsBrownVcb = set(brown.words())       # Vocabulary of 56057 words in Brown Corpus

# store sentence tokenizers' results as a list of lists or strings:
nlp = spacy.load('en_core_web_sm')
LLsST =  [sNews.split('. ')] \
    + [nltk.sent_tokenize(sNews)] \
    + [nltk.tokenize.PunktSentenceTokenizer().tokenize(sNews)] \
    + [[n.text for n in nlp(sNews).sents]]

# store word tokenizers' results as a list of lists of strings
LLsWT = [sNews.split()] \
    + [nltk.RegexpTokenizer(pattern=r"\s+", gaps=True ).tokenize(sNews)] \
    + [nltk.RegexpTokenizer(pattern=r"\s+", gaps=True ).tokenize(sNews)] \
    + [nltk.WhitespaceTokenizer().tokenize(sNews)] \
    + [nltk.RegexpTokenizer(pattern=r"\w+", gaps=False).tokenize(sNews)] \
    + [nltk.word_tokenize(sNews)] \
    + [nltk.TreebankWordTokenizer().tokenize(sNews)] \
    + [[t.text for t in nlp(sNews)]] \
    + [nltk.tokenize.toktok.ToktokTokenizer().tokenize(sNews)] \
    + [nltk.WordPunctTokenizer().tokenize(sNews)]

[K     |████████████████████████████████| 266kB 8.0MB/s 
[K     |████████████████████████████████| 327kB 11.8MB/s 
[?25h  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone


Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [2]:
# Example of news article
pprint(sNews[0:100])

("I'm afraid I was not able to find the GIFs... is the list \n"
 'updated weekly, perhaps, or am I just mis')


In [3]:
# Examples of parsed sentences by different tokenizers
[{i:LLsST[j][i][:100] for i in [0,1]} for j in range(len(LLsST))]

[{0: "I'm afraid I was not able to find the GIFs..",
  1: 'is the list \nupdated weekly, perhaps, or am I just missing something?\n\nThe forces and accelerations '},
 {0: "I'm afraid I was not able to find the GIFs... is the list \nupdated weekly, perhaps, or am I just mis",
  1: 'The forces and accelerations involved in doing a little bit of orbital\nmaneuvering with HST aboard a'},
 {0: "I'm afraid I was not able to find the GIFs... is the list \nupdated weekly, perhaps, or am I just mis",
  1: 'The forces and accelerations involved in doing a little bit of orbital\nmaneuvering with HST aboard a'},
 {0: "I'm afraid I was not able to find the GIFs... is the list \nupdated weekly, perhaps, or am I just mis",
  1: 'The forces and accelerations involved in doing a little bit of orbital\nmaneuvering with HST aboard a'}]

In [4]:
# Examples of parsed words by different tokenizers
[LLsWT[i][0:9] for i in range(len(LLsWT))]

[["I'm", 'afraid', 'I', 'was', 'not', 'able', 'to', 'find', 'the'],
 ["I'm", 'afraid', 'I', 'was', 'not', 'able', 'to', 'find', 'the'],
 ["I'm", 'afraid', 'I', 'was', 'not', 'able', 'to', 'find', 'the'],
 ["I'm", 'afraid', 'I', 'was', 'not', 'able', 'to', 'find', 'the'],
 ['I', 'm', 'afraid', 'I', 'was', 'not', 'able', 'to', 'find'],
 ['I', "'m", 'afraid', 'I', 'was', 'not', 'able', 'to', 'find'],
 ['I', "'m", 'afraid', 'I', 'was', 'not', 'able', 'to', 'find'],
 ['I', "'m", 'afraid', 'I', 'was', 'not', 'able', 'to', 'find'],
 ['I', "'", 'm', 'afraid', 'I', 'was', 'not', 'able', 'to'],
 ['I', "'", 'm', 'afraid', 'I', 'was', 'not', 'able', 'to']]

## **P1. Sentence Tokenizers**

Compute the mean (average) length of a sentence (as a count of characters) for each tokenizer and save it to `LnMeanSentLen`. Compute the difference between maximal mean length and minimal mean length.

<p><font color="gray"><i>Takeaway:</i> 
Notice the drastic difference in sentence lengths among these tokenizers. Wow! Investigate the shortest and longest sentences. Are they parsed correctly? What sentence separator can be used to tokenize these correctly? Which sentence tokenizer appears most/least reliable. Which one is fastest (if you tried timing these)? Are there tuning parameters for poorly performing parsers to improve their tokenization results? How would you pre-process the corpus to improve sentence tokenization?
<br><i>Hints</i>: The answer is in [100, 200] interval. It might be easier if you use list comprehensions and <code>np.mean()</code> function</font>

In [5]:
# Example. LnMeanSentLen=[6.0, 14.0]. Then max - min of these values yields 8. 
LLsST_ = [['5char','7  char'],['5char','7  char','a sentence with 30 characters ']]

In [6]:
num_st_tokenizers = len(LLsST)
LnMeanSetD = {}
for tkn_num in range(num_st_tokenizers):
  num_st = len(LLsST[tkn_num])
  LnMeanSetD[tkn_num] = []
  for st_idx in range(num_st):
    LnMeanSetD[tkn_num].append(len(LLsST[tkn_num][st_idx]))

In [7]:
 LnMeanSetLen = {}
 max_, min_ = (float("-inf"), float("inf"))
 for i in range(len(LnMeanSetD)):
   LnMeanSetLen[i] = np.mean(LnMeanSetD[i])
   max_, min_ = max(max_, LnMeanSetLen[i]), min(min_,  LnMeanSetLen[i])
LnMeanSetLen, max_, min_, max_ - min_

({0: 212.84369449378332,
  1: 136.66975666280416,
  2: 137.19651162790697,
  3: 79.57915567282322},
 212.84369449378332,
 79.57915567282322,
 133.2645388209601)

## **P2. Word Tokenizers**

Consider а list of lists of word strings, <code>LLsWT</code>, created above. <b>Compute</b> <code>abs(nMaxWordLength-nMaxWordCount)</code>

1. `nMaxWordCount` = max count of tokens among each tokenizer.
1. `nMaxWordLength` = max character length among all words from all tokenizers.

--------------

<font color=gray><i>Takeaway:</i> Are the longest and shortest word tokens meaningful? If not, should we avoid these or pre-process these in some way?
<br><i>Hint:</i> The answer is in the [23000, 24000] interval. It is easier to process <code>LLsWT</code> via loops, list comprehensions or Pandas dataframes.
</font>

In [8]:
# Example: nMaxWordCount=9 and nMaxWordLength=6. The absolute difference is |9-6| or 3
LLsWT_ =  [['I', "'m", 'afraid'], ['I', "'", 'm', 'afraid', 'I', 'was', 'not', 'able', 'to']]

In [9]:
num_st_tokenizers = len(LLsWT)
LnWordLenSet = {}
LnNumWordSet = {}
for tkn_num in range(num_st_tokenizers):
  num_st = len(LLsWT[tkn_num])
  LnWordLenSet[tkn_num] = []
  LnNumWordSet[tkn_num] = len(LLsWT[tkn_num])
  for wd_idx in range(num_st):
    LnWordLenSet[tkn_num].append(len(LLsWT[tkn_num][wd_idx]))
  

In [10]:
MaxWordLenSet = {i: max(LnWordLenSet[i]) for i in range(len(LnWordLenSet.keys()))}
print(MaxWordLenSet)
nMaxWordLength = max(MaxWordLenSet.values())
print(nMaxWordLength)
nMaxWordCount = max(LnNumWordSet.values())
print(nMaxWordCount)
print('diff', abs(nMaxWordCount - nMaxWordLength))

{0: 77, 1: 77, 2: 77, 3: 77, 4: 18, 5: 36, 6: 36, 7: 155, 8: 77, 9: 77}
155
23816
diff 23661


## **P3. Word Tokenizers - Largest Word**

Consider a list of lists <code>LLsWT</code> created above. <b>Compute</b> the length of the longest token containing <a href=http://sticksandstones.kstrom.com/appen.html>ASCII</a> letters, i.e. any of a-z or A-Z.

<p><i>Takeaway:</i> What additional preprocessing would you include to avoid semantic-less word tokens in your resulting vocabulary?

-----------

<font color=gray><p><i>Hint:</i> The answer is in the [0, 200] interval. You may need to flatten a list of lists of strings into just a list of strings for convenience. Then remove words that do not contain any ASCII letters. This can be done with the `search()` method from `re` object and `[a-zA-Z]` pattern.

In [11]:
# Example: The longest word token containing `[a-zA-Z]` is `'here.'` and has length 5.
LLsWT_ = [['I', "'", 'm', 'here', '. (2021)'], ['I\'m', 'here.', '(2021)']]

In [12]:
comp = re.compile(r'[a-zA-Z]')
max_len, max_word = 0, ''
for tkn_list in LLsWT:
  for word in tkn_list:
    # print('Word', word)
    match = comp.match(word)
    if match:
      if len(word) > max_len: max_word = word
      max_len = max(max_len, len(word))
      

print('Largest Word', max_word, max_len)

Largest Word nic.funet.fi:/pub/astro/general/astroftp.txt 44


## **P4. Dimension Reduction: Lower Casing**

Here we compare the effectiveness of lower casing of a text corpus. Consider news text from `LoadNews(['rec.motorcycles'])` with original and lower casing. Compute the **percentage decrease in vocabulary size** of word tokens with and without lowercase pre-processing. Use `nltk.WordPunctTokenizer()` object to parse into word tokens. Then do the same for *each newsgroup element* of <code>LsTgtNames</code> list. Submit the median of these quantities.

If original and processed vocab sizes are <code>a</code> and <code>b</code>, then we want <code>(a-b)/a*100</code> as the percent decrease in vocab size.

**Toy example 1:** A string <code>"r you R user?"</code> has a vocabulary (i.e. unique word tokens) of 4 words; and its lower-case counterpart <code>"r you r user?"</code> has a vocabulary of 3 words (because "R" was replaced with "r"). So, the percentage decrease in vocabulary is (4-3)/4*100=25%. Submit 25

**Example 2:** You should observe these intermediate results. Here the rows ordered by `%improvement`:

|group|#orig_words|#lowcase_words|%improvement|
|-----|-----------|--------------|--------|
|rec.motorcycles|2712|2484|8.41|
|comp.sys.mac.hardware|2384|2183|8.43|

<p><font color="gray"><i>Takeaway:</i> Notice a healthy reduction in vocabulary size across these newsgroups by lower casing the text. Also, a median is a better measure of centrality (or average) than mean because the former is robust to extreme values (or outliers). FYI: <code>nltk.WordPunctTokenizer()</code> is fast, but offers poorer quality than Spacy's <code>nlp</code> model.

-------

<font color=gray><i>Hint:</i> First write your code for a single news group; then wrap it into a function; then call this function for every newsgroup. It's simpler to apply lower casing before tokenization. Note vocabulary is always a container of unique words.

<font color=gray>Just like a real vocabulary or dictionary, a digital vocabulary also contains unique words only. A big challenge in NLP is handling (storing and processing) huge vocabularies. In classical NLP we normalize words to reduce the vocabulary size with a minimal "loss of information". So, if your vocabulary contains 1 million words, then each word is essentially a 1 million dimensional one-hot vector (of zeros and a single 1 identifying the unique position of the word). If we reduce the vocabulary to 100K words, then each word is a 100K dimensional one-hot vector. Naturally, with shorter vector representations we can fit more words in compute memory and do more NLP magic. Pre-processing helps us reduce the vocabulary size.


In [13]:
from collections import Counter

In [14]:
df_improvement = pd.DataFrame(columns=['group', 'orig_words',	'#lowercase_words',	'%improvement'])
pd.set_option('display.max_rows', 85)

In [15]:
def compute_improvement(news_group):
  news = LoadNews([news_group])
  token_list = nltk.WordPunctTokenizer().tokenize(news)
  org_word_freq = Counter(token_list)
  num_org_word = len(org_word_freq)
  news_lower = news.lower()
  token_list_lower = nltk.WordPunctTokenizer().tokenize(news_lower)
  lower_word_freq = Counter(token_list_lower)
  num_lower_word = len(lower_word_freq)
  return {'group': news_group, 
          'orig_words': num_org_word, 
          '#lowercase_words': num_lower_word,
          '%improvement': ((num_org_word - num_lower_word)/num_org_word) * 100}


In [16]:
result = []
for news_group in LsTgtNames:
  df_improvement = df_improvement.append(compute_improvement(news_group), ignore_index=True)

In [17]:
df_improvement.sort_values(by='%improvement')

Unnamed: 0,group,orig_words,#lowercase_words,%improvement
8,rec.motorcycles,2712,2484,8.41
4,comp.sys.mac.hardware,2384,2183,8.43
19,talk.religion.misc,4450,4068,8.58
7,rec.autos,2935,2679,8.72
10,rec.sport.hockey,4026,3669,8.87
17,talk.politics.mideast,5704,5188,9.05
9,rec.sport.baseball,3499,3174,9.29
15,soc.religion.christian,5101,4614,9.55
18,talk.politics.misc,5255,4751,9.59
0,alt.atheism,4928,4444,9.82


In [18]:
np.median(df_improvement['%improvement'])

9.856739800925276

In [19]:
moto_news = LoadNews(['comp.sys.mac.hardware'])
print(moto_news[:150])
token_list = nltk.WordPunctTokenizer().tokenize(moto_news)
org_word_freq = Counter(token_list)
print("# original Words", len(org_word_freq))
print('Most Common',org_word_freq.most_common()[:100])
print(f'Freq of "After"', org_word_freq.get('After'))






Don't forget the LAMG (Los Angeles Macintosh Group) BBS! It's the BBS for
the largest Mac-only user group in the country now that BMUG is
multi-pl
# original Words 2384
Most Common [('.', 443), ('the', 400), (',', 288), ('to', 223), ('a', 206), ('I', 205), ("'", 151), ('-', 150), ('and', 143), ('of', 127), ('it', 114), ('that', 105), ('is', 103), ('you', 88), ('(', 86), ('with', 85), ('on', 78), ('?', 76), ('in', 68), ('have', 68), ('for', 66), ('can', 62), ('s', 53), ('t', 52), ('be', 50), ('The', 48), (')', 47), ('"', 47), ('not', 47), ('are', 45), ('but', 41), ('this', 41), ('or', 37), (':', 37), ('as', 35), ('Apple', 32), ('/', 32), ('was', 31), ('at', 31), ('an', 30), ('one', 29), ('$', 28), ('Mac', 26), ('my', 26), ('from', 26), ('any', 26), ('when', 26), ('if', 26), ('has', 26), ('there', 25), ('do', 25), ('monitor', 24), ('get', 24), ('about', 24), ('all', 24), ('your', 23), ('will', 23), ('which', 23), ('!', 22), ('up', 22), ('just', 22), ('out', 22), ('It', 21), (').', 21

In [20]:
moto_news_lower = moto_news.lower()
print(moto_news_lower[:150])
token_list_lower = nltk.WordPunctTokenizer().tokenize(moto_news_lower)
lower_word_freq = Counter(token_list_lower)
print("# Lowercase Words", len(lower_word_freq))
print('Most Common',lower_word_freq.most_common()[:100])
print(f'Freq of "after"', lower_word_freq.get('after'))





don't forget the lamg (los angeles macintosh group) bbs! it's the bbs for
the largest mac-only user group in the country now that bmug is
multi-pl
# Lowercase Words 2183
Most Common [('the', 449), ('.', 443), (',', 288), ('to', 224), ('a', 215), ('i', 208), ("'", 151), ('-', 150), ('and', 144), ('it', 137), ('of', 127), ('is', 110), ('that', 106), ('you', 97), ('(', 86), ('with', 85), ('on', 78), ('?', 76), ('in', 73), ('can', 71), ('have', 69), ('for', 68), ('s', 59), ('this', 58), ('t', 53), ('not', 51), ('be', 50), ('are', 49), (')', 47), ('"', 47), ('if', 47), ('but', 42), ('or', 42), ('mac', 38), ('any', 38), ('as', 38), (':', 37), ('apple', 36), ('at', 35), ('was', 32), ('/', 32), ('my', 31), ('so', 31), ('do', 31), ('has', 31), ('one', 30), ('an', 30), ('there', 28), ('monitor', 28), ('when', 28), ('$', 28), ('from', 27), ('all', 27), ('what', 26), ('they', 24), ('get', 24), ('about', 24), ('will', 24), ('we', 24), ('your', 23), ('which', 23), ('!', 22), ('does', 22), ('up',

## **P5. Dimension Reduction: Contraction Expansion (CE)**

Similar to P4, measure the percent decrease in vocabulary size across multiple news groups due to the application of CE with the <code>contractions</code> package (without modifying). Apply CE before tokenization. See w3 Colab notebook.

**Example:** You should observe these intermediate results. Here the rows ordered by `%improvement`:


|group|#orig_words|#CE_words|%improvement|
|--|--|--|--|
|talk.politics.mideast|5704|5685|0.33|
|soc.religion.christian|5101|	5083|	0.35|

<p><font color="gray"><i>Takeaway:</i> As expected, the CE is not as effective, but is complementary to lower casing and other techniques. Typically,we create pre-processing pipelines, where the order matters. For example, given "I'm here", CE+parsing yields <code>['I','am','here']</code>, while parsing+CE yields <code>['I am', 'here']</code>.

---------------------

<font color="gray">Notice that we are measuring the percent decrease in vocabulary size due to application of contraction expansion only. The previous preprocessing of lower casing is not used here, since it would make it more difficult to measure the effect of expansion along.</font>


In [21]:
cMap = contractions.contractions_dict


In [22]:
moto_news = LoadNews(['soc.religion.christian'])
#print(moto_news[:150])
token_list = nltk.WordPunctTokenizer().tokenize(moto_news)
org_word_freq = Counter(token_list)
print("# original Words", len(org_word_freq))
#print('Most Common',org_word_freq.most_common()[:100])


moto_news_exp = contractions.fix(moto_news)
#print(moto_news_exp[:150])
token_list_exp = nltk.WordPunctTokenizer().tokenize(moto_news_exp)
exp_word_freq = Counter(token_list_exp)
print("# original Words", len(exp_word_freq))
#print('Most Common',exp_word_freq.most_common()[:100])

# original Words 5101
# original Words 5083


In [23]:
df_improvement_exp = pd.DataFrame(columns=['group', 'orig_words',	'#CE_words',	'%improvement'])
pd.set_option('display.max_rows', 85)

In [24]:
def compute_exp_improvement(news_group):
  news = LoadNews([news_group])
  token_list = nltk.WordPunctTokenizer().tokenize(news)
  org_word_freq = Counter(token_list)
  num_org_word = len(org_word_freq)
  
  news_exp = contractions.fix(news)
  token_list_exp = nltk.WordPunctTokenizer().tokenize(news_exp)
  exp_word_freq = Counter(token_list_exp)
  num_exp_word = len(exp_word_freq)
  return {'group': news_group, 
          'orig_words': num_org_word, 
          '#CE_words': num_exp_word,
          '%improvement': ((num_org_word - num_exp_word)/num_org_word) * 100}

In [25]:
result = []
for news_group in LsTgtNames:
  df_improvement_exp = df_improvement_exp.append(compute_exp_improvement(news_group), ignore_index=True)

In [26]:
df_improvement_exp.sort_values(by='%improvement')

Unnamed: 0,group,orig_words,#CE_words,%improvement
17,talk.politics.mideast,5704,5685,0.33
15,soc.religion.christian,5101,5083,0.35
14,sci.space,4929,4910,0.39
18,talk.politics.misc,5255,5232,0.44
6,misc.forsale,3827,3809,0.47
19,talk.religion.misc,4450,4428,0.49
0,alt.atheism,4928,4901,0.55
5,comp.windows.x,4163,4139,0.58
1,comp.graphics,4971,4942,0.58
13,sci.med,5095,5065,0.59


In [27]:
np.median(df_improvement_exp['%improvement'])

0.6265741128351531

## **P6. Dimension Reduction: Stopwords Removal**

Similar to P4, measure the percent decrease in vocabulary size across multiple news groups due to the removal of stop words. Use <code>LsStopWords</code> defined above. Remove stopwords after tokenization. See w3 Colab.

**Example:** You should observe these intermediate results. Here the rows ordered by `%improvement`:


|group|	#orig_words|	#important_words|	%improvement|
|--|--|--|--|
|talk.politics.mideast|	5704|	5568|	2.38|
|comp.graphics|	4971|	4842|	2.60|

<font color=gray><i>Takeaway:</i> Are you surprised by the average percent of stop words in these corpora?

In [55]:
def tokenize_news(sNews):
  LLWTs = [sNews.split()] \
      + [nltk.RegexpTokenizer(pattern=r"\s+", gaps=True ).tokenize(sNews)] \
      + [nltk.RegexpTokenizer(pattern=r"\s+", gaps=True ).tokenize(sNews)] \
      + [nltk.WhitespaceTokenizer().tokenize(sNews)] \
      + [nltk.RegexpTokenizer(pattern=r"\w+", gaps=False).tokenize(sNews)] \
      + [nltk.word_tokenize(sNews)] \
      + [nltk.TreebankWordTokenizer().tokenize(sNews)] \
      + [[t.text for t in nlp(sNews)]] \
      + [nltk.tokenize.toktok.ToktokTokenizer().tokenize(sNews)] \
      + [nltk.WordPunctTokenizer().tokenize(sNews)]

  return LLWTs

In [79]:
def remove_stop_words(tokens):
  return [str(w) for w in tokens if not str(w) in set(LsStopWords)]

In [80]:
moto_news = LoadNews(['comp.graphics'])
# moto_news[:100]
# type(moto_news)
# token_list = nltk.WordPunctTokenizer().tokenize(moto_news)
token_lists = tokenize_news(moto_news)
#token_list = nltk.tokenize.PunktSentenceTokenizer().tokenize(moto_news)
#token_list
token_lists_wo_stop_words = []
for token_list in token_lists:
  news_wo_stop_words = remove_stop_words(token_list)
  token_lists_wo_stop_words.append(news_wo_stop_words)

In [81]:
for i in  range(len(token_lists_wo_stop_words)):
  org_word_freq = Counter(token_lists[i])
  word_freq_wo_stop_words = Counter(token_lists_wo_stop_words[i])
  print("# of original Words:{}, # important_words: {} %improvement: {}".format(len(org_word_freq), len(word_freq_wo_stop_words),((len(org_word_freq) - len(word_freq_wo_stop_words)) / len(org_word_freq)) * 100))


# of original Words:6139, # important_words: 6013 %improvement: 2.0524515393386547
# of original Words:6139, # important_words: 6013 %improvement: 2.0524515393386547
# of original Words:6139, # important_words: 6013 %improvement: 2.0524515393386547
# of original Words:6139, # important_words: 6013 %improvement: 2.0524515393386547
# of original Words:4824, # important_words: 4695 %improvement: 2.674129353233831
# of original Words:4978, # important_words: 4858 %improvement: 2.4106066693451185
# of original Words:5350, # important_words: 5231 %improvement: 2.2242990654205608
# of original Words:5005, # important_words: 4889 %improvement: 2.317682317682318
# of original Words:5338, # important_words: 5210 %improvement: 2.397901835893593
# of original Words:4971, # important_words: 4842 %improvement: 2.595051297525649


In [86]:
df_improvement_stop = pd.DataFrame(columns=['group', 'orig_words',	'#important_words',	'%improvement'])
pd.set_option('display.max_rows', 85)

In [87]:
def compute_improvement_wo_stop_words(news_group):
  news = LoadNews([news_group])
  token_list = nltk.WordPunctTokenizer().tokenize(news)
  org_word_freq = Counter(token_list)
  num_org_word = len(org_word_freq)
  
  token_list_wo_stop = remove_stop_words(token_list)
  token_list_wo_stop_freq = Counter(token_list_wo_stop)
  num_word_wo_stop = len(token_list_wo_stop_freq)
  return {'group': news_group, 
          'orig_words': num_org_word, 
          '#important_words': num_word_wo_stop,
          '%improvement': ((num_org_word - num_word_wo_stop)/num_org_word) * 100}

In [88]:
result = []
for news_group in LsTgtNames:
  df_improvement_stop = df_improvement_stop.append(compute_improvement_wo_stop_words(news_group), ignore_index=True)

In [89]:
df_improvement_stop.sort_values(by='%improvement')

Unnamed: 0,group,orig_words,#important_words,%improvement
17,talk.politics.mideast,5704,5568,2.38
1,comp.graphics,4971,4842,2.6
13,sci.med,5095,4960,2.65
18,talk.politics.misc,5255,5114,2.68
14,sci.space,4929,4795,2.72
15,soc.religion.christian,5101,4961,2.74
0,alt.atheism,4928,4786,2.88
6,misc.forsale,3827,3711,3.03
5,comp.windows.x,4163,4034,3.1
19,talk.religion.misc,4450,4309,3.17


In [90]:
np.median(df_improvement_stop['%improvement'])

3.2236139252164304

## **P7. Dimension Reduction: Normalizing Accented Characters (NAC)**

Similar to P4, measure the percent decrease in vocabulary size across multiple news groups due to the removal of accent marks as we did in Module 3. Use <code>LsStopWords</code> defined above to retrieve smaller vocabularies. First do NAC, then tokenize. See w3 Colab. 

**Example:** You should observe these intermediate results. Here the rows ordered by `%improvement`:


|group|	#orig_words|	#NAC_tokens|	%improvement|
|--|--|--|--|
|alt.atheism|	4928|	4928|	0.0|

<font color=gray><i>Takeaway:</i> Does the answer surprise you? Why do you think you are seeing this percentage reduction?

In [35]:
moto_news = LoadNews(['alt.atheism'])
normalized_text = unicodedata.normalize('NFKD', moto_news).encode('ascii', 'ignore').decode('utf-8', 'ignore')
token_list = nltk.WordPunctTokenizer().tokenize(normalized_text)
len(Counter(token_list))

4928

In [36]:
df_improvement_nac = pd.DataFrame(columns=['group', 'orig_words',	'#NAC_tokens',	'%improvement'])
pd.set_option('display.max_rows', 85)

def compute_improvement_nac(news_group):
  news = LoadNews([news_group])
  token_list = nltk.WordPunctTokenizer().tokenize(news)
  org_word_freq = Counter(token_list)
  num_org_word = len(org_word_freq)
  
  normalized_text = unicodedata.normalize('NFKD', news).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  token_list_nac = nltk.WordPunctTokenizer().tokenize(normalized_text)
  token_list_nac_freq = Counter(token_list_nac)
  num_word_nac = len(token_list_nac_freq)
  return {'group': news_group, 
          'orig_words': num_org_word, 
          '#NAC_tokens': num_word_nac,
          '%improvement': ((num_org_word - num_word_nac)/num_org_word) * 100}

In [37]:
result = []
for news_group in LsTgtNames:
  df_improvement_nac = df_improvement_nac.append(compute_improvement_nac(news_group), ignore_index=True)

df_improvement_nac.sort_values(by='%improvement')

Unnamed: 0,group,orig_words,#NAC_tokens,%improvement
0,alt.atheism,4928,4928,0.0
17,talk.politics.mideast,5704,5704,0.0
16,talk.politics.guns,3895,3895,0.0
15,soc.religion.christian,5101,5101,0.0
14,sci.space,4929,4929,0.0
13,sci.med,5095,5095,0.0
12,sci.electronics,3274,3274,0.0
11,sci.crypt,3772,3772,0.0
10,rec.sport.hockey,4026,4026,0.0
9,rec.sport.baseball,3499,3499,0.0


In [38]:
np.median(df_improvement_nac['%improvement'])

0.0

## **P8. Dimension Reduction: Porter Stemmer vs WordNet Lemmatizer**

Similar to P4, we compare the effectiveness of stemmers and lemmatizers in reducing the vocabulary size. 

Apply the stemmer <code>pso</code> (defined above) to the tokenized words of each news group (as we did in Module 3 video and Jupyter Notebook). Then compute the median percent decrease in vocabulary size. Then apply the lemmatizer <code>wlo</code> (defined above) the same way. Then submit the largest of the two percentage decrease values.

**Example:** You should observe these intermediate results. Here the rows ordered by `%improvement`:

|group|#orig_words|#pso_words|%improvement|
|-|-|-|-|
|rec.sport.hockey|4026|3271|18.75|
|rec.motorcycles|2712|2158|20.43|

|group|#orig_words|#wso_words|%improvement|
|-|-|-|-|
|misc.forsale|3827|3708|3.11|
|rec.sport.hockey|4026|3897|3.20|

<font color=gray><i>Takeaway:</i> Notice the more aggressive normalizer. Do you think it produces higher quality words (that can still be found in a common English dictionary)?

In [39]:
df_improvement_stemmer = pd.DataFrame(columns=['group', '#orig_words',	'#stem_lem_words',	'%improvement', 'stemmer'])
pd.set_option('display.max_rows', 85)

def compute_improvement_w_stemming(news_group, stem_lem, stemmer_name):
  news = LoadNews([news_group])
  token_list = nltk.WordPunctTokenizer().tokenize(news)
  org_word_freq = Counter(token_list)
  num_org_word = len(org_word_freq)
  
  if stemmer_name == 'pso':
    token_list_stemmed = [stem_lem.stem(w) for w in token_list]
  else:
    token_list_stemmed = [stem_lem.lemmatize(w) for w in token_list]
  token_list_stemmed_freq = Counter(token_list_stemmed)
  num_word_wo_stop = len(token_list_stemmed_freq)
  return {'stemmer': stemmer_name,
          'group': news_group, 
          '#orig_words': num_org_word, 
          '#stem_lem_words': num_word_wo_stop,
          '%improvement': ((num_org_word - num_word_wo_stop)/num_org_word) * 100}

In [40]:
result = []
stemmer_dict = {'pso': pso, 'wlo': wlo}
for name, stemmer in stemmer_dict.items():
  for news_group in LsTgtNames:
    df_improvement_stemmer = df_improvement_stemmer.append(compute_improvement_w_stemming(news_group, stemmer, name), ignore_index=True)

In [103]:
df_improvement_stemmer_grp = df_improvement_stemmer.groupby(['stemmer'])
#df_improvement_stemmer.sort_values(by=['%improvement'], ascending=False)

In [110]:
#rec.sport.hockey	4026	3271	18.75
#rec.sport.hockey	4026	3897	3.20
df_pso = df_improvement_stemmer_grp.get_group('pso')
np.median(df_pso['%improvement'])
#df_pso

25.65413661583169

In [109]:
df_wlo = df_improvement_stemmer_grp.get_group('wlo')
np.median(df_wlo['%improvement'])
#df_wlo

5.425927181355558

In [101]:
# rec.motorcycles	2712	2158	20.43
df_improvement_stemmer_grp.get_group('rec.motorcycles')

Unnamed: 0,group,#orig_words,#stem_lem_words,%improvement,stemmer
8,rec.motorcycles,2712,2158,20.43,pso
28,rec.motorcycles,2712,2583,4.76,wlo


In [102]:
# misc.forsale 3827	3708	3.11
df_improvement_stemmer_grp.get_group('misc.forsale')

Unnamed: 0,group,#orig_words,#stem_lem_words,%improvement,stemmer
6,misc.forsale,3827,3014,21.24,pso
26,misc.forsale,3827,3708,3.11,wlo


In [42]:
for news_group in LsTgtNames:
  df_improvement_stemmer.get_group(news_group)

df_improvement_stemmer.sort_values(by)

AttributeError: ignored

## **P9. Dimension Reduction: Measure Quality of Stems/Lemmas**

Here we are quantitatively evaluating the quality of contributions to vocabulary from **stemming** and **lemmatization**. Many of the stemmed and lemmatized words may not be proper English words. We assess whether stems and lemmas are spelled correctly by checking whether they appear in the set of 56K word vocabulary from **Brown corpus**, which is <code>SsBrownVcb</code> set object created above.

<p>Given one newsgroup corpus, say <code>'misc.forsale'</code>, compute <code>vocab_orig</code> set of tokens from the original text and <code>vocab_pso</code> set of stemmed tokens using <code>pso</code> object. Then compute a set of newly formed word tokens, <code>new_tokens_pso</code>, that were not in the original set. Now, how many of these new stems are in the Brown vocabulary? Compute the percent of new stems which are also found in Brown corpus (vs all new stems from <code>pso</code>). Let's call this quality metric <code>accuracy_pso</code>.

<p>Now, compute <code>accuracy_wlo</code> similarly, but with a lemmatizer object <code>wlo</code> created above. 

<p>Compute the absolute difference, <code>abs(accuracy_wlo-accuracy_pso)</code> and call it <code>abs_acc_diff</code>.

<p>Finally, compute the median of all <code>abs_acc_diff</code> metrics across all newsgroups in <code>LsTgtNames</code> list.

<p><font color="gray"><i>Takeaway:</i> Notice the drastic difference in quality between two techniques.

---------------

<i>Hint:</i> You will need to use set operations here. Sets are extremely effective for testing memberships of members of one group in another group. See Corey Schafer video on sets, if you need a refresher. </font>

<hr>

**Toy example 1:** Say you have a sentence "NLP is awesomely exciting", which tokenizes to <font color=blue>["NLP", "is", "awesomely", "exciting"]</font>, which stem to (for example) <font color=blue>["NLP", "is", "awesome", "excit"]</font>. Now, there are 2 new words and, suppose, only one of them is correct. That is <font color=blue>"awesome"</font> is a new word that is correct and <font color=blue>"excit"</font> is incorrect. 

<p>So, we only have 50% of the new words, which are correct. Then we would expect lemmatization to yield 100% of new words to be correct. How do we determine whether the word is correct or not? We could spell-check, but it's slow. We could also check if it's in some dictionary of words that we consider correct, such as Oxford Dictionary, Wikipedia, etc. Well, often we just use Brown vocabulary as that ground-truth set of words. So, in this exercise we need to compute the percent of new words found in Brown corpus (vs all new stems from pso).

<p>Further clarification:

1. Let S:= set of new words resulting from stemming (i.e. new stems)
1. Let B:= set of all words in Brown vocabulary (these are unique words)
1. Let BS:= the intersection of the B and S. That is all new words found in Brown corpus
1. len(BS) is the size of this intersection, i.e. the count of elements in the BS set.
1. We need to find len(BS)/len(S) as a percentage.

<p>Finally, we compute the absolute difference between quality metrics, scale it up to all newsgroups, and then pick the median for submission.

You should observe these intermediate results. Here the rows ordered by `abs_acc_diff`:

|newsgroups|acc_diff_pso|acc_diff_wlo|abs_acc_diff|
|--|--|--|--|
|misc.forsale|30.07|80.49|50.42|
|rec.motorcycles|26.52|79.00|52.48|

In [52]:
df_stemmer_compare = pd.DataFrame(columns=['newsgroups', 'acc_diff_pso',	'acc_diff_wlo',	'abs_acc_diff'])
pd.set_option('display.max_rows', 85)
from collections import defaultdict

In [53]:
def compute_improvement_w_stemming(news_group):
  news = LoadNews([news_group])
  token_list = nltk.WordPunctTokenizer().tokenize(news.lower())
  org_word_freq = Counter(token_list)
  num_org_word = len(org_word_freq)
  set_of_orig_words = set(token_list)
  #print('Set of original words', set_of_orig_words)

  stemmer_dict = {'pso': pso, 'wlo': wlo}
  token_list_stemmed = defaultdict(set)
  for name, stemmer in stemmer_dict.items():
    if name == 'pso':
      token_list_from_stemming = [stemmer.stem(w) for w in token_list]
      #print('stemmed pso', token_list_from_stemming)
    else:
      token_list_from_stemming = [stemmer.lemmatize(w) for w in token_list]
      #print('stemmed wlo', token_list_from_stemming)
    
    stemmed_word_freq = Counter(token_list_from_stemming)
    token_list_stemmed[name] = set(token_list_from_stemming)

  
  new_words_per_stemmer = defaultdict(set)
  for name in stemmer_dict.keys():
    new_words_per_stemmer[name] = token_list_stemmed[name] - set_of_orig_words

  return new_words_per_stemmer

In [92]:
#d = compute_improvement_w_stemming("")
results = {}
for news_group in ['misc.forsale', 'rec.motorcycles']: #LsTgtNames:
  results[news_group] = compute_improvement_w_stemming(news_group)
# token_list_stemmed_freq = Counter(token_list_stemmed)
# num_word_wo_stop = len(token_list_stemmed_freq)
# return {'stemmer': stemmer_name,
#         'group': news_group, 
#         '#orig_words': num_org_word, 
#         '#stem_lem_words': num_word_wo_stop,
#         '%improvement': ((num_org_word - num_word_wo_stop)/num_org_word) * 100}

In [96]:
bs1 = results['misc.forsale']['pso'].intersection(SsBrownVcb)
bs2 = results['rec.motorcycles']['pso'].intersection(SsBrownVcb)


In [97]:
(len(bs1)/len(results['misc.forsale']['pso'])) * 100

22.122762148337596

In [48]:
for news_group, result in results.items():
  bs_pso = result['pso'].intersection(SsBrownVcb)
  bs_wlo = result['wlo'].intersection(SsBrownVcb)
  acc_diff_pso = (len(bs_pso)/len(result['pso'])) * 100
  acc_diff_wlo = (len(bs_wlo)/len(result['wlo'])) * 100
  #print('acc_diff_pso: {}	acc_diff_wlo:	{} abs_acc_diff: {}'.format(acc_diff_pso, acc_diff_wlo, abs(acc_diff_wlo - acc_diff_pso)))
  result_news_group = {
      'newsgroups': news_group, 
      'acc_diff_pso': acc_diff_pso, 
      'acc_diff_wlo': acc_diff_wlo,
      'abs_acc_diff': abs(acc_diff_wlo - acc_diff_pso)
      }
  df_stemmer_compare = df_stemmer_compare.append(result_news_group, ignore_index=True)

In [49]:
df_stemmer_compare

Unnamed: 0,newsgroups,acc_diff_pso,acc_diff_wlo,abs_acc_diff
0,alt.atheism,20.18,82.02,61.84
1,comp.graphics,16.9,75.0,58.1
2,comp.os.ms-windows.misc,20.34,81.32,60.98
3,comp.sys.ibm.pc.hardware,22.66,84.88,62.22
4,comp.sys.mac.hardware,23.29,86.59,63.29
5,comp.windows.x,17.09,79.81,62.71
6,misc.forsale,22.12,78.26,56.14
7,rec.autos,21.82,84.31,62.5
8,rec.motorcycles,24.52,78.64,54.12
9,rec.sport.baseball,23.84,73.64,49.8


In [50]:
np.median(df_stemmer_compare['abs_acc_diff'])

59.68398086237613

In [None]:
# acc_diff_wlo = (len(bs_wlo)/len(result['wlo'])
#df_stemmer_compare = df_stemmer_compare.append(compute_improvement_w_stemming(news_group, stemmer, name), ignore_index=True)