# Text preprocessing

**Data URL**

In [1]:
DATA_URL = "https://www.gutenberg.org/files/913/913-0.txt"

Loading *NLTK*'s 'wordnet'

In [2]:
! pip install -q nltk
import nltk
import zipfile
import os
nltk.download('punkt_tab')

try:
    nltk.data.find('tokenizers/punkt')
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('punkt')
    nltk.download('wordnet')

wordnet_zip_path = '/root/nltk_data/corpora/wordnet.zip'
wordnet_dir = '/root/nltk_data/corpora/'

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [3]:
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
#nltk.download("wordnet")

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
lemmatizer

<WordNetLemmatizer>

Downloading the text for the task via `urllib`

In [4]:
import urllib.request

opener = urllib.request.URLopener({})
resource = opener.open(DATA_URL)
charset = resource.headers.get_content_charset()
print("Charset", charset)
raw_text = resource.read()

if charset:
  raw_text = raw_text.decode(resource.headers.get_content_charset())
else:
  raw_text = raw_text.decode("utf-8")

raw_text[:100]

Charset utf-8


'\ufeffThe Project Gutenberg EBook of A Hero of Our Time, by M. Y. Lermontov\r\n\r\nThis eBook is for the use '

Removing the book ending (Gutenberg legal information)

In [5]:
import re

clean_pattern = re.compile("End of the Project Gutenberg EBook.*")
cleaner_text =  re.sub(clean_pattern, "", raw_text.replace("\n", " ").replace("\r", " "))
cleaner_text[-100:]

'ases, equivalent to: “Men are fools, fortune is  blind, and life is not worth a straw.”]            '

Splitting the text into tokens with a little help from [NLTK](https://nltk.org/).

In [6]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

! unzip /root/nltk_data/corpora/wordnet.zip -d /root/nltk_data/corpora/;

tokens =  word_tokenize(cleaner_text)
"A total of %d 'tokens'" % len(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Archive:  /root/nltk_data/corpora/wordnet.zip
   creating: /root/nltk_data/corpora/wordnet/
  inflating: /root/nltk_data/corpora/wordnet/lexnames  
  inflating: /root/nltk_data/corpora/wordnet/data.verb  
  inflating: /root/nltk_data/corpora/wordnet/index.adv  
  inflating: /root/nltk_data/corpora/wordnet/adv.exc  
  inflating: /root/nltk_data/corpora/wordnet/index.verb  
  inflating: /root/nltk_data/corpora/wordnet/cntlist.rev  
  inflating: /root/nltk_data/corpora/wordnet/data.adj  
  inflating: /root/nltk_data/corpora/wordnet/index.adj  
  inflating: /root/nltk_data/corpora/wordnet/LICENSE  
  inflating: /root/nltk_data/corpora/wordnet/citation.bib  
  inflating: /root/nltk_data/corpora/wordnet/noun.exc  
  inflating: /root/nltk_data/corpora/wordnet/verb.exc  
  inflating: /root/nltk_data/corpora/wordnet/README  
  inflating: /root/nltk_data/corpora/wordnet/index.sense  
  inflating: /root/nltk_data/corpora/wordnet/data.noun  
  inflating: /root/nltk_data/corpora/wordnet/data.adv  


"A total of 72566 'tokens'"

Now we are about to **lemmatize the tokens**. Please note that for better results we should have first PoS-tagged the text (e.g. with NLTK as well, [please refer to the book and the docs](https://www.nltk.org/book/ch05.html)). `WordNetLemmatizer` would work best with PoS tags provided. However, to make things short and simple, we won't do it as of now.



---


## Task \#1
using Python's standard library's `str.isalpha` modify the code below to remove all non-letter tokens from sentences.

---



In [12]:
from tqdm.notebook import tqdm

lemmas = list(filter(lambda t: t.isalpha(), [lemmatizer.lemmatize(lemma) for lemma in tqdm(tokens)]))
lemmas[-10:]

  0%|          | 0/72566 [00:00<?, ?it/s]

['fortune', 'is', 'blind', 'and', 'life', 'is', 'not', 'worth', 'a', 'straw']

Let us check if everything is correct. Total corpus size should be equal to 59087

In [13]:
len(lemmas)

59087

To continue working on this assignment make sure those numbers match!


---

## Task #2

Using `lemmas`, english NLTK stopwords (`nltk.corpus.stopwords`) and `nltk.FreqDist`, compute **the fraction of the stopwords** in the top-100 most frequent words in the text.

E.g. if there were just **3 stopwords in 100** most frequent words in the text, the answer would be **0.03**

**Googling how to work with NLTK is encouraged.**

---



In [24]:
import nltk
from nltk import FreqDist
from nltk.corpus import stopwords
nltk.download("stopwords")
freq_dist = FreqDist(lemmas)
STOPWORDS = set(stopwords.words("english"))
top_freq_lemmas = sorted(freq_dist.items(), key=lambda x: x[1], reverse=True)[:50]
number_intersection_keys = len(set(map(lambda x: x[0], top_freq_lemmas)) & STOPWORDS)
print(number_intersection_keys)
frequent_lemmas = {lemma: count for lemma, count in freq_dist.items() if count > 20}
print(len(frequent_lemmas))
stopwords.words("english")[:5]

41
382


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['a', 'about', 'above', 'after', 'again']

Check yourself: stopwords rate in 100 most frequent words in the text should be 0.66



---


## Task #3
Compute how many tokens occur in the text **strictly greater than** 50 times.

---



Chek yourself: 149