Study the attached file and prepare a python notebook titled: Building your vocabulary.

This task involves following objectives:
Understand the concepts of Tokenization, Stemming, and Lemmatization
Learn to use existing libraries
Learn to create your own python implementation of the task.
The notebook must have 3 parts:

Part 1:
The important detailed notes (definition, types, examples etc.) of the concepts given in the  tut_document.
Code implementation as given in the tut_document

Part 2:
As per previous assignment perform Tokenization, Stemming and Lemmatization to create two vocabularies using NLTK and Spacy.
The dataset for the above task is the text written at this webpage: https://www.gutenberg.org/cache/epub/69875/pg69875-images.html
Note: The content of the webpage should not be copy-pasted to a txt file to be used as input. You must fetch the text directly from the web page using  "BeautifulSoup" and make a detailed note on 'BeautifulSoup' as you have been asked in Part 1.
Store the vocabularies in suitable data structure and find out the difference between them.
Part 3:
Read the webpage data as in Part 2.
Write a python code to perform Sentence segmentation and word tokenization
Removing digits/punctuation
Perform lowercasing
create a vocabulary of your own.
Compare the difference in your vocabular with two vocabularies created above.
What is the major drawback in your vocabulary apart from stemming and lemmatization?
I am providing enough time for this task to be completed. Utilize your Lab and Tutorials sessions as well as weekend to complete the task.

Submission will be evaluated on following qualities:
Completeness
Correctness
Organization
Documentation
Punctuality.

## PART 1



In [None]:
"""
* Tokenization
  Tokenization is the first step in any NLP pipeline. Tokenization is the process of breaking down the given
  text in natural language processing into the smallest unit in a sentence called a token.
  A tokenizer breaks unstructured data and natural language text into chunks of information that can be
  considered as discrete elements or lexical units.

* Stemming
  Stemming is the process of finding the root of words. There can be over-stemming (when words are over
  truncated) and under-stemming (when two or more words can be stemmed from the same root).

* Lemmatization
  Lemmatization is the process of finding the form of the related word in the dictionary.

* Bag of words :-


"""

In [None]:
>>> import numpy as np
>>> token_sequence = str.split(sentence)
>>> vocab = sorted(set(token_sequence))
>>> ', '.join(vocab)
'26., Jefferson, Monticello, Thomas, age, at, began, building, of, the'
>>> num_tokens = len(token_sequence)
>>> vocab_size = len(vocab)
>>> onehot_vectors = np.zeros((num_tokens,
...                            vocab_size), int)
>>> for i, word in enumerate(token_sequence):
...     onehot_vectors[i, vocab.index(word)] = 1
>>> ' '.join(vocab)
'26. Jefferson Monticello Thomas age at began building of the'
>>> onehot_vectors


array([[1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0]])

In [None]:
>>> import pandas as pd
>>> pd.DataFrame(onehot_vectors, columns=vocab)

Unnamed: 0,Retained,anachronistic,and,as,non-standard,printed.,spellings
0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,0
2,0,0,1,0,0,0,0
3,0,0,0,0,1,0,0
4,0,0,0,0,0,0,1
5,0,0,0,1,0,0,0
6,0,0,0,0,0,1,0


In [None]:
>>> df = pd.DataFrame(onehot_vectors, columns=vocab)
>>> df[df == 0] = ''
>>> df

Unnamed: 0,Retained,anachronistic,and,as,non-standard,printed.,spellings
0,1.0,,,,,,
1,,1.0,,,,,
2,,,1.0,,,,
3,,,,,1.0,,
4,,,,,,,1.0
5,,,,1.0,,,
6,,,,,,1.0,


In [None]:
>>> sentences = """Thomas Jefferson began building Monticello at the\
...   age of 26.\n"""
>>> sentences += """Construction was done mostly by local masons and\
...   carpenters.\n"""
>>> sentences += "He moved into the South Pavilion in 1770.\n"
>>> sentences += """Turning Monticello into a neoclassical masterpiece\
...   was Jefferson's obsession."""
>>> corpus = {}
>>> for i, sent in enumerate(sentences.split('\n')):
...     corpus['sent{}'.format(i)] = dict((tok, 1) for tok in
...         sent.split())
>>> df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
>>> df[df.columns[:10]]

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.
sent0,1,1,1,1,1,1,1,1,1,1
sent1,0,0,0,0,0,0,0,0,0,0
sent2,0,0,0,0,0,0,1,0,0,0
sent3,0,0,0,0,1,0,0,0,0,0


In [None]:
df = df.T
print(df.sent0.dot(df.sent1))
print(df.sent0.dot(df.sent2))
print(df.sent0.dot(df.sent3))

0
1
1


## PART 2

In [None]:
"""
As per previous assignment perform Tokenization, Stemming and Lemmatization to create two vocabularies using NLTK and Spacy.
The dataset for the above task is the text written at this webpage: https://www.gutenberg.org/cache/epub/69875/pg69875-images.html
Note: The content of the webpage should not be copy-pasted to a txt file to be used as input. You must fetch the text directly from the web page using  "BeautifulSoup" and make a detailed note on 'BeautifulSoup' as you have been asked in Part 1.
Store the vocabularies in suitable data structure and find out the difference between them.

"""

In [None]:
"""
Fetching out the text part from the webpage.
Then adding all paragraphs to an empty string object
to make article.

"""

In [None]:
raw_html = urllib.request.urlopen("https://www.gutenberg.org/cache/epub/69875/pg69875-images.html")
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')

article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:
    article_text += para.text

In [None]:
article_text

'Title: The windfairies, and other talesAuthor: Mary De MorganIllustrator: Olive J. CockerellRelease Date: January 24, 2023 [EBook #69875]Language: EnglishOriginal Publication: United Kingdom: United Kingdom: Seeley & Co.,1900..Credits: Brian Wilsden, Tim Lindell and the Online Distributed Proofreading Team at https://www.pgdp.net (This file was produced from images generously made available by The Internet Archive/American Libraries.)[Pg i]“Indeed,” said the Duke, “I should not have thought you so very\r\npretty.”[Vain Kesta, p. 43.][Pg ii][Pg iii][Pg 1-2][Pg 3]\nThere was once a\r\nwindmill which stood on the downs by the sea, far from any town or\r\nvillage, and in which the miller lived alone with his little daughter.\r\nHis wife had died when the little girl, whose name was Lucilla, was a\r\nbaby, and so the miller lived by himself with his child, of whom he was\r\nvery proud. As her father was busy with his work, and as little Lucilla\r\nhad no other children to play with, she wa

In [None]:
"""
Using nltk to tokenize and making vocabulary.

"""
import nltk
corpus = nltk.sent_tokenize(article_text)

In [None]:
corpus

['Title: The windfairies, and other talesAuthor: Mary De MorganIllustrator: Olive J. CockerellRelease Date: January 24, 2023 [EBook #69875]Language: EnglishOriginal Publication: United Kingdom: United Kingdom: Seeley & Co.,1900..Credits: Brian Wilsden, Tim Lindell and the Online Distributed Proofreading Team at https://www.pgdp.net (This file was produced from images generously made available by The Internet Archive/American Libraries.',
 ')[Pg i]“Indeed,” said the Duke, “I should not have thought you so very\r\npretty.”[Vain Kesta, p.',
 '43.',
 '][Pg ii][Pg iii][Pg 1-2][Pg 3]\nThere was once a\r\nwindmill which stood on the downs by the sea, far from any town or\r\nvillage, and in which the miller lived alone with his little daughter.',
 'His wife had died when the little girl, whose name was Lucilla, was a\r\nbaby, and so the miller lived by himself with his child, of whom he was\r\nvery proud.',
 'As her father was busy with his work, and as little Lucilla\r\nhad no other children 

In [None]:
wordfreq_1 = {}
for sentence in corpus:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq_1.keys():
            wordfreq_1[token] = 1
        else:
            wordfreq_1[token] += 1

In [None]:
wordfreq_1

{'Title': 1,
 ':': 38,
 'The': 119,
 'windfairies': 24,
 ',': 3929,
 'and': 2515,
 'other': 49,
 'talesAuthor': 1,
 'Mary': 1,
 'De': 1,
 'MorganIllustrator': 1,
 'Olive': 1,
 'J.': 2,
 'CockerellRelease': 1,
 'Date': 1,
 'January': 1,
 '24': 2,
 '2023': 1,
 '[': 234,
 'EBook': 1,
 '#': 1,
 '69875': 1,
 ']': 234,
 'Language': 1,
 'EnglishOriginal': 1,
 'Publication': 1,
 'United': 2,
 'Kingdom': 2,
 'Seeley': 1,
 '&': 4,
 'Co.,1900': 1,
 '..': 1,
 'Credits': 1,
 'Brian': 1,
 'Wilsden': 1,
 'Tim': 1,
 'Lindell': 1,
 'the': 2640,
 'Online': 1,
 'Distributed': 1,
 'Proofreading': 1,
 'Team': 1,
 'at': 220,
 'https': 1,
 '//www.pgdp.net': 1,
 '(': 3,
 'This': 9,
 'file': 1,
 'was': 514,
 'produced': 1,
 'from': 163,
 'images': 1,
 'generously': 1,
 'made': 59,
 'available': 1,
 'by': 129,
 'Internet': 1,
 'Archive/American': 1,
 'Libraries': 1,
 '.': 1030,
 ')': 3,
 'Pg': 232,
 'i': 1,
 '“': 906,
 'Indeed': 8,
 '”': 906,
 'said': 322,
 'Duke': 19,
 'I': 692,
 'should': 98,
 'not': 271,
 'h

In [None]:
"""
Using spacy to tokenize and making vocabulary.

"""

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(article_text)
tokens = []
for token in doc:
    tokens.append(token.lemma_)


In [None]:
"""
Building the dictionary of word to frequency

"""

wordfreq_2={}

for token in tokens:
        if token not in wordfreq_2.keys():
            wordfreq_2[token] = 1
        else:
            wordfreq_2[token] += 1

In [None]:
wordfreq_2

{'title': 1,
 ':': 34,
 'the': 2764,
 'windfairie': 25,
 ',': 3923,
 'and': 2604,
 'other': 67,
 'talesAuthor': 1,
 'Mary': 1,
 'De': 1,
 'MorganIllustrator': 1,
 'Olive': 1,
 'J.': 3,
 'CockerellRelease': 1,
 'Date': 1,
 'January': 1,
 '24': 1,
 '2023': 1,
 '[': 159,
 'EBook': 1,
 '#': 1,
 '69875]language': 1,
 'EnglishOriginal': 1,
 'Publication': 1,
 'United': 2,
 'Kingdom': 2,
 'Seeley': 1,
 '&': 4,
 'Co.': 1,
 ',1900': 1,
 '..': 1,
 'credit': 1,
 'Brian': 1,
 'Wilsden': 1,
 'Tim': 1,
 'Lindell': 1,
 'Online': 1,
 'Distributed': 1,
 'Proofreading': 1,
 'Team': 1,
 'at': 243,
 'https://www.pgdp.net': 1,
 '(': 3,
 'this': 122,
 'file': 1,
 'be': 1494,
 'produce': 1,
 'from': 165,
 'image': 1,
 'generously': 1,
 'make': 117,
 'available': 1,
 'by': 146,
 'Internet': 1,
 'Archive': 1,
 '/': 1,
 'american': 1,
 'libraries.)[pg': 1,
 'i]“Indeed': 1,
 '"': 1122,
 'say': 389,
 'Duke': 22,
 'I': 950,
 'should': 100,
 'not': 352,
 'have': 705,
 'think': 141,
 'you': 576,
 'so': 255,
 'very':

In [None]:
"""
FInding the difference between the two dictionaries
build using nltk and spacy.

"""

difference=set(wordfreq_1.items())^(wordfreq_2.items())

In [None]:
difference

{('LTD', 1),
 ('foot', 8),
 ('tighten', 4),
 ('lamb', 6),
 ('down', 118),
 ('saving', 2),
 ('implore', 2),
 ('60', 1),
 ('three', 8),
 ('gloves', 1),
 ('waved', 2),
 ('flood', 2),
 ('minute', 6),
 ('stole', 6),
 ('newspapers', 1),
 ('when——', 1),
 ('delight', 4),
 ('staggered', 1),
 ('murmured', 1),
 ('rained', 1),
 ('short?”“ay', 1),
 ('”“indeed', 2),
 ('comer', 1),
 ('ask', 71),
 ('fell', 19),
 ('“see', 1),
 ('Lucilla', 78),
 ('clad', 2),
 ('twang', 1),
 ('shelf', 6),
 ('longs', 2),
 ('strange', 13),
 ('Co.', 1),
 ('what', 136),
 ('trumpet', 4),
 ('surely', 14),
 ('money', 31),
 ('froze', 1),
 ('”It', 1),
 ('”', 906),
 ('duchess!”away', 1),
 ('meets!”“why', 1),
 ('disappeared', 14),
 ('See', 11),
 ('"', 1122),
 ('stood', 33),
 ('again!the', 1),
 ('wanting', 2),
 ('tongue', 2),
 ('”“Much', 1),
 ('ready', 4),
 ('tricking', 1),
 ('rubbed', 5),
 ('direction', 2),
 ('you?”“No', 1),
 ('teach', 35),
 ('shriek', 2),
 ('a-crying', 1),
 ('execute', 2),
 ('celebrated', 1),
 ('159]“the', 1),
 ('

## PART 3


In [None]:
"""
Sentance tokenization using nltk.

"""

sentences=nltk.tokenize.sent_tokenize(article_text)

sentences

['Title: The windfairies, and other talesAuthor: Mary De MorganIllustrator: Olive J. CockerellRelease Date: January 24, 2023 [EBook #69875]Language: EnglishOriginal Publication: United Kingdom: United Kingdom: Seeley & Co.,1900..Credits: Brian Wilsden, Tim Lindell and the Online Distributed Proofreading Team at https://www.pgdp.net (This file was produced from images generously made available by The Internet Archive/American Libraries.',
 ')[Pg i]“Indeed,” said the Duke, “I should not have thought you so very\r\npretty.”[Vain Kesta, p.',
 '43.',
 '][Pg ii][Pg iii][Pg 1-2][Pg 3]\nThere was once a\r\nwindmill which stood on the downs by the sea, far from any town or\r\nvillage, and in which the miller lived alone with his little daughter.',
 'His wife had died when the little girl, whose name was Lucilla, was a\r\nbaby, and so the miller lived by himself with his child, of whom he was\r\nvery proud.',
 'As her father was busy with his work, and as little Lucilla\r\nhad no other children 

In [None]:
"""
Word tokenization using nltk.

"""

words=nltk.tokenize.word_tokenize(article_text)

words

['Title',
 ':',
 'The',
 'windfairies',
 ',',
 'and',
 'other',
 'talesAuthor',
 ':',
 'Mary',
 'De',
 'MorganIllustrator',
 ':',
 'Olive',
 'J.',
 'CockerellRelease',
 'Date',
 ':',
 'January',
 '24',
 ',',
 '2023',
 '[',
 'EBook',
 '#',
 '69875',
 ']',
 'Language',
 ':',
 'EnglishOriginal',
 'Publication',
 ':',
 'United',
 'Kingdom',
 ':',
 'United',
 'Kingdom',
 ':',
 'Seeley',
 '&',
 'Co.,1900',
 '..',
 'Credits',
 ':',
 'Brian',
 'Wilsden',
 ',',
 'Tim',
 'Lindell',
 'and',
 'the',
 'Online',
 'Distributed',
 'Proofreading',
 'Team',
 'at',
 'https',
 ':',
 '//www.pgdp.net',
 '(',
 'This',
 'file',
 'was',
 'produced',
 'from',
 'images',
 'generously',
 'made',
 'available',
 'by',
 'The',
 'Internet',
 'Archive/American',
 'Libraries',
 '.',
 ')',
 '[',
 'Pg',
 'i',
 ']',
 '“',
 'Indeed',
 ',',
 '”',
 'said',
 'the',
 'Duke',
 ',',
 '“',
 'I',
 'should',
 'not',
 'have',
 'thought',
 'you',
 'so',
 'very',
 'pretty.',
 '”',
 '[',
 'Vain',
 'Kesta',
 ',',
 'p',
 '.',
 '43',
 '.'

In [None]:
"""
Removing digits and punctuations and lowercasing .

"""

import string

punct_list = list(string.punctuation)
def remove_punctuation(text):
    for punc in punct_list:
        if punc in text:
            text = text.replace(punc, ' ')
    return text.strip()

punc_rem_text=remove_punctuation(article_text)

In [None]:
punc_rem_text

'Title  The windfairies  and other talesAuthor  Mary De MorganIllustrator  Olive J  CockerellRelease Date  January 24  2023  EBook  69875 Language  EnglishOriginal Publication  United Kingdom  United Kingdom  Seeley   Co  1900  Credits  Brian Wilsden  Tim Lindell and the Online Distributed Proofreading Team at https   www pgdp net  This file was produced from images generously made available by The Internet Archive American Libraries   Pg i “Indeed ” said the Duke  “I should not have thought you so very\r\npretty ” Vain Kesta  p  43   Pg ii  Pg iii  Pg 1 2  Pg 3 \nThere was once a\r\nwindmill which stood on the downs by the sea  far from any town or\r\nvillage  and in which the miller lived alone with his little daughter \r\nHis wife had died when the little girl  whose name was Lucilla  was a\r\nbaby  and so the miller lived by himself with his child  of whom he was\r\nvery proud  As her father was busy with his work  and as little Lucilla\r\nhad no other children to play with  she wa

In [None]:
"""
Lowercasing

"""

lower_case_punc_rem_text=punc_rem_text.lower()

In [None]:
lower_case_punc_rem_text

'title  the windfairies  and other talesauthor  mary de morganillustrator  olive j  cockerellrelease date  january 24  2023  ebook  69875 language  englishoriginal publication  united kingdom  united kingdom  seeley   co  1900  credits  brian wilsden  tim lindell and the online distributed proofreading team at https   www pgdp net  this file was produced from images generously made available by the internet archive american libraries   pg i “indeed ” said the duke  “i should not have thought you so very\r\npretty ” vain kesta  p  43   pg ii  pg iii  pg 1 2  pg 3 \nthere was once a\r\nwindmill which stood on the downs by the sea  far from any town or\r\nvillage  and in which the miller lived alone with his little daughter \r\nhis wife had died when the little girl  whose name was lucilla  was a\r\nbaby  and so the miller lived by himself with his child  of whom he was\r\nvery proud  as her father was busy with his work  and as little lucilla\r\nhad no other children to play with  she wa

In [None]:
"""
Removing digits from text

"""

final = ''.join((x for x in lower_case_punc_rem_text if not x.isdigit()))

print(final)


title  the windfairies  and other talesauthor  mary de morganillustrator  olive j  cockerellrelease date  january     ebook   language  englishoriginal publication  united kingdom  united kingdom  seeley   co    credits  brian wilsden  tim lindell and the online distributed proofreading team at https   www pgdp net  this file was produced from images generously made available by the internet archive american libraries   pg i “indeed ” said the duke  “i should not have thought you so very
pretty ” vain kesta  p     pg ii  pg iii  pg    pg  
there was once a
windmill which stood on the downs by the sea  far from any town or
village  and in which the miller lived alone with his little daughter 
his wife had died when the little girl  whose name was lucilla  was a
baby  and so the miller lived by himself with his child  of whom he was
very proud  as her father was busy with his work  and as little lucilla
had no other children to play with  she was alone nearly all day 
and had to 

In [None]:
"""
Building the dictionary of word to frequency

"""
tokens_1=article_text.split()
wordfreq_3={}

for token in tokens_1:
        if token not in wordfreq_3.keys():
            wordfreq_3[token] = 1
        else:
            wordfreq_3[token] += 1

In [None]:
wordfreq_3

{'Title:': 1,
 'The': 63,
 'windfairies,': 13,
 'and': 2469,
 'other': 43,
 'talesAuthor:': 1,
 'Mary': 1,
 'De': 1,
 'MorganIllustrator:': 1,
 'Olive': 1,
 'J.': 3,
 'CockerellRelease': 1,
 'Date:': 1,
 'January': 1,
 '24,': 1,
 '2023': 1,
 '[EBook': 1,
 '#69875]Language:': 1,
 'EnglishOriginal': 1,
 'Publication:': 1,
 'United': 2,
 'Kingdom:': 2,
 'Seeley': 1,
 '&': 4,
 'Co.,1900..Credits:': 1,
 'Brian': 1,
 'Wilsden,': 1,
 'Tim': 1,
 'Lindell': 1,
 'the': 2638,
 'Online': 1,
 'Distributed': 1,
 'Proofreading': 1,
 'Team': 1,
 'at': 219,
 'https://www.pgdp.net': 1,
 '(This': 1,
 'file': 1,
 'was': 502,
 'produced': 1,
 'from': 154,
 'images': 1,
 'generously': 1,
 'made': 50,
 'available': 1,
 'by': 124,
 'Internet': 1,
 'Archive/American': 1,
 'Libraries.)[Pg': 1,
 'i]“Indeed,”': 1,
 'said': 214,
 'Duke,': 7,
 '“I': 58,
 'should': 95,
 'not': 250,
 'have': 239,
 'thought': 71,
 'you': 476,
 'so': 179,
 'very': 159,
 'pretty.”[Vain': 1,
 'Kesta,': 19,
 'p.': 1,
 '43.][Pg': 1,
 'ii][

In [None]:
"""
Difference between own vocabulary and vocubulary created by spacy

"""

difference_1=set(wordfreq_3.items())^(wordfreq_2.items())

difference_1

{('you?”“Of', 1),
 ('Limited,', 1),
 ('quiet,', 1),
 ('drove', 1),
 ('midst', 3),
 ('LTD', 1),
 ('worm,', 1),
 ('long,', 5),
 ('wife;', 7),
 ('see,', 6),
 ('KING’S', 1),
 ('tighten', 4),
 ('does;', 1),
 ('songs,', 1),
 ('book.”—Bristol', 1),
 ('lamb', 6),
 ('triangles,', 1),
 ('fire', 9),
 ('down', 118),
 ('Instead', 1),
 ('saving', 2),
 ('trance,', 1),
 ('mine', 6),
 ('complexion', 2),
 ('implore', 2),
 ('nonsense', 1),
 ('could.', 1),
 ('possible', 2),
 ('head.“That', 1),
 ('asked', 47),
 ('trouble.”Next', 1),
 ('know,”', 1),
 ('due.', 1),
 ('pile', 3),
 ('spoke', 9),
 ('high,', 3),
 ('right,', 1),
 ('waved', 2),
 ('voice!', 1),
 ('hands.“I', 1),
 ('like?”', 1),
 ('flood', 2),
 ('minute', 6),
 ('stole', 6),
 ('rightly', 2),
 ('angrily', 1),
 ('delight', 4),
 ('thud,', 1),
 ('girl?”“I', 1),
 ('GREEN:', 1),
 ('staggered', 1),
 ('rained', 1),
 ('short?”“ay', 1),
 ('”“indeed', 2),
 ('comer', 1),
 ('ship.', 1),
 ('ask', 71),
 ('needed,', 1),
 ('“see', 1),
 ('Lucilla', 78),
 ('clad', 2),
 

In [None]:
"""
Difference between own vocabulary and vocubulary created by nltk

"""

difference_2=set(wordfreq_3.items())^(wordfreq_1.items())

difference_2

{('you?”“Of', 1),
 ('Limited,', 1),
 ('quiet,', 1),
 ('midst', 3),
 ('worm,', 1),
 ('long,', 5),
 ('wife;', 7),
 ('see,', 6),
 ('KING’S', 1),
 ('foot', 8),
 ('does;', 1),
 ('songs,', 1),
 ('book.”—Bristol', 1),
 ('triangles,', 1),
 ('fire', 9),
 ('declare', 1),
 ('Instead', 1),
 ('trance,', 1),
 ('mine', 6),
 ('complexion', 2),
 ('nonsense', 1),
 ('could.', 1),
 ('60', 1),
 ('possible', 2),
 ('three', 8),
 ('head.“That', 1),
 ('asked', 47),
 ('trouble.”Next', 1),
 ('know,”', 1),
 ('due.', 1),
 ('pile', 3),
 ('spoke', 9),
 ('high,', 3),
 ('gloves', 1),
 ('right,', 1),
 ('voice!', 1),
 ('hands.“I', 1),
 ('like?”', 1),
 ('newspapers', 1),
 ('rightly', 2),
 ('when——', 1),
 ('angrily', 1),
 ('thud,', 1),
 ('girl?”“I', 1),
 ('GREEN:', 1),
 ('murmured', 1),
 ('ship.', 1),
 ('fell', 19),
 ('needed,', 1),
 ('pardon', 1),
 ('128', 1),
 ('gentleman’s', 3),
 ('parts', 1),
 ('rejoice', 1),
 ('trumpet', 4),
 ('52]Once', 1),
 ('hole,', 10),
 ('money', 31),
 ('stories.”—World.', 1),
 ('chap,”', 1),
 (

In [None]:
"""

Major Drawback my vocabulary apart from lemmatization and stemming

".split()" method used to split sentences with space as delimiter does not care about digits/numbers
or punctuation and also counts the special characters as different word which is wrong.


"""