<a href="https://www.kaggle.com/code/faressayah/nlp-with-spacy-nltk-gensim?scriptVersionId=117957129" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 📌 Notebook Goals
> - Understand Basic NLP Topics (Tokenization, Stemming, Lemmatization, Stop words).
> - Spacy for Vocabulary Matching.
> - Gensim for topic modeling and semantic similarity
> - General discussion of what Natural Language Processing is.

# 📚 What is Spacy?
> - Spacy is an open source Natural Language Processing Library designed to effectively handle NLP tasks with the most efficient implementation of common algorithms.
> - For many NLP tasks, Spacy only has one implementation method, choosing the most efficient algorithm currently available. This means you often don't have the option to choose other algorithms.

# 📝 What is NLTK?
> - NLTK - Natural Language Toolkit is a very popular open source. Initially released in 2001, it is much older than Spacy (released 2015). It also provides many functionalities, but includes less efficient implementations.

# 💪🏻 NLTK vs Spacy
> - For many common NLP tasks, Spacy is much faster and more efficient, at the cost of the user not being able to choose algorithmic implementations. However, Spacy does not include pre-created models for some applications, such as sentiment analysis, which is typically easier to perform with NLTK.

# 📚 spaCy Basics

**spaCy** (https://spacy.io/) is an open-source Python library that parses and "understands" large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).


## ✔️ Working with spaCy
- There are few keys steps for working with Spacy:
> 1. Loading the language library.
> 2. Building a Pipeline Object
> 3. Using Tokens
> 4. Parts-of-Speech Tagging
> 5. Understanding Token Attributes
___
## 🔪 Tokenization

The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Doc object to contain descriptive information. 

![tokenization.png](attachment:tokenization.png)

-  **Prefix**:	Character(s) at the beginning &#9656; `$ ( “ ¿`
-  **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
-  **Infix**:	Character(s) in between &#9656; `- -- / ...`
-  **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`

> Notice that tokens are pieces of the original text. That is, we don't see any conversion to word stems or lemmas (base forms of words) and we haven't seen anything about organizations/places/money etc. Tokens are the basic building blocks of a Doc object - everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.

## 📚 spaCy Objects

> After importing the spacy module in the cell above we loaded a **model** and named it `nlp`.<br>Next we created a **Doc** object by applying the model to our text, and named it `doc`.<br>spaCy also builds a companion **Vocab** object that we'll cover in later sections.<br>The **Doc** object that holds the processed text is our focus here.

## ➿ Pipeline
> When we run `nlp`, our text enters a *processing pipeline* that first breaks down the text and then performs a series of operations to tag, parse and describe the data.   Image source: 
![pipeline1.png](attachment:pipeline1.png)
We can check to see what components currently live in the pipeline. In later sections we'll learn how to disable components and add new ones as needed.

___
## 🔖 Part-of-Speech Tagging (POS)
> The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Tesla` was recognized to be a ***proper noun***. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

___
## 🧮 Dependencies
> We also looked at the syntactic dependencies assigned to each token. `Tesla` is identified as an `nsubj` or the ***nominal subject*** of the sentence.

___
## ➕ Additional Token Attributes
> We'll see these again in upcoming lectures. For now we just want to illustrate some of the other information that spaCy assigns to tokens:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|


___
## 🧾 Spans
> Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

___
## 📑 Sentences
> Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Doc.sents`. Later we'll write our own segmentation rules.

In [1]:
import spacy
import pandas as pd

data = pd.read_csv('../input/nlp-getting-started/train.csv')

# 1. Loading the language library
nlp = spacy.load('en_core_web_sm')

# 2. Building a Pipline Object
doc = nlp(u'''
Tesla will start selling cars in India next year, government says. 
Elon Mask (CEO of Tesla) is now the richest men in the world.
''')


# 3. Using Tokens
for token in doc:
    print(f"{token.text:{12}}{token.pos_:{12}}{token.dep_:{12}}{token.lemma_}")


           SPACE       dep         

Tesla       PROPN       nsubj       Tesla
will        AUX         aux         will
start       VERB        ccomp       start
selling     VERB        xcomp       sell
cars        NOUN        dobj        car
in          ADP         prep        in
India       PROPN       pobj        India
next        ADJ         amod        next
year        NOUN        npadvmod    year
,           PUNCT       punct       ,
government  NOUN        nsubj       government
says        VERB        ROOT        say
.           PUNCT       punct       .

           SPACE       dep         

Elon        PROPN       compound    Elon
Mask        PROPN       prep        Mask
(           PUNCT       punct       (
CEO         PROPN       appos       CEO
of          ADP         prep        of
Tesla       PROPN       pobj        Tesla
)           PUNCT       punct       )
is          AUX         ROOT        be
now         ADV         advmod      now
the         DET         det       

In [2]:
data[data.target == 1]['text'][1]

'Forest fire near La Ronge Sask. Canada'

In [3]:
data[data.target == 1]['text'][300]

'Shadow boxing the apocalypse'

In [4]:
data[data.target == 0]['text']

15                                         What's up man?
16                                          I love fruits
17                                       Summer is lovely
18                                      My car is so fast
19                           What a goooooooaaaaaal!!!!!!
                              ...                        
7581    @engineshed Great atmosphere at the British Li...
7582    Cramer: Iger's 3 words that wrecked Disney's s...
7584    These boxes are ready to explode! Exploding Ki...
7587                                   Sirens everywhere!
7593    I just heard a really loud bang and everyone i...
Name: text, Length: 4342, dtype: object

In [5]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f614ecf8a60>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f614ecf8910>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f614e9c27d0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f614e967f50>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f614e977190>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f614e9c2a50>)]

In [6]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [7]:
text = """
Elon Musk, the billionaire CEO of Tesla and SpaceX, is now the richest person in the world, surpassing former titleholder and Amazon chief Jeff Bezos with a net worth of $189.7 billion, according to Forbes’s real-time billionaire net-worth estimates on Jan. 8, 2021 at 1pm. Since March, Musk’s wealth has grown almost seven-fold, up a staggering $163.1 billion.
"""
doc = nlp(text)

In [8]:
quote = doc[30:50]
print(quote)
print(type(quote))

a net worth of $189.7 billion, according to Forbes’s real-time billionaire net-worth estimates
<class 'spacy.tokens.span.Span'>


In [9]:
for i, sentence in enumerate(doc.sents, 1):
    print(f"{i} - {sentence}")

1 - 
Elon Musk, the billionaire CEO of Tesla and SpaceX, is now the richest person in the world, surpassing former titleholder and Amazon chief Jeff Bezos with a net worth of $189.7 billion, according to Forbes’s real-time billionaire net-worth estimates on Jan. 8, 2021 at 1pm.
2 - Since March, Musk’s wealth has grown almost seven-fold, up a staggering $163.1 billion.
3 - 



## Named Entity

In [10]:
for entity in doc.ents:
    print(f"{entity.text:-<{20}}{entity.label_:-<{20}}{str(spacy.explain(entity.label_))}")

Tesla---------------ORG-----------------Companies, agencies, institutions, etc.
Amazon--------------ORG-----------------Companies, agencies, institutions, etc.
Jeff Bezos----------PERSON--------------People, including fictional
$189.7 billion------MONEY---------------Monetary values, including unit
Forbes--------------ORG-----------------Companies, agencies, institutions, etc.
Jan. 8, 2021--------DATE----------------Absolute or relative dates or periods
1pm-----------------TIME----------------Times smaller than a day
March---------------DATE----------------Absolute or relative dates or periods
Musk----------------PERSON--------------People, including fictional
almost seven-fold---CARDINAL------------Numerals that do not fall under another type
$163.1 billion------MONEY---------------Monetary values, including unit


# Noun Chunks

In [11]:
for chunk in doc.noun_chunks:
    print(chunk.text)


Elon Musk
the billionaire CEO
Tesla
SpaceX
the richest person
the world
former titleholder
Amazon chief Jeff Bezos
a net worth
Forbes’s real-time billionaire net-worth estimates
Jan.
1pm
March
Musk’s wealth


## Built-in Visualizers

In [12]:
from spacy import displacy

displacy.render(doc, style='dep', jupyter=True, options={'distance':90})

## Visualizing the entity recongnizer

In [13]:
displacy.render(doc, style='ent', jupyter=True)

## Stemming

- Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for 'boat' might return 'boats' and 'boating'. Here, 'boat' would be the stem for [boat, boater, boating, boats].


In [14]:
import nltk
from nltk.stem.porter import PorterStemmer

words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly', 'fairness']
p_stemmer = PorterStemmer()

for word in words:
    print(f"{word} --------> {p_stemmer.stem(word)}")

run --------> run
runner --------> runner
ran --------> ran
runs --------> run
easily --------> easili
fairly --------> fairli
fairness --------> fair


In [15]:
from nltk.stem.snowball import SnowballStemmer

words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly', 'fairness']
s_stemmer = SnowballStemmer(language='english')

for word in words:
    print(f"{word} --------> {s_stemmer.stem(word)}")

run --------> run
runner --------> runner
ran --------> ran
runs --------> run
easily --------> easili
fairly --------> fair
fairness --------> fair


In [16]:
words = ['generous', 'generation', 'generously', 'generate']

print('===============SNOWBALL STEMMER================')
for word in words:
    print(f"{word} --------> {s_stemmer.stem(word)}")
    
print('===============PORTER STEMMER================')
for word in words:
    print(f"{word} --------> {p_stemmer.stem(word)}")

generous --------> generous
generation --------> generat
generously --------> generous
generate --------> generat
generous --------> gener
generation --------> gener
generously --------> gener
generate --------> gener


## Lemmatization

- In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words. The lemma of 'was' is 'be' and the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence. Lemmatization is typically seen as much more informative than simple stemming, which is why Spacy has opted to only have Lemmatization available instead of Stemming. 

In [17]:
text = nlp(u"I am a runner running in a race because I love to run since I ran everyday")

for token in text:
    print(f"{token.text:{12}}{token.pos_:{10}}\t{token.lemma:{20}}\t{token.lemma_}")

I           PRON      	 4690420944186131903	I
am          AUX       	10382539506755952630	be
a           DET       	11901859001352538922	a
runner      NOUN      	12640964157389618806	runner
running     VERB      	12767647472892411841	run
in          ADP       	 3002984154512732771	in
a           DET       	11901859001352538922	a
race        NOUN      	 8048469955494714898	race
because     SCONJ     	16950148841647037698	because
I           PRON      	 4690420944186131903	I
love        VERB      	 3702023516439754181	love
to          PART      	 3791531372978436496	to
run         VERB      	12767647472892411841	run
since       SCONJ     	10066841407251338481	since
I           PRON      	 4690420944186131903	I
ran         VERB      	12767647472892411841	run
everyday    NOUN      	12502803309396265471	everyday


## Stop Words

- Words like 'a' and 'the' appear so frequently that they don't require tagging as thoroughly as nouns, verbs and modifiers. We call these stop words, and they can be filtered from text to be processed. Spacy holds a built-in list of some `326` English stop words.

In [18]:
nlp = spacy.load('en_core_web_sm')

print(nlp.Defaults.stop_words)
print(len(nlp.Defaults.stop_words))

{'each', 'beforehand', 'anyhow', 'may', 'much', 'somehow', 'noone', 'you', 'an', 'though', 'me', 'any', 'alone', 'no', 'up', 'do', 'next', 'go', 'so', 'below', 'still', 'from', 'she', "'re", '‘m', 'i', 'elsewhere', 'thereby', 'others', 'another', 'never', 'had', 'should', 'both', '’s', '‘d', 'under', 'not', 'toward', 'few', 'three', 'thru', 'whole', 'eleven', 'whenever', 'very', 'therefore', 'what', 'through', 'name', 'your', 'ourselves', 'of', 'some', 'least', 'nevertheless', 'seeming', 'thus', 'until', 'n’t', 'say', 'has', 'before', 'call', 'could', 'herein', 'nor', '’ve', 'back', 'top', 'front', 'also', 'in', 'yourselves', 'last', 'us', 'only', 'seem', 'namely', 'afterwards', 'first', 'out', 'became', 'now', 'become', 'where', 'whence', 'latterly', 'itself', 'even', 'about', 'upon', 'themselves', 'her', '‘s', 'onto', 'was', 'by', 'well', 'myself', 'either', 'anyone', 'former', 'their', 'are', 'take', 'sometime', 'becomes', 'he', 'they', '‘ll', 'somewhere', 'bottom', 'nothing', 'fort

In [19]:
words = ['is', 'and', 'Tesla', 'you', 'IS', 'AND']

for word in words:
    print(f"{word}: is stop word: {nlp.vocab[word].is_stop}")

is: is stop word: True
and: is stop word: True
Tesla: is stop word: False
you: is stop word: True
IS: is stop word: True
AND: is stop word: True


In [20]:
# We can add our own stop word
nlp.Defaults.stop_words.add('btw')
nlp.Defaults.stop_words.add('u')

sentence = 'Where was u ? I was looking for you btw session...'
for word in sentence.split():
    print(f"{word:{20}}: is stop word: {nlp.vocab[word].is_stop}")

Where               : is stop word: True
was                 : is stop word: True
u                   : is stop word: False
?                   : is stop word: False
I                   : is stop word: True
was                 : is stop word: True
looking             : is stop word: False
for                 : is stop word: True
you                 : is stop word: True
btw                 : is stop word: True
session...          : is stop word: False


In [21]:
# We can also remove stop word
nlp.vocab['for'].is_stop = False
sentence = 'Where was u ? I was looking for you btw session...'
for word in sentence.split():
    print(f"{word:{20}}: is stop word: {nlp.vocab[word].is_stop}")

Where               : is stop word: True
was                 : is stop word: True
u                   : is stop word: False
?                   : is stop word: False
I                   : is stop word: True
was                 : is stop word: True
looking             : is stop word: False
for                 : is stop word: False
you                 : is stop word: True
btw                 : is stop word: True
session...          : is stop word: False


## Phrase Matching and Vocabulary

- We can think of this as a powerful version of Regular Expression where we actually take parts of speech into account for our patterns.

In [22]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern_1 = [{'LOWER': 'solarpower'}] # ----> SolarPower
pattern_2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}] # ---> Solar-Power
pattern_3 = [{'LOWER': 'solar'}, {'LOWER': 'power'}] # ---> Solar Power

matcher.add('SolarPower', [pattern_1, pattern_2, pattern_3])

text = u'''
Solar Power is the conversion of energy from sunlight into electricity, 
either directly using photovoltaics (PV), indirectly using concentrated SolarPower, 
or a combination. Concentrated Solar-Power systems use lenses or mirrors and solar 
tracking systems to focus a large area of sunlight into a small beam.
'''
doc = nlp(text)
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 25, 26), (8656102463236116519, 33, 36)]


## Word Vectors and Semantic Similarity

- Spacy can compare two objects and predict similarity, `Doc.similarity()`, `Span.similarity()` and `Token.similarity()`. They take another object and return a similarity score (`0` to `1`).
- `Important`: needs a model that has word vectors included, for example: `en_core_web_md`, `en_core_web_lg`, not `en_core_web_sm`.

In [23]:
!python3 -m spacy download en_core_web_md

Collecting en-core-web-md==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.3.0/en_core_web_md-3.3.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.3.0
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [24]:
import en_core_web_md

# Load a larger model with vectors
nlp = en_core_web_md.load()

# Compare two documents
doc_1 = nlp("I like fast food")
doc_2 = nlp("I like pizza")

print(doc_1.similarity(doc_2))
print(doc_2.similarity(doc_1))

0.8382381514100752
0.8382381514100752


In [25]:
# Compare two tokens
doc = nlp("I like pizza and pasta")

token_1 = doc[2]
token_2 = doc[4]
print(token_1.similarity(token_2))

1.0


In [26]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.5528544552721405


## View token tags

Recall that we can obtain a particular token by its index position.
- To view the description of either type of tag use `spacy.explain(tag)`

In [27]:
text = u'''
Since March, Musk’s wealth has grown almost seven-fold, up a staggering $163.1 billion.
'''
doc = nlp(text)
for token in doc:
    print(f"{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}")


          SPACE    _SP    whitespace
Since      SCONJ    IN     conjunction, subordinating or preposition
March      PROPN    NNP    noun, proper singular
,          PUNCT    ,      punctuation mark, comma
Musk       PROPN    NNP    noun, proper singular
’s         PART     POS    possessive ending
wealth     NOUN     NN     noun, singular or mass
has        AUX      VBZ    verb, 3rd person singular present
grown      VERB     VBN    verb, past participle
almost     ADV      RB     adverb
seven      NUM      CD     cardinal number
-          ADJ      JJ     adjective (English), other noun-modifier (Chinese)
fold       ADV      RB     adverb
,          PUNCT    ,      punctuation mark, comma
up         ADV      RB     adverb
a          DET      DT     determiner
staggering ADJ      JJ     adjective (English), other noun-modifier (Chinese)
$          SYM      $      symbol, currency
163.1      NUM      CD     cardinal number
billion    NUM      CD     cardinal number
.          PUNCT   

## Coarse-grained Part-of-speech Tags
Every token is assigned a POS Tag from the following list:


<table><tr><th>POS</th><th>DESCRIPTION</th><th>EXAMPLES</th></tr>
    
<tr><td>ADJ</td><td>adjective</td><td>**big, old, green, incomprehensible, first**</td></tr>
<tr><td>ADP</td><td>adposition</td><td>*in, to, during*</td></tr>
<tr><td>ADV</td><td>adverb</td><td>*very, tomorrow, down, where, there*</td></tr>
<tr><td>AUX</td><td>auxiliary</td><td>*is, has (done), will (do), should (do)*</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>*and, or, but*</td></tr>
<tr><td>CCONJ</td><td>coordinating conjunction</td><td>*and, or, but*</td></tr>
<tr><td>DET</td><td>determiner</td><td>*a, an, the*</td></tr>
<tr><td>INTJ</td><td>interjection</td><td>*psst, ouch, bravo, hello*</td></tr>
<tr><td>NOUN</td><td>noun</td><td>*girl, cat, tree, air, beauty*</td></tr>
<tr><td>NUM</td><td>numeral</td><td>*1, 2017, one, seventy-seven, IV, MMXIV*</td></tr>
<tr><td>PART</td><td>particle</td><td>*'s, not,*</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>*I, you, he, she, myself, themselves, somebody*</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>*Mary, John, London, NATO, HBO*</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>*., (, ), ?*</td></tr>
<tr><td>SCONJ</td><td>subordinating conjunction</td><td>*if, while, that*</td></tr>
<tr><td>SYM</td><td>symbol</td><td>*$, %, §, ©, +, −, ×, ÷, =, :), 😝*</td></tr>
<tr><td>VERB</td><td>verb</td><td>*run, runs, running, eat, ate, eating*</td></tr>
<tr><td>X</td><td>other</td><td>*sfpksdpsxmsa*</td></tr>
<tr><td>SPACE</td><td>space</td></tr>

***


___
## Fine-grained Part-of-speech Tags
Tokens are subsequently given a fine-grained tag as determined by morphology:
<table>
<tr><th>POS</th><th>Description</th><th>Fine-grained Tag</th><th>Description</th><th>Morphology</th></tr>
<tr><td>ADJ</td><td>adjective</td><td>AFX</td><td>affix</td><td>Hyph=yes</td></tr>
<tr><td>ADJ</td><td></td><td>JJ</td><td>adjective</td><td>Degree=pos</td></tr>
<tr><td>ADJ</td><td></td><td>JJR</td><td>adjective, comparative</td><td>Degree=comp</td></tr>
<tr><td>ADJ</td><td></td><td>JJS</td><td>adjective, superlative</td><td>Degree=sup</td></tr>
<tr><td>ADJ</td><td></td><td>PDT</td><td>predeterminer</td><td>AdjType=pdt PronType=prn</td></tr>
<tr><td>ADJ</td><td></td><td>PRP\$</td><td>pronoun, possessive</td><td>PronType=prs Poss=yes</td></tr>
<tr><td>ADJ</td><td></td><td>WDT</td><td>wh-determiner</td><td>PronType=int rel</td></tr>
<tr><td>ADJ</td><td></td><td>WP\$</td><td>wh-pronoun, possessive</td><td>Poss=yes PronType=int rel</td></tr>
<tr><td>ADP</td><td>adposition</td><td>IN</td><td>conjunction, subordinating or preposition</td><td></td></tr>
<tr><td>ADV</td><td>adverb</td><td>EX</td><td>existential there</td><td>AdvType=ex</td></tr>
<tr><td>ADV</td><td></td><td>RB</td><td>adverb</td><td>Degree=pos</td></tr>
<tr><td>ADV</td><td></td><td>RBR</td><td>adverb, comparative</td><td>Degree=comp</td></tr>
<tr><td>ADV</td><td></td><td>RBS</td><td>adverb, superlative</td><td>Degree=sup</td></tr>
<tr><td>ADV</td><td></td><td>WRB</td><td>wh-adverb</td><td>PronType=int rel</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>CC</td><td>conjunction, coordinating</td><td>ConjType=coor</td></tr>
<tr><td>DET</td><td>determiner</td><td>DT</td><td>determiner</td><td></td></tr>
<tr><td>INTJ</td><td>interjection</td><td>UH</td><td>interjection</td><td></td></tr>
<tr><td>NOUN</td><td>noun</td><td>NN</td><td>noun, singular or mass</td><td>Number=sing</td></tr>
<tr><td>NOUN</td><td></td><td>NNS</td><td>noun, plural</td><td>Number=plur</td></tr>
<tr><td>NOUN</td><td></td><td>WP</td><td>wh-pronoun, personal</td><td>PronType=int rel</td></tr>
<tr><td>NUM</td><td>numeral</td><td>CD</td><td>cardinal number</td><td>NumType=card</td></tr>
<tr><td>PART</td><td>particle</td><td>POS</td><td>possessive ending</td><td>Poss=yes</td></tr>
<tr><td>PART</td><td></td><td>RP</td><td>adverb, particle</td><td></td></tr>
<tr><td>PART</td><td></td><td>TO</td><td>infinitival to</td><td>PartType=inf VerbForm=inf</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>PRP</td><td>pronoun, personal</td><td>PronType=prs</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>NNP</td><td>noun, proper singular</td><td>NounType=prop Number=sign</td></tr>
<tr><td>PROPN</td><td></td><td>NNPS</td><td>noun, proper plural</td><td>NounType=prop Number=plur</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>-LRB-</td><td>left round bracket</td><td>PunctType=brck PunctSide=ini</td></tr>
<tr><td>PUNCT</td><td></td><td>-RRB-</td><td>right round bracket</td><td>PunctType=brck PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>,</td><td>punctuation mark, comma</td><td>PunctType=comm</td></tr>
<tr><td>PUNCT</td><td></td><td>:</td><td>punctuation mark, colon or ellipsis</td><td></td></tr>
<tr><td>PUNCT</td><td></td><td>.</td><td>punctuation mark, sentence closer</td><td>PunctType=peri</td></tr>
<tr><td>PUNCT</td><td></td><td>''</td><td>closing quotation mark</td><td>PunctType=quot PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>""</td><td>closing quotation mark</td><td>PunctType=quot PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>``</td><td>opening quotation mark</td><td>PunctType=quot PunctSide=ini</td></tr>
<tr><td>PUNCT</td><td></td><td>HYPH</td><td>punctuation mark, hyphen</td><td>PunctType=dash</td></tr>
<tr><td>PUNCT</td><td></td><td>LS</td><td>list item marker</td><td>NumType=ord</td></tr>
<tr><td>PUNCT</td><td></td><td>NFP</td><td>superfluous punctuation</td><td></td></tr>
<tr><td>SYM</td><td>symbol</td><td>#</td><td>symbol, number sign</td><td>SymType=numbersign</td></tr>
<tr><td>SYM</td><td></td><td>\$</td><td>symbol, currency</td><td>SymType=currency</td></tr>
<tr><td>SYM</td><td></td><td>SYM</td><td>symbol</td><td></td></tr>
<tr><td>VERB</td><td>verb</td><td>BES</td><td>auxiliary "be"</td><td></td></tr>
<tr><td>VERB</td><td></td><td>HVS</td><td>forms of "have"</td><td></td></tr>
<tr><td>VERB</td><td></td><td>MD</td><td>verb, modal auxiliary</td><td>VerbType=mod</td></tr>
<tr><td>VERB</td><td></td><td>VB</td><td>verb, base form</td><td>VerbForm=inf</td></tr>
<tr><td>VERB</td><td></td><td>VBD</td><td>verb, past tense</td><td>VerbForm=fin Tense=past</td></tr>
<tr><td>VERB</td><td></td><td>VBG</td><td>verb, gerund or present participle</td><td>VerbForm=part Tense=pres Aspect=prog</td></tr>
<tr><td>VERB</td><td></td><td>VBN</td><td>verb, past participle</td><td>VerbForm=part Tense=past Aspect=perf</td></tr>
<tr><td>VERB</td><td></td><td>VBP</td><td>verb, non-3rd person singular present</td><td>VerbForm=fin Tense=pres</td></tr>
<tr><td>VERB</td><td></td><td>VBZ</td><td>verb, 3rd person singular present</td><td>VerbForm=fin Tense=pres Number=sing Person=3</td></tr>
<tr><td>X</td><td>other</td><td>ADD</td><td>email</td><td></td></tr>
<tr><td>X</td><td></td><td>FW</td><td>foreign word</td><td>Foreign=yes</td></tr>
<tr><td>X</td><td></td><td>GW</td><td>additional word in multi-word expression</td><td></td></tr>
<tr><td>X</td><td></td><td>XX</td><td>unknown</td><td></td></tr>
<tr><td>SPACE</td><td>space</td><td>_SP</td><td>space</td><td></td></tr>
<tr><td></td><td></td><td>NIL</td><td>missing tag</td><td></td></tr>
</table>

## Working with POS Tags

In english language, the same string of characters can have different meanings, even within the same sentence. For this reason, morphology is important.

In [28]:
doc = nlp("I read books on NLP.")
r = doc[1]
print(f"{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}\n")

doc = nlp("I am reading a book on NLP.")
r = doc[2]
print(f"{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}")

read       VERB     VBP    verb, non-3rd person singular present

reading    VERB     VBG    verb, gerund or present participle


## Counting POS Tags

The `Doc.count_by()` method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object.

In [29]:
doc = nlp("The quick brown fox jumped over the lazy dog's back.")

pos_count = doc.count_by(spacy.attrs.POS)
print(pos_count)
new = {}
for key, value in pos_count.items():
    new[doc.vocab[key].text] = value
    
print(new)

{90: 2, 84: 3, 92: 3, 100: 1, 85: 1, 94: 1, 97: 1}
{'DET': 2, 'ADJ': 3, 'NOUN': 3, 'VERB': 1, 'ADP': 1, 'PART': 1, 'PUNCT': 1}


# Gensim

Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning. 

## Document 
In Gensim, a document is an object of the text sequence type (commonly known as str in Python 3). A document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book.

In [30]:
!wget https://www.gutenberg.org/files/1342/1342-0.txt

--2023-02-01 18:18:55--  https://www.gutenberg.org/files/1342/1342-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 772145 (754K) [text/plain]
Saving to: ‘1342-0.txt’


2023-02-01 18:19:03 (801 KB/s) - ‘1342-0.txt’ saved [772145/772145]



In [31]:
import os
import spacy

nlp = spacy.load('en_core_web_sm')

def read_file(file_name):
    with open(file_name, 'r') as file:
        return file.read()
    
text = read_file('1342-0.txt')
processed_text = nlp(text)

## Corpus
A corpus is a collection of document objects. Corpora serve two roles in Gensim:
1. Input for training a model.
2. Documents to orgnize.

In [32]:
# An example of corpus, consists of 7045 sentences
sentences = [s for s in processed_text.sents]

print(len(sentences))

6183


In [33]:
print(sentences[30:34])

[
All the minor passages--the loves of Jane and Bingley, the advent of Mr.
Collins, the visit to Hunsford, the Derbyshire tour--fit in after the
same unostentatious, but masterly fashion., There is no attempt at the
hide-and-seek, in-and-out business, which in the transactions between
Frank Churchill and Jane Fairfax contributes no doubt a good deal to the
intrigue of_ Emma, _but contributes it in a fashion which I do not think
the best feature of that otherwise admirable book., Although Miss Austen
always liked something of the misunderstanding kind, which afforded her
opportunities for the display of the peculiar and incomparable talent to
be noticed presently, she has been satisfied here with the perfectly
natural occasions provided by the false account of Darcy’s conduct given
by Wickham, and by the awkwardness (arising with equal naturalness) from
the gradual transformation of Elizabeth’s own feelings from positive
aversion to actual love., I do not know whether the all-grasping h

In [34]:
len(processed_text.text.split())

130406

The above example loads the entire corpus into memory. In practice, corpora may be very large, so loading them into memory may be impossible. Gensim intelligently handles such corpus by streaming one document at a time.

## Word2Vec


Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. The output are vectors, one vector per word, with remarkable linear relationships.

Word2Vec is very useful in automatic text tagging, recommender systems and machine translation.

`Word2Vec`: is a more recent model that embeds words in a lower-dimentional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, `strong` and `powerful` would be close together and `strong` and `Paris` would be relatively far.

In [35]:
import gensim
from gensim.models import Word2Vec

print(f"Gensim Version: {gensim.__version__}")

# We need data for training the model
processed_sentences = [sent.lemma_.split() for sent in processed_text.sents]
processed_sentences[0]

Gensim Version: 4.0.1


['\ufeffthe',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Pride',
 'and',
 'prejudice,',
 'by',
 'Jane',
 'Austen',
 'this',
 'ebook',
 'be',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'in',
 'the',
 'United',
 'States',
 'and',
 'most',
 'other',
 'part',
 'of',
 'the',
 'world',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restriction',
 'whatsoever.']

In [36]:
# Word2Vec accepts several parameters that affect both training speed and quality
interchangeable_words_model = Word2Vec(
    sentences=processed_sentences,
    min_count=10, # Purning the internal dictionary
#     size=200, # the number of dimensions (N) gensim maps the word onto
    window=2, # 
    compute_loss=True,
    sg=1
)

# print(len(interchangeable_words_model.wv.vocab))

# getting the training loss
training_loss = interchangeable_words_model.get_latest_training_loss()
print(f"Training Loss: {training_loss}")

for w, sim in interchangeable_words_model.wv.most_similar('Darcy'):
    print((w, sim))

Training Loss: 899973.5
('Bingley', 0.970473051071167)
('Collins', 0.9701353907585144)
('Darcy,', 0.9364341497421265)
('Collins,', 0.9210847020149231)
('Wickham', 0.9209935069084167)
('Gardiner', 0.9069268703460693)
('Bingley,', 0.8984025716781616)
('Bingley’s', 0.8918498754501343)
('Darcy’s', 0.8892329931259155)
('Bennet,', 0.8779745697975159)


## Evaluating

`Word2Vec` training is an unsupervised task, there's no good way to objectively evaluate the result.

## FastText

In [37]:
# from gensim.models import FastText

# model = FastText(window=2)
# model.build_vocab(sentences=processed_sentences)
# model.train(sentences=processed_sentences, total_examples=len(processed_sentences), epochs=10)

# for w, sim in model.wv.most_similar('Darcy'):
#     print((w, sim))

In [38]:
# for w, sim in model.wv.most_similar('Darcy', topn=20):
#     print((w, sim))

In [39]:
# model = FastText(window=50)
# model.build_vocab(sentences=processed_sentences)
# model.train(sentences=processed_sentences, total_examples=len(processed_sentences), epochs=10)

# for w, sim in model.wv.most_similar('Darcy'):
#     print((w, sim))

In [40]:
# print("night" in model.wv.vocab)
# print("nights" in model.wv.vocab)

In [41]:
# print(model.wv.similarity("nights", "night"))
# print(model.wv.similarity("tonight", "night"))