# NLP INTRODUCTION

In this notebook I will use spaCy library for the following tasks: 
    
    Basic tex processing and pattern matching.
    Building machine learning models with text.
    Representing text with word embeddings that numerically capture the meaning of words and documents.

In [3]:
import spacy

Loading the en_core_web_sm model

In [5]:
nlp = spacy.load('en_core_web_sm')

Processing the text

In [8]:
doc = nlp("Winners don't do different things, they do the things differently.")

### Tokenising

This will return an object that has token.
It is a specific unit of text in the document, which includes indiviudal words and punctuation.

In [11]:
for t in doc:
    print(t)

Winners
do
n't
do
different
things
,
they
do
the
things
differently
.


### Text Processing

There are few important types of texts, which helps in improving modeling of the words.

**Lemma** : It is the base of the word. Eg: Eating is lemmatized to Eat.
**Stopword**: These are the words which occurs frequently in english language but contains very little information. 
Eg: a, the, is, and, etc.

In [14]:
print(f"Token\t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))

for token in doc:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

Token		Lemma 		Stopword
Winners		winner		False
do		do		True
n't		n't		True
do		do		True
different		different		False
things		thing		False
,		,		False
they		they		True
do		do		True
the		the		True
things		thing		False
differently		differently		False
.		.		False


But on the other hand, lemmatizing and dropping the stopwards entirely might result in degrading model performance.
Rather it should be used for hyperparameter optimization.

### Pattern Matching

It is matching tokens or phrases within chunks of text or whole documents.

In [16]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

In [17]:
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", patterns)

In [18]:
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the "
               "Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.") 
matches = matcher(text_doc)
print(matches)

[(3766102292120407359, 17, 19), (3766102292120407359, 22, 24), (3766102292120407359, 30, 32), (3766102292120407359, 33, 35)]


In [19]:
match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])

TerminologyList iPhone 11
