# Intro to NLP
> Notes from the Kaggle Course.

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]
- image: images/chart-preview.png

This course appears a bit more hands-on as it starts off with introducing spacy. `spacy.load` code examples are added

In [4]:
!pip install spacy
!python -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 7.2 MB/s 
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [7]:
# you can process english text now
doc = nlp("Tea is healthy and calming, don't you think?")

In [8]:
for token in doc:
  print(token)

Tea
is
healthy
and
calming
,
do
n't
you
think
?


These are token objects. A token object has the lemmatization with `token.lemma_` and if it's a stop word `token.is_stop`

In [9]:
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
  print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

Token 		Lemma 		Stopword
----------------------------------------
Tea		tea		False
is		be		True
healthy		healthy		False
and		and		True
calming		calm		False
,		,		False
do		do		True
n't		n't		True
you		you		True
think		think		False
?		?		False


Lemmatization and Stopwords can be helpful, but also detrimental to a model's performance. Consider it as hyperparameters for tweaking the performance of a model.

## Pattern Matching

Spacy has pattern matching capabilities that are easier to use then Regex

In [10]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

Matchers depend on a vocabulary model, so the english model above was used. `attr='LOWER'` lowers all text ensuring case insentivity


Convert the terms we need to match to documents and add to the matcher

In [11]:
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", patterns)

In [12]:
# Borrowed from https://daringfireball.net/linked/2019/09/21/patel-11-pro
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the " 
               "Galaxy Note 10 Plus and last year's iPhone XS and Google Pixel 3.")
matches = matcher(text_doc)
print(matches)

[(3766102292120407359, 17, 19), (3766102292120407359, 22, 24), (3766102292120407359, 30, 32), (3766102292120407359, 33, 35)]


A match is a tuple of (match_id, start_pos, end_pos)

In [13]:
match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])

TerminologyList iPhone 11


## Text Classification

Machines need numeric representations of text. 

One way to convert a sentence or phrase to a numeric represneataiotion is to count the occurances of a word in a document
Then the vector is the length of every word in the entire corpus. A variation of one-hot encoding.

This is called *bag of words*

Another approach is by scaling the term count by the overall term's frequency in the corpus. The name of that representation is called TF-IDF or Term Frequency - Inverse Document Frequency


spacy can support bag of words with the `TextCategorizer`



In [14]:
nlp = spacy.blank("en")

textcat = nlp.create_pipe("textcat", config={
  "exclusive_classes": True,
  "architecture": "bow"
})

nlp.add_pipe(textcat)

ConfigValidationError: 

Config validation error

textcat -> architecture        extra fields not permitted
textcat -> exclusive_classes   extra fields not permitted

{'nlp': <spacy.lang.en.English object at 0x7fdf109c6fa0>, 'name': 'textcat', 'architecture': 'bow', 'exclusive_classes': True, 'model': {'@architectures': 'spacy.TextCatEnsemble.v2', 'linear_model': {'@architectures': 'spacy.TextCatBOW.v2', 'exclusive_classes': True, 'ngram_size': 1, 'no_output_layer': False}, 'tok2vec': {'@architectures': 'spacy.Tok2Vec.v2', 'embed': {'@architectures': 'spacy.MultiHashEmbed.v2', 'width': 64, 'rows': [2000, 2000, 1000, 1000, 1000, 1000], 'attrs': ['ORTH', 'LOWER', 'PREFIX', 'SUFFIX', 'SHAPE', 'ID'], 'include_static_vectors': False}, 'encode': {'@architectures': 'spacy.MaxoutWindowEncoder.v2', 'width': 64, 'window_size': 1, 'maxout_pieces': 3, 'depth': 2}}}, 'threshold': 0.5, '@factories': 'textcat'}