# Getting [spaCY](https://spacy.io/) with NLP
Practical and accessable guide to natural language processing (nlp) using the spaCY Python library. Lots of out of the box functionality...
- Multi language support
- Pretrained models
- Built in visualizers
- Named entity recognition

In [1]:
!pip install spacy[transformers]
!python -m spacy download en_core_web_trf
!python -m spacy download en_core_web_lg
!pip install -U sentence-transformers
!python -m textblob.download_corpora
!pip install spacytextblob

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_trf")

# options for displaying parts of speach
options = {"compact": True, "color": "black", "font": "Source Sans Pro"}


---
## Displaying Named Entities
- Quick [demo article](https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/)

In [38]:
raw_text = """
Brain Inc. is an American multinational technology company that specializes in consumer electronics, software and online services headquartered in San Francisco, California, United States.
"""

text1 = nlp(raw_text)

displacy.render(text1,style="ent",jupyter=True)

In [3]:
spacy.explain("GPE")

---
## Comparing Sentence Simularity via Word2Vec/CosSimularity
- A nice explenation of [cos simularity](https://www.sciencedirect.com/topics/computer-science/cosine-similarity#:~:text=Cosine%20similarity%20measures%20the%20similarity,document%20similarity%20in%20text%20analysis.)
- What is [Word2Vec](https://jalammar.github.io/illustrated-word2vec/)

Depending on what embeding (vectorization) you use the results will vary.


In [4]:
# We need to use the non-transformer model if we want to stay inside spacy for sentence similarity scoring (as of june 2022)
nlp_lg = spacy.load("en_core_web_lg")

# create 4 dummy reviews
sent1 = nlp_lg("Running for president is probably hard.")
sent2 = nlp_lg("Space aliens lurk in the night time.")
sent3 = nlp_lg("The president was elected in november.")

# computer similarity
print("Comparing against: ", sent1)
print(sent1.similarity(sent2), sent2)
print(sent1.similarity(sent3), sent3)

Spacy's transformer model doesn't do sentence embeding in the same way `en_core_web_lg does`. They suggest using the [Sentence Transformer library](https://github.com/UKPLab/sentence-transformers) for sentence similarity scoring.
- [StackOverflow ref](https://stackoverflow.com/questions/72454434/getting-similarity-score-with-spacy-and-a-transformer-model)
- Found it more accurate than the spacy `en_core_web_lg` model
- Is an independent package from spacy
- Good blog [post](https://towardsdatascience.com/semantic-similarity-using-transformers-8f3cb5bf66d6) for this libraries usage

In [41]:
from sentence_transformers import SentenceTransformer, util
trf_model = SentenceTransformer('all-mpnet-base-v2') #('paraphrase-MiniLM-L6-v2')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
            'Space aliens lurk in the night time.',
            'The president was elected in november.',
            'Running for president is probably hard.',
            'The november election was a busy month for the president elect.']

#Sentences are encoded by calling model.encode()
embeddings = trf_model.encode(sentences, convert_to_tensor=True)

#Print the embeddings
print("Comparing against: ",sentences[4])
print('=========')
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
#     print("Embedding:", embedding)
    cosine_scores = util.cos_sim(embeddings[4], embedding)
    print("Cos Sim:", cosine_scores.item())
    print("")

In [6]:
# There is probably a way to get sentence similarity scores manually (for the transformer model) using the word embedings, but could not get this working.

# import torch

# def similarity(obj1, obj2):
#         (v1, t1), (v2, t2) = obj1._.trf_data.tensors, obj2._.trf_data.tensors
#         try:
#             return ((1 - cosine(v1, v2)) + (1 - cosine(t1, t2))) / 2
#         except:
#             return 0.0


# sent1 = nlp("Running for president is probably hard.")
# sent2 = nlp("Space aliens lurk in the night time.")

# s1 = sent1._.trf_data.tensors[0].reshape(-1, 768)
# s2 = sent2._.trf_data.tensors[0].reshape(-1, 768)

# cosine_scores
# similarity(sent1,sent2)

---
### Sentiment Anlysis
Spacy can make use of the [TextBlob nlp library](https://github.com/sloria/TextBlob) to get polarity and subjectivity scores (which make up a generic sentiment score)
- Seperate resource for an intro to [TextBlob](https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/#:~:text=Polarity%20is%20float%20which%20lies,of%20%5B0%2C1%5D.)

In [43]:
from spacytextblob.spacytextblob import SpacyTextBlob
nlp_lg.add_pipe('spacytextblob')

text = 'I had a really horrible day. It was the worst day ever! But every now and then I have a really good day that makes me happy.'
doc = nlp_lg(text)
print("Polarity:: ",doc._.blob.polarity)            # Polarity: float of [-1,1]. 1 being a positive statement and -1 negative
print("Subjectivity:: ",doc._.blob.subjectivity)    # Subjectivity: float of [0,1]. 1 being a personal opinion, emotion or judgment, 0 factual information

---
## Part of Speech Manipulation
- We can use the part of speech (pos) tag to manipulate things like pluralization.

In [37]:
d = nlp_lg('A dog can sit on command if trained by a person.')
displacy.render(d, style="dep",jupyter=True, options=options)

for word,pos in d._.blob.tags:
    if pos == 'NN':
        print(word.pluralize(), end=" ")

---
### Rule-Based Matching
Rule-based matching based on features of text like part of speach. Full list of [features here](https://spacy.io/usage/rule-based-matching#adding-patterns-attributes) and [demo here](https://explosion.ai/demos/matcher).

In [34]:
from spacy.matcher import Matcher
from spacy.tokens import Span

# example sentences
doc1 = nlp("You read this book")
doc2 = nlp("I will book my ticket")

# Select "book" where it is only being used as a noun
pattern = [{'TEXT': 'book', 'POS': 'NOUN'}]

# Initialize the matcher with spacy default vocab
matcher = Matcher(nlp.vocab)
matcher.add('book_noun', [pattern])
match1 = matcher(doc1)
match2 = matcher(doc2)

def display_match(doc, matches):
    for match_id, start, end in matches:
        span = Span(doc, start, end, label=match_id)
        doc.ents = list(doc.ents) + [span]

    displacy.render(doc, style="ent")

# show sentence pos
displacy.render(doc1,style="dep",jupyter=True, options=options)
display_match(doc1, match1)
displacy.render(doc2,style="dep",jupyter=True, options=options)
display_match(doc2, match2)