# Spacy

### Models

Spacy comes with a variety of different models that can used per language. For instance, the models for English are available [here](https://spacy.io/models/en). You'll need to download each model separately:

```python
python3 -m spacy download en_core_web_sm
python3 -m spacy download en_core_web_md

```

In [5]:
##download english core web corpus, small and medium
##the larger the model, the better the performnace (sm is 12 mb compressed, medium is around 90 compressed)
!python3 -m spacy download en_core_web_sm
!python3 -m spacy download en_core_web_md

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 2.8 MB/s eta 0:00:01


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Collecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
[K     |████████████████████████████████| 45.7 MB 28.0 MB/s eta 0:00:01     |██████████████

Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


## Pattern Matching Using Spacy

The below code and example is from Ashiq KS's article [Rule-Based Matching with spacy](https://medium.com/@ashiqgiga07/rule-based-matching-with-spacy-295b76ca2b68):

In [1]:
#The input text string is converted to a Document object
text = '''
Computer programming is the process of writing instructions that get executed by computers. 
The instructions, also known as code, are written in a programming language which the computer 
can understand and use to perform a task or solve a problem. Basic computer programming involves 
the analysis of a problem and development of a logical sequence of instructions to solve it. 
There can be numerous paths to a solution and the computer programmer seeks to design and 
code that which is most efficient. Among the programmer’s tasks are understanding requirements, 
determining the right programming language to use, designing or architecting the solution, coding, 
testing, debugging and writing documentation so that the solution can be easily
understood by other programmers.Computer programming is at the heart of computer science. It is the 
implementation portion of software development, application development 
and software engineering efforts, transforming ideas and theories into actual, working solutions.
'''

In [3]:
import spacy

In [4]:
from spacy.matcher import Matcher #import Matcher class from spacy
#import the Span class to extract the words from the document object
from spacy.tokens import Span 

#Language class with the English model 'en_core_web_sm' is loaded
nlp = spacy.load("en_core_web_sm")

doc = nlp(text) # convert the string above to a document

#instantiate a new Matcher class object 
matcher = Matcher(nlp.vocab)

### Define the Target Pattern

The `pattern` object that you define should be a list of dictionary elements, each dictionary describing the token to match for. 

Here, we are matching for the usage of `computer` as a `NOUN`.

In [None]:
#define the pattern
pattern = [{'LOWER': 'computer', 'POS': 'NOUN'},
             {'POS':{'NOT_IN': ['VERB']}}]


### Load the Pattern into the Matcher

In [None]:
#add the pattern to the previously created matcher object
matcher.add("Matching", None, pattern)

## Using Regular Expressions in Spacy

The below example can be found at https://spacy.io/usage/rule-based-matching. It uses the `re.finditer()` function to
quickly iterate through all the matches found. 

In [7]:
import spacy
import re
nlp = spacy.load("en_core_web_sm")
##think of this as 1 row of your dataframe, 1 document
doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")

##simple regex cleaning
expression = r"\b[Uu](nited|\.?) ?[Ss](tates|\.?)\b"
##finditer creates an iterator, allows you to "loop" through the text to find matches
for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    # This is a Span object or None if match doesn't map to valid token sequence
    if span is not None:
        print("Found match:", span.text)

Found match: United States
Found match: United States
Found match: US


## Part of Speech Tagging

In [6]:
# !python3 -m spacy download en_core_web_sm
# !python3 -m spacy download en_core_web_md

In [8]:
import en_core_web_sm
import spacy
from scipy.spatial.distance import cosine
import spacy

nlp = spacy.load('en_core_web_md')

In [9]:
import pandas as pd
rows = []
doc = nlp(u"Steve Jobs and Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    rows.append((token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop))
    
data = pd.DataFrame(rows, columns=["text", "lemma", "part_of_speech", "tag", "dependency", "shape", "is_alphanumeric", "is_stopword"])
data.head()

##tokenization done
##lemmatization done
##part of speech tagged
##dependency also found - compound phrase, nsubj is name of the subject
##finds stopwords

Unnamed: 0,text,lemma,part_of_speech,tag,dependency,shape,is_alphanumeric,is_stopword
0,Steve,Steve,PROPN,NNP,compound,Xxxxx,True,False
1,Jobs,Jobs,PROPN,NNP,nsubj,Xxxx,True,False
2,and,and,CCONJ,CC,cc,xxx,True,True
3,Apple,Apple,PROPN,NNP,conj,Xxxxx,True,False
4,is,be,AUX,VBZ,aux,xx,True,True


### Named Entity Recognition

In [10]:
# example from spacy docs
doc = nlp(u"Steve Jobs and Apple is looking at buying U.K. startup for $1 billion")
import en_core_web_sm
import spacy
from scipy.spatial.distance import cosine
nlp = en_core_web_sm.load()

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
    
##named entities - people, places, things, concepts

Steve Jobs 0 10 PERSON
Apple 15 20 ORG
U.K. 42 46 GPE
$1 billion 59 69 MONEY


In [11]:
# visualize this using displacy:
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

# Word Embeddings (word2vec Introduction) from Intro to Algorithmic Marketing

### ALL IMAGES COMING FROM TEXTBOOK - introduction to algorithmic marketing by ilya katsov

## Continuous Bag of Words (Use Context to Predict Target Word)
![alt text](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/word2vec_cbow.png "Logo Title Text 1")

## Softmax
![alt text](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/softmax.png "Logo Title Text 1")

## Skipgram
![alt text](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/skipgram.png "Logo Title Text 1")

## Softmax
![alt text](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/wordembedding_cluster.png "Logo Title Text 1")

In [12]:
import en_core_web_sm
import en_core_web_md
import spacy
from scipy.spatial.distance import cosine
nlp = en_core_web_sm.load()

In [13]:
tokens = nlp(u'dog cat Beijing sad depressed couch sofa canine China Chinese France Paris banana')

for token1 in tokens:
    for token2 in tokens:
        if token1 != token2:
            print(f" {token1} - {token2}: {1 - cosine(token1.vector, token2.vector)}")
            
##china france high similarity (often times they are used in the same context)
##paris france high similarity

 dog - cat: 0.6064711809158325
 dog - Beijing: 0.26680809259414673
 dog - sad: 0.23861631751060486
 dog - depressed: 0.15567244589328766
 dog - couch: 0.40754207968711853
 dog - sofa: 0.28589534759521484
 dog - canine: 0.40234488248825073
 dog - China: 0.3280990421772003
 dog - Chinese: 0.223441019654274
 dog - France: 0.45519712567329407
 dog - Paris: 0.4896498918533325
 dog - banana: 0.37603437900543213
 cat - dog: 0.6064711809158325
 cat - Beijing: 0.34632179141044617
 cat - sad: 0.2476261705160141
 cat - depressed: 0.08602733165025711
 cat - couch: 0.2503412663936615
 cat - sofa: 0.25963470339775085
 cat - canine: 0.38918542861938477
 cat - China: 0.28076714277267456
 cat - Chinese: 0.11958666890859604
 cat - France: 0.3794781565666199
 cat - Paris: 0.39501628279685974
 cat - banana: 0.4623035490512848
 Beijing - dog: 0.26680809259414673
 Beijing - cat: 0.34632179141044617
 Beijing - sad: 0.2716704308986664
 Beijing - depressed: 0.22452880442142487
 Beijing - couch: 0.1336926072835

In [18]:
nlp = en_core_web_sm.load()
tokens = nlp('dog')

print(tokens.vector.shape)
nlp = en_core_web_md.load()

tokens = nlp('dog')

print(tokens.vector.shape)

##vector based on the different models based on the different spacy models

(96,)
(300,)


In [21]:
tokens = nlp('dog')
tokens.vector
##for en_core_web_md: this is how dog looks like for EVERYONE (us, google devs that made word2vec), 
##it's how dog is represented

array([-4.0176e-01,  3.7057e-01,  2.1281e-02, -3.4125e-01,  4.9538e-02,
        2.9440e-01, -1.7376e-01, -2.7982e-01,  6.7622e-02,  2.1693e+00,
       -6.2691e-01,  2.9106e-01, -6.7270e-01,  2.3319e-01, -3.4264e-01,
        1.8311e-01,  5.0226e-01,  1.0689e+00,  1.4698e-01, -4.5230e-01,
       -4.1827e-01, -1.5967e-01,  2.6748e-01, -4.8867e-01,  3.6462e-01,
       -4.3403e-02, -2.4474e-01, -4.1752e-01,  8.9088e-02, -2.5552e-01,
       -5.5695e-01,  1.2243e-01, -8.3526e-02,  5.5095e-01,  3.6410e-01,
        1.5361e-01,  5.5738e-01, -9.0702e-01, -4.9098e-02,  3.8580e-01,
        3.8000e-01,  1.4425e-01, -2.7221e-01, -3.7016e-01, -1.2904e-01,
       -1.5085e-01, -3.8076e-01,  4.9583e-02,  1.2755e-01, -8.2788e-02,
        1.4339e-01,  3.2537e-01,  2.7226e-01,  4.3632e-01, -3.1769e-01,
        7.9405e-01,  2.6529e-01,  1.0135e-01, -3.3279e-01,  4.3117e-01,
        1.6687e-01,  1.0729e-01,  8.9418e-02,  2.8635e-01,  4.0117e-01,
       -3.9222e-01,  4.5217e-01,  1.3521e-01, -2.8878e-01, -2.28

# Finding Most Similar Words (Using Our Old Methods)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# inspect the default settings for CountVectorizer
CountVectorizer()

In [None]:
reviews = open("poor_amazon_toy_reviews.txt").readlines()

vectorizer = CountVectorizer(ngram_range=(1, 1), 
                             stop_words="english", 
                             max_features=500,token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b')
X = vectorizer.fit_transform(reviews)

data = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
data.head()

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# create similiarity matrix
similarity_matrix = pd.DataFrame(cosine_similarity(data.T.values), 
             columns=vectorizer.get_feature_names(),
                                 index=vectorizer.get_feature_names())

In [None]:
# unstack matrix into table
similarity_table = similarity_matrix.rename_axis(None).rename_axis(None, axis=1).stack().reset_index()

In [None]:
# rename columns
similarity_table.columns = ["word1", "word2", "similarity"]
similarity_table.shape

In [None]:
similarity_table = similarity_table[similarity_table["similarity"] < 0.99]
similarity_table.shape

In [None]:
similarity_table.sort_values(by="similarity", ascending=False).drop_duplicates(
    subset="similarity", keep="first").head(10)

In [None]:
top_500_words = vectorizer.get_feature_names()

# Exercise: Similar Words Using Word Embeddings

In [None]:
# load into spacy your top 500 words

tokens = nlp(f'{" ".join(top_500_words)}')

In [None]:
from itertools import product
# create a list of similarity tuples

similarity_tuples = []

for token1, token2 in product(tokens, repeat=2):
    similarity_tuples.append((token1, token2, token1.similarity(token2)))

similarities = pd.DataFrame(similarity_tuples, columns=["word1","word2", "score"])


In [None]:
# find similar words
similarities[similarities["score"] < 1].sort_values(
    by="score", ascending=False).drop_duplicates(
    subset="score", keep="first").head(5)