This file contains notes about **word embeddings and their use cases**, **pipelines**

In [16]:
import spacy
from spacy import displacy
import numpy as np
from functools import lru_cache

@lru_cache(maxsize=None)
def load_nlp(modelname):
    return spacy.load(modelname)

nlp = load_nlp("en_core_web_lg")

with open ("/home/eric/python/notes/spacynotes/fcc_repo/data/wiki_us.txt", "r") as f:
    text = f.read()
    
doc = nlp(text)

# Word Embeddings

`Word embeddings` or word vectors, are numerical representations of words in multidimensional space through matrices.

The **purpose of the word vector** is to get a computer system to understand a word. 

These `embeddings` are trained through machine learning mechanisms and attempt to capture **syntactic structure** and **semantic similarity**.

In [17]:
sentence1 = list(doc.sents)[0]

sentence1[0].vector # grab the vector of the first word

array([-7.2681e+00, -8.5717e-01,  5.8105e+00,  1.9771e+00,  8.8147e+00,
       -5.8579e+00,  3.7143e+00,  3.5850e+00,  4.7987e+00, -4.4251e+00,
        1.7461e+00, -3.7296e+00, -5.1407e+00, -1.0792e+00, -2.5555e+00,
        3.0755e+00,  5.0141e+00,  5.8525e+00,  7.3378e+00, -2.7689e+00,
       -5.1641e+00, -1.9879e+00,  2.9782e+00,  2.1024e+00,  4.4306e+00,
        8.4355e-01, -6.8742e+00, -4.2949e+00, -1.7294e-01,  3.6074e+00,
        8.4379e-01,  3.3419e-01, -4.8147e+00,  3.5683e-02, -1.3721e+01,
       -4.6528e+00, -1.4021e+00,  4.8342e-01,  1.2549e+00, -4.0644e+00,
        3.3278e+00, -2.1590e-01, -5.1786e+00,  3.5360e+00, -3.1575e+00,
       -3.5273e+00, -3.6753e+00,  1.5863e+00, -8.1594e+00, -3.4657e+00,
        1.5262e+00,  4.8135e+00, -3.8428e+00, -3.9082e+00,  6.7549e-01,
       -3.5787e-01, -1.7806e+00,  3.5284e+00, -5.1114e-02, -9.7150e-01,
       -9.0553e-01, -1.5570e+00,  1.2038e+00,  4.7708e+00,  9.8561e-01,
       -2.3186e+00, -7.4899e+00, -9.5389e+00,  8.5572e+00,  2.74

The point of these `word embeddings` is to capture similarity quickly and reliably 

In [24]:
your_word = "dog"

# finds words similar to your_word
ms = nlp.vocab.vectors.most_similar(np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]

print(words)

['dog', 'dogs', 'cat', 'puppy', 'pet', 'pup', 'canine', 'wolfdogs', 'dogsled', 'uppy']


Using spacy, we can calculate **simiarity between two docs**, as well as **similarity between two words**

In [40]:

nlp = load_nlp("en_core_web_md")

doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))

# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.691649353055761
salty fries <-> hamburgers 0.6938489675521851


# Pipelines

A **`pipeline`** is a sequence of `pipes`, or actors on data, that **make alterations to the data** or **extract information from it**.

<img src="pipeline.png" alt="drawing" width="700"/>

## How to add pipes

Sometimes, an `off the shelf pipeline`  from spacy is not good enough, or may be too slow.

In this case, you'll want to form your own **pipeline**


In [50]:
nlp = spacy.blank("en") # creates a blank model, specifying english

nlp.add_pipe("sentencizer")

import requests
from bs4 import BeautifulSoup
s = requests.get("https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt")
soup = BeautifulSoup(s.content).text.replace("-\n", "").replace("\n", " ")
nlp.max_length = 5278439

doc = nlp(soup)
print (len(list(doc.sents))) # 94k sentences

94134


## Examining a pipeline

In spaCy, we have a few different ways to study a pipeline. If we want to do this in a script, we can do the following command

In [53]:
nlp = load_nlp("en_core_web_sm")

nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att

Note the **dictionary structure**. This tells us not only what is inside the pipeline, but its order. 

Each **key** after “summary” is a `pipe`. The **value** is a dictionary. This dictionary tells us a few different things. 

All of these value dictionaries state: `“assigns”` which corresponds to a value of what that particular pipe assigns to the token and doc as it passes through the pipeline. 

In some cases, there will be a key of “scores” in the dictionary. This indicates how the machine learning model was evaluated