In [2]:
import spacy

In [3]:
!python3 -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


# Word Vectors

### What is a word vector?
A word vector (also known as a word embedding) is a way to represent words as numerical vectors in a high-dimensional space.

The idea is that words with similar meanings will have similar vectors.

Traditional text processing methods treat words as discrete entities, which means they can't easily capture the relationships between words. For example, "king" and "queen" are related but are treated as completely separate words. Word vectors allow us to capture semantic meaning and relationships between words.

**Similar words have similar vectors as the values at each index(parameter) are close to the other word's vector**

### Example

`King vector: [ 0.12  0.44  0.55 -0.67  0.12  0.14 -0.22  0.31  0.45  0.10]`

`Queen vector: [ 0.11  0.43  0.54 -0.66  0.11  0.13 -0.21  0.30  0.44  0.09]`

`Man vector: [ 0.34  0.12  0.67 -0.55  0.44  0.56 -0.33  0.21  0.11  0.08]`

`Woman vector: [ 0.33  0.11  0.66 -0.54  0.43  0.55 -0.32  0.20  0.10  0.07]`

**Each dimension (or index) in a word vector does not correspond to a specific, interpretable feature like "royalty" or "gender". Instead, the dimensions are abstract and result from the training process of the word embedding algorithm.**



In [4]:
nlp = spacy.load("en_core_web_md") 

In [5]:
with open("data/wiki_us.txt") as f:
    text = f.read()

In [6]:
doc = nlp(text)
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


In [7]:
import numpy as np

your_word = "country"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)

words = [nlp.vocab.strings[w] for w in ms[0][0]]

distances = ms[2]

print(words)

['country—0,467', 'nationâ\x80\x99s', 'countries-', 'continente', 'Carnations', 'pastille', 'бесплатно', 'Argents', 'Tywysogion', 'Teeters']


# `similarity` method for docs

In [8]:
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

In [9]:
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.691649353055761


We see in the above case that the sentences are fairly similar.

In [10]:
doc3 = nlp("The Empire State Building is in New York.")

In [11]:
print(doc1, "<->", doc3, doc1.similarity(doc3))

I like salty fries and hamburgers. <-> The Empire State Building is in New York. 0.1766669125394067


In this case, the similarity is pretty low

In [12]:
doc4 = nlp("I like oranges.")
doc5 = nlp("I like apples.")

In [13]:
print(doc4, "<->", doc5, doc4.similarity(doc5))

I like oranges. <-> I like apples. 0.9787322286815502


Once again, a very high similarity since apples and oranges both lie in the same cluster of fruits and the sentence is similarly structured as well.

# spaCy Pipelines

![image.png](attachment:image.png)

Two pipes are activated on this.

**Entity Ruler**\
A rules-based named entity recognizer known as an EntityRuler which finds entities

**Entity Linker**\
An EntityLinker pipe that identifies what entity that is to perform toponym resolution.

While we can use the `doc.ents` method, we can use pipelines that are more sophisticated.\
We will use the **Tok2Vec** input layer to vectorize the input sentence.

### Creating a blank spacy pipeline

In [16]:
nlp = spacy.blank("en")

#- English tokenizer

### Adding a single pipe - sentencizer

In [17]:
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x1783f6600>

### Why is it better to just add a single pipe sometimes?
![image.png](attachment:image.png)

The time difference is massive - 7 seconds for the blank model with just Sentencizer vs 47 minutes for the small English model of spaCy.

**The Sentencizer model is faster**.

This is because there is a lot of other stuff that the bigger model is doing which takes a lot more time.

However the **English model will be more accurate**. You can see this in the above case where the Sentencizer model likely did not recognize some sentences and has a smaller value.

### Analyzing pipeline

In [18]:
nlp.analyze_pipes()

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []},
  'doc.sents': {'assigns': ['sentencizer'], 'requires': []}}}

In [20]:
nlp2 = spacy.load("en_core_web_sm")
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att

The bigger model has a lot more pipes that can do a lot more.