# Introduction and Word Vectors

- Two thoughts of Prof Christopher Manning  about language:
    - Language is such an evolved system of communication, but very **uncertain**.
        - However humans have some agreed meaning which helps us communicate so well.
        - We are internally and subconsciously doing some kind of probabilistic inference to determine meaning not just for information but also for social functions etc.
    - For artificial intelligence to reach at a very sophisticated level, it needs to be able to capture all of the human knowledge, which is predominantly conveyed through human language. 
        - Human language is our networking language through which we collectively form a huge network of individuals.
        - Human language made human being invincible. Language made humans to be able to work collectively as a group or team.  That is how they evolved not to just survive in a world of more powerful animals but they thrived.
        - Invention of writing made this knowledge to shared spatially (i.e. through space) or temporally (i.e. through time) not just verbally. 
        - Writing is very recent (~5000 years) phenomenon in scale of evolution, but made humans super powerful.
        - We compress knowledge efficiently and provide a view of the world in very few bits of information (e.g. I went to Zoo and saw an elephant. When you read this it constructs the whole visual scenery in your mind with images which can take few megabytes to store in a computer, but was communicated in very little words).

### How do we represent the meaning of the words?
- Linguists use something called Denotational Semantics to think about meaning.
    - Linguists think of meaning as what things represent.
    - $$\text{signifier(symbol)}  \leftrightarrow \text{signified{(idea or thing)}}$$
    - Word "chair" representing all the thing that are chair.
    - Word "running" represents a set of actions people do, which represents the activity those actions perform i.e. 🏃‍♂️ "

### How do we have reasonable meaning in a computer?
- Common Solution: Something like `WordNet` which is a thesaurus containing words and their relationships using  synonym set and hypernyms ("is a" relationship)

In [7]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ramand\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

- Synonyms of "good"

In [10]:
from nltk.corpus import wordnet as wn
poses = {'n' : 'noun', 'v' : 'verb', 'a' : 'adjective', 's' : 'adjective(s)', 'r' : 'adverb' }
for synset in wn.synsets("good"):
    print(f'{poses[synset.pos()]} : {", ".join([l.name() for l in synset.lemmas()])}')

noun : good
noun : good, goodness
noun : good, goodness
noun : commodity, trade_good, good
adjective : good
adjective(s) : full, good
adjective : good
adjective(s) : estimable, good, honorable, respectable
adjective(s) : beneficial, good
adjective(s) : good
adjective(s) : good, just, upright
adjective(s) : adept, expert, good, practiced, proficient, skillful, skilful
adjective(s) : good
adjective(s) : dear, good, near
adjective(s) : dependable, good, safe, secure
adjective(s) : good, right, ripe
adjective(s) : good, well
adjective(s) : effective, good, in_effect, in_force
adjective(s) : good
adjective(s) : good, serious
adjective(s) : good, sound
adjective(s) : good, salutary
adjective(s) : good, honest
adjective(s) : good, undecomposed, unspoiled, unspoilt
adjective(s) : good
adverb : well, good
adverb : thoroughly, soundly, good


- Hypernyms of "Tiger"

In [14]:
tiger = wn.synset("Tiger.n.01")
hyper = lambda t: t.hypernyms()
list(tiger.closure(hyper))

[Synset('person.n.01'),
 Synset('causal_agent.n.01'),
 Synset('organism.n.01'),
 Synset('physical_entity.n.01'),
 Synset('living_thing.n.01'),
 Synset('entity.n.01'),
 Synset('whole.n.02'),
 Synset('object.n.01')]

- WordNet is outlining various use of "good" in English. They are very fine grained difference which humans can barely understand.
- It clearly misses nuance. e.g. expert is not really "good"
- It also misses new meanings of word e.g. 'wicked good'. It is based on human labor and impossible to keep up-to-date.
- It also can't give us accurate word similarity and a score of how similar a pair of words are?
- It is very subjective.


### Localist Distribution
This is tangential to current discussion.
- A representation of space or collection where each entity is represented independently in a space.
- It can only describe a number of distinct object that are linear in number of dimension.
- This representation do not represent any relationship between the entities.
- One Hot Encoding is a localist representaiton.
- If we represent each word as a symbol, English language is estimated to have 13 million words. If we represent each word as a vector of 13 million dimension with one 1 and rest 0.
- In Neurology, localist representation theorizes that each neuron is a single concept on a stand alone basis. Each neuron or localist unit which has "meaning and representation"
- This is inverse of Distributed Representation.


### Representing words as discrete symbols.
- **Pre-2012**
    - Words as discrete symbols in lexicon (vocabulary)
    - "hotel, conference, motel" a localist representation
    - One Hot Encoding is used to represent the words as vector
        - motel = [00000100]
        - hotel =  [00000010]
    - These vectors become huge because languages have lot of words.
    - In language like English, we can have almost infinite words by using Derivation Morphology 
        - "New words are created in language by adding more words to the ending of existing words."
        - "e.g. Paternal --> Paternalistic, Paternalistically "
        - "This can explode the vocabulary of a language by many folds."
    - This takes huge computational power as word vector can be of dimension 500,000 or more.
    - Another bigger problem is often times we are interested in relationship and meaning of words.
        - If I search for "Seattle Motels", I might also actually like "Seattle Hotels" in the search too.
        - However,  words as a symbol representation keeps these words orthogonal in the space. See One Hot Encoding above. 
        - There is no notion of similarity between one-hot encoded vectors.
        - Word Similarity tables can solve this problem but that means but that explodes the computational problem. For each pair of words you keep a score of how similar they are but leads to really large table e.g.  using 500,000 words in vocabulary we might end up with a table of 2.5 trillion cells.
- Instead of that how about **we encode "similarity" in vector themselves?**

### Distributional Semantics
- Linguistic meaning of the word: A word's meaning is given by the words which appear close to this word frequently. 
- When a word *w* appears in the text, it's **context** is the set of words that appear near *w*  at a fixed length window.
- "You shall know a word by the company it keeps" - J. R. Firth 1957
- ![](https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fdailygrind%2FlUwgP_m7kQ.png?alt=media&token=933154b6-2990-4d0a-aa17-323b7ddabf13)
- The meaning of "banking" is the collections of all the words around it.


### Distributed Representation

- This idea is inverse of localist representation e.g. One Hot Encoding.
- In One Hot Encoding each vector is independent of the other words and represented as really large vectors (English has 13 million different words) where each vector has all 0s but one 1 which represent that word.
    - motel = \[00000100...00\]
    - hotel  = \[00000010...00\]
- In distributed representation, each word is represented as a dense vector which is similar to vector of words that appear in similar contexts.
- In other words, words which appear together live closer in the vector space representing all the words. e.g. motel and hotel will be related words and will live close in the vector space.
- The dimensions of this vector is very small compared to the vocabulary e.g. 50, 100, 200.... 4000 as compared to 13 million English words. 
- We use this smaller vector space to encode the relationship between words.
$$ \text{banking} = \begin{pmatrix} 0.102 \\ 0.432 \\0.445\\ 0.001\\0.034 \end{pmatrix}$$
- These word vectors are sometimes called Word Embeddings or word representations.



