# What is Name Entity Recognition

To FIND and CLASSIFY names in text

![](images/ner_intro.png)

A hard task because
- An official name can sound like typical words: "First National Bank", "Future School"
- Hard to determine class of entity: "Charles Schwab" can be referred to a person or an organization

=> ambiguous, dependent on context

Overall, single word classification is always hard, since words can have double meaning, such as "sanction" or "Paris" (Paris city, or Paris Hilton?)

# Window classification

Use a window containing neighboring words to classify the center word

## Softmax classifier

Steps:
- Use word2vector to convert each word in a FIXED SIZE window to a vector
- Concatenate those vectors. x_window will be a vector of size (windows size * word vector length,)

![](images/ner_softmax.png)

## Binary classifier with unnormalized score

Still use a fixed window and concat vector x_window size size (windows size * word vector length,)

Task: classify whether center word is a location or not (binary classification)
- Will go over all positions in corpus, but only correct position will get high score. Correct position = position that has actual NER Location at center

E.g.: Not all museums in Paris are amazing
- One true window: 'museums in **Paris** are amazing'
- Wrong window: 'all museums **in** Paris are'

### Neural net feed forward for this architecture

Assuming word-vector-length is 4. We will have 3-layer neural net as described below

![](images/ner_neuralnet.png)

a is R(8x1). Output of s is R(2x1), or binary outputs

## Cons of updating word vectors using SGD

If we are doing binary classification for movie review for example, and feeding a pre-trained w2v to start with, some words representation will be updated (thanks to SGD) in a way that break the original word relationship
- Words like: "TV","telly","television" are close to each other, but if 'television' is in test set, it will be further away from 'TV' and 'telly' in train set which got updated. Things will got worst if TV and telly are key words to determine negativity of a review.

![](images/ner_updatew2v_or_not.png)

Note: 
- use transfer learning (w2v pre-trained?) Yes, but if we already have big corpuses around +100 millions of words of data (probably from machine translation tasks), then I guess it's fine to start with random initialization of word vectors

about fine-tuning w2v pre-trained: 
- if small data (~ 100k words) => don't, since small data is easy to overfit, can't generalize well
- large dataset (>1 mil words): yes

# Dependency

## definition

![](images/dependency_1.png)

![](images/dependency_2.png)

![](images/dependency_3.png)

![](images/dependency_4.png)

![](images/dependency_5.png)

![](images/dependency_6.png)

## Greedy transition-based parsing

![](images/dependency_7.png)

![](images/dependency_8.png)

Note that there are only 3 actions: shift (pop the buffer and push to stack), left arc (reduce the top stack's left word, and top stack's word -> second to top's word in term of dependency), right arc (reduce the top stack itself, second to top -> top)

Finish when buffer is empty

## Use machine learning to predict action

Joakim Nirve idea (**MaltParser**): build a machine learning classifier to predict next action

Each action is predicted by a discriminative classifier (e.g.,
softmax classifier) over each legal move

- Max of 3 untyped choices; max of |R| × 2 + 1 when typed
- Features: top of stack word, POS; first in buffer word, POS; etc.

Pros: 
- **There is NO search (in the simplest form).** But you can profitably do a beam search if you wish (slower but better). You keep k good parse prefixes at each time step
- The model’s accuracy is fractionally below the state of the art in dependency parsing
- It provides very fast linear time parsing, with great performance

![](images/dependency_9.png)

![](images/dependency_10.png)

So we have
- first word of stack (d dimensional vector)
- POS of the first word (d dimensional vector)
- dependency label of that word(d dimensional vector): this is a small set and you can extract similarities from them

Then concat them.

This feature vector consists of a list of tokens (e.g., the last word in the stack, first word in the buffer, dependent of the second-to-last word in the stack if there is one, etc.). 

![](images/dependency_11.png)