# NLP

![](NLP_examples.png)

# Language Models

* `pip install spacy`
* `python -m spacy download en_core_web_md` - if this doesn't load, try
* `python -m spacy download en_core_web_sm`

----

### Learning Objectives:
#### * What are the categories of NLP?
#### * What is a Language model?
#### * See a 'traditional' language model
#### * Language Representation areas of interest
#### * Introducing Spacy
#### * Introducing word2vec

---

## Categories of NLP:

#### Language Representation - translating natural language into 'computer language'

#### Language Classification - Input is some language, output is a label describing that language

#### Language Generation - Input is some lanaguage, output is some new language

---

### Tell me what they are, and where they go!
* Sentiment analysis
* Speech recognition
* Representing Grammar (Syntax)
* Relation Extraction
* Question Answering
* Named Entity Recognition
* Representing Meaning (Semantics)
* Chatbots
* Machine Translation
* Dependency grammar
* The 'Grammarly' model
* Information Summarisation
* Language Reasoning
* Representing Context (Pragmatics)

---

### What is a Language Model?

#### A model which tries to assess the liklehood of language
* I go home vs I home go - syntax
* I go home vs I go house - semantics
* I go home vs I go home and tell myself everything is going to be ok cos im so depressed - pragamtics

---

$P(W) = P(w_1, w_2, ..., w_n)$

or

$P(w_{t+1} | w_{t-1+n}, ..., w_{t})$

---

### A 'traditional' language model... for Sequence Generation

#### A bigram Markov chain (more later in the course)

In [1]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import numpy as np
import re

In [2]:
def generate_word_distribution(data):
    """create a probability distribution over all words
    
        params: data - a Bunch data object from sklearn
        returns: Word probability distribution
    """

    text = data['data']
    all_data = ' '.join([' '.join(re.findall('(?u)\\b\\w\\w+\\b',article.lower())) for article in text]).split()
    words = pd.DataFrame({'words':all_data})
    words['next_words'] = words['words'].shift(-1)
    word_distribution = words.groupby('words')['next_words'].value_counts(normalize=True)
    
    return word_distribution

In [3]:
def text_generation(seed, length, distribution):
    """seed a distribution with a seed word, and ask it to make more words
        
        params: seed - A seed word, 
                length -Length of the generated sentence
                distribution - A word probability distribution
                
        returns: generated sentence
    """
    
    try:
        seed = seed.lower()
        for i in range(length):
             seed += ' ' + np.random.choice(distribution[seed.split()[-1]].index, p=distribution[seed.split()[-1]].values)
        return seed
    
    except:
        print('Oops! Try another seed')
        return None

### Download text data

In [4]:
data = fetch_20newsgroups(remove=['headers', 'footers'])

### Calculate the bigram probabilities

In [5]:
distribution = generate_word_distribution(data)

### Generate some new sentences

In [6]:
sentence = text_generation('It', 20, distribution)

In [7]:
sentence

'it seems to get ahold of this view of them are discussions belong when the most of burning truck to natural'

----

### Language Representation areas of interest

In [None]:
apples -> apple #stemming
am -> be # lemmatization 
an apple and a banana -> apple banana #removing stop words
Apple -> apple #lowercase

#### * Preprocessing - tokenization, stop words, lowercase, lemmatization/stemming, normalization
#### * Curse of dimensionality
#### * Semantic similarity
#### * Word order
#### * Word sense disambiguation
#### * Grammar

---

## Introducing Spacy

#### Solves: Preprocessing

In [8]:
import spacy

In [9]:
nlp = spacy.load('en_core_web_md')

In [10]:
def clean_text(review, model):
    """preprocess a string (tokens, stopwords, lowercase, lemma & stemming) returns the cleaned result
        params: review - a string
                model - a spacy model
                
        returns: list of cleaned tokens
    """
    
    new_doc = []
    doc = model(review)
    for word in doc:
        if not word.is_stop and word.is_alpha:
            new_doc.append(word.lemma_.lower())
            
    return new_doc

In [19]:
clean_text('this is a document', nlp)

['document']

## No Similarity in BOW!!

![](orthogonal_BOW.png)

### Introducing word2vec

#### Solves: Semantic similarity, curse of dimensionality

In [20]:
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
def vectorize_my_word(word, model):
    try:
        return model.vocab[word].vector.reshape(-1,1).T
    except:
        print("Doesn't look like this word can be found")
        return None

In [34]:
vectorize_my_word('apple', nlp).shape #one word is 2d, then a list of words is 3d

(1, 300)

In [26]:
queen = vectorize_my_word('queen', nlp)
king = vectorize_my_word('king', nlp)
man = vectorize_my_word('man', nlp)
woman = vectorize_my_word('woman', nlp)

In [27]:
new_queen = king - man + woman

In [28]:
cosine_similarity(new_queen, queen)

array([[0.78808445]], dtype=float32)

In [29]:
cosine_similarity(queen, king)

array([[0.725261]], dtype=float32)

In [31]:
princess= vectorize_my_word('princess', nlp)

In [32]:
cosine_similarity(queen, princess)

array([[0.6578181]], dtype=float32)

---

### What can I do with this?

# pratical takeaway1: 
* Use the spacy cleaning function, or onethat you write yourself, on your corpus, to reduce dimensionality, and get better performance on the downstream task (artist prediction)

# pratical takeaway2: 
* try and use word2vec vectors in place of BOW on your ML prediction 
* CAVEAT! ML models we use accept 2d inputs only, but you might find yourself with a 3Dinput 
* QUESTION? How can I transform my 3D input into 2D so that MLmodels accept it?