<a href="https://colab.research.google.com/github/bundickm/CheatSheets/blob/master/NLP_Cheat_Sheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Documentation

[SpaCy](https://spacy.io/api/doc)

[NLTK](https://www.nltk.org/)

**Natural Language Processing (NLP)**  - A subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. 

**Token** - An instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing

Atrributes of good tokens:
- Should be stored in an iterable datastructure (Allows analysis of the "semantic unit")
- Should be all the same case (Reduces the complexity of our data)
- Should be free of non-alphanumeric characters (ie punctuation, whitespace)
- Removes information that is probably not relevant to the analysis

In [0]:
def tokenize(text):
    """
    Really basic tokenizer.
    Parses a string into a list of semantic units (words)
    """
    tokens = re.sub(r'[^a-zA-Z ^0-9]', '', text).lower().split()
    return tokens

In [0]:
# Basic tokenizing with SpaCy
import spacy
from spacy.tokenizer import Tokenizer

nlp = en_core_web_sm.load()
tokenizer = Tokenizer(nlp.vocab)
[token.text for token in tokenizer(sample)]

In [0]:
# Tokenizer Pipe
tokens = []

for doc in tokenizer.pipe(df['reviews.text'], batch_size=500):
    doc_tokens = [token.text for token in doc]
    tokens.append(doc_tokens)
    
df['tokens'] = tokens

**Stopwords** - Words which are filtered out before or after processing of natural language data

In [0]:
# Spacy's Default Stop Words
nlp.Defaults.stop_words

In [0]:
# Stop Words and Remove punctuation
tokens = []

for doc in tokenizer.pipe(df['reviews.text'], batch_size=500):
    doc_tokens = []
    for token in doc:
        if (token.is_stop == False) and (token.is_punct == False):
            doc_tokens.append(token.text.lower())
    tokens.append(doc_tokens)
    
df['tokens'] = tokens

In [0]:
# Extending Stop Words
STOP_WORDS = nlp.Defaults.stop_words.union(['I', 'amazon', 'i', 'it', "it's", 'it.', 'the', 'this',])

tokens = []

for doc in tokenizer.pipe(df['reviews.text'], batch_size=500):
    doc_tokens = []

    for token in doc: 
        if (token.text not in STOP_WORDS):
            doc_tokens.append(token.text.lower())

    tokens.append(doc_tokens)
    
df['tokens'] = tokens

**Statistical Trimming** - A technique to preserve the words that describe most of the variation in your data.

**Stemming** -  Usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

**Lemmatization** - Usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

In [0]:
# Porter stemming algorithm
from nltk.stem import PorterStemmer

ps = PorterStemmer()

words = ["game","gaming","gamed","games","gamer"]

for word in words:
    print(ps.stem(word))

In [0]:
# SpaCy lemmatization
text = "This is the start of our NLP adventure. We started here with Spacy."
doc = nlp(text)

for token in doc:
    print(token.lemma_)

In [0]:
# Wrap it all in a function
def get_lemmas(text):
    lemmas = []
    doc = nlp(text)
    
    for token in doc:
        if ((token.is_stop == False) and (token.is_punct == False) and 
            (token.pos_ != 'PRON')):
            doc_tokens.append(token.lemma_)

    return lemmas

**Vectorization** - Transforming text into a meaningful vector (or array) of numbers.

In [0]:
# Count Vectorizer

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# list of text documents
text = ["We created a new dataset which emphasizes diversity of content, by \
        scraping content from the Internet."," In order to preserve document \
        quality, we used only pages which have been curated/filtered by \
        humans—specifically, we used outbound links from Reddit which received \
        at least 3 karma."," This can be thought of as a heuristic indicator \
        for whether other users found the link interesting (whether educational\
         or funny), leading to higher data quality than other similar datasets,\
         such as CommonCrawl."]

# create the transform
vectorizer = CountVectorizer(stop_words='english')

# tokenize and build vocab
vectorizer.fit(text)


# Create a Vocabulary
# The vocabulary establishes all of the possible words that we might use.
vectorizer.vocabulary_

# The vocabulary dictionary does not represent the counts of words!!
dtm = vectorizer.transform(text)

# Get Word Counts for each document
dtm = pd.DataFrame(dtm.todense(), columns=vectorizer.get_feature_names())
dtm.head()

**Bag-of-Words Model** - A representation of text that describes the occurrence of words within a document. It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. One of the limitations of Bag-of-Words approaches is that any information about the textual context surrounding that word is lost. The model is only concerned with whether known words occur in the document, not where in the document. 

BoW Involves:
- A vocabulary of known words.
- A measure of the presence of known words.

One of the limitations of Bag-of-Words approaches is that any information about the textual context surrounding that word is lost. This also means that with bag-of-words approaches often the only tools that we have for identifying words with similar usage or meaning and subsequently consolidating them into a single vector is through the processes of stemming and lemmatization which tend to be quite limited at consolidating words unless the two words are very close in their spelling or in their root parts-of-speech.

In [0]:
# Basic Bag-of-Words
from collections import Counter

df['tokens'] = df['reviews.text'].apply(tokenize)

# The object `Counter` takes an iterable, 
# but you can instaniate an empty one and update it. 
word_counts = Counter()

# Update it based on a split of each of our documents
df['tokens'].apply(lambda x: word_counts.update(x))

# Print out the 10 most common words
word_counts.most_common(10)

**Term Frequency - Inverse Document Frequency (TF-IDF)** - The purpose of TF-IDF is to find what is unique to each document. Because of this we penalize the term frequencies of words that are common across all documents which will allow for each document's unique topics to rise to the top.
- Term Frequency: Percentage of words in document for each word
- Document Frequency: A penalty for the word existing in a high number of documents.
<center><img src="https://mungingdata.files.wordpress.com/2017/11/equation.png?w=430&h=336" width="300"></center>

In [0]:
# Term Frequency - Inverse Document Frequency
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)

# Create a vocabulary and get word counts per document
dtm = tfidf.fit_transform(text)

# View Feature Matrix as DataFrame
docs = pd.DataFrame(dtm.todense(), columns = tfidf.get_feature_names())
docs.head()

**Latent Semantic Indexing** - A technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts (topics) related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. Values close to 1 represent very similar paragraphs while values close to 0 represent very dissimilar paragraphs.

**Topic Modeling** - A type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic modeling and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

In [0]:
# LDA Modeling
# A Dictionary Representation of all the words in our corpus
id2word = corpora.Dictionary(doc_stream(path))

# Let's remove extreme values from the dataset
id2word.filter_extremes(no_below=10, no_above=0.75)

corpus = [id2word.doc2bow(text) for text in doc_stream(path)]

lda = LdaMulticore(corpus=corpus,
                   id2word=id2word,
                   random_state=723812,
                   num_topics = 15,
                   passes=10,
                   workers=4
                  )

words = [re.findall(r'"([^"]*)"',t[1]) for t in lda.print_topics()]
topics = [' '.join(t[0:5]) for t in words]

In [0]:
# LDA Interpretation
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda, corpus, id2word)
distro = [lda[d] for d in corpus]

def update(doc):
        d_dist = {k:0 for k in range(0,15)}
        for t in doc:
            d_dist[t[0]] = t[1]
        return d_dist
    
new_distro = [update(d) for d in distro]

df = pd.DataFrame.from_records(new_distro, index=titles)
df.columns = topics

In [0]:
def compute_coherence_values(dictionary, corpus, path, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    path : path to input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        stream = doc_stream(path)
        model = LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=id2word, workers=4)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=stream, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values
# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, 
                                                        corpus=corpus, 
                                                        path=path, 
                                                        start=2, 
                                                        limit=40, 
                                                        step=6)

In [0]:
# Show graph
import matplotlib.pyplot as plt

limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

###Other

In [0]:
# KNN Example
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5, algorithm='ball_tree')

# Fit on TF-IDF Vectors
nn.fit(dtm.todense())

# Query Using kneighbors 
nn.kneighbors(dtm.todense()[232])

In [0]:
# KNN with Outside Text Sample
new = tfidf.transform(random_tech_article)
nn.kneighbors(new.todense())

In [0]:
# Predicting categorical values with NLP
vectorizer = TfidfVectorizer(stop_words='english', 
                        max_features = 10000)
sgdc  = SGDClassifier()
pipe = Pipeline([('vect', vectorizer), ('sgdc', sgdc)])
pipe.fit(X, y)

score = (cross_val_score(pipe, X, y, 
                          cv = 10, 
                          scoring = 'accuracy',
                          n_jobs = -1,
                          verbose = 10)).mean()
print(score)

preds = pipe.predict(test['description'])