# Natural Language Processing

What is NLP?
- Using computers to process (analyze, understand, generate) natural human languages

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve natural language understanding -- that is, enabling computers to derive meaning from human or natural language input.

Why NLP?
- Most knowledge created by humans is unstructured text
- Need some way to make sense of it
- Enables quantitative analysis of text data


Why NLTK?
- High-quality, reusable NLP functionality

In [2]:
import nltk
#nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/ernestogiron/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/ernestogiron/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/ernestogiron/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/ernestogiron/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/ernestogiron/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/ernestogiron/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Down

[nltk_data]    | Downloading package senseval to
[nltk_data]    |     /Users/ernestogiron/nltk_data...
[nltk_data]    |   Unzipping corpora/senseval.zip.
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     /Users/ernestogiron/nltk_data...
[nltk_data]    |   Unzipping corpora/sentiwordnet.zip.
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /Users/ernestogiron/nltk_data...
[nltk_data]    |   Unzipping corpora/sentence_polarity.zip.
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /Users/ernestogiron/nltk_data...
[nltk_data]    |   Unzipping corpora/shakespeare.zip.
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /Users/ernestogiron/nltk_data...
[nltk_data]    |   Unzipping corpora/sinica_treebank.zip.
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /Users/ernestogiron/nltk_data...
[nltk_data]    |   Unzipping corpora/smultron.zip.
[nltk_data]    | Downloading p

True

In [5]:
'''
Tokenization

What:  Separate text into units such as sentences or words
Why:   Gives structure to previously unstructured text
Notes: Relatively easy with English language text, not easy with some languages
'''

# "corpus" = collection of documents
# "corpora" = plural form of corpus

import requests
from bs4 import BeautifulSoup
r = requests.get("http://en.wikipedia.org/wiki/Data_science")
b = BeautifulSoup(r.text, "lxml")
paragraphs = b.find("body").findAll("p")
text = ""
for paragraph in paragraphs:
    text += paragraph.text + " "
# Data Science corpus
text[:500]

# tokenize into sentences
sentences = [sent for sent in nltk.sent_tokenize(text)]
sentences[:10]

# tokenize into words
tokens = [word for word in nltk.word_tokenize(text)]
tokens[:100]

# only keep tokens that start with a letter (using regular expressions)
import re
clean_tokens = [token for token in tokens if re.search('^[a-zA-Z]+', token)]
clean_tokens[:100]

# count the tokens
from collections import Counter
c = Counter(clean_tokens)
c.most_common(25)       # mixed case
sorted(c.items())[:25]  # counts similar words separately
for item in sorted(c.items())[:25]:
    print(item[0], item[1])

ASA 1
Action 1
Additionally 1
Advanced 2
Although 1
American 2
An 1
Analysis 2
Analytics 3
April 2
Areas 1
Association 3
August 1
Because 1
Big 1
Board 1
Business 2
C. 1
C.F 1
CODATA 1
Carver 1
Century 3
Chandra 1
Chikio 1
Classification 1


In [6]:
'''
Stemming
What:  Reduce a word to its base/stem form
Why:   Often makes sense to treat multiple word forms the same way
Notes: Uses a "simple" and fast rule-based approach
       Output can be undesirable for irregular words
       Stemmed words are usually not shown to users (used for analysis/indexing)
       Some search engines treat words with the same stem as synonyms
'''

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

# example stemming
stemmer.stem('charge')
stemmer.stem('charging')
stemmer.stem('charged')

# stem the tokens
stemmed_tokens = [stemmer.stem(t) for t in clean_tokens]

# count the stemmed tokens
c = Counter(stemmed_tokens)
c.most_common(25)       # all lowercase
sorted(c.items())[:25]  # some are strange

[('a', 24),
 ('about', 2),
 ('academ', 1),
 ('action', 1),
 ('activ', 1),
 ('actual', 1),
 ('ad', 1),
 ('addit', 1),
 ('address', 1),
 ('advanc', 3),
 ('advantag', 1),
 ('advoc', 1),
 ('advocaci', 1),
 ('after', 1),
 ('all', 1),
 ('alon', 1),
 ('also', 1),
 ('although', 1),
 ('american', 2),
 ('an', 4),
 ('analysi', 6),
 ('analyst', 2),
 ('analyt', 6),
 ('analyz', 1),
 ('and', 49)]

In [7]:
'''
Lemmatization
What:  Derive the canonical form ('lemma') of a word
Why:   Can be better than stemming, reduces words to a 'normal' form.
Notes: Uses a dictionary-based approach (slower than stemming)
'''

lemmatizer = nltk.WordNetLemmatizer()

# compare stemmer to lemmatizer
stemmer.stem('dogs')
lemmatizer.lemmatize('dogs')

stemmer.stem('wolves') # Beter for information retrieval and search
lemmatizer.lemmatize('wolves') # Better for text analysis

stemmer.stem('is')
lemmatizer.lemmatize('is')
lemmatizer.lemmatize('is',pos='v')

'be'

In [8]:
'''
Part of Speech Tagging
What:  Determine the part of speech of a word
Why:   This can inform other methods and models such as Named Entity Recognition
Notes: http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
'''

temp_sent = 'Sinan and Kevin are great teachers!'
# pos_tag takes a tokenize sentence
nltk.pos_tag(nltk.word_tokenize(temp_sent))

[('Sinan', 'NNP'),
 ('and', 'CC'),
 ('Kevin', 'NNP'),
 ('are', 'VBP'),
 ('great', 'JJ'),
 ('teachers', 'NNS'),
 ('!', '.')]

In [9]:
'''
Stopword Removal
What:  Remove common words that will likely appear in any text
Why:   They don't tell you much about your text
'''

# most of top 25 stemmed tokens are "worthless"
c.most_common(25)

# view the list of stopwords
stopwords = nltk.corpus.stopwords.words('english')
sorted(stopwords)

# stem the stopwords
stemmed_stops = [stemmer.stem(t) for t in stopwords]

# remove stopwords from stemmed tokens
stemmed_tokens_no_stop = [stemmer.stem(t) for t in stemmed_tokens if t not in stemmed_stops]
c = Counter(stemmed_tokens_no_stop)
most_common_stemmed = c.most_common(25)

# remove stopwords from cleaned tokens
clean_tokens_no_stop = [t for t in clean_tokens if t not in stopwords]
c = Counter(clean_tokens_no_stop)
most_common_not_stemmed = c.most_common(25)

# Compare the most common results for stemmed words and non stemmed words
for i in range(25):
    text_list = most_common_stemmed[i][0] + '  ' + str(most_common_stemmed[i][1]) + ' '*25
    text_list = text_list[0:30]
    text_list += most_common_not_stemmed[i][0] + '  ' + str(most_common_not_stemmed[i][1])
    print(text_list)

data  62                      data  39
scienc  39                    Data  23
statist  19                   science  23
term  12                      In  16
scientist  11                 Science  16
method  7                     term  12
comput  7                     Statistical  8
busi  7                       The  7
use  7                        scientists  7
intern  7                     methods  6
analysi  6                    statistics  6
analyt  6                     business  5
confer  6                     used  5
journal  6                    International  5
field  5                      analysis  4
publish  5                    many  4
lectur  5                     conference  4
mine  4                       first  4
mani  4                       Journal  4
job  4                        field  3
univ  4                       information  3
first  4                      computer  3
statistician  4               Review  3
big  4                        Century  3
area  3      

In [10]:
'''
Named Entity Recognition
What:  Automatically extract the names of people, places, organizations, etc.
Why:   Can help you to identify "important" words
Notes: Training NER classifier requires a lot of annotated training data
       Should be trained on data relevant to your task
       Stanford NER classifier is the "gold standard"
'''

def extract_entities(text):
    entities = []
    # tokenize into sentences
    for sentence in nltk.sent_tokenize(text):
        # tokenize sentences into words
        # add part-of-speech tags
        # use NLTK's NER classifier
        chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)))
        # parse the results
        entities.extend([chunk for chunk in chunks if hasattr(chunk, 'label')])
    return entities

for entity in extract_entities('Kevin and Sinan are instructors for General Assembly in Washington, D.C.'):
    print('[' + entity.label() + '] ' + ' '.join(c[0] for c in entity.leaves()))

[PERSON] Kevin
[PERSON] Sinan
[ORGANIZATION] General Assembly
[GPE] Washington


In [11]:
'''
Term Frequency - Inverse Document Frequency (TF-IDF)
What:  Computes "relative frequency" that a word appears in a document
           compared to its frequency across all documents
Why:   More useful than "term frequency" for identifying "important" words in
           each document (high frequency in that document, low frequency in
           other documents)
Notes: Used for search engine scoring, text summarization, document clustering
'''

sample = ['Bob likes sports', 'Bob hates sports', 'Bob likes likes trees']

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit_transform(sample).toarray()
vect.get_feature_names()

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf.fit_transform(sample).toarray()
tfidf.get_feature_names()

['bob', 'hates', 'likes', 'sports', 'trees']

In [14]:
'''
LDA - Latent Dirichlet Allocation
What:  Way of automatically discovering topics from sentences
Why:   Much quicker than manually creating and identifying topic clusters
'''
# pip install lda
import lda
import numpy as np

# Instantiate a count vectorizer with two additional parameters
vect = CountVectorizer(stop_words='english', ngram_range=[1,3]) 
sentences_train = vect.fit_transform(sentences)

# Instantiate an LDA model
model = lda.LDA(n_topics=10, n_iter=500)
model.fit(sentences_train) # Fit the model 
n_top_words = 10
topic_word = model.topic_word_
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vect.get_feature_names())[np.argsort(topic_dist)][:-n_top_words:-1]
    print('Topic {}: {}'.format(i, ', '.join(topic_words)))

INFO:lda:n_documents: 39
INFO:lda:vocab_size: 1497
INFO:lda:n_words: 1874
INFO:lda:n_topics: 10
INFO:lda:n_iter: 500
INFO:lda:<0> log likelihood: -21343
INFO:lda:<10> log likelihood: -18304
INFO:lda:<20> log likelihood: -17843
INFO:lda:<30> log likelihood: -17596
INFO:lda:<40> log likelihood: -17548
INFO:lda:<50> log likelihood: -17598
INFO:lda:<60> log likelihood: -17455
INFO:lda:<70> log likelihood: -17323
INFO:lda:<80> log likelihood: -17331
INFO:lda:<90> log likelihood: -17201
INFO:lda:<100> log likelihood: -17441
INFO:lda:<110> log likelihood: -17433
INFO:lda:<120> log likelihood: -17237
INFO:lda:<130> log likelihood: -17315
INFO:lda:<140> log likelihood: -17217
INFO:lda:<150> log likelihood: -17144
INFO:lda:<160> log likelihood: -17174
INFO:lda:<170> log likelihood: -17277
INFO:lda:<180> log likelihood: -17379
INFO:lda:<190> log likelihood: -17262
INFO:lda:<200> log likelihood: -17249
INFO:lda:<210> log likelihood: -17369
INFO:lda:<220> log likelihood: -17194
INFO:lda:<230> log l

Topic 0: field, areas, technical, technical areas, article data, computing data, computing, article, field statistics
Topic 1: data, scientists, data scientists, lecture, lecture entitled statistics, data collection, entitled statistics data, analysts, collection
Topic 2: data, science, data science, journal, 2015, data science journal, 2014, science journal, statistical learning
Topic 3: statistical, analysis, data mining, mining, section, association, learning, 2001, research
Topic 4: science, data, data science, statistics, term, computer, term data science, term data, used
Topic 5: university, classification, international, publication, applications, society, definition, degree, survey
Topic 6: information, systems, issues, digital, digital data, data driven, technology, disciplinary, librarians archivists
Topic 7: methods, conference, analytics, international, conference data, launched, data, ecda, journal data science
Topic 8: big data, statistician, big, term statistician, mahal

In [15]:
'''
EXAMPLE: Automatically summarize a document
'''

# corpus of 2000 movie reviews
from nltk.corpus import movie_reviews
reviews = [movie_reviews.raw(filename) for filename in movie_reviews.fileids()]

# create document-term matrix
tfidf = TfidfVectorizer(stop_words='english')
dtm = tfidf.fit_transform(reviews)
features = tfidf.get_feature_names()

# find the most and least "interesting" sentences in a randomly selected review
def summarize():
    # choose a random movie review    
    review_id = np.random.randint(0, len(reviews))
    review_text = reviews[review_id]

    # we are going to score each sentence in the review for "interesting-ness"
    sent_scores = []
    # tokenize document into sentences
    for sentence in nltk.sent_tokenize(review_text):
        # exclude short sentences
        if len(sentence) > 6:
            score = 0
            token_count = 0
            # tokenize sentence into words
            tokens = nltk.word_tokenize(sentence)
            # compute sentence "score" by summing TF-IDF for each word
            for token in tokens:
                if token in features:
                    score += dtm[review_id, features.index(token)]
                    token_count += 1
            # divide score by number of tokens
            sent_scores.append((score / float(token_count + 1), sentence))

    # lowest scoring sentences
    print('\nLOWEST:\n')
    for sent_score in sorted(sent_scores)[:3]:
        print (sent_score[1])

    # highest scoring sentences
    print('\nHIGHEST:\n')
    for sent_score in sorted(sent_scores, reverse=True)[:3]:
        print (sent_score[1])

# try it out!
summarize()


LOWEST:

not that it doesn't try .
yet , in the end , it doesn't really matter .
the four leads are nearly all playing the same character .

HIGHEST:

there's the group's unofficial leader , thurgood ( david chappelle ) , scarface ( guillermo diaz ) , brian ( jim breuer ) , and kenny ( harland williams ) .
that is , until thurgood stumbles upon a stash of pharmaceutical marijuana being tested at the company where he works as a janitor .
to top it off , and in a move contrasting with the tone of the rest of the film , thurgood is given a love interest , mary jane ( rachel true ) .


In [22]:
'''
TextBlob Demo: "Simplified Text Processing"
Installation: pip install textblob
'''
from textblob import TextBlob, Word

# identify words and noun phrases
blob = TextBlob('Kevin and Sinan are instructors for General Assembly in Washington, D.C.')
blob.words
blob.noun_phrases

# sentiment analysis
blob = TextBlob('I hate this horrible movie. This movie is not very good.')
blob.sentences
blob.sentiment.polarity
[sent.sentiment.polarity for sent in blob.sentences]

# sentiment subjectivity
TextBlob("I am a cool person").sentiment.subjectivity # Pretty subjective
TextBlob("I am a person").sentiment.subjectivity # Pretty objective
# different scores for essentially the same sentence
print(TextBlob('Kevin and Sinan are instructors for General Assembly in Washington, D.C.').sentiment.subjectivity)
print(TextBlob('Kevin and Sinan are instructors in Washington, D.C.').sentiment.subjectivity)

# singularize and pluralize
blob = TextBlob('Put away the dishes.')
[word.singularize() for word in blob.words]
[word.pluralize() for word in blob.words]

# spelling correction
blob = TextBlob('15 minuets late')
blob.correct()

# spellcheck
Word('parot').spellcheck()

# definitions
Word('bank').define()
Word('bank').define('v')

# translation and language identification
blob = TextBlob('Welcome to the classroom.')
print(blob.translate(to='es'))
blob = TextBlob('Hola amigos')
blob.detect_language()

0.5
0.0
Bienvenido al salón de clases.


'es'

In [23]:
'''
Data Science Toolkit Sentiment
Provides many different APIs for converting and getting information
We'll use the text2sentiment API.
'''
# Import the necessary modules
import requests
import json

# Sample sentences
sentences = ['I love Sinan!', 'I hate Sinan!', 'I feel nothing about Sinan!']
# API endpoint (i.e.the URL they ask you to send your text to)
url = 'http://www.datasciencetoolkit.org/text2sentiment/'

# Loop through the sentences
for sentence in sentences:
    payload = {'text': sentence} # The sentence we want the sentiment of 
    headers = {'content-type': 'application/json'} # The type of data you are sending
    r = requests.post(url, data=json.dumps(payload), headers=headers) # Send the data
    print (sentence, json.loads(r.text)['score']) # Print the results

I love Sinan! 3.0
I hate Sinan! -3.0
I feel nothing about Sinan! 0
