# NLP part 1: Basic tasks

---

You are currently looking at **version 1.0** of this notebook.

---

## Text Mining areas in short:

## 1. Working text (Part 1)
Working with text needs a tool box that is quite different from working with numerical data. Generally characters, words, sentences need to be cleaned and (pre)processed before doing actual analyses. Luckily their are very valuable frameworks and toolboxes around, like NLTK:
 - NLTK documentation link: http://www.nltk.org/api/nltk.html
 - NLTK cheat sheet: https://blogs.princeton.edu/etc/files/2014/03/Text-Analysis-with-NLTK-Cheatsheet.pdf
 - NLTK book: http://www.nltk.org/book/

## 2. Sentiment Analysis (Part 2)
Sentiment analysis is generally a starting point in analyzing a text and is then coupled with other techniques (e.g., topic analysis). Sentiment analysis is usually done using a corpus of positive and negative words.
It identifies entities and emotions in a sentence and use these to determine if the entity is being viewed positively or negatively

#### Easy example sentiment analyses
<li>I had an <b style="color:green">excellent</b> souffle at the restaurant Cavity Maker</li>
<li>Excellent is a positive word for both the souffle as well as for the restaurant</li>

#### Not so easy examples
Often, looking at words alone is not enough to figure out the sentiment:  
<li><i>The Girl on the Train is an <span style="color:green">excellent</span> book for a ‘stuck at home’ snow day</i></li> This one is easy since it includes an explicit positive opinion using a positive word
<li><i>The Girl on the Train is an <span style="color:green">excellent</span> book for using as a liner for your cat’s litter box</i></li> Not so simple! The positive word "excellent" is used with a negative connotation. 
<li><i>The Girl on the Train is <span style="color:green">better</span> than Gone Girl</i></li> The positive word is used as a comparator. Whether the writer likes The Girl on the Train or not depends on what he or she thinks of Gone Girl

## Sources of sentiment coded words
<ol>
<li>Hu and Liu's sentiment analysis lexicon: words coded as either positive or negative</li>
<ul>
<li>http://ptrckprry.com/course/ssd/data/positive-words.txt
<li>http://ptrckprry.com/course/ssd/data/negative-words.txt
</ul>
<li>NRC Emotion Lexicon: words coded into emotional categories (many languages)</li>
<ul>
<li>http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm</li>
</ul>
<li>SentiWordNet: Lists of words weighted by positive or negative sentiment. Includes guidance on how to use the words</li>
<ul>
<li>http://sentiwordnet.isti.cnr.it/</li>
</ul>
<li>Vadar Sentiment tool: 7800 words with positive or negative polarity</li>
<ul>
<li>Included with python nltk</li>
</ul>
</ol>

## 3. Topic modeling (Part 2)
The goal of topic modeling is to identify the major concepts underlying a piece of text.  
Topic modeling uses "Unsupervised Learning". No apriori knowledge is necessary.  
Though it is helpful in cleaning up results!

---
## Setup notebook
---

### Import the generic libraries used in this notebook

In [None]:
%matplotlib inline

import string
import numpy as np
import pandas as pd
import requests
import json
import re
from collections import OrderedDict, Counter
import pprint

import matplotlib
import matplotlib.pyplot as plt
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

### Manage warnings

In [None]:
import warnings
warnings.filterwarnings("ignore")

### Set defaults and constants

In [None]:
# Set pandas defaults
pd.set_option('max_rows', 10)                                # Show max 10 rows: head(5) ... tail(5)
pd.set_option('display.float_format', lambda x: '%.3f' % x)  # Set precision of DataFrames/Series

### Check current working directory and file structure

In [None]:
!pwd
# !ls

---
## 1. Working text
---

In [None]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "
n_chars = len(text1) # The length of text1

In [None]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.
n_words = len(text2)

In [None]:
print(text2)
n_chars, n_words

In [None]:
list('abcdefghijklm'), list('1234567890')

### List comprehension allows us to find specific words:

In [None]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

In [None]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

In [None]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

<br>
We can find unique words using `set()`.

In [None]:
text3 = 'To be or not to be'
text4 = text3.split(' ')
len(text4), len(set(text4))

In [None]:
set(text4)

In [None]:
set([w.lower() for w in text4])

### Processing free-text

In [None]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')
text6;

#### Finding hastags:

In [None]:
[w for w in text6 if w.startswith('#')]

#### Finding callouts:

In [None]:
[w for w in text6 if w.startswith('@')]

In [None]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

#### Regular expressions help us with more complex parsing
For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [None]:
import re

In [None]:
[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

### Read a labeled data set; [(text, label)]

In [None]:
with open("data/sentiment_labelled_sentences/full_set.txt") as f:
    content = f.readlines()

#### First look at data structure

In [None]:
content[:10]

#### Split sentences and labels

In [None]:
## Remove leading and trailing white spaces before splitting labels
content = [x.strip() for x in content]

## Separate the sentences from the labels; '\t1\n' => 1 is the label
sentences = [x.split("\t")[0] for x in content]
labels = [x.split("\t")[1] for x in content]

### Preprocessing the text data

To transform this prediction problem into one amenable to linear classification, we will first need to preprocess the text data. We will do four transformations:

1. Remove punctuation and numbers.
2. Transform all words to lower-case.
3. Remove _stop words_.
4. Convert the sentences into vectors, using a bag-of-words representation.

In [None]:
def full_remove(x, removal_list):
    # Replace chars from removal list with spaces
    for remove_item in removal_list:
        x = x.replace(remove_item, ' ')
    # Return without superfluous spaces
    return ' '.join(x.split(None))

In [None]:
## Remove digits
digit_less = [full_remove(x, list('1234567890')) for x in sentences]

## Remove punctuation
punc_less = [full_remove(x, list(string.punctuation)) for x in digit_less]

## Make everything lower-case
sents_lower = [x.lower() for x in punc_less]
type(sents_lower), sents_lower[:5]

#### Stop words
 - Stop words are words that are filtered out because they are believed to contain no useful information for the task at hand. You can create your own arbitrary stop word list or use a generic one.

In [None]:
from nltk.corpus import stopwords

In [None]:
corpus = ' '.join([sent_words for sent_words in sents_lower])
dictionary = set(corpus.split())

# Use predefined stop words set
stop_words = set(stopwords.words('english'))

# Define our own unwanted words set
unwanted_words = set(['the', 'a', 'an', 'i', 'he', 'she', 'they', 'to', 'of', 'it', 'from'])

# Get short words
MIN_LENGTH = 3
short_words = set([word for word in dictionary if len(word) < MIN_LENGTH])

# Define set of words to clear from text/sentences
clear_set = stop_words | unwanted_words | short_words

# Clear text from unwanted words
sents_split = [x.split() for x in sents_lower]
sents_processed = [' '.join(list(filter(lambda word: word not in clear_set, sent_words))) for sent_words in sents_split]

What do the sentences look like so far?

In [None]:
sents_processed[0:10]

---
## Basic NLP Tasks with NLTK
---

### NLTK sources
 - ntlk documentation link: http://www.nltk.org/api/nltk.html
 - Commands cheat sheet: https://blogs.princeton.edu/etc/files/2014/03/Text-Analysis-with-NLTK-Cheatsheet.pdf
 - nltk book: http://www.nltk.org/book/

In [None]:
import nltk
from nltk.book import *

### Counting vocabulary of words

In [None]:
'no words in text:', len(text7), text7

In [None]:
'no words in sentence:', len(sent7), sent7

In [None]:
'no unique words:', len(set(text7))

In [None]:
'first 10 unique words:', list(set(text7))[:10]

### Frequency of words

In [None]:
dist = FreqDist(text7)
dist2 = Counter(text7)
len(dist), dist == dist2

In [None]:
vocab1 = dist.keys()
# vocab1[:10] # can't slice in python 3

# Python 3 dict.keys() returns an iterable view instead of a list
list(vocab1)[:10]

In [None]:
'frequency of key in text:', dist['four']

In [None]:
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 100]
'words with more than 5 characters and frequency higher than 100:', freqwords

### Normalization and stemming
Stemming is the process for reducing inflected/derived words to their stem/base/root. The stem need not be identical to the morphological root of the word.

In [None]:
input1 = 'List listed lists listing listings'
words1 = input1.lower().split(' ')
words1

In [None]:
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

### Lemmatization
Lemmatisation is the process of grouping together the different inflected forms.
For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word.

In [None]:
inputl = 'Walk walked walks walking walker Walkers'
wordsl = inputl.lower().split(' ')

WNlemma = nltk.WordNetLemmatizer()
'walks => walk ', [WNlemma.lemmatize(t) for t in wordsl], [WNlemma.lemmatize(t) for t in wordsl] == wordsl

In [None]:
udhr = nltk.corpus.udhr.words('English-Latin1')
'Universal declaration of human rights corpus:', udhr[:20]

In [None]:
[porter.stem(t) for t in udhr[:20]]

In [None]:
WNlemma = nltk.WordNetLemmatizer()
lemmatized = [WNlemma.lemmatize(t) for t in udhr[:20]]

#### Lexical diversity

In [None]:
len(set(lemmatized)) / len(lemmatized)

### Tokenization

In [None]:
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')

In [None]:
text_tokens = nltk.word_tokenize(text11)
text_nltk = nltk.Text(text_tokens)
text_tokens, text_nltk

In [None]:
nltk.word_tokenize(text11), '-'*50, 'no of words:', len(nltk.word_tokenize(text11))

In [None]:
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)
len(sentences)

In [None]:
sentences

In [None]:
text1

In [None]:
len(nltk.word_tokenize(' '.join(text1)))

In [None]:
text1[:10], nltk.Text(text1[:10])

In [None]:
words = ' '.join(text1).lower().split(' ')
dist = FreqDist(words)#.most_common() in ['whale', 'Whale']
dist['whale'] * 100 / len(nltk.word_tokenize(' '.join(text1)))

In [None]:
FreqDist(text1).most_common(10)

In [None]:
# word length > 5, frequency > 150
dist = FreqDist(text1).most_common()
sorted([k for k, v in dist if len(k) > 5 and v > 150])

In [None]:
# Longest word + length
from collections import OrderedDict
dist = FreqDist(text1).most_common()

# dictionary sorted by length of the key string
longest_word = OrderedDict(sorted(dist, key=lambda t: len(t[0]), reverse=True)).popitem(last=False)
longest_word[0], len(longest_word[0])

In [None]:
pd.Series({len(w):w for w in text1})[-1:]

In [None]:
# unique words with frequency of more than 2000 and their frequency
dist = FreqDist(text1).most_common(50)
result = sorted([(f, w) for w, f in dist if f > 2000 and w.isalpha()])

In [None]:
# Average # tokens per sentance
sentences = nltk.sent_tokenize(' '.join(text1))
np.mean([len(nltk.word_tokenize(s)) for s in sentences])

---
## Advanced NLP Tasks with NLTK
---

### POS tagging

In [None]:
nltk.help.upenn_tagset('NN'), nltk.help.upenn_tagset('DT'), nltk.help.upenn_tagset('VB'), nltk.help.upenn_tagset('MD')

In [None]:
text13 = nltk.word_tokenize(text11)
nltk.pos_tag(text13)

In [None]:
text14 = nltk.word_tokenize("Visiting aunts can be a nuisance")
nltk.pos_tag(text14)

In [None]:
# Parsing sentence structure
text15 = nltk.word_tokenize("Alice loves Bob")
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")

parser = nltk.ChartParser(grammar)
trees = parser.parse_all(text15)
for tree in trees:
    print(tree)

In [None]:
from nltk.corpus import treebank
text17 = treebank.parsed_sents('wsj_0001.mrg')[0]
print(text17)

### POS tagging and parsing ambiguity

In [None]:
text18 = nltk.word_tokenize("The old man the boat")
nltk.pos_tag(text18)

In [None]:
text19 = nltk.word_tokenize("Colorless green ideas sleep furiously")
nltk.pos_tag(text19)

### Named Entities: People, places, organizations
 - Named entities are often the subject of sentiments so identifying them can be very useful

### Named entity detection - Part-of-speech tagging
 - tokenize sentences with sentence detector (english)
 - tokenize words in each sentence
 - chunk them; ne_chunk identifies likely chunked candidates (ne = named entity)
 - build chunks using nltk's guess on what members of chunk represent (people, place, organization)

In [None]:
en={}
try:
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = sent_detector.tokenize(community_data.raw().strip())
    for sentence in sentences:
            tokenized = nltk.word_tokenize(sentence)
            tagged = nltk.pos_tag(tokenized)
            chunked = nltk.ne_chunk(tagged)
            for tree in chunked:
                if hasattr(tree, 'label'):
                    ne = ' '.join(c[0] for c in tree.leaves())
                    en[ne] = [tree.label(), ' '.join(c[1] for c in tree.leaves())]
except Exception as e:
    print(str(e))

In [None]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(en)

In [None]:
# most frequent parts of speech in this text? What is their frequency?
df = pd.DataFrame(nltk.pos_tag(text1))
df.columns = ['word', 'pos']
df = df.groupby('pos')['pos'].count().sort_values(ascending=False)
list(zip(df.head(5).index, df.head(5)))

---
## Part 2: Text Mining areas explained

[Open Notebook](./nlp_part2_sentiment_topic_similarity_classification_.ipynb)

---