# Text mining intro

---

You are currently looking at **version 1.0** of this notebook.

---

## Text Mining areas in short:

## 1. Working text
Working with text needs a tool box that is quite different from working with numerical data. Generally characters, words, sentences need to be cleaned and (pre)processed before doing actual analyses. Luckily their are very valuable frameworks and toolboxes around, like NLTK:
 - NLTK documentation link: http://www.nltk.org/api/nltk.html
 - NLTK cheat sheet: https://blogs.princeton.edu/etc/files/2014/03/Text-Analysis-with-NLTK-Cheatsheet.pdf
 - NLTK book: http://www.nltk.org/book/

## 2. Sentiment Analysis
Sentiment analysis is generally a starting point in analyzing a text and is then coupled with other techniques (e.g., topic analysis). Sentiment analysis is usually done using a corpus of positive and negative words.
It identifies entities and emotions in a sentence and use these to determine if the entity is being viewed positively or negatively

#### Easy example sentiment analyses
<li>I had an <b style="color:green">excellent</b> souffle at the restaurant Cavity Maker</li>
<li>Excellent is a positive word for both the souffle as well as for the restaurant</li>

#### Not so easy examples
Often, looking at words alone is not enough to figure out the sentiment:  
<li><i>The Girl on the Train is an <span style="color:green">excellent</span> book for a ‘stuck at home’ snow day</i></li> This one is easy since it includes an explicit positive opinion using a positive word
<li><i>The Girl on the Train is an <span style="color:green">excellent</span> book for using as a liner for your cat’s litter box</i></li> Not so simple! The positive word "excellent" is used with a negative connotation. 
<li><i>The Girl on the Train is <span style="color:green">better</span> than Gone Girl</i></li> The positive word is used as a comparator. Whether the writer likes The Girl on the Train or not depends on what he or she thinks of Gone Girl

## Sources of sentiment coded words
<ol>
<li>Hu and Liu's sentiment analysis lexicon: words coded as either positive or negative</li>
<ul>
<li>http://ptrckprry.com/course/ssd/data/positive-words.txt
<li>http://ptrckprry.com/course/ssd/data/negative-words.txt
</ul>
<li>NRC Emotion Lexicon: words coded into emotional categories (many languages)</li>
<ul>
<li>http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm</li>
</ul>
<li>SentiWordNet: Lists of words weighted by positive or negative sentiment. Includes guidance on how to use the words</li>
<ul>
<li>http://sentiwordnet.isti.cnr.it/</li>
</ul>
<li>Vadar Sentiment tool: 7800 words with positive or negative polarity</li>
<ul>
<li>Included with python nltk</li>
</ul>
</ol>

## 3. Topic modeling
The goal of topic modeling is to identify the major concepts underlying a piece of text.  
Topic modeling uses "Unsupervised Learning". No apriori knowledge is necessary.  
Though it is helpful in cleaning up results!

---
## Setup notebook
---

### Import the generic libraries used in this notebook

In [None]:
%matplotlib inline

import string
import numpy as np
import pandas as pd
import requests
import json
import re
from collections import OrderedDict, Counter
import pprint

import matplotlib
import matplotlib.pyplot as plt
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

### Manage warnings

In [None]:
import warnings
warnings.filterwarnings("ignore")

### Set defaults and constants

In [None]:
# Set pandas defaults
pd.set_option('max_rows', 10)                                # Show max 10 rows: head(5) ... tail(5)
pd.set_option('display.float_format', lambda x: '%.3f' % x)  # Set precision of DataFrames/Series

### Check current working directory and file structure

In [None]:
!pwd
# !ls

---
## 1. Working text
---

In [None]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "
n_chars = len(text1) # The length of text1

In [None]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.
n_words = len(text2)

In [None]:
print(text2)
n_chars, n_words

In [None]:
list('abcdefghijklm'), list('1234567890')

### List comprehension allows us to find specific words:

In [None]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

In [None]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

In [None]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

<br>
We can find unique words using `set()`.

In [None]:
text3 = 'To be or not to be'
text4 = text3.split(' ')
len(text4), len(set(text4))

In [None]:
set(text4)

In [None]:
set([w.lower() for w in text4])

### Processing free-text

In [None]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')
text6;

#### Finding hastags:

In [None]:
[w for w in text6 if w.startswith('#')]

#### Finding callouts:

In [None]:
[w for w in text6 if w.startswith('@')]

In [None]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

#### Regular expressions help us with more complex parsing
For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [None]:
import re

In [None]:
[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

### Read a labeled data set; [(text, label)]

In [None]:
with open("data/sentiment_labelled_sentences/full_set.txt") as f:
    content = f.readlines()

#### First look at data structure

In [None]:
content[:10]

#### Split sentences and labels

In [None]:
## Remove leading and trailing white spaces before splitting labels
content = [x.strip() for x in content]

## Separate the sentences from the labels; '\t1\n' => 1 is the label
sentences = [x.split("\t")[0] for x in content]
labels = [x.split("\t")[1] for x in content]

### Preprocessing the text data

To transform this prediction problem into one amenable to linear classification, we will first need to preprocess the text data. We will do four transformations:

1. Remove punctuation and numbers.
2. Transform all words to lower-case.
3. Remove _stop words_.
4. Convert the sentences into vectors, using a bag-of-words representation.

In [None]:
def full_remove(x, removal_list):
    # Replace chars from removal list with spaces
    for remove_item in removal_list:
        x = x.replace(remove_item, ' ')
    # Return without superfluous spaces
    return ' '.join(x.split(None))

In [None]:
## Remove digits
digit_less = [full_remove(x, list('1234567890')) for x in sentences]

## Remove punctuation
punc_less = [full_remove(x, list(string.punctuation)) for x in digit_less]

## Make everything lower-case
sents_lower = [x.lower() for x in punc_less]
type(sents_lower), sents_lower[:5]

#### Stop words
 - Stop words are words that are filtered out because they are believed to contain no useful information for the task at hand. You can create your own arbitrary stop word list or use a generic one.

In [None]:
from nltk.corpus import stopwords

In [None]:
corpus = ' '.join([sent_words for sent_words in sents_lower])
dictionary = set(corpus.split())

# Use predefined stop words set
stop_words = set(stopwords.words('english'))

# Define our own unwanted words set
unwanted_words = set(['the', 'a', 'an', 'i', 'he', 'she', 'they', 'to', 'of', 'it', 'from'])

# Get short words
MIN_LENGTH = 3
short_words = set([word for word in dictionary if len(word) < MIN_LENGTH])

# Define set of words to clear from text/sentences
clear_set = stop_words | unwanted_words | short_words

# Clear text from unwanted words
sents_split = [x.split() for x in sents_lower]
sents_processed = [' '.join(list(filter(lambda word: word not in clear_set, sent_words))) for sent_words in sents_split]

What do the sentences look like so far?

In [None]:
sents_processed[0:10]

---
## Basic NLP Tasks with NLTK
---

### NLTK sources
 - ntlk documentation link: http://www.nltk.org/api/nltk.html
 - Commands cheat sheet: https://blogs.princeton.edu/etc/files/2014/03/Text-Analysis-with-NLTK-Cheatsheet.pdf
 - nltk book: http://www.nltk.org/book/

In [None]:
import nltk
from nltk.book import *

### Counting vocabulary of words

In [None]:
'no words in text:', len(text7), text7

In [None]:
'no words in sentence:', len(sent7), sent7

In [None]:
'no unique words:', len(set(text7))

In [None]:
'first 10 unique words:', list(set(text7))[:10]

### Frequency of words

In [None]:
dist = FreqDist(text7)
dist2 = Counter(text7)
len(dist), dist == dist2

In [None]:
vocab1 = dist.keys()
# vocab1[:10] # can't slice in python 3

# Python 3 dict.keys() returns an iterable view instead of a list
list(vocab1)[:10]

In [None]:
'frequency of key in text:', dist['four']

In [None]:
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 100]
'words with more than 5 characters and frequency higher than 100:', freqwords

### Normalization and stemming
Stemming is the process for reducing inflected/derived words to their stem/base/root. The stem need not be identical to the morphological root of the word.

In [None]:
input1 = 'List listed lists listing listings'
words1 = input1.lower().split(' ')
words1

In [None]:
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

### Lemmatization
Lemmatisation is the process of grouping together the different inflected forms.
For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word.

In [None]:
inputl = 'Walk walked walks walking walker Walkers'
wordsl = inputl.lower().split(' ')

WNlemma = nltk.WordNetLemmatizer()
'walks => walk ', [WNlemma.lemmatize(t) for t in wordsl], [WNlemma.lemmatize(t) for t in wordsl] == wordsl

In [None]:
udhr = nltk.corpus.udhr.words('English-Latin1')
'Universal declaration of human rights corpus:', udhr[:20]

In [None]:
[porter.stem(t) for t in udhr[:20]]

In [None]:
WNlemma = nltk.WordNetLemmatizer()
lemmatized = [WNlemma.lemmatize(t) for t in udhr[:20]]

#### Lexical diversity

In [None]:
len(set(lemmatized)) / len(lemmatized)

### Tokenization

In [None]:
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')

In [None]:
text_tokens = nltk.word_tokenize(text11)
text_nltk = nltk.Text(text_tokens)
text_tokens, text_nltk

In [None]:
nltk.word_tokenize(text11), '-'*50, 'no of words:', len(nltk.word_tokenize(text11))

In [None]:
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)
len(sentences)

In [None]:
sentences

In [None]:
len(nltk.word_tokenize(moby_raw))

In [None]:
text1[:10], nltk.Text(text1[:10])

In [None]:
words = ' '.join(text1).lower().split(' ')
dist = FreqDist(words)#.most_common() in ['whale', 'Whale']
dist['whale'] * 100 / len(nltk.word_tokenize(' '.join(text1)))

In [None]:
FreqDist(text1).most_common(10)

In [None]:
# word length > 5, frequency > 150
dist = FreqDist(text1).most_common()
sorted([k for k, v in dist if len(k) > 5 and v > 150])

In [None]:
# Longest word + length
from collections import OrderedDict
dist = FreqDist(text1).most_common()

# dictionary sorted by length of the key string
longest_word = OrderedDict(sorted(dist, key=lambda t: len(t[0]), reverse=True)).popitem(last=False)
longest_word[0], len(longest_word[0])

In [None]:
pd.Series({len(w):w for w in text1})[-1:]

In [None]:
# unique words with frequency of more than 2000 and their frequency
dist = FreqDist(text1).most_common(50)
result = sorted([(f, w) for w, f in dist if f > 2000 and w.isalpha()])

In [None]:
# Average # tokens per sentance
sentences = nltk.sent_tokenize(' '.join(text1))
np.mean([len(nltk.word_tokenize(s)) for s in sentences])

---
## Advanced NLP Tasks with NLTK
---

### POS tagging

In [None]:
nltk.help.upenn_tagset('NN'), nltk.help.upenn_tagset('DT'), nltk.help.upenn_tagset('VB'), nltk.help.upenn_tagset('MD')

In [None]:
text13 = nltk.word_tokenize(text11)
nltk.pos_tag(text13)

In [None]:
text14 = nltk.word_tokenize("Visiting aunts can be a nuisance")
nltk.pos_tag(text14)

In [None]:
# Parsing sentence structure
text15 = nltk.word_tokenize("Alice loves Bob")
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")

parser = nltk.ChartParser(grammar)
trees = parser.parse_all(text15)
for tree in trees:
    print(tree)

In [None]:
from nltk.corpus import treebank
text17 = treebank.parsed_sents('wsj_0001.mrg')[0]
print(text17)

### POS tagging and parsing ambiguity

In [None]:
text18 = nltk.word_tokenize("The old man the boat")
nltk.pos_tag(text18)

In [None]:
text19 = nltk.word_tokenize("Colorless green ideas sleep furiously")
nltk.pos_tag(text19)

### Named Entities: People, places, organizations
 - Named entities are often the subject of sentiments so identifying them can be very useful

### Named entity detection - Part-of-speech tagging
 - tokenize sentences with sentence detector (english)
 - tokenize words in each sentence
 - chunk them; ne_chunk identifies likely chunked candidates (ne = named entity)
 - build chunks using nltk's guess on what members of chunk represent (people, place, organization)

In [None]:
en={}
try:
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = sent_detector.tokenize(community_data.raw().strip())
    for sentence in sentences:
            tokenized = nltk.word_tokenize(sentence)
            tagged = nltk.pos_tag(tokenized)
            chunked = nltk.ne_chunk(tagged)
            for tree in chunked:
                if hasattr(tree, 'label'):
                    ne = ' '.join(c[0] for c in tree.leaves())
                    en[ne] = [tree.label(), ' '.join(c[1] for c in tree.leaves())]
except Exception as e:
    print(str(e))

In [None]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(en)

In [None]:
# most frequent parts of speech in this text? What is their frequency?
df = pd.DataFrame(nltk.pos_tag(text1))
df.columns = ['word', 'pos']
df = df.groupby('pos')['pos'].count().sort_values(ascending=False)
list(zip(df.head(5).index, df.head(5)))

---
## Text Mining areas explained
---

## 1. Sentiment Analysis
Sentiment analysis is generally a starting point in analyzing a text and is then coupled with other techniques (e.g., topic analysis). Sentiment analysis is usually done using a corpus of positive and negative words.
It identifies entities and emotions in a sentence and use these to determine if the entity is being viewed positively or negatively

#### Easy example sentiment analyses
<li>I had an <b style="color:green">excellent</b> souffle at the restaurant Cavity Maker</li>
<li>Excellent is a positive word for both the souffle as well as for the restaurant</li>

#### Not so easy examples
Often, looking at words alone is not enough to figure out the sentiment:  
<li><i>The Girl on the Train is an <span style="color:green">excellent</span> book for a ‘stuck at home’ snow day</i></li> This one is easy since it includes an explicit positive opinion using a positive word
<li><i>The Girl on the Train is an <span style="color:green">excellent</span> book for using as a liner for your cat’s litter box</i></li> Not so simple! The positive word "excellent" is used with a negative connotation. 
<li><i>The Girl on the Train is <span style="color:green">better</span> than Gone Girl</i></li> The positive word is used as a comparator. Whether the writer likes The Girl on the Train or not depends on what he or she thinks of Gone Girl

## Sources of sentiment coded words
<ol>
<li>Hu and Liu's sentiment analysis lexicon: words coded as either positive or negative</li>
<ul>
<li>http://ptrckprry.com/course/ssd/data/positive-words.txt
<li>http://ptrckprry.com/course/ssd/data/negative-words.txt
</ul>
<li>NRC Emotion Lexicon: words coded into emotional categories (many languages)</li>
<ul>
<li>http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm</li>
</ul>
<li>SentiWordNet: Lists of words weighted by positive or negative sentiment. Includes guidance on how to use the words</li>
<ul>
<li>http://sentiwordnet.isti.cnr.it/</li>
</ul>
<li>Vadar Sentiment tool: 7800 words with positive or negative polarity</li>
<ul>
<li>Included with python nltk</li>
</ul>
</ol>

## Simple sentiment analysis using from Hu and Liu's sentiment analysis lexicon
Compute the proportion of positive and negative words in a text

In [None]:
import nltk
# nltk.download() # download datasets = corpi
from nltk import sent_tokenize, word_tokenize 
from nltk.corpus import stopwords, inaugural, PlaintextCorpusReader
from nltk.probability import FreqDist
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS

#### Get positive and negative words from Hu and Liu's sentiment analysis lexicon

In [None]:
def get_words(url):
    words = requests.get(url).content.decode('latin-1')
    word_list = words.split('\n')
    words_list = [word for word in word_list if ';' not in word if word != '']
    return words_list

In [None]:
# Get lists of positive and negative words
p_url = 'http://ptrckprry.com/course/ssd/data/positive-words.txt'
n_url = 'http://ptrckprry.com/course/ssd/data/negative-words.txt'
positive_words = get_words(p_url)
negative_words = get_words(n_url)
positive_words[:5], negative_words[:5]

#### Read text

In [None]:
with open('data/community.txt','r') as f:
    community = f.read()
with open('data/le_monde.txt','r') as f:
    le_monde = f.read()

#### Unique words in positive and negative domains

In [None]:
print('community:', len(set(word_tokenize(community)) & set(positive_words))*100/len(set(word_tokenize(community))), 
len(set(word_tokenize(community)) & set(negative_words))*100/len(set(word_tokenize(community))), '\nle monde:',
      
len(set(word_tokenize(le_monde)) & set(positive_words))*100/len(set(word_tokenize(le_monde))), 
len(set(word_tokenize(le_monde)) & set(negative_words))*100/len(set(word_tokenize(le_monde))))

#### Relative frequency words in positive and negative domains

In [None]:
cpos, cneg, lpos, lneg = 0, 0, 0, 0

ttl_community = len(word_tokenize(community))
ttl_le_monde = len(word_tokenize(le_monde))

for word in word_tokenize(community):
    if word in positive_words:
        cpos += 100/ttl_community
    if word in negative_words:
        cneg += 100/ttl_community
        
for word in word_tokenize(le_monde):
    if word in positive_words:
        lpos += 100/ttl_le_monde
    if word in negative_words:
        lneg += 100/ttl_le_monde

print("text source: {}\t {}\t {}".format('positive','negative','difference'))    
print("community:   {0:1.2f}\t {1:1.2f}\t\t {2:1.2f}".format(cpos, cneg, (cpos-cneg)))
print("le monde:    {0:1.2f}\t {1:1.2f}\t\t {2:1.2f}".format(lpos, lneg, (lpos-lneg)))


## Simple sentiment analysis using a associations with emotions 
- 8 emoitions: anger, fear, anticipation, trust, surprise, sadness, joy, and disgust
- NRC data codifies words with emotions
- 14,182 words are coded into 2 sentiments and 8 emotions


#### For example, the word abandonment is associated with anger, fear, sadness and has a negative sentiment
- abandoned	anger	        1
- abandoned	anticipation	0
- abandoned	disgust	        0
- abandoned	fear	        1
- abandoned	joy	            0
- abandoned	negative	    1
- abandoned	positive	    0
- abandoned	sadness	        1
- abandoned	surprise	    0
- abandoned	trust	        0

#### Build dictionary of the NRC sentiment data

In [None]:
!head -50 data/NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt

In [None]:
def get_nrc_data(nrc):
    count = 0
    emotion_dict = {}
    with open(nrc, 'r') as f:
        all_lines = []
        for line in f:
            # skip first 46 lines
            if count < 46: 
                count += 1
                continue
            line = line.strip().split('\t')
            # add associations to dict
            if int(line[2]) == 1:
                # build list of assciations (within dict) for each (key)word
                if emotion_dict.get(line[0]):
                    emotion_dict[line[0]].append(line[1])
                else:
                    emotion_dict[line[0]] = [line[1]]
    return emotion_dict

In [None]:
emotion_dict = get_nrc_data("data/NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt")
emotion_dict['abandoned']

## YELP without API
 - make sure you've the file "yelp_data.pickle" in the same directory as your notebook

In [None]:
import pickle
with open ('data/yelp_data.pickle', 'rb') as fp:
    all_snippets = pickle.load(fp)

In [None]:
# Sanity check
all_snippets

## Yelp with API
- https://www.yelp.com/developers/documentation/v3
- log into yelp (top right hand corner of the page)
- Click <span style="color:blue">Create App</span> on the left hand menu bar
- Enter app info (leave optional stuff blank)
- Copy the client id and API key to a secure place (this notebook should do the trick or use a text file!)
- You might need to generate the API key by clicking on Manage App after creating an App. Make sure you get one!
- Safe credentials to file and set permissions CHMOD 400

In [None]:
# !ls

### Read keys from credentials file

In [None]:
with open('yelp_credentials.txt') as f:
    contents = f.read().split('\n')
    CLIENT_ID = contents[0]
    API_KEY = contents[1]

In [None]:
# API constants, you shouldn't have to change these.
API_KEY
API_HOST = 'https://api.yelp.com'     # The API url header
BUSINESS_PATH = '/v3/businesses/'     # The path to get data for a single business
SEARCH_PATH = '/v3/businesses/search' # The path for an API request to find businesses

In [None]:
def get_response(business_id, search=False, search_data=''):
    # Construct url for purpose
    if search:
        url = API_HOST + SEARCH_PATH
    else:
        url = '{}{}{}/reviews'.format(API_HOST, BUSINESS_PATH, business_id)
    # Authorisation
    headers = {'Authorization': 'Bearer {}'.format(API_KEY)}
    response = requests.get(url, headers=headers).json()
    return requests.get(url, headers=headers, params=search_data).json()

In [None]:
def get_business(business, location, number=15):
    search_data = {'term': business, 'location': location.replace(' ', '+'), 'limit': number}
    return get_response(business, True, search_data)['businesses']

### Get restaurants

In [None]:
location = 'Amsterdam, Netherlands'
business = 'restaurant'
get_business(business, location, 2)

In [None]:
location = 'Amsterdam, Netherlands'
df = pd.DataFrame(get_business(business, location, 20))
df.head(10)

In [None]:
def get_business_review(business_id):
    response = get_response(business_id)
    return [review['text'] for review in response['reviews']]

In [None]:
get_business_review(df['alias'][4])

In [None]:
def get_reviews(business, location, number=25):
    businesses = get_business(business, location, number)
    if businesses:
        review_list = []
        for business in businesses:
            review_text = get_business_review(business['alias'])
            review_list.append((business['alias'], business['name'], review_text))
    return review_list

In [None]:
all_snippets = get_reviews(business, location)
all_snippets[:2]

### Analyze emotions in text

In [None]:
def emotion_analyser(text, emotion_dict=emotion_dict):
    '''Compute percentage of words associated with emotion'''
    # Set up the emotion count dict
    emotions = {x for y in emotion_dict.values() for x in y}
    emotion_count = {}
    for emotion in emotions:
        emotion_count[emotion] = 0

    # Count emotions and normalize by total number of words
    words = text.split()
    total_words = len(words)
    for word in words:
        if emotion_dict.get(word):
            for emotion in emotion_dict.get(word):
                emotion_count[emotion] += 1/total_words*100
                
    return emotion_count

### Analyse joint reviews for each restaurant

In [None]:
def sentiment_analyser(snippets, emotion_dict=emotion_dict):
    df_ = pd.DataFrame([emotion_analyser(' '.join(snippet[2]), emotion_dict) for snippet in snippets])
    df_['restaurant'] = [snippet[1] for snippet in snippets]
    df_.set_index('restaurant', inplace=True)
    pd.set_option('display.float_format', lambda x: '{:.1f}%'.format(x))
    return df_

In [None]:
sentiment_analyser(all_snippets)

### Combine address, reviews and sentiment

In [None]:
def analyse_nearby_restaurants(business, address, number=15):
    snippets = get_reviews(business, address, number)
    return sentiment_analyser(snippets)

In [None]:
analyse_nearby_restaurants('bar', 'Community Food and Juice, New York, NY', 15)

In [None]:
analyse_nearby_restaurants('pub', 'Amsterdam', 15)

## Word Clouds

In [None]:
# !pip install wordcloud

In [None]:
from wordcloud import WordCloud, STOPWORDS

In [None]:
def word_cloud(snippets):
    'Combine all reviews to one corpus and generate wordcloud'
    text = ' '.join(str(snippet[2]) for snippet in snippets)
    return WordCloud(stopwords=STOPWORDS, background_color='white', width=3000, height=3000).generate(text)

In [None]:
plt.imshow(word_cloud(all_snippets))
plt.axis('off')
plt.show()

In [None]:
reviews_dict = {snippet[1]:' '.join(snippet[2]) for snippet in all_snippets}
reviews_dict.keys()

In [None]:
text = reviews_dict['Cannibale Royale']
wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', width=3000, height=3000).generate(text)

In [None]:
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

#### Use the PlainTextCorpusReader to combine text files into corpus

In [None]:
community = "data/community", "community.*"
le_monde = "data/le_monde", "le_monde.*"
heights = "data/heights", "heights.*"
amigos = "data/amigos", "amigos.*"

community_data = PlaintextCorpusReader(*community)
le_monde_data = PlaintextCorpusReader(*le_monde)
heights_data = PlaintextCorpusReader(*heights)
amigos_data = PlaintextCorpusReader(*amigos)

In [None]:
amigos_data.fileids()[:5]

In [None]:
amigos_data.raw()[:300]

#### Modified sentiment analyser
 - snippets are joint in raw file
 - tuple consist of (resto, review)

In [None]:
def sentiment_analyser_2(snippets, emotion_dict=emotion_dict):
    df_ = pd.DataFrame([emotion_analyser(snippet[1]) for snippet in snippets])
    df_['restaurant'] = [snippet[0] for snippet in snippets]
    df_.set_index('restaurant', inplace=True)
    pd.set_option('display.float_format', lambda x: '{:.1f}%'.format(x))
    return df_

In [None]:
emotion_analyser(community_data.raw())

In [None]:
restaurant_data = [('community', community_data.raw()),('le monde', le_monde_data.raw())
                  ,('heights', heights_data.raw()), ('amigos', amigos_data.raw())]

sentiment_analyser_2(restaurant_data)

## Text Complexity analysis
complexity factors:
 - average word length: longer words adds to complexity
 - average sentence length: longer sentences are more complex (unless the text is rambling!)
 - vocabulary: the ratio of unique words used to the total number of words (more variety, more complexity)

<b>token:</b> A sequence (or group) of characters of interest. 
For e.g., in the below analysis, a token = a word
 - A token is the base unit of analysis
 - convert text into tokens and nltk text object

In [None]:
text = le_monde_data.raw()
sentences = nltk.Text(sent_tokenize(text))
print(len(sentences))

In [None]:
words = nltk.Text(word_tokenize(text))
print(len(words))

In [None]:
def get_complexity(text):
    num_chars = len(text)
    num_words = len(word_tokenize(text))
    num_sentences = len(sent_tokenize(text))
    vocab = {x.lower() for x in word_tokenize(text)}
    return len(vocab),int(num_chars/num_words),int(num_words/num_sentences),len(vocab)/num_words

In [None]:
get_complexity(le_monde_data.raw())

In [None]:
df = pd.DataFrame()
for i, text in enumerate(restaurant_data):
    (vocab, word_size, sent_size, vocab_to_text) = get_complexity(text[1])
    df.loc[i, 'text'] = text[0]
    df.loc[i, 'vocab_size'] = vocab
    df.loc[i, 'word_size'] = word_size
    df.loc[i, 'sent_size'] = sent_size
    df.loc[i, 'unique_words'] = vocab_to_text
    pd.set_option('display.float_format', lambda x: '%.2f' % x)
df

## Word cloud comparison

In [None]:
texts = restaurant_data

# Remove unwanted words
DELETE_WORDS = []

def remove_words(text_string, DELETE_WORDS=DELETE_WORDS):
    for word in DELETE_WORDS:
        text_string = text_string.replace(word,' ')
    return text_string

# Remove short words
MIN_LENGTH = 4

def remove_short_words(text_string, min_length=MIN_LENGTH):
    word_list = text_string.split()
    for word in word_list:
        if len(word) < min_length:
            text_string = text_string.replace(' '+word+' ', ' ', 1)
    return text_string


# Setup axes/grid
ROW_NUM, COL_NUM = 2, 2
fig, axes = plt.subplots(ROW_NUM, COL_NUM, figsize=(12, 12))

for i in range(0, len(texts)):
    text_string = remove_words(texts[i][1])
    text_string = remove_short_words(text_string)
    ax = axes[i%2]
    ax = axes[i//2, i%2] # Use this if ROW_NUM >=2
    ax.set_title(texts[i][0])
    wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', width=1200, height=1000, max_words=20).generate(text_string)
    ax.imshow(wordcloud)
    ax.axis('off')
plt.show()

## nltk contains a large corpora of pre-tokenized text 
- use nltk.download() to import the corpora

    

## Comparitive analysis
- compare US President inaugural speeches

In [None]:
inaugural.fileids()[-5:]

In [None]:
inaugural.raw('2017-Trump.txt')[:220]

<h4>Let's look at the complexity of the speeches by four presidents</h4>

In [None]:
texts = [('trump', inaugural.raw('2017-Trump.txt')),
         ('obama', inaugural.raw('2009-Obama.txt')+inaugural.raw('2013-Obama.txt')),
         ('jackson', inaugural.raw('1829-Jackson.txt')+inaugural.raw('1833-Jackson.txt')),
         ('washington', inaugural.raw('1789-Washington.txt')+inaugural.raw('1793-Washington.txt'))]

In [None]:
pd.DataFrame([get_complexity(text[1]) for text in texts], 
             index=['Trump', 'Obama', 'Jackson', 'Washington'],
             columns=['vocab', 'word_size', 'sent_size', 'vocab_to_text'])

### Sentence length over time


In [None]:
year, sentence_lengths = [], []
for fileid in inaugural.fileids():
    year.append(fileid.split('-')[0])
    sentence_lengths.append(get_complexity(' '.join(inaugural.words(fileid)))[2])
plt.figure(figsize=(8,12))
plt.plot(sentence_lengths, year);

### Dispersion plots 
 - show the relative frequency and location of words over the text

In [None]:
from nltk.book import *

In [None]:
plt.figure(figsize=(12,8))
text4.dispersion_plot(['government', 'citizen', 'freedom', 'duties', 'America', 'independence', 'God', 'patriotism'])

### Word stemmming
 - example: patriot, patriotic, patriotism all express roughly the same idea
 - nltk has a stemmer that implements the "Porter Stemming Algorithm" (https://tartarus.org/martin/PorterStemmer/)

In [None]:
def strip_text(raw):
    striptext = raw.replace('\n\n', ' ')
    return striptext.replace('\n', ' ')

In [None]:
striptext = strip_text(inaugural.raw())

In [None]:
sentences = sent_tokenize(striptext)
words = word_tokenize(striptext)

In [None]:
p_stemmer = PorterStemmer()
text = nltk.Text([p_stemmer.stem(i).lower() for i in words])

In [None]:
from collections import Counter
word_counter = Counter(text)
# word_counter.most_common(20)

In [None]:
plt.figure(figsize=(12,8))
text.dispersion_plot(['govern', 'citizen', 'free', 'america', 'independ', 'god', 'patriot'])

### Weighted word analysis using Vader
- Vader contains a list of 7500 features weighted by how positive or negative they are
- It uses these features to calculate stats on how positive, negative and neutral a passage is
- And combines these results to give a compound sentiment (higher = more positive) for the passage
- Human trained on twitter-data and generally considered good for informal communication
- 10 humans rated each feature in each tweet in context from -4 to +4
- Calculates the sentiment in a sentence using word order analysis
- "marginally good" will get a lower positive score than "extremely good"
- Computes a "compound" score based on heuristics (between -1 and +1)
- Includes sentiment of emoticons, punctuation, and other 'social media' lexicon elements

In [None]:
# !pip install vaderSentiment

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
def vader_comparison(texts):
    import numpy as np
    analyzer = SentimentIntensityAnalyzer()
    
    restos = [resto[0] for resto in restaurant_data]
    df_ = pd.DataFrame(np.zeros(4*len(restos)).reshape(-1, 4), 
                       index=restos,
                       columns=['positive', 'negative', 'neutral', 'compound'])
    
    for resto in restaurant_data:
        name = resto[0]
        sentences = sent_tokenize(resto[1])
        n = len(sentences)
        for sentence in sentences:
            vs = analyzer.polarity_scores(sentence)
            df_.loc[name, 'positive'] += vs['pos']/n
            df_.loc[name, 'negative'] += vs['neg']/n
            df_.loc[name, 'neutral'] += vs['neu']/n
            df_.loc[name, 'compound'] += vs['compound']/n
    return df_

In [None]:
vader_comparison(restaurant_data)

### Get text sentiment when 'service' is mentioned in sentences.

In [None]:
meaningful_sents = [(i, sentence) for i, sentence in enumerate(sentences) if 'service' in sentence]
vader_comparison(meaningful_sents)       

---
### Text summaration
---

### Affect calculator for common terms in our domain (e.g., food items)

In [None]:
def get_affect(text, keyword, lower=False):
    analyzer = SentimentIntensityAnalyzer()
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = sent_detector.tokenize(text.strip())
    sentence_count, running_total = 0, 0
    
    for sentence in sentences:
        if lower: sentence = sentence.lower()
        if keyword in sentence:
            vs = analyzer.polarity_scores(sentence)
            running_total += vs['compound']
            sentence_count += 1
    if sentence_count == 0: return 0
    return running_total/sentence_count

In [None]:
get_affect(community_data.raw(), 'service', True)

### Concordance returns text fragments around a word

In [None]:
nltk.Text(community_data.words()).concordance('service', 100)

## Text summarisation
 - generate a short summary of a large piece of text automatically
 - use these summaries as input into a topic analyzer

#### A naive form of summarization 
- identify the most frequent words in a piece of text
- use the occurrence of these words in sentences to rate the importance of a sentence

In [None]:
summary_sentences, candidate_sentences, candidate_sentence_counts = [], {}, {}
striptext = strip_text(community_data.raw())

#### Remove unimportant, stopwords and numbers

In [None]:
words = word_tokenize(striptext)
lowercase_words = [word.lower() for word in words
                  if word not in stopwords.words() and word.isalpha()]

#### Get word frequencies

In [None]:
word_frequencies = Counter(lowercase_words)
most_frequent_words = word_frequencies.most_common(20)

In [None]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(most_frequent_words)

#### Lowercase the sentences
candidate_sentences is a dictionary with the original sentence as the key, and its lowercase version as the value

In [None]:
candidate_sentences = {sentence: sentence.lower() for sentence in sent_tokenize(striptext)}

#### Summarise word frequency scores per sentence

In [None]:
for long, short in candidate_sentences.items():
    count = 0
    for freq_word, frequency_score in most_frequent_words:
        if freq_word in short:
            count += frequency_score
            candidate_sentence_counts[long] = count

In [None]:
sorted_sentences = OrderedDict(sorted(
                    candidate_sentence_counts.items(),
                    key = lambda x: x[0],
                    reverse = True)[:4])
pp.pprint(sorted_sentences)

<h4>Packaging all this into a function</h4>


In [None]:
def build_naive_summary(text):

    striptext = text.replace('\n\n', ' ')
    striptext = striptext.replace('\n', ' ')

    lowercase_words = [word.lower() for word in word_tokenize(striptext)
                      if word not in stopwords.words() and word.isalpha()]
    
    candidate_sentences = {sentence: sentence.lower() for sentence in sent_tokenize(striptext)}
    candidate_sentence_counts = {}   
    most_frequent_words = FreqDist(lowercase_words).most_common(20)
    
    for long, short in candidate_sentences.items():
        count = 0
        for freq_word, frequency_score in most_frequent_words:
            if freq_word in short:
                count += frequency_score
                candidate_sentence_counts[long] = count   
                
    sorted_sentences = OrderedDict(sorted(
                        candidate_sentence_counts.items(),
                        key = lambda x: x[1],
                        reverse = True)[:4])
    return sorted_sentences   

In [None]:
summary = '\n'.join(build_naive_summary(community_data.raw()))
print(summary)

In [None]:
summary = '\n'.join(build_naive_summary(le_monde_data.raw()))
print(summary)

### We can summarize George Washington's first inaugural speech

In [None]:
build_naive_summary(inaugural.raw('1789-Washington.txt'))

## gensim: another text summarizer
- Gensim uses a network with sentences as nodes and 'lexical similarity' as weights on the arcs between nodes<p>

In [None]:
# !pip install gensim

In [None]:
import gensim.summarization

In [None]:
# community_root = "data/community"
# le_monde_root = "data/le_monde"
# community_files = "community.*"
# le_monde_files = "le_monde.*"
# heights_root = "data/heights"
# heights_files = "heights.*"
# amigos_root = "data/amigos"
# amigos_files = "amigos.*"
# community_data = PlaintextCorpusReader(community_root,community_files)
# le_monde_data = PlaintextCorpusReader(le_monde_root,le_monde_files)
# heights_data = PlaintextCorpusReader(heights_root,heights_files)
# amigos_data = PlaintextCorpusReader(amigos_root,amigos_files)

In [None]:
type(community_data)

In [None]:
summary_sentences, candidate_sentences, candidate_sentence_counts = [], {}, {}
striptext = strip_text(community_data.raw())

In [None]:
print(gensim.summarization.keywords(striptext, words=10))

In [None]:
summary = gensim.summarization.summarize(striptext, word_count=100) 
print(summary)

In [None]:
summary = '\n'.join(build_naive_summary(community_data.raw()))
print(summary)

---
## 2. Topic modeling
---
The goal of topic modeling is to identify the major concepts underlying a piece of text.  
Topic modeling uses "Unsupervised Learning". No apriori knowledge is necessary.  
Though it is helpful in cleaning up results!

### LDA: Latent Dirichlet Allocation Model
 - identify potential topics using pruning techniques like 'upward closure'
 - compute conditional probabilities for topic word-sets
 - identify the most likely topics, over multiple passes probabilistically picking topics in each pass
 - intuitive explanation: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

In [None]:
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.parsing.preprocessing import STOPWORDS

#### Prepare the text

In [None]:
striptext = strip_text(PlaintextCorpusReader("data/", "Nikon_coolpix_4300.txt").raw())
sentences = sent_tokenize(striptext)

#words = word_tokenize(striptext)
#tokenize each sentence into word tokens
texts = [[word for word in sentence.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for sentence in sentences]
len(texts)

<h4>Create a (word, frequency) dictionary for each word in the text</h4>

In [None]:
print(text)

In [None]:
text

In [None]:
dictionary = corpora.Dictionary(texts)                # (word_id, frequency) pairs
corpus = [dictionary.doc2bow(text) for text in texts] # (word_id, frequency) pairs by sentence

In [None]:
dictionary.token2id.items();

In [None]:
dictionary.keys();

In [None]:
corpus[5]

In [None]:
texts[5]

In [None]:
dictionary[73], dictionary[4]

### LDA analysis
parameters:  
 - Number of topics: The number of topics you want generated. The larger the document, the more the desirable topics
 - Passes: The LDA model makes through the document. More passes, slower analysis

In [None]:
# Set parameters
num_topics = 5
passes = 10 

In [None]:
lda = LdaModel(corpus, id2word=dictionary, num_topics=num_topics, passes=passes)

<h4>See results</h4>

In [None]:
pp = pprint.PrettyPrinter(indent=2)
pp.pprint(lda.print_topics(num_words=3))

### Matching topics to documents
- sort topics by probability
- using sentences as documents here, so this is less than ideal

In [None]:
from operator import itemgetter
topics = lda.get_document_topics(corpus[0], minimum_probability=0.05, per_word_topics=False)
sorted(topics, key=itemgetter(1), reverse=True)

### Making sense of the topics
 - draw wordclouds

In [None]:
def draw_wordcloud(lda, n_topics, min_size=0, STOPWORDS=[]):
    topics = lda.show_topic(n_topics, topn=50)
    
    df_ = pd.DataFrame([(word, prob) for word, prob in topics 
                        if len(word) >= min_size if word not in STOPWORDS], 
                       columns=['word', 'prob'])
    
    multip = 100 * df_['prob'] / df_['prob'].sum()
    df_['multip'] =  multip.astype('int32')
    word_list = (df_['word'] + ' ') * df_['multip']
    text = ''.join(word_list)
    wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', width=3000, height=3000).generate(text)
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show();

In [None]:
draw_wordcloud(lda, 2)

### Let's look at Presidential addresses to see what sorts of topics emerge from there
 - Each document will be analyzed for topic</li>
 - The corpus will consist of 58 documents, one per presidential address

In [None]:
REMOVE_WORDS = {'shall','generally','spirit','country','people','nation','nations','great','better'}

# Create a word dictionary (id, word)
texts = [[word for word in sentence.lower().split()
        if word not in STOPWORDS and word not in REMOVE_WORDS and word.isalnum()]
        for sentence in sentences]
dictionary = corpora.Dictionary(texts)

# Create a corpus of documents
text_list = []
for fileid in inaugural.fileids():
    text = inaugural.words(fileid)
    doc = []
    for word in text:
        if word in STOPWORDS or word in REMOVE_WORDS or not word.isalpha() or len(word) < 5:
            continue
        doc.append(word)
    text_list.append(doc)
    
by_address_corpus = [dictionary.doc2bow(text) for text in text_list]

<h2>Create the model</h2>

In [None]:
lda = LdaModel(by_address_corpus, id2word=dictionary, num_topics=20, passes=10)

In [None]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(lda.print_topics(num_words=10))

<h2>We can now compare presidential addresses by topic</h2>

In [None]:
len(by_address_corpus)

In [None]:
topics = lda.get_document_topics(by_address_corpus[0], minimum_probability=0, per_word_topics=False)
sorted(topics, key=itemgetter(1), reverse=True)

In [None]:
draw_wordcloud(lda, 18)

In [None]:
print(lda.show_topic(12, topn=5))
print(lda.show_topic(18, topn=5))

## Similarity
 - Given a corpus of documents, when a new document arrives, find the document that is the most similar

In [None]:
from gensim.similarities.docsim import Similarity
from gensim import corpora, models, similarities

In [None]:
doc1 = """
Many, many years ago, I used to frequent this place for their amazing french toast. 
It's been a while since then and I've been hesitant to review a place I haven't been to in 7-8 years... 
but I passed by French Roast and, feeling nostalgic, decided to go back.

It was a great decision.

Their Bloody Mary is fantastic and includes bacon (which was perfectly cooked!!), olives, 
cucumber, and celery. The Irish coffee is also excellent, even without the cream which is what I ordered.

Great food, great drinks, a great ambiance that is casual yet familiar like a tiny little French cafe. 
I highly recommend coming here, and will be back whenever I'm in the area next.

Juan, the bartender, is great!! One of the best in any brunch spot in the city, by far.
"""

In [None]:
doc2 = """
I went to Mexican Festival Restaurant for Cinco De Mayo because I had been there years 
prior and had such a good experience. This time wasn't so good. The food was just 
mediocre and it wasn't hot when it was brought to our table. They brought my friends food out 
10 minutes before everyone else and it took forever to get drinks. We let it slide because the place was 
packed with people and it was Cinco De Mayo. Also, the margaritas we had were slamming! Pure tequila. 

But then things took a turn for the worst. As I went to get something out of my purse which was on 
the back of my chair, I looked down and saw a huge water bug. I had to warn the lady next to me because 
it was so close to her chair. We called the waitress over and someone came with a broom and a dustpan and 
swept it away like it was an everyday experience. No one seemed phased.

Even though our waitress was very nice, I do not think we will be returning to Mexican Festival again. 
It seems the restaurant is a shadow of its former self.
"""

In [None]:
all_text = [community_data.raw()] + [le_monde_data.raw()] + [amigos_data.raw()] + [heights_data.raw()]
doc_list = [community_data, le_monde_data, amigos_data, heights_data]
documents = [doc.raw() for doc in doc_list]
assert all_text == documents

texts = [[word for word in document.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for document in documents]

In [None]:
def get_lsi(texts):
    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    return dictionary, corpus, models.LsiModel(corpus, id2word=dictionary, num_topics=2)

In [None]:
def get_doc_similarity(doc, dictionary, corpus, lsi):
    '''Match new doc against known docs to ge similarity'''
    vec_bow = dictionary.doc2bow(doc.lower().split())
    vec_lsi = lsi[vec_bow]
    index = similarities.MatrixSimilarity(lsi[corpus])
    sims = index[vec_lsi]
    return sorted(enumerate(sims), key=lambda x: -x[1])

In [None]:
dictionary, corpus, lsi = get_lsi(texts)
get_doc_similarity(doc1, dictionary, corpus, lsi)

In [None]:
get_doc_similarity(doc2, dictionary, corpus, lsi)

---
## Sentiment analysis using logistic regression
---

The **`sentiment`** data set consists of 3000 sentences which come from reviews on `imdb.com`, `amazon.com`, and `yelp.com`. Each sentence is labeled according to whether it comes from a positive review or negative review.

We will use <font color="magenta">logistic regression</font> to learn a classifier from this data.

Before starting on this notebook, download the data from https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences. The folder `sentiment_labelled_sentences` (containing the data file `full_set.txt`) should be in the same directory as the notebook.

## 1. Set up notebook, load and preprocess data

First, some standard includes.

In [None]:
%matplotlib inline
import string
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

Now, we load in the data. Make sure the notebook is the same directory as the folder `sentiment_labelled_sentences`, and that the folder contains `full_set.txt`.

The data set consists of 3000 sentences, each labeled '1' (if it came from a positive review) or '0' (if it came from a negative review). To be consistent with our notation from lecture, we will change the negative review label to '-1'.

In [None]:
# Read in the data set.
with open("data/sentiment_labelled_sentences/full_set.txt") as f:
    content = f.readlines()

In [None]:
# Remove leading and trailing white space
content = [x.strip() for x in content]

# Separate the sentences from the labels
sentences = [x.split("\t")[0] for x in content]
labels = [x.split("\t")[1] for x in content]

In [None]:
# Transform the labels from '0 v.s. 1' to '-1 v.s. 1'
y = np.array(labels, dtype='int8')
y = 2*y - 1

### Preprocessing the text data

To transform this prediction problem into one amenable to linear classification, we will first need to preprocess the text data. We will do four transformations:

1. Remove punctuation and numbers.
2. Transform all words to lower-case.
3. Remove _stop words_.
4. Convert the sentences into vectors, using a bag-of-words representation.

We begin with first two steps.

In [None]:
## full_remove takes a string x and a list of characters removal_list 
## returns x with all the characters in removal_list replaced by ' '
def full_remove(x, removal_list):
    for w in removal_list:
        x = x.replace(w, ' ')
    return x

## Remove digits
digits = [str(x) for x in range(10)]
digit_less = [full_remove(x, digits) for x in sentences]

## Remove punctuation
punc_less = [full_remove(x, list(string.punctuation)) for x in digit_less]

## Make everything lower-case
sents_lower = [x.lower() for x in punc_less]

#### Remove unwanted, short and stop words

In [None]:
from nltk.corpus import stopwords

In [None]:
corpus = ' '.join([sent_words for sent_words in sents_lower])
dictionary = set(corpus.split())

# Use predefined stop words set
stop_words = set(stopwords.words('english'))

# Define our own unwanted words set
unwanted_words = set(['the', 'a', 'an', 'i', 'he', 'she', 'they', 'to', 'of', 'it', 'from'])

# Get short words
MIN_LENGTH = 3
short_words = set([word for word in dictionary if len(word) < MIN_LENGTH])

# Define set of words to clear from text/sentences
clear_set = stop_words | unwanted_words | short_words

# Clear text from unwanted words
sents_split = [x.split() for x in sents_lower]
sents_processed = [' '.join(list(filter(lambda word: word not in clear_set, sent_words))) for sent_words in sents_split]

What do the sentences look like so far?

In [None]:
sents_processed[0:10]

### Bag of words

In order to use linear classifiers on our data set, we need to transform our textual data into numeric data. 
The classical way to do this is known as the _bag of words_ representation. 

_bag of words_ representation: 
 - each word is thought of as corresponding to a number in `{1, 2, ..., V}` where `V` is the size of our vocabulary. 
 - each sentence is represented as a V-dimensional vector $x$, where $x_i$ is the number of times that word $i$ occurs in the sentence.

We use of the `CountVectorizer` class in `scikit-learn` for the transformation.

We will cap the number of features at 4500, meaning;
 - a word will make it into our vocabulary only if it is one of the 4500 most common words in the corpus
 - this can weed out spelling mistakes and words which occur too infrequently to be useful
 - we will also append a '1' to the end of each vector to allow our linear classifier to learn a bias term.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Transform to bag of words representation.
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, max_features = 4500)
data_features = vectorizer.fit_transform(sents_processed)

# Append '1' to the end of each vector.
p_mat = data_features.toarray()
data_mat = np.ones((p_mat.shape[0], p_mat.shape[1]+1))
data_mat[:,:-1] = p_mat

### Training / test split

Finally, we split the data into a training set of 2500 sentences and a test set of 500 sentences (of which 250 are positive and 250 negative).

In [None]:
## Split the data into testing and training sets

# Random stratified selection of the indices 
np.random.seed(0)
test_inds = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), 
                      np.random.choice((np.where(y==1))[0], 250, replace=False))
all_inds = range(len(labels))
train_inds = list(set(all_inds) - set(test_inds))

train_data = data_mat[train_inds,]
train_labels = y[train_inds]

test_data = data_mat[test_inds,]
test_labels = y[test_inds]

print("train data: ", train_data.shape)
print("test data: ", test_data.shape)

## 2. Fitting a logistic regression model to the training data

We use logistic regression classifier, using stochastic gradient descent (SGD), from `scikit-learn`.  
Randomness in the SGD procedure yields slightly different solutions (and thus different error values).

In [None]:
from sklearn.linear_model import SGDClassifier

## Fit logistic classifier on training data
clf = SGDClassifier(loss='log', penalty='l2', max_iter=200)
clf.fit(train_data, train_labels)

## Pull out the parameters (w,b) of the logistic regression model
w = clf.coef_[0,:]
b = clf.intercept_

## Get predictions on training and test data
preds_train = clf.predict(train_data)
preds_test = clf.predict(test_data)

## Compute errors
errs_train = np.sum((preds_train > 0.0) != (train_labels > 0.0))
errs_test = np.sum((preds_test > 0.0) != (test_labels > 0.0))

print('Training error: {}'.format(float(errs_train)/len(train_labels)))
print('Test error: {}'.format(float(errs_test)/len(test_labels)))

## 3. Analyzing the margin

The logistic regression model produces not just classifications but also conditional probability estimates. 
We will say that `x` has **margin** `gamma` if (according to the logistic regression model)  
`Pr(y=1|x) > (1/2)+gamma`  
or  
`Pr(y=1|x) < (1/2)-gamma`  

The following function **margin_counts** computes how many points in the test-set have margin of at least `gamma`.
input:
1. the classifier (`clf`, computed earlier)
2. the test set (`test_data`)
3. a value of `gamma`   

In [None]:
# Return number of test points for which Pr(y=1) lies in [0, 0.5 - gamma) or (0.5 + gamma, 1]
def margin_dist(clf, test_data, gamma):
    # Compute probability on each test point
    proba = clf.predict_proba(test_data)[:, 1]
    # Find data points for which prediction is at least gamma away from 0.5
    margin_inds = np.where((proba > (0.5+gamma)) | (proba < (0.5-gamma)))[0]
    
    return float(len(margin_inds))/len(test_data)

We now visualize the test set's distribution of margin values.

In [None]:
gammas = np.arange(0, 0.5, 0.01)
f = np.vectorize(lambda g: margin_dist(clf, test_data, g))

In [None]:
plt.plot(gammas, f(gammas), linewidth=2, color='green')
plt.xlabel('Margin', fontsize=14)
plt.ylabel('Fraction of points above margin', fontsize=14)
plt.show()

Next, we investigate a natural question: <font color="magenta">Are points `x` with larger margin more likely to be classified correctly?</font>

To address this, we define a function **margin_errors** that computes the fraction of points with margin at least `gamma` that are misclassified.

In [None]:
# Return error of predictions that lie in intervals [0, 0.5 - gamma) and (0.5 + gamma, 1]
def margin_errors(clf, test_data, test_labels, gamma):
    # Compute probability on each test point
    preds = clf.predict_proba(test_data)[:, 1]
    
    # Find data points for which prediction is at least gamma away from 0.5
    margin_inds = np.where((preds > (0.5+gamma)) | (preds < (0.5-gamma)))[0]
    
    # Compute error on those data points.
    num_errors = np.sum((preds[margin_inds] > 0.5) != (test_labels[margin_inds] > 0.0))
    return float(num_errors)/len(margin_inds)

We now visualize the relationship between margin and error rate.

In [None]:
## Create grid of gamma values
gammas = np.arange(0, 0.5, 0.01)

## Compute margin_errors on test data for each value of g
f = np.vectorize(lambda g: margin_errors(clf, test_data, test_labels, g))

## Plot the result
plt.plot(gammas, f(gammas), linewidth=2)
plt.ylabel('Error rate', fontsize=14)
plt.xlabel('Margin', fontsize=14)
plt.show()

## 4. Words with large influence

Finally, we attempt to partially **interpret** the logistic regression model.

Which words are most important in deciding whether a sentence is positive? As a first approximation to this, we simply take the words whose coefficients in `w` have the largest positive values.

Likewise, we look at the words whose coefficients in `w` have the most negative values, and we think of these as influential in negative predictions.

In [None]:
## Convert bow into a Series, automatically sorted:
bow_series = pd.Series({v:k for k, v in vectorizer.vocabulary_.items()})

## Get indices of sorting w
inds = np.argsort(w)

# First/last number of words
N = 10

## Words with large negative values
neg_inds = inds[:N]
print("Highly negative words:\n{}".format(bow_series[inds[:N]].values))
print("Highly positive words:\n{}".format(bow_series[inds[-N:]].values))

## 5. Unresolved, undecided

Suppose you are building a classifier, and can tolerate an error rate of at most some value `e`. Unfortunately, every classifier you try has a higher error than this. 

Therefore, you decide that the classifier is allowed to occasionally **abstain**: that is, to say *"don't know"*. When it actually makes a prediction, it must have error rate at most `e`. And subject to this constraint, it should abstain as infrequently as possible.