In [9]:
import numpy as np;
import re
from os import listdir
from os.path import isfile, join
from collections import Counter,defaultdict
import random


filepath_pos = "review_polarity/txt_sentoken/pos/"
filepath_neg = "review_polarity/txt_sentoken/neg/"

filenames_pos = sorted([filepath_pos+f for f in listdir(filepath_pos) if isfile(join(filepath_pos,f))])
filenames_neg = sorted([filepath_neg+f for f in listdir(filepath_neg) if isfile(join(filepath_neg,f))])

train_filenames_pos = filenames_pos[:800]
train_filenames_neg = filenames_neg[:800]

test_filenames_pos = filenames_pos[800:]
test_filenames_neg = filenames_neg[800:]
#
train_filenames = train_filenames_pos + train_filenames_neg
test_filenames = test_filenames_pos + test_filenames_neg

classes = [1,-1]


def get_data(filenames_pos, filenames_neg):
    data = []
    new_data = []
    for filename in filenames_pos:
        word_list = []
        with open(filename) as fh:
            for line in fh.readlines():
                word_list.extend(re.sub("[^\w']"," ",line).split())
        data.append((1,Counter(word_list)))
        
    for filename in filenames_neg:
        word_list = []
        with open(filename) as fh:
            for line in fh.readlines():
                word_list.extend(re.sub("[^\w']"," ",line).split())
        data.append((-1,Counter(word_list)))
    
    for label,feature_dict in data :   
        words = feature_dict.keys()
        values = feature_dict.values()
        feature_dict = Counter(dict(zip(words,values)))
        new_data.append((label,feature_dict))
        
    return new_data


<h3>Text classification</h3>

A very common problem in NLP:
<center>
<p style="border:3px; width: 500px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 1em;">
<i>Given a piece of text, assign a label from a predefined set</i>
</p>
</center>

<b>What could the labels be?</b>

<ul>
<li>positive vs negative (e.g. sentiment in reviews)</li>
<li>about world politics or not</li>
<li>author name (author identification)</li>
<li>pass or fail in essay grading</li>
</ul>

### In this section

We will see how to:
- representing text with numbers
- learn a classifier using the perceptron rule

... and you will ask me questions!

The maths we need is addition, subtraction, multiplication and division! 

In [38]:
example_text = open("review_polarity/txt_sentoken/pos/cv750_10180.txt").read()

### Sentiment analysis on film reviews

<img src="images/imdb.jpg" style="width:100%;">

### Representing text with numbers

In [14]:
print(example_text)

what's shocking about " carlito's way " is how good it is . 
having gotten a bit of a bad rap for not being a big box office hit like pacino's previous film , " scent of a woman , " and not having as strong a performance as he did in that one ( he had just won an oscar ) , " carlito's way " was destined for underrated heaven . 
that's what it is : an underrated gem of a movie . 
and what a shame because pacino and de palma both do amazing jobs with it , and turn it into a great piece of a pulpy character study . 
 " carlito's way " deals with , well , carlito brigante ( pacino ) , a puerto rican ex-drug kingpin , who gets out of a long jailterm when his coke-addicted , curly-haired lawyer ( sean penn ) points out a legal technicality . 
of course , carlito was actually awoken in prison , and has decided to go straight , even if he's really a crook at heart . 
carlito , like barry lyndon , is a man who is trapped by fate at every turn , and can't escape into something he is not . 
carli

### Counting words

In [12]:
dictionary = Counter(re.sub("[^\w']"," ",example_text).split())
print(dictionary)

Counter({'a': 47, 'and': 37, 'the': 31, 'is': 26, 'of': 24, 'to': 20, 'in': 18, 'it': 13, 'his': 12, 'he': 11, 'great': 10, 'at': 9, 'carlito': 9, 'film': 8, 'by': 8, 'which': 8, 'for': 8, 'as': 8, 'but': 8, 'some': 7, "he's": 7, 'out': 7, 'pacino': 7, "carlito's": 7, 'when': 6, 'him': 6, 'has': 6, "it's": 6, 'was': 6, 'i': 6, 'with': 6, 'way': 6, 'not': 5, 'palma': 5, 'well': 5, 'like': 5, 'de': 5, 'big': 4, 'what': 4, 'all': 4, 'woman': 4, 'into': 4, 'amazing': 4, 'that': 4, 'from': 4, 'her': 4, 'get': 4, 'are': 4, 'scent': 4, 'good': 4, 'scene': 4, 'this': 4, 'we': 4, 'on': 4, 'underrated': 3, 'who': 3, 'if': 3, 'performance': 3, 'you': 3, 'every': 3, 'actually': 3, 'emotional': 3, 'down': 3, 'where': 3, "what's": 3, 'best': 3, 'about': 3, 'even': 3, 'scenes': 3, 'or': 3, 'prison': 3, 'miller': 3, 'more': 3, 'doing': 3, 'me': 3, 'strong': 3, 'gets': 3, 'an': 3, 'turn': 3, 'since': 2, 'lot': 2, 'sean': 2, 'done': 2, 'luis': 2, 'said': 2, 'being': 2, 'did': 2, 'still': 2, 'never': 2, 

### Bag of words representation

- The higher the counts for a word, the more important it is for the document
- No document has very word; most have 0 counts (implicitly)

Anything missing?

- which words to keep?
- how to value their presence/absence?
- word order is ignored, could we add bigrams?

Choice of representation (features) matters a lot!

### Our first classifier

Now we have represented a text as counts over words/features.

We need a model to decide whether the review is positive or negative.

If each word $n$ has counts $x_n$ in the review and is associated with a weight ($w_n$), then:

$$\hat y = sign(\sum_{n=1}^N w_nx_n) = sign(\mathbf{w} \cdot \mathbf{x})$$

In [26]:
bag_of_words=Counter({'and': 37, 'is': 26, 'he': 11, 'great': 10, 'carlito': 9, 'film': 8, 'but': 8, 'some': 7, 'pacino': 7, "carlito's": 7, 'palma': 5, 'well': 5, 'like': 5,  'woman': 4, 'amazing': 4}) 

$$\hat y = sign(\sum_{n=1}^N w_nx_n) = sign(\mathbf{w} \cdot \mathbf{x})$$

In [18]:
print(bag_of_words)

Counter({'and': 37, 'is': 26, 'he': 11, 'great': 10, 'carlito': 9, 'but': 8, 'film': 8, 'some': 7, "carlito's": 7, 'pacino': 7, 'like': 5, 'palma': 5, 'well': 5, 'amazing': 4, 'woman': 4})


In [34]:
weights = dict({'and': 0.0, 'is': 0.0, 'he': 0.0, 'great': 0.0,\
                'carlito': 0.0, 'but': 0.0, 'film': 0.0, 'some': 0.0,\
                'carlito\'s': 0.0, 'pacino': 0.0, 'like': 0.0,\
                'palma': 0.0, 'well': 0.0, 'amazing': 0.0, 'woman': 0.0})

In [37]:
score = 0.0
for word, counts in bag_of_words.items():
    score += counts * weights[word]
print(score)
print("positive") if score >= 0.0 else print("negative")

0.0
positive


<h3>Another view</h3>
<a href="https://blog.dbrgn.ch/2013/3/26/perceptrons-in-python/"><img src="images/perceptron.png" style="width:600px; background:none; border:none; box-shadow:none;" /></a>
<p class="fragment">
How to learn the weights $\mathbf{w}$?
</p>

<h3>The perceptron</h3>

<p><img style="float: left; width:40%" src="images/colorfulperceptron.jpg"/><img src="images/Rosenblatt-CAL1958.jpg" style="width:35%; float: right;"/>
</p>

<p>Proposed by Rosenblatt in 1958 and still in use by researchers</p>

### Supervised learning

Given training documents with the correct labels

$$D_{train} = \{\mathbf{x}^1,y^1)...(\mathbf{x}^M,y^M)\}$$

Find the weights $\mathbf{w}$ for the linear classifier

$$\hat y = sign(\sum_{n=1}^N w_nx_n) = sign(\mathbf{w} \cdot \mathbf{x})$$

so that we can predict the labels of **unseen** documents


### Supervised learning


<img src="images/supervisedMLbyRaschka.jpg" style="width:100%;">

<h3>Learning with the perceptron</h3>
<p style="font-size: 100%; border:3px; width: 90%; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em;">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,y^1)...(\mathbf{x}^M,y^M)\}\\
& set\; \mathbf{w} = \mathbf{0} \\
& \mathbf{for} \; (\mathbf{x},y) \in D_{train} \; \mathbf{do}\\
& \quad predict  \; \hat y = sign(\mathbf{w}\cdot \phi(\mathbf{x}))\\
& \quad \mathbf{if} \; \hat y \neq y \; \mathbf{then}\\
& \quad \quad \mathbf{if} \; \hat y\; \mathbf{is}\; 1 \; \mathbf{then}\\
& \quad \quad \quad update \; \mathbf{w} = \mathbf{w} - \phi(\mathbf{x})\\
& \quad \quad \mathbf{else}\\
& \quad \quad \quad update \; \mathbf{w} = \mathbf{w} + \phi(\mathbf{x})\\
& \mathbf{return} \; \mathbf{w}
\end{align}
</p>

<ul class="fragment">
<li>error-driven, online learning</li>
<li>$x$ is the document $\phi(x)$ is the bag of words, bigrams, etc.</li>
</ul>

### A little test

Given the following tweets labeled with sentiment:

| Label        | Tweet | 
| -------------|--------|
| negative     | Very sad about Iran. |
| negative     | No Sat off...Need to work 6 days a week. |
| negative     | I’m a sad panda today.|
| positive     | such a beautiful satisfying day of bargain shopping. loves it. |
| positive     | who else is in a happy mood?? |
| positive     | actually quite happy today. |

What features would the perceptron find indicative of positive/negative class?

Would they generalize to unseen test data?

### Evaluation

The standard way to evaluate our classifier is:

$$ Accuracy = \frac{correctLabels}{allInstances}$$

What could go wrong?

When one class is much more common than the other, predicting it always
gives high accuracy.

### Evaluation

| Predicted/Correct	| MinorityClass | MajorityClass  |
| ------------- 		|:-------------:| -----:|
| **MinorityClass**     | TruePositive | FalsePositive |
| **MajorityClass**     | FalseNegative  | TrueNegative |

$$ Precision = \frac{TruePositive}{TruePositive+FalsePositive}$$

$$ Recall = \frac{TruePositive}{TruePositive+FalseNegative}$$

### Time for (another!) test

<img src="images/maxresdefault.jpg" style="width:100%;">

Discuss in pairs what features you would use in a classifier that predicts FAIL/PASS for an essay!