# NLP 1 : Classics

In this practical session, we review some of the basics of text classification: **bag of words and linear models**

## Classic NLP pipeline

To do classification on text using statistical classifiers it is mandatory to vectorize text. Different methods exists to go from raw text to vectors. In this practical session we propose to explore the bag of word model.

### Bag of Word model:

- A dictionnary of all considered word is built (of size $D$) from training text. Creating this dictionnary can be seen as handcrafting features.
- Each document is represented as a **sparse** vector (of size $D$) coding for each possible word(features) in the dictionnary. 


### Classification pipeline overview

```
      
raw text ---|Feature extraction|---> vectors ---|statistical classifier|---> pos or neg
```

In this pipeline you have two ways of improving performance:

- handcraft better features
- optimizing the classifier hyper-parameters


### Goal of this session:

1. Warmup (just read and run -- make sure you know what's happening)
2. Tinker with the feature extractor and try different methods (stop-words, stemming,...)
3. Tinker with the classifiers
4. Visualize best features
5. **(Bonus)** make word clouds of those features 

## The data: Large Movie Review Dataset 

Here, to explore this classical NLP pipeline, we propose to do binary sentiment classification.  For this, we use a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. 

- [The original dataset can be found here](http://ai.stanford.edu/~amaas/data/sentiment/)

To make things easier, the data was already formatted into a json (`json_pol`) containing a train and a test list of `(string(review),int(class))` couples. As it's a binary classification problem, there are two classes: 

- `0` codes for negative reviews 
- `1` codes for positive reviews


The json has the following format: 
```json
{
"train":[["this is a positive review",1],["this is a negative review",0],...],
"test":[["this is a positive review",1],["this is a negative review",0],...]
}
```


------------------------------------
# WARMUP (just read and run)
## Step 1: Load Data



### <font color='red'> /!\ YOU NEED TO UNZIP dataset/json_pol.zip first /!\</font>

The json of the following format: `{"train":[[review,class],...], "test":[[review,class],...]}`. 

 - We need to load it and collect both test and train lists:

In [2]:
import json
from collections import Counter

#### /!\ YOU NEED TO UNZIP dataset/json_pol.zip first /!\


# Loading json
with open("dataset/json_pol",encoding="utf-8") as f:
    data = f.readlines()
    json_data = json.loads(data[0])
    train = json_data["train"]
    test = json_data["test"]
    

# Quick Check
counter_train = Counter((x[1] for x in train))
counter_test = Counter((x[1] for x in test))
print("Number of train reviews : ", len(train))
print("----> # of positive : ", counter_train[1])
print("----> # of negative : ", counter_train[0])
print("")
print(train[0])
print("")
print("Number of test reviews : ",len(test))
print("----> # of positive : ", counter_test[1])
print("----> # of negative : ", counter_test[0])

print("")
print(test[0])
print("")



Number of train reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

["The undoubted highlight of this movie is Peter O'Toole's performance. In turn wildly comical and terribly terribly tragic. Does anybody do it better than O'Toole? I don't think so. What a great face that man has!<br /><br />The story is an odd one and quite disturbing and emotionally intense in parts (especially toward the end) but it is also oddly touching and does succeed on many levels. However, I felt the film basically revolved around Peter O'Toole's luminous performance and I'm sure I wouldn't have enjoyed it even half as much if he hadn't been in it.", 1]

Number of test reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

['Although credit should have been given to Dr. Seuess for stealing the story-line of "Horton Hatches The Egg", this was a fine film. It touched both the emotions and the intellect. Due especially to the incredible performance of seven year old 

## Step 2: Feature extraction

Now that we have the data, we need to vectorize the text so it can be used by classifiers.
Different methods exists to vectorize text. Here we use a [bag of word](https://en.wikipedia.org/wiki/Bag-of-words_model) approach:

>  In the bag of word model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

In other words, each text becomes a (sparse) vector which codes for its words. With the following function from scikit-learn, it is straightforward to get a bag of word from raw texts: [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

>Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data. [USER GUIDE](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)



In [3]:
from sklearn.feature_extraction.text import CountVectorizer
classes = [pol for text,pol in train]
corpus = [text for text,pol in train]

# vectorizer = CountVectorizer(input='content', encoding='utf-8',
#                              decode_error='strict', strip_accents=None,
#                              lowercase=True, preprocessor=None, tokenizer=None,
#                              stop_words=None, token_pattern='(?u)\b\w\w+\b',
#                              ngram_range=(1, 1), analyzer='word',
#                              max_df=1.0, min_df=1, max_features=None,
#                              vocabulary=None, binary=False, dtype='numpy.int64')


vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names()[:500]) # we only print a few




['00', '000', '0000000000001', '00001', '00015', '000s', '001', '003830', '006', '007', '0079', '0080', '0083', '0093638', '00am', '00pm', '00s', '01', '01pm', '02', '020410', '029', '03', '04', '041', '05', '050', '06', '06th', '07', '08', '087', '089', '08th', '09', '0f', '0ne', '0r', '0s', '10', '100', '1000', '1000000', '10000000000000', '1000lb', '1000s', '1001', '100b', '100k', '100m', '100min', '100mph', '100s', '100th', '100x', '100yards', '101', '101st', '102', '102nd', '103', '104', '1040', '1040a', '1040s', '105', '1050', '105lbs', '106', '106min', '107', '108', '109', '10am', '10lines', '10mil', '10min', '10minutes', '10p', '10pm', '10s', '10star', '10th', '10x', '10yr', '11', '110', '1100', '11001001', '1100ad', '111', '112', '1138', '114', '1146', '115', '116', '117', '11f', '11m', '11th', '12', '120', '1200', '1200f', '1201', '1202', '123', '12383499143743701', '125', '125m', '127', '128', '12a', '12hr', '12m', '12mm', '12s', '12th', '13', '130', '1300', '1300s', '131', 

## Step 3: Classifiers

Once we have vectorized data, we can use them to train statistical classifiers.

Here, we propose to use three classic options:

- NaÃ¯ve bayes
- Logistic Regression
- SVM


We fit each model below with default parameters

and we evaluate the accuracy of each model

In [4]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score


#NaÃ¯ve Bayes
nb_clf = MultinomialNB()
nb_clf.fit(X, classes)


#Logistic Regression
lr_clf = LogisticRegression(random_state=0, solver='lbfgs',n_jobs=-1)
lr_clf.fit(X, classes)

#Linear SVM
svm_clf = LinearSVC(random_state=0, tol=1e-5)
svm_clf.fit(X, classes)


true = [pol for text,pol in test]
test_corpus = [text for text,pol in test]
X_test = vectorizer.transform(test_corpus)

pred_nb = nb_clf.predict(X_test)
pred_lr = lr_clf.predict(X_test)
pred_svm = svm_clf.predict(X_test)


print(f"NaÃ¯ve Bayes accuracy: {accuracy_score(true, pred_nb)}")
print(f"Logistic Regression accuracy: {accuracy_score(true, pred_lr)}")
print(f"SVM accuracy: {accuracy_score(true, pred_svm)}")

NaÃ¯ve Bayes accuracy: 0.81356
Logistic Regression accuracy: 0.86372
SVM accuracy: 0.84572


--------------------------------------------------

# EXERCISES (now it's your turn)

To further improve performances, we can try to improve the simple bag of word model as used above to address some issues: 
 

- **(a)**  If we visualize the word frequency distribution we see that a few words (roughly 20) appear a lot more than the others. These words are often refered to as **stop words**. Would remove them improve accuracy ?
- **(b)** Some words are made of two words or three...
- **(c)** Is it really necessary to have words which appears only in 2 or 3 reviews ? 
- **(d)** Punctuation, UPPER CASE, numbers... Do cleaning bring improvements?



In [None]:
# Let's plot the count of the 1000 most used words:

import matplotlib.pyplot as plt
%matplotlib inline

from collections import Counter

wc = Counter()
for text,pol in train+test:
    wc.update(text.split(" "))
    
freq = [f for w,f in wc.most_common(1000)]

plt.plot(freq[:1000])
print(wc.most_common(20))

**a) let's remove stopwords**

>stop_words : string {â€˜englishâ€™}, list, or None (default)

>If â€˜englishâ€™, a built-in stop word list for English is used. There are several known issues with â€˜englishâ€™ and you should consider an alternative (see Using stop words).
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

[see CountVectorizer for full doc](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [None]:
# CountVectorizer can take a list of stop words as argument.
# Build or download a list of stop word (from NLTK for exemple)

stop_words = ["the", "a", "and"] #Make a better list

vectorizer = CountVectorizer(stop_words=stop_words)
X = vectorizer.fit_transform(corpus)

**b) n-grams**

>ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

>analyzer : string, {â€˜wordâ€™, â€˜charâ€™, â€˜char_wbâ€™} or callable
Whether the feature should be made of word or character n-grams. Option â€˜char_wbâ€™ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.

[see CountVectorizer for full doc](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [None]:
vectorizer = CountVectorizer(ngram_range=(1,1),analyzer='word') # Maybe 2-grams or 3-grams bring improvements ?
X = vectorizer.fit_transform(corpus)

**(c) Restrict dictionnary size**

>max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

>min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

>max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.

[see CountVectorizer for full doc](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [None]:
vectorizer = CountVectorizer(max_df=100000,min_df=5,max_features=25000) #try out some values
X = vectorizer.fit_transform(corpus)

#What is the dictionnary size now ?
dic_size = ###
print(dic_size)

**(d) Clean text ?**
>strip_accents : {â€˜asciiâ€™, â€˜unicodeâ€™, None}
Remove accents and perform other character normalization during the preprocessing step. â€˜asciiâ€™ is a fast method that only works on characters that have an direct ASCII mapping. â€˜unicodeâ€™ is a slightly slower method that works on any characters. None (default) does nothing.
Both â€˜asciiâ€™ and â€˜unicodeâ€™ use NFKD normalization from unicodedata.normalize.

>lowercase : boolean, True by default
Convert all characters to lowercase before tokenizing.

>preprocessor : callable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.

>token_pattern : string
Regular expression denoting what constitutes a â€œtokenâ€, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).


`Prepocessor` argument takes a function which processes text directly. This way it becomes easy to do "fancy" things like:
- [part of speech tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging)
- [stemming](https://en.wikipedia.org/wiki/Stemming) 

To do both, you can use [NLTK tagger](http://www.nltk.org/api/nltk.tag.html) and [NLTK stemmer](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem).


[see CountVectorizer for full doc](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [None]:
import re

reg = "\b[^\W]\b" #matches word with characters 

# 1) Try removing punctuation or putting text to lower case (maybe use a regex)
# 2) Try "Stemming" - "pos-tagging" the text

def preprocess(text):
    """
    Transforms text to remove unwanted bits.
    """
    return text.replace("."," ") # This function is only taking care of dots, what about !:,?+-&*%

vectorizer = CountVectorizer(preprocessor=preprocess)
X = vectorizer.fit_transform(corpus)

**(bonus)** 

- Instead of word counts, the bag of word vector can only represent used word:

> binary : boolean, default=False
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

[see CountVectorizer for full doc](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

- Words can also be weighted by importance. The most used weighting scheme in Information Retrieval is TF-IDF.

[TfidfVectorizer from scikit can be directly used](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)



In [None]:
#from sklearn.feature_extraction.text import TfidfVectorizer

#vectorizer = CountVectorizer(binary=True)
#vectorizer = TfidfVectorizer()


# Answer the following questions

- What is the most effective pre-processing ?
- Which model is the most accurate ?

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

#NaÃ¯ve Bayes
nb_clf = MultinomialNB()
nb_clf.fit(X, classes)

#Logistic Regression
lr_clf = LogisticRegression(random_state=0, solver='lbfgs',n_jobs=-1)
lr_clf.fit(X, classes)

#Linear SVM
svm_clf = LinearSVC(random_state=0, tol=1e-5)
svm_clf.fit(X, classes)


true = [pol for text,pol in test]
test_corpus = [text for text,pol in test]
X_test = vectorizer.transform(test_corpus)

pred_nb = nb_clf.predict(X_test)
pred_lr = lr_clf.predict(X_test)
pred_svm = svm_clf.predict(X_test)


print(f"NaÃ¯ve Bayes accuracy: {accuracy_score(true, pred_nb)}")
print(f"Logistic Regression accuracy: {accuracy_score(true, pred_lr)}")
print(f"SVM accuracy: {accuracy_score(true, pred_svm)}")

# Visualizing features

It can be interesting to find out which words are the most positive or negative for our models. To do so, you can simply look how each models weigth each features with respect to each class.

In [None]:
# we first build a dictionnary {id_feature : word} from our vectorizer

features = {v:k for k,v in vectorizer.vocabulary_.items()} # invert mapping (k2v)

### NaÃ¯ve Bayes

For the naÃ¯ve bayes model, we can look directly at `p(word | class)`

[MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)

> feature_log_prob_ : array, shape (n_classes, n_features)
Empirical log probability of features given a class, P(x_i|y).

In [None]:
import numpy as np
#NaÃ¯ve bayes

k = 50 # we want the 50 most negative and positive words

feat_neg = nb_clf.feature_log_prob_[0] # get list of negative class log probability
#feat_pos =  # same for positive

most_neg = [] # find the corresponding words
most_pos = [] 

print(most_neg[:k])
print(most_pos[:k])





## Linear Models : Logistic Regression & SVM


For linear models, we can look at feature coefficients:
 [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
 
 	
>coef_ : array, shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary. In particular, when multi_class=â€™multinomialâ€™, coef_ corresponds to outcome 1 (True) and -coef_ corresponds to outcome 0 (False).

In [None]:
#Logistic Regression

k = 50 # we want the 50 most negative and positive words

feat = lr_clf.coef_[0] 


In [None]:
#SVM
k = 50 # we want the 50 most negative and positive words

feat = svm_clf.coef_[0]




------------------------
##Â **(bonus:)**  [make word clouds !](https://github.com/amueller/word_cloud)

### Installation
If you are using pip:

`pip install wordcloud`

### If you are using conda, you can install from the conda-forge channel:

`conda install -c conda-forge wordcloud`


In [None]:
# import ...

# Get text:
#text = "this is exemple text"

# Generate a word cloud image
# wordcloud = WordCloud().generate(text)

# Display the generated image:
# the matplotlib way:
# import matplotlib.pyplot as plt
# plt.imshow(wordcloud, interpolation='bilinear')
# plt.axis("off")


# ------------ End of Practical -------------

# Legacy loading method (if you use original data)

Data is split into two folders "train" and "test" and in each of these folders is two other folders "neg" and "pos" containing reviews in ".txt". One ".txt" per review

=> We wish to load those reviews `(text,polarity)` in two lists: `train` and `test`

Data format is: numReview_polarity.txt , one file per review.

Now that we have each review's filepath we have two things left to do:
 - read each review from file 
 - extract polarity from filename
 

In [None]:
### This code is for legacy

def get_polarity(f):
    """
    Extracts polarity from filename:
    0 is negative (< 5)
    1 is positive (> 5)
    """
    _,name = pathsplit(f)
    if int(name.split('_')[1].split('.')[0]) < 5:
        return 0
    else:
        return 1

def open_one(f):
    """
    open one file, gets polarity and review text
    """
    polarity = get_polarity(f)
    
    with open(f,"r",encoding='utf-8') as review:
        text = " ".join(review.readlines()).strip()
    
    return (text,polarity)


import glob
from os.path import split as pathsplit

#We first recover all reviews filepaths using glob - https://docs.python.org/3/library/glob.html

dir_train = "dataset/aclImdb/train/"
dir_test = "dataset/aclImdb/test/"

train_files = glob.glob(dir_train+'pos/*.txt') + glob.glob(dir_train+'neg/*.txt')
test_files = glob.glob(dir_test+'pos/*.txt') + glob.glob(dir_test+'neg/*.txt')


print(train_files[:3]) #We have 2 lists of files

print(open_one(train_files[0])) #test

train = [open_one(x) for x in train_files]
test = [open_one(x) for x in test_files]