**Book**: Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications  
**Book**: Data Mining the textbook

# Sentiment Analysis

<img src="figures/sentiment_analysis_copy.jpg">

**Sentiment analysis** or **opinion mining** analyzes  
  * **people’s** *opinions*, *appraisals*, *attitudes*, and *emotions*  
  * **toward** *entities*, *individuals*, *issues*, *events*, *topics*, and their attributes.
* For **example**, businesses always want to find public or consumer **opinions about their products and services**.

**Social media**

* **individuals and organizations** are increasingly using the content in the WEB for their **decision making**
  * To **buy a consumer product**, there are many user **reviews of products** on the Web.
  * To gather public **opinions about products and services** there is an abundance of information publicly available.
* **Finding and monitoring opinion sites** on the Web and **distilling the information** contained in them remains a formidable task because of the proliferation of diverse sites. 

**Summarizing the information**

* The average human reader will have difficulty identifying relevant sites and accurately **summarizing** the information and **opinions** contained in them.
* **Bias problem**: `people often pay greater attention to opinions that are consistent with their own preferences`.
* People also have difficulty, when the **amount of information** to be processed is **large**.

## The Problem of Opinion Mining

* **One opinion** represents only **the view of a single person**
  * it is essential to **analyze a collection of opinions** rather than only one
  * some form of **summary of opinions** is needed

### Problem Definitions
We can use the review of an iPhone to introduce the problem
>    “(1) I bought an **iPhone** a few days ago. (2) <font style="color:blue">It was such a nice phone.</font> (3) <font style="color:blue">The touch screen was really cool</font>. (4) <font style="color:blue">The voice quality was clear too</font>. (5) <font style="color:red">However, my mother was mad with me as I did not tell her before I bought it</font>. (6) <font style="color:red">She also thought the phone was too expensive, and wanted me to return it to the shop</font>. ... ”

**Question**: what we want to **mine or extract** from this review?

* here are `several opinions` in this review.
  * Sentences (2), (3), and (4) express some **<font style="color:blue">positive opinions</font>**
  * Sentences (5) and (6) express **<font style="color:red">negative opinions</font>** or emotions.

* The opinions all have some `targets`.  
  The targets of the opinions in sentence/s   <img align="right" style="padding-left:10px;" src="figures/sentiment-graph.png" width="30%">  
  * (2) is the **iPhone** as a whole, 
  * (3) and (4) are “ **touch screen** ” and “ **voice quality** ” of the iPhone. 
  * (6) is the **price** of the iPhone,
  * (5) is “ **me** ”, not iPhone.

* Different `holder` of the opinions in sentences 
  * (2), (3), and (4) is the **author of the review** (“ I ”), 
  * (5) and (6) it is “ **my mother**. ”
  
With this example in mind, we can now **formally define** the `opinion mining problem`.

In general, **opinions** can be **expressed about** anything, 
* e.g., a **product**, a **service**, an **individual**, an **organization**, an **event**, or a **topic**, 
* **by any person** or organization. 

We use the term <font style="color:red">**entity**</font> to denote the **target object** that has been evaluated. 
* An entity can have a **set of components** (or parts) and a **set of attributes**. 
* Each component may have its own **sub-components** and its set of **attributes**, and so on. 
* Thus, an entity can be hierarchically decomposed based on the <font style="color:red">**part-of**</font> relation.

**Mobile phone example**

<img src="figures/Mobile-handset-architecture.png" width="70%">

**Definition of entity**

> An **entity** $e$ is a product, service, person, event, organization, or topic.   
  It is associated with a pair, $e$: $(T, W)$, where $T$ is a **hierarchy of components (or parts)**, subcomponents, and so on,   
  and $W$ is a **set of attributes of e**. Each component or sub-component also has its own set of attributes.

**Example iPhone**

* A particular brand of cellular phone is an entity, e.g., iPhone.
* It has a set of **components**, e.g., *battery* and *screen*, 
* and also a set of **attributes**, e.g., *voice quality*, *size*, and *weight*. 
* The **battery component** also has its own set of **attributes**, e.g., *battery life* and *battery size*.

One can express an **opinion on** 
* the **cellular phone** itself (the root node), e.g., "<font style="color:red">I do not like iPhone</font>" or 
* on any one of its **attributes**, e.g., "<font style="color:red">The voice quality of iPhone is lousy</font>" 

**Simplified and Flattened entity** (two levels)

* the **root level node** is still the **entity** itself, 
* the **second level nodes** are the different <font style="color:red">**aspects**</font> of the entity.

> The **aspects** of an entity $e$ are the **components and attributes** of $e$.

**Types of opinions**

Two main types of opinions: 
* **Regular opinions** are often referred to simply as opinions in the research literature. 
* A **comparative opinion** expresses a relation of similarities or differences *between two or more entities*

**Opinion**
* An opinion (or regular opinion) is simply a <font style="color:blue">*positive or negative view, attitude, emotion, or appraisal*</font> about an **entity** or an **aspect** of the entity from an **opinion holder**. 

**Opinion orientations or polarity**
* **Positive**, **negative**, and **neutral** are called opinion orientations. 
* Neutral is often interpreted as no opinion.

**Definition of opinion**

 > An opinion (or regular opinion) is a quintuple,
$$(e_i , a_{ij} , oo_{ijkl} , h_k , t_l )$$
where 
* $e_i$ is the name of an **entity**, 
* $a_{ij}$ is an **aspect** of $e_i$ , 
* $oo_{ijkl}$ is the **opinion orientation**  about aspect $a_{ij}$ of entity $e_i$ , 
* $h_k$ is the **opinion holder**, and 
* $t_l$ is the **time** when the opinion is expressed by $h_k$. 

> The opinion orientation $oo_{ijkl}$ can be **positive**, **negative**, or **neutral** or be expressed
with different strength/intensity levels. 

**Remark 1**

The opinion $oo_{ijkl}$ must be 
* given by **opinion holder** $h_k$ 
* about aspect $a_{ij}$ of entity $e_i$ at time $t_l$ . 

Otherwise, we may **assign an opinion to a wrong entity or wrong aspect**, etc.

**Remark 2**

These five components in $(e_i , a_{ij} , oo_{ijkl} , h_k , t_l )$ are essential. 
* Without any of them, it can be problematic in general.  
  **Example** "`The picture quality is great`" 
  * if we do not know whose picture quality, the opinion is of little use. 
* This is not true for every application.  
  **Examples** 
  * `knowing each opinion holder is not necessary if we want to summarize opinions from a large number of people`. 
  * New components can be added to the tuple:   
  In some applications we may want to know the sex and age of each opinion holder.

### Aspect-Based Opinion Summary

* **One opinion** from a single holder is usually **not sufficient** for action.
  * **Summary of opinions** is needed.
  * A common form of summary: `aspect-based opinion summary` (or `feature-based opinion summary`)


**Example of Aspect-based opinion summary**

<img src="figures/aspect-based-opinion-summary.png" width="50%">

**Example of Aspect-based opinion summary**

<img src="figures/visual-aspect-based-opinion-summary.png" width="70%">

## Document Sentiment Classification

<img src="figures/docment-sentiment-classification.png" width="30%">


Given an **opinionated document** $d$ evaluating an entity $e$,  
determine the opinion orientation $oo$ on $e$,  
i.e., determine $oo$ on aspect GENERAL in the quintuple $(e, GENERAL, oo, h, t)$.  
`e, h, and t are assumed known or irrelevant`.

**Assumption** 

Sentiment classification assumes that 
* the opinion document $d$ (e.g., a product review) expresses opinions on a **single entity** $e$ and 
* the opinions are from a **single opinion holder** $h$.

### Classification Based on Supervised Learning

Sentiment classification obviously can be formulated as 
* a supervised learn ing problem with <font style="color:red">three classes</font>, **positive**, **negative**, and **neutral**. 
* Training and testing data used in the existing research are mostly **product reviews**
* **Example**: For Review with rating (e.g., 1–5 stars), 
  * a review with 4 or 5 stars is considered a **positive review**, 
  * a review with 1 or 2 stars is considered a **negative review** 
  * a review with 3 stars is considered a **neutral review**.

**Sentiment classification vs topic-based text classification**

Sentiment classification is similar to classic **topic-based text classification**,   
which classifies documents into predefined topic classes, e.g., politics, sciences, sports, etc. 
* In sentiment classification, **opinion words** (also called sentiment words) that `indicate positive or negative opinions` are important, 
  * e.g., *great, excellent, amazing, horrible, bad, worst, etc.*

**Supervised learning methods and features**

Any existing supervised learning methods can be applied to sentiment classification, 
* e.g., **Naïve Bayesian** classification, **Support vector machines** (SVM), etc.

Features
* Unigrams, TF-IDF, Part of speech, Opinion words and phrases, Negations, etc

### Classification Based on Unsupervised Learning

**Opinion words and phrases** are the dominating indicators for sentiment classification. 
* Using **unsupervised learning** based on such words and phrases would be quite natural.
* Perform **classification** using some **fixed syntactic phrases** that are likely to be used to **express opinions**.
  * Example: extracts `phrases containing adjectives or adverbs` as adjectives and adverbs are good indicators of opinions.

## Sentence Subjectivity and Sentiment Classification

## Aspect-Based Opinion Mining

## Mining Comparative Opinions

## Opinion Spam Detection

### Challenges and Issues 
Challenges 
* Relevant objects vs irrelevant ones 
* Same feature expressed in different wordings 
* Words that could be positive and negative in different context 
* Long text that could contain both positive and negative opinions
* Detecting opinion oriented sentences
* Integrating the tasks above

Some other issues 
* Identifying comparison words 
* Dealing with different writing style by different people 
* Tracking changing opinions 
* Measuring strength of opinions 
* Tackling sarcastic statements and mixed views 
* Spam opinions

## Python example

First, let's import useful libraries

In [1]:
import matplotlib.pylab as plt
#matplotlib inline 
#plt.style.use('seaborn-whitegrid')
plt.rc('text', usetex=True)
plt.rc('font', family='times')
plt.rc('xtick', labelsize=10) 
plt.rc('ytick', labelsize=10) 
plt.rc('font', size=12) 

**Installation requirements**

In [2]:
!pip install unidecode



**download the nltk corpora**

Before starting is important to download the nltk corpora. You can download individual data packages or you can download the entire collection (using “all”). Useful corpora for this notebook include wordnet, movie_reviews, and stopwords.

In [3]:
import nltk
#nltk.download()
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ignazio/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Task 1: Stemmer/lemmatizer example using NLTK

In [4]:
raw_docs = ["Here are some very simple basic sentences.", 
            "They won't be very interesting, I'm afraid.", 
            "The point of these examples is to _learn how basic text cleaning works_ on *very simple* data."
           ]
            
from nltk.tokenize import word_tokenize
tokenized_docs = [word_tokenize(doc) for doc in raw_docs]

import re
import string
regex = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = []
for review in tokenized_docs:
    new_review = []
    for token in review:
        new_token = regex.sub(u'', token)
        if not new_token == u'':
            new_review.append(new_token)
    tokenized_docs_no_punctuation.append(new_review)

from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
porter = PorterStemmer()
snowball = SnowballStemmer('english')
wordnet = WordNetLemmatizer()

preprocessed_docs = []
for doc in tokenized_docs_no_punctuation:
    final_doc = []
    for word in doc:
        #final_doc.append(porter.stem(word))
        # final_doc.append(snowball.stem(word))
        # requires 'corpora/wordnet' -> nltk.download()
        final_doc.append(wordnet.lemmatize(word))
        # requires 'corpora/wordnet' -> nltk.download()
    preprocessed_docs.append(final_doc)

print("tokenized_docs_no_punctuation\n", tokenized_docs_no_punctuation)
print("preprocessed_docs\n",preprocessed_docs)

tokenized_docs_no_punctuation
 [['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences'], ['They', 'wo', 'nt', 'be', 'very', 'interesting', 'I', 'm', 'afraid'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', 'learn', 'how', 'basic', 'text', 'cleaning', 'works', 'on', 'very', 'simple', 'data']]
preprocessed_docs
 [['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentence'], ['They', 'wo', 'nt', 'be', 'very', 'interesting', 'I', 'm', 'afraid'], ['The', 'point', 'of', 'these', 'example', 'is', 'to', 'learn', 'how', 'basic', 'text', 'cleaning', 'work', 'on', 'very', 'simple', 'data']]


**Results**

These examples use functions of the modules "`PorterStemmer`", "`SnowballStemmer`", and "`WordNetLemmatizer`". 
* Results of the three approaches are almost **equivalent**. 

### Task 2: Word frequencies feature vector

In [5]:
mydoclist = ['Mireia loves me more than Hector loves me', 
             'Sergio likes me more than Mireia loves me', 
             'He likes basketball more than footbal']
from collections import Counter
for doc in mydoclist:
    tf = Counter()
    for word in doc.split():
        tf[word] +=1
    print(tf.items())
    
# define a set with all possible words included
# in all the sentences or "corpus"    
def build_lexicon(corpus):  
    lexicon = set()
    for doc in corpus:
        lexicon.update([word for word in doc.split()])
    return lexicon
def tf(term, document):
    return freq(term, document)
def freq(term, document):
    return document.split().count(term)
vocabulary = build_lexicon(mydoclist)
doc_term_matrix = []
print('Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']')
for doc in mydoclist:
    print('The doc is "' + doc + '"')
    tf_vector = [tf(word, doc) for word in vocabulary]
    tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)
    print ('The tf vector for Document %d is [%s]' % ((mydoclist.index(doc)+1), tf_vector_string))
    doc_term_matrix.append(tf_vector)
    
print ('All combined, here is our master document term matrix: ')
print (doc_term_matrix)

dict_items([('Mireia', 1), ('loves', 2), ('Hector', 1), ('me', 2), ('more', 1), ('than', 1)])
dict_items([('Mireia', 1), ('Sergio', 1), ('loves', 1), ('likes', 1), ('me', 2), ('more', 1), ('than', 1)])
dict_items([('footbal', 1), ('basketball', 1), ('He', 1), ('likes', 1), ('more', 1), ('than', 1)])
Our vocabulary vector is [Sergio, loves, than, likes, Mireia, Hector, basketball, He, footbal, me, more]
The doc is "Mireia loves me more than Hector loves me"
The tf vector for Document 1 is [0, 2, 1, 0, 1, 1, 0, 0, 0, 2, 1]
The doc is "Sergio likes me more than Mireia loves me"
The tf vector for Document 2 is [1, 1, 1, 1, 1, 0, 0, 0, 0, 2, 1]
The doc is "He likes basketball more than footbal"
The tf vector for Document 3 is [0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1]
All combined, here is our master document term matrix: 
[[0, 2, 1, 0, 1, 1, 0, 0, 0, 2, 1], [1, 1, 1, 1, 1, 0, 0, 0, 0, 2, 1], [0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1]]


Based on the previous script, 
* each document is in the **same feature space**,  
* we can start applying some **machine learning** methods: classifying, clustering, and so on. 

**Problems**: 
* Words are **not** all **equally informative**. 
  * If words appear too frequently in a single document, they are going to muck up our analysis. 
* We need to do some **vector normalizing**: L2 norm

### Task 3: L2 Normalization

For any given vector $\vec{u}$ its unit vector (written as $\hat{u}$) is calculated as follows:
$$\hat{u} = \frac{\vec{u}}{\|\vec{u}\|}$$

In [6]:
import math
import numpy as np
def l2_normalizer(vec):
    denom = np.sum([el**2 for el in vec])
    return [(el / math.sqrt(denom)) for el in vec]
doc_term_matrix_l2 = []
for vec in doc_term_matrix:
    doc_term_matrix_l2.append(l2_normalizer(vec))
print ('A regular old document term matrix: ')
print (np.matrix(doc_term_matrix))
print ('\nA document term matrix with row-wise L2 norms of 1:')
print (np.matrix(doc_term_matrix_l2))

A regular old document term matrix: 
[[0 2 1 0 1 1 0 0 0 2 1]
 [1 1 1 1 1 0 0 0 0 2 1]
 [0 0 1 1 0 0 1 1 1 0 1]]

A document term matrix with row-wise L2 norms of 1:
[[0.         0.57735027 0.28867513 0.         0.28867513 0.28867513
  0.         0.         0.         0.57735027 0.28867513]
 [0.31622777 0.31622777 0.31622777 0.31622777 0.31622777 0.
  0.         0.         0.         0.63245553 0.31622777]
 [0.         0.         0.40824829 0.40824829 0.         0.
  0.40824829 0.40824829 0.40824829 0.         0.40824829]]


We have scaled down the vectors so that each element is between [0, 1], without losing too much
valuable information. 

### Task 4: Feature weighting by its inverse document frequency

In [7]:
def numDocsContaining(word, doclist):
    doccount = 0
    for doc in doclist:
        if freq(word, doc) > 0:
            doccount +=1
    return doccount
def idf(word, doclist):
    n_samples = len(doclist)
    df = numDocsContaining(word, doclist)
    return np.log(n_samples / (float(df)) )
my_idf_vector = [idf(word, mydoclist) for word in vocabulary]
print ('Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']')
print ('The inverse document frequency vector is [' + ', '.join(format(freq, 'f') for freq in my_idf_vector) + ']')

Our vocabulary vector is [Sergio, loves, than, likes, Mireia, Hector, basketball, He, footbal, me, more]
The inverse document frequency vector is [1.098612, 0.405465, 0.000000, 0.405465, 0.405465, 1.098612, 1.098612, 1.098612, 1.098612, 0.405465, 0.000000]


### Task 5: Film critics binary sentiment analysis recognition code 

In this example we apply the whole sentiment analysis process to the Large Movie reviews
dataset (http://www.aclweb.org/anthology/P11-1015). 
* This is one of the largest public available data sets for sentiment analysis, 
  * more than 50.000 texts from movie reviews 
  * Ground truth annotation related to **positive** and **negative** movie review. 
  * We use a subset of the dataset consisting in about 10% of the data. 

You can use the following commands in a Linux operating system to download the required data:

In [8]:
#!wget http://ai.stanford.edu/~amaas//data/sentiment/aclImdb_v1.tar.gz 

--2019-05-20 20:10:46--  http://ai.stanford.edu/~amaas//data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz.1’


2019-05-20 20:11:17 (2,64 MB/s) - ‘aclImdb_v1.tar.gz.1’ saved [84125825/84125825]



In [9]:
!mkdir "data"
!tar -xf aclImdb_v1.tar.gz -C data/
!mkdir "data/train"
!mkdir "data/train/pos2"
!mkdir "data/train/neg2"
!mkdir "data/test"
!mkdir "data/test/pos2"
!mkdir "data/test/neg2"

mkdir: cannot create directory ‘data’: File exists
mkdir: cannot create directory ‘data/train’: File exists
mkdir: cannot create directory ‘data/train/pos2’: File exists
mkdir: cannot create directory ‘data/train/neg2’: File exists
mkdir: cannot create directory ‘data/test’: File exists
mkdir: cannot create directory ‘data/test/pos2’: File exists
mkdir: cannot create directory ‘data/test/neg2’: File exists


**For windows systems**

* you can download the data here: http://ai.stanford.edu/~amaas/data/sentiment/
* Move them to the folder data/

**Binary sentiment analysis recognition**

Next lines of code will select a subset of the critics of the dataset to run the next example for binary sentiment analysis recognition.

In [10]:
import os
import shutil

for file in os.listdir("data/aclImdb/train/pos/"):
    if file.endswith(".txt"):
        os.rename('data/aclImdb/train/pos/' + file, 'data/train/pos2/' + file)

for file in os.listdir("data/aclImdb/train/neg/"):
    if file.endswith(".txt"):
        os.rename('data/aclImdb/train/neg/' + file, 'data/train/neg2/' + file)

for file in os.listdir("data/aclImdb/test/pos/"):
    if file.endswith(".txt"):
        os.rename('data/aclImdb/test/pos/' + file, 'data/test/pos2/' + file)

for file in os.listdir("data/aclImdb/test/neg/"):
    if file.endswith(".txt"):
        os.rename('data/aclImdb/test/neg/' + file, 'data/test/neg2/' + file)

**Training and testing**

And the next script will perform the whole training and testing procedure on the selected subset of the dataset.

In [12]:
import os
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.classify import NaiveBayesClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from unidecode import unidecode
import time, random

def BoW():
    # Tokenizing text
    text_tokenized = [word_tokenize(doc) for doc in text]
    # Removing punctuation
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    tokenized_docs_no_punctuation = []
    for review in text_tokenized:
        new_review = []
        for token in review:
            new_token = regex.sub(u'', token)
            if not new_token == u'':
                new_review.append(new_token)
        tokenized_docs_no_punctuation.append(new_review)
    # Stemming and Lemmatizing
    porter = PorterStemmer()
    preprocessed_docs = []
    for doc in tokenized_docs_no_punctuation:
        final_doc = ''
        for word in doc:
            final_doc = final_doc + ' ' + porter.stem(word)
        preprocessed_docs.append(final_doc)
    return preprocessed_docs

print('Reading the training data positive')
text = []
files = random.sample(os.listdir("data/train/pos2/"), 500)
for file in files:
    if file.endswith(".txt"):
        infile = open('data/train/pos2/' + file, 'r')
        text.append(unidecode(infile.read()))
        infile.close()
num_posTrain=len(text)

print('Reading the training data negative')
files = random.sample(os.listdir("data/train/neg2/"), 500)
for file in files:
    if file.endswith(".txt"):
        infile = open('data/train/neg2/' + file, 'r')
        text.append(unidecode(infile.read()))
        infile.close()
num_Train=len(text)

print('Defining dictionaries')

preprocessed_docs=BoW()
# Computing TIDF word space
tfidf_vectorizer = TfidfVectorizer(min_df = 1)
trainData = tfidf_vectorizer.fit_transform(preprocessed_docs)

# Reading the test data

print('Reading the test data positive')

text = []
files = random.sample(os.listdir("data/test/pos2/"), 100)
for file in files:
    if file.endswith(".txt"):
        infile = open('data/test/pos2/' + file, 'r')
        text.append(unidecode(infile.read()))
        infile.close()
num_posTest=len(text)

print('Reading the test data negative')
files = random.sample(os.listdir("data/test/neg2/"), 100)
for file in files:
    if file.endswith(".txt"):
        infile = open('data/test/neg2/' + file, 'r')
        text.append(unidecode(infile.read()))
        infile.close()
num_Test=len(text)

print('Computing test feature vectors')
start_time = time.time()

preprocessed_docs=BoW()
testData = tfidf_vectorizer.transform(preprocessed_docs)

targetTrain = []
for i in range(0,num_posTrain):
    targetTrain.append(0)
for i in range(0,num_Train-num_posTrain):
    targetTrain.append(1)

targetTest = []
for i in range(0,num_posTest):
    targetTest.append(0)
for i in range(0,num_Test-num_posTest):
    targetTest.append(1)

print('Training and testing on training Naive Bayes')
start_time = time.time()

gnb = GaussianNB()
testData.todense()
y_pred = gnb.fit(trainData.todense(), targetTrain).predict(trainData.todense())
print("Number of mislabeled training points out of a total %d points : %d" % (trainData.shape[0],(targetTrain != y_pred).sum()))

print('Training and testing on test Naive Bayes')

y_pred = gnb.fit(trainData.todense(), targetTrain).predict(testData.todense())
print("Number of mislabeled test points out of a total %d points : %d" % (testData.shape[0],(targetTest != y_pred).sum()))

print('Training and testing on train with SVM')
clf = svm.SVC(gamma="scale")
clf.fit(trainData.todense(), targetTrain)
y_pred = clf.predict(trainData.todense())
print("Number of mislabeled test points out of a total %d points : %d" % (trainData.shape[0],(targetTrain != y_pred).sum()))

print('Testing on test with already trained SVM')
y_pred = clf.predict(testData.todense())
print("Number of mislabeled test points out of a total %d points : %d" % (testData.shape[0],(targetTest != y_pred).sum()))

Reading the training data positive
Reading the training data negative
Defining dictionaries
Reading the test data positive
Reading the test data negative
Computing test feature vectors
Training and testing on training Naive Bayes
Number of mislabeled training points out of a total 1000 points : 12
Training and testing on test Naive Bayes
Number of mislabeled test points out of a total 200 points : 85
Training and testing on train with SVM
Number of mislabeled test points out of a total 1000 points : 188
Testing on test with already trained SVM
Number of mislabeled test points out of a total 200 points : 52


The previous example uses a small percentage of one of the largest public available datasets for sentiment analysis, which includes more than 50,000 texts from movie reviews

### Example 6: Tweet binary sentiment analysis recognition code

* Another **simple example** of sentiment analysis based on tweets. 
* There are more works using **more tweet data** ( http://www.sananalytics.com/lab/twitter-sentiment/ ) 

In [14]:
def BoW():
    # Tokenizing text
    text_tokenized = [word_tokenize(doc) for doc in text]
    # Removing punctuation
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    tokenized_docs_no_punctuation = []
    for review in text_tokenized:
        new_review = []
        for token in review:
            new_token = regex.sub(u'', token)
            if not new_token == u'':
                new_review.append(new_token)
        tokenized_docs_no_punctuation.append(new_review)
    # Stemming and Lemmatizing
    porter = PorterStemmer()
    preprocessed_docs = []
    for doc in tokenized_docs_no_punctuation:
        final_doc = ''
        for word in doc:
            final_doc = final_doc + ' ' + porter.stem(word)
        preprocessed_docs.append(final_doc)
    return preprocessed_docs

text = ['I love this sandwich.', 'This is an amazing place!',
        'I feel very good about these beers.',
         'This is my best work.', 'What an awesome view', 'I do not like this restaurant',
         'I am tired of this stuff.', 'I can not deal with this', 'He is my sworn enemy!',
         'My boss is horrible.']

targetTrain = [0,0,0,0,0,1,1,1,1,1]
preprocessed_docs=BoW()
tfidf_vectorizer = TfidfVectorizer(min_df = 1)
trainData = tfidf_vectorizer.fit_transform(preprocessed_docs)

text = ['The beer was good.', 'I do not enjoy my job', 'I aint feeling dandy today',
        'I feel amazing!'
        ,'Gary is a friend of mine.', 'I can not believe I am doing this.']
targetTest = [0,1,1,0,0,1]
preprocessed_docs=BoW()
testData = tfidf_vectorizer.transform(preprocessed_docs)

gnb = GaussianNB()
testData.todense()
y_pred = gnb.fit(trainData.todense(), targetTrain).predict(trainData.todense())
print("Number of mislabeled training points out of a total %d points : %d" % (trainData.shape[0],(targetTrain != y_pred).sum()))

print('Training and testing on test Naive Bayes')

y_pred = gnb.fit(trainData.todense(), targetTrain).predict(testData.todense())
print("Number of mislabeled test points out of a total %d points : %d" % (testData.shape[0],(targetTest != y_pred).sum()))

print('Training and testing on train with SVM')
clf = svm.SVC(gamma="scale")
clf.fit(trainData.todense(), targetTrain)
y_pred = clf.predict(trainData.todense())
print("Number of mislabeled test points out of a total %d points : %d" % (trainData.shape[0],(targetTrain != y_pred).sum()))

print('Testing on test with already trained SVM')
y_pred = clf.predict(testData.todense())
print("Number of mislabeled test points out of a total %d points : %d" % (testData.shape[0],(targetTest != y_pred).sum()))

Number of mislabeled training points out of a total 10 points : 0
Training and testing on test Naive Bayes
Number of mislabeled test points out of a total 6 points : 2
Training and testing on train with SVM
Number of mislabeled test points out of a total 10 points : 0
Testing on test with already trained SVM
Number of mislabeled test points out of a total 6 points : 2


In this previous simple scenario both learning strategies achieve the same recognition rates in both training and test sets.

Note that similar words are shared between tweets. In practice, with real examples, tweets will include unstructured sentences and abbreviations, making recognition harder. 