
## Analytics Specializations & Applications - Week 4

# Text Analytics 2 - Case Study Exercises
----------

This set of exercises assumes that you have completed the accompanying **"Text Analytics 1 - Preparatory Exercises"** jupyter notebook. If you haven't, please find and run through that set of exercises first.

### Scenario
Now we have the tools we need under our belt, consider the following case study scenario. We are a communications/media company like Krow Communications, whose outputs are now receiving reviews and discussion on the web (particularly in the form of youtube comments). We would to generate a text analytics solution that will take these reviews (which we assume to be unstructured text alone), perfrom some text analytics on them, and then tell us if that review was positive or negative (to do this we will need to perform sentiment analysis). Once we have a method of doing this we can score the success of our outputs in an automatic fashion (and also potentially gauge how reaction towards them changes over time). 

The problem is that we currently have no basis for assessing reviews - our media outputs don't get "scored". This problem is called the "cold start" problem - we just don't have any ground truth which we can build a text analytics model against. 

Luckily, we may be able to leverage some "transfer learning" - one of our partners has a dataset of movie reviews that **are** accompanied with a score - so we know if the text they include is broadly positive or negative.

By performing text analytics on this dataset and concentrating on sentiment (rather than movie actors, directors, genres, etc) we will be able to create a natural lanugage model that will receive any review - such as those we get discussing our advertising campaigns - and tell us something about the author's reaction. This we can then document, and use in our future pitches.

### The dataset
Our transfer dataset consists of 25,000 written movie reviews from the Internet Movie Database, IMDb (www.imdb.com). No movie has more than 30 reviews, and the review text is accompanied by a binary score (with the value 1 if the manual IMDb rating for that review is greater than 6, and the value 0 if the rating is less than 5). From this data we will learn what constitutes a positive and negative review in terms of text).

To analyse this text, so we can understand what consitutes a positive and negative reivew in terms of language, we will implement the following:

* Data Collation
* Stripping / Case Folding
* Stemming
* Stopping
* Tokenization 
* Vectorization (and TF-IDF)
* Testing (using Cosine Similarity)


Let's begin by loading in the data (which is provided in a file in the same folder as this exercise)...


<span style="font-weight:bold; color:green;">&rarr; Load in and examine the first ten lines of the data <span/>

In [1]:
import pandas
data = pandas.read_csv("movie_data.tsv", delimiter="\t")

#-- examine the first 10 lines of the data here
data.head(10)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
5,8196_8,1,I dont know why people think this is such a ba...
6,7166_2,0,"This movie could have been very good, but come..."
7,10633_1,0,I watched this video at a friend's house. I'm ...
8,319_1,0,"A friend of mine bought this film for £1, and ..."
9,8713_10,1,<br /><br />This movie is full of references. ...


Have a look at the last entry - see how it has html tags in it. We need to get rid of these (and let's lose the punctuation while we are at it), so let's first do some stripping. I've created a custom function to do this which is out of scope of this course, so for now 

In [2]:
import html_cleaner
data.review = html_cleaner.remove_html(data.review)

#-- examine the first 10 lines of the data again
data.head(10)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
5,8196_8,1,I dont know why people think this is such a ba...
6,7166_2,0,"This movie could have been very good, but come..."
7,10633_1,0,I watched this video at a friend's house. I'm ...
8,319_1,0,"A friend of mine bought this film for £1, and ..."
9,8713_10,1,This movie is full of references. Like \Mad Ma...


In [3]:
#-- describe the data here
data.describe()

Unnamed: 0,sentiment
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


We are going to learn a model that can recognize those reviews with positive sentiment form those with negative sentiment, so fist split the data into test and training sets (use the first 20,000 items for the training data nd the rest for the test data.

<span style="font-weight:bold; color:green;">&rarr; Split the data into test and training <span/>

In [4]:
data.drop(["id"], axis = 1)
train_data = data[:20000]
test_data = data[20000:]

Ok, as before the next step is to vectorize our text data - let's do that first with a simple Count Vectorizer (and examine how much TF-IDF can improve things later)'

<span style="font-weight:bold; color:green;">&rarr; Complete the following code <span/>

In [5]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer

#-- create our vectorizer object, ready to fit and 
#-- transform our data into a vector space format
vectorizer = CountVectorizer()

#-- setup the model's feature space using our training data
vectorizer.fit(train_data.review)

#-- and then convert the training data set into vector format
train_features = vectorizer.transform(train_data.review)

#-- while we are here convert our test dataset in the same way
test_features = vectorizer.transform(test_data.review)

print("Training and test data successfully vectorized")

Training and test data successfully vectorized


Now let's create a model that will understand how sentiment is constructued out of text in some way. For this job we could use any classifier, but given Naive Bayes models have historically been used in text analysis, let's maintain that tradition here:

In [6]:
#-- let's use a multinomial naive bayes classifer
from sklearn.naive_bayes import MultinomialNB
NB = MultinomialNB()

#-- fit the model to our training data - note in this step the model is
#-- finding the relationship between word frequencies and the sentiment
#-- of each review
NB.fit(train_features, train_data.sentiment)
print("Linguistic Model successfuly created")

Linguistic Model successfuly created


Now let's see how well our model works, by testing it on our holdout dataset (note that we would normally cross-validate here to get a more representative score, but a single holdout test is fine for now):

In [7]:
#-- generate some predictions
results = NB.predict_proba(test_features)
print(results)

[[9.28052187e-01 7.19478130e-02]
 [9.89831089e-01 1.01689113e-02]
 [3.18273809e-02 9.68172619e-01]
 ...
 [9.99454524e-01 5.45475684e-04]
 [9.91074863e-01 8.92513651e-03]
 [8.89195479e-05 9.99911080e-01]]


The results come in two columns for each review - the first column is the probability that it is a negative review, and the second if it is a positive review. We can come up with an actual prediction of whether the review contains positive sentiment or not by seeing if the second column is > 0.5 or not (our threshold):

In [8]:
#-- note the neat syntax here: First we index the result's second column using
#-- [:,1] and then we test if it is more than 0.5 and hence a positive review
predictions = results[:,1] > 0.5

#-- the columns which were more than 0.5 are designated as True
print(predictions)

[False False  True ... False False  True]


In [9]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(test_data.sentiment, predictions)
print("We predicted the sentiment of {0:.01f}% of reviews correctly".format(acc*100))

We predicted the sentiment of 84.2% of reviews correctly


84% is not bad at all, given we are using a simple and quick bag of words approach. In fact this is no doubt good enough for the business task, and we could start applying the model to our own reviews.

However, we can do better as we've omitted some useful steps. Your challenge is now to wee how much you can improve the results this model by implementing:
> * Stopping
> * Stemming
> * Case Folding
> * and a TfifdVectorizer() 

Also consider:
> * What can you find out about the important features (i.e. which words are most influential?) 
> * Can you design a query that fools the model? - tip. consider including negative words even though the review is good...

Good luck! And ask for help if you run out of ideas - we will be going through the solution on Tuesday, along with the coursework release.




<br/>

### SOLUTIONS SECTION

Below is an example solution to some of the above challenges. These aren't the only ways of doing all the data cleansing, and finding important features, so if you've found another way then that's great.

Below is code that creates a vectorizer that uses TFIDF, which we give a tokenizer that uses only alphabetical words. Note that doing the vectorization can take a few minutes to run (the star next to the code block in jupyter indicates it is still processing).

In [10]:
import pandas
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
import html_cleaner

#-- first let's create a function that will do our stemming
#-- for us when we need it to, along with removing punctuation...
def tokenize_and_stem(text):
    stemmer = PorterStemmer()
    stemmed_tokens = []
    
    #-- let's just extract alphabetical words of at least 3 characters
    tokens = nltk.RegexpTokenizer(r'[a-zA-Z]{3,}').tokenize(text)
    for t in tokens:
        stemmed_tokens.append(stemmer.stem(t))
    return stemmed_tokens


#-- create our vectorizer object, and fit our data, cleansed by:
#-- 1. supplying the vectorizer our customer stemming functino
#-- 2. telling it to use it's inbuilt stop words
#-- 3. telling it to convert all the text to lowercase itself
vectorizer_new = TfidfVectorizer(
                 tokenizer = tokenize_and_stem,
                 stop_words='english', 
                 lowercase=True)
print("Improved text vectorizer successfully created...")

#-- and use this upgraded vectorizer to fit our customer reviews
train_features = vectorizer_new.fit_transform(train_data.review)
print("Training reviews successfully vectorized...")

test_features = vectorizer_new.transform(test_data.review)
print("Testing reviews successfully vectorized.")


Improved text vectorizer successfully created...




Training reviews successfully vectorized...
Testing reviews successfully vectorized.


In [11]:
tokens = vectorizer.get_feature_names()

In [12]:
from sklearn.naive_bayes import MultinomialNB

NB_new = MultinomialNB()
NB_new.fit(train_features, train_data.sentiment)
print("Linguistic Model successfuly created")

Linguistic Model successfuly created


In [13]:
from sklearn.metrics import accuracy_score

#-- lets run our test data through the sentiment analyser
results = NB_new.predict_proba(test_features)

#-- and then see how good its predictions were now
predictions = results[:,1] > 0.5
acc = accuracy_score(test_data.sentiment, predictions)
print("We predicted the sentiment of {0:.2f}% of reviews correctly".format(acc*100))

We predicted the sentiment of 85.50% of reviews correctly


Well 85.5% is definitely an improvement (even if 1.3% doesn't seem a huge one in this case, it could well be significant in practice!). Now let's move on to see if we can unpick how our customer review analyser is working and address the question:

> * What can you find out about the important features (i.e. which words are most influential?) 

Our text classification model gives every word a coefficient to show how important it is - the higher the better (they are negative so this translates to the closer to zero the better). Let's first join these up with their corresponding names:

In [14]:
#-- extract the feature names
feature_names = vectorizer_new.get_feature_names()

#-- extract the coefficient assigned to each feature
coefficients = NB_new.coef_[0]

#-- pair the words and their coefficients
word_coefficients = list(zip(coefficients, feature_names))
print( word_coefficients[5000:5010])

[(-11.709844112606802, 'brenna'), (-10.408091352572558, 'brennan'), (-11.509167275048224, 'brennen'), (-10.415827639374001, 'brent'), (-11.6360779618455, 'brenten'), (-11.528025678409094, 'brentwood'), (-11.709844112606802, 'brereton'), (-11.709844112606802, 'breslin'), (-11.709844112606802, 'bresnahan'), (-10.876968869350712, 'bressart')]




To find the important words that indicate sentiment, we now just need to sort these words in order of their coefficient

In [15]:
#-- sort the word based on their coefficient score - note the odd "lambda"
#-- syntax in the function - this essentially just says use the first 
#-- element (i.e th score) of the coefficient/name pairs to do the sorting:
word_importance = sorted(word_coefficients, key=lambda x: x[0])

Now we have an ordered list, we can look at the last 20 say, to find the words that are important in detecting sentiment - and hence distinguishing positive reviews from negative ones:

In [16]:
for a,b in word_importance[-20:]:
    print(a, b)

-6.711050926237108 play
-6.666375677565675 best
-6.579716975004585 realli
-6.575640906766945 make
-6.522937570646216 just
-6.472685608123468 charact
-6.401405032632604 watch
-6.361756448785925 love
-6.361305580799283 time
-6.360537299145098 ha
-6.3553334816843305 stori
-6.335025571568903 good
-6.284764713363029 great
-6.262635951894112 veri
-6.237542071214977 like
-5.92191010047866 hi
-5.6724097783330265 wa
-5.5137134989245595 film
-5.500440529092075 movi
-5.401471479131113 thi


This looks like it has some sense - important words like "like", "good", "great" and "love" are high up there as we'd expect. But also words like "story" and "characters", which are coherent with what people are looking for in good movies. 

In terms of "transfer learning" these items, however, may not apply to adverts/brands - what still remains is to try and create a review that can fool our model to explore this!

In [17]:
#-- create some test "reviews"
test_reviews = [
    "I'm not sure about this advert - it is selling a bad brand!",
    "I love this advert - it is selling a good brand!",
    "This is excellent work",
    "What is this rubbish?",
    "Please save us from this nonsense",
    "I enjoyed watching this",
    "I wanted to say this advert is bad, but I can't - just the opposite in fact!"
]

#-- vectorize it
vec_test = vectorizer_new.transform( test_reviews )

#-- Run it through the sentiment analyser
results = NB_new.predict_proba(vec_test)

#-- Examine the sentiment the model detects
for t, r in zip(test_reviews, results):
    if r[1] > 0.5:
        print(t, "--> POSITIVE REVIEW")
    else:
        print(t, "--> NEGATIVE REVIEW")

I'm not sure about this advert - it is selling a bad brand! --> NEGATIVE REVIEW
I love this advert - it is selling a good brand! --> POSITIVE REVIEW
This is excellent work --> POSITIVE REVIEW
What is this rubbish? --> NEGATIVE REVIEW
Please save us from this nonsense --> NEGATIVE REVIEW
I enjoyed watching this --> POSITIVE REVIEW
I wanted to say this advert is bad, but I can't - just the opposite in fact! --> NEGATIVE REVIEW


Notice how all the reviews are categorized correctly... apart from the last one. With some carefully worded phrasing, we have tricked our model. Nonetheless even with a simple bag of words approach we have a useful tool for the business, which we can now use to track the companies influence.