# Part 1: VADER

In this example we will classify data into negative and positive statements
Before you continue, please make sure that you have nltk installed
http://www.nltk.org/

If everything is setup correctly, the below cell should run without errors

In [1]:
import pandas as pd
import numpy as np
import nltk
%matplotlib inline

In [2]:
# Let's start by using a text dataset from sklearn
# This dataset contain around 18000 emails from a newsgroup
# http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True)

In [3]:
# Now we'll start using VADER
# For that you first have to download and install NLTK
# When you did that, run these two lines of code to fetch
# the vader lexicon it'll need to classify negative/positive words
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/jens/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [4]:
# Here we create our sentiment analyser!
from nltk.sentiment.vader import SentimentIntensityAnalyzer
model = SentimentIntensityAnalyzer()



In [6]:
# Using model.polarity_scores() give VADER a string
# from the newsgroup and print the results
# What do the numbers mean?
model.polarity_scores(newsgroups_train.data[0])

{'compound': 0.807, 'neg': 0.012, 'neu': 0.916, 'pos': 0.072}

In [7]:
newsgroups_train.data[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [8]:
# Cool! Now we can see how happy something is.
# Now extract the scores for each email
scores = [model.polarity_scores(x) for x in newsgroups_train.data]

In [10]:
type(scores[0])

dict

In [12]:
# Good job! But it's a little confusing with all those numbers...
# Let's make it simpler by reducing all that data into 'H' and 'N'.
# 'H' if the post is more happy than negative, and 'N' if the
# post is more negative than happy.
# Put this as an additional column in your pandas data frame.
# We can know whether something is happy if the compound >= 0.
# Similarly an email is negative if the compound < 0.
def happysad(n):
    if n >= 0:
        return 'h'
    else:
        return 'n'
    
hs_score = [happysad(d['compound']) for d in scores]

In [15]:
len([sad for sad in hs_score if sad == 'n'])

3341

In [None]:
# Hint: You can use the pandas .assign method
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html


# Part 2: Text classification

Now we'll start to analyse the text and actually classify which category the
text was contained in, based on the content of the emails. So we will have
to predict a categorical variable, given some text.

You probably noticed that the emails contains a "target" variable as well. This
is a number, indicating what category the email is about. This is our predicting
variable!

In [16]:
# First we have to transform text into something we can work
# with (... numbers of course!! we love numbers!)
# Import the TfidfVectorizer below and instantiate a new vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

In [22]:
# Now that we have a vectorizer we have to fit it to our data
crazy_vector = vectorizer.fit_transform(newsgroups_train.data)

In [23]:
len(newsgroups_train.data)

11314

In [21]:
crazy_vector.shape

(11314, 130107)

In [26]:
# Check out the shape of the vector - what are you seeing?

In [27]:
set(newsgroups_train.target)

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}

In [30]:
# Now that we have a representation of text we can actually
# work with (numbers!), we can start to classify the category
# Start by importing the KMeans clustering algorithm and
# instantiate a new KMeans model - remember to set the 
# numbers of clusters with n_clusters=X !
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=20)


In [32]:
# Now that you have the KMeans classifier, fit it to your
# tfidf vectors as your input (X) and the actual target
# category as your output (y)
model.fit(crazy_vector, newsgroups_train.target)

# Hint: Instead of creating all these models one by one
# you can create them as a single pipeline instead
# http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=20, p=2,
           weights='uniform')

In [34]:
# Finally we have to figure out how well that went.
# Use the accuracy_score to see how many categories we
# got right!
from sklearn.metrics import accuracy_score
accuracy_score(newsgroups_train.target, model.predict(crazy_vector))

0.79901007601202045