# Part 1: VADER

In this example we will classify data into negative and positive statements
Before you continue, please make sure that you have nltk installed
http://www.nltk.org/

If everything is setup correctly, the below cell should run without errors

In [1]:
import pandas as pd
import numpy as np
import nltk
%matplotlib inline

In [None]:
# Let's start by using a text dataset from sklearn
# This dataset contain around 18000 emails from a newsgroup
# http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True)

In [None]:
# Now we'll start using VADER
# For that you first have to download and install NLTK
# When you did that, run these two lines of code to fetch
# the vader lexicon it'll need to classify negative/positive words
import nltk
nltk.download('vader_lexicon')

In [None]:
# Here we create our sentiment analyser!
from nltk.sentiment.vader import SentimentIntensityAnalyzer
model = SentimentIntensityAnalyzer()

In [None]:
# Using model.polarity_scores() give VADER a string
# from the newsgroup and print the results
# What do the numbers mean?

In [None]:
# Cool! Now we can see how happy something is.
# Now extract the scores for each email

In [None]:
# Good job! But it's a little confusing with all those numbers...
# Let's make it simpler by reducing all that data into 'H' and 'N'.
# 'H' if the post is more happy than negative, and 'N' if the
# post is more negative than happy.
# Put this as an additional column in your pandas data frame.
# We can know whether something is happy if the compound >= 0.
# Similarly an email is negative if the compound < 0.

In [None]:
# Hint: You can use the pandas .assign method
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html
scores = [model.polarity_scores(t) for t in data["data"]]

# Part 2: Text classification

Now we'll start to analyse the text and actually classify which category the
text was contained in, based on the content of the emails. So we will have
to predict a categorical variable, given some text.

You probably noticed that the emails contains a "target" variable as well. This
is a number, indicating what category the email is about. This is our predicting
variable!

In [None]:
# First we have to transform text into something we can work
# with (... numbers of course!! we love numbers!)
# Import the TfidfVectorizer below and instantiate a new vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Now that we have a vectorizer we have to fit it to our data

In [None]:
# ... and finally we have to use the vectorizer to transform
# our text into high-dimensional vectors

In [None]:
# Check out the shape of the vector - what are you seeing?

In [None]:
# Now that we have a representation of text we can actually
# work with (numbers!), we can start to classify the category
# Start by importing the KMeans clustering algorithm and
# instantiate a new KMeans model - remember to set the 
# numbers of clusters with n_clusters=X !
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Now that you have the KMeans classifier, fit it to your
# tfidf vectors as your input (X) and the actual target
# category as your output (y)
# Hint: Instead of creating all these models one by one
# you can create them as a single pipeline instead
# http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline

In [None]:
# Finally we have to figure out how well that went.
# Use the accuracy_score to see how many categories we
# got right!