In [None]:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction import _stop_words
from sklearn.metrics.pairwise import cosine_similarity as cosine
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.tokenize.casual import casual_tokenize
import pandas as pd
#from nlpia.data.loaders import get_data
from sklearn.decomposition import TruncatedSVD
import numpy as np
# pd.set_option('display.max_rows', None)
# pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Week 6 Classification 

In previous weeks, we've looked at how we can turn text documents into vectors of numbers, and that these numbers give us some idea of the meaning of that text. We can then use these vectors for tasks like search, query and recommendation. 

### Unsupervised Algorithms

When we used techniques like **SVD** or **LDA** to create topics to assign documents to, this was a form of **unsupervised** algorithm. This is because we just gave the text to the algorithm and got it to decide how it wanted to split up the data, and what each topic would represent. 

The algorithm just invented some topics, then decided how much each document belonged to each topic. We could then go through and manually try to ascribe a theme to each topic if wanted to, for example

- Topic 1: The cat topic
- Topic 2: The computer topic
- Topic 3: The dog food and treehouses topic?

Sometimes this worked and there was a clear, human understandable concept that encapsulated the topic well, and sometimes this wasn't possible. 



### Pandas Data Frames 

Hopefully by now we have a good handle on the data structures for holding data in Python. We've seen **vectors** (1D arrays) and **matrices** (2D arrays, or arrays of arrays). These can either be a native Python `list`, or and `numpy array`. 

We've also seen data frames from the pandas library a few times already, and we've mainly been using them for their nice display qualities, we'll just formally introduce them now as they will start making ever more appearances! 

You can initialise them from an existing array, and use column and row names to index them instead of just numbers!

In [None]:
a = np.array([1,2,3,4])

In [None]:
#Initialising 
#Pass an array or list into the contructor 
a = np.arange(9).reshape((3,3))
df = pd.DataFrame(a)
df = pd.DataFrame(a, columns = ["col1","col2","another boring name"])
df = pd.DataFrame(a, index = ["row1","row2", "row3 your boat"], columns = ["col1","col2","another boring name"])
print(df)

In [None]:
#Accessing by name or number
print(df["col1"]["row1"])
print(df["col1"][0])

#iloc for indexes
print(df.iloc[0])
print(df.iloc[1])
print(df.iloc[1,1])

### The NewsGroup Dataset

This dataset contains newsgroup posts about space and about hockey.

In [None]:
df = pd.read_csv("../data/space_hockey.csv")

In [None]:
df

In [None]:
#See some random space posts
test = df["label"] == 0
random_post = df[test].sample(1)
print(random_post["text"].item())

In [None]:
#See some random hockey posts
test = df["label"] == 1
random_post = df[test].sample(1)
print(random_post["text"].item())

In [None]:
#I've added in my own tokens I want to remove
remove = list(_stop_words.ENGLISH_STOP_WORDS) + ["%","!",",",":","@","£","$","&","*","(",")","?","<",">",".","+","_","|","/","-"]
def my_tokeniser(doc):
    #
    tokens =  casual_tokenize(doc)
    processed = []
    for t in tokens:
        #make lowercase
        t = t.lower()
        #Remove stop words
        if not t in remove:
            processed.append(t)
    #Return an array of tokens for that document
    return processed
#Make TFIDF Vectoriser
vectoriser = TfidfVectorizer(tokenizer=my_tokeniser)
#Fit the model
tfidf_model = vectoriser.fit(df["text"])
#Get the vocab
vocab = np.array(tfidf_model.get_feature_names_out())

#Get vectors for everything
vectorised = tfidf_model.transform(df["text"]) #.todense()
vectorised_df = pd.DataFrame(vectorised.todense(), columns = vocab)
#Get vectors for space articles
vectorised_space = tfidf_model.transform(df[df["label"] == 0]["text"])
vectorised_space_df = pd.DataFrame(vectorised_space.todense(), columns = vocab)
#Get vectors for hockey articles
vectorised_hockey = tfidf_model.transform(df[df["label"] == 1]["text"])
vectorised_hockey_df = pd.DataFrame(vectorised_hockey.todense(), columns = vocab)


In [None]:
#Get SVD vectors
svd = TruncatedSVD(n_components = 16, n_iter = 100) 
svd_topic_vectors = svd.fit_transform(vectorised)

In [None]:
topic_weights = pd.DataFrame(svd.components_.T, index=vocab, columns=['topic{}'.format(i) for i in range(16)])
topic_weights #display it

In [None]:
topic_weights.topic2.sort_values(ascending=False).head(10) # show top 10 weighted words for topic 2

# Supervised Machine Learning 

But what if we had specific labels that we wanted to attach to documents that we knew about before hand? 

- Is this text spam or not?
- Is this book about horror, sci-fi or cooking?
- Is this song going to be a hit?

This is where **supervised machine learning** comes in, specifically, **classification**. 

To try and get a feel for the task of classification, we're going to have a look at each class and see if we think it wll be easy to a computer program that can pick apart these two groups.

We're going to try an experiment to see if we can find two words that would allow us to decide if a document was from the space group or the hockey group.

In [None]:
#Sum up TFIDF score for each word across all documents
space_sums = pd.DataFrame(vectorised_space.sum(axis = 0).T, index=vocab, columns = ["sums"])
hockey_sums = pd.DataFrame(vectorised_hockey.sum(axis = 0).T, index=vocab, columns = ["sums"])
print("\nSPACE:\n", space_sums["sums"].sort_values(ascending=False).head(30), "\nHOCKEY:\n", hockey_sums["sums"].sort_values(ascending=False).head(30))

### Selecting the TFxIDF values for a specific token

We're going to plot the `TFxIDF` values for **each post** for **two specific tokens**.

The arrays that hold our `TFxIDF` values have the shape ``numPost x lengthOfVocab`` 

In [None]:
vectorised.shape

We can use a **filter** to get the indexer of a **given token**

``
filter = vocab == features[0]
``

Then we can grab all the rows (using the `colon :`), given that column. 

We end up with **2 features** for each token, which we can plot on a graph. The colour represents **which news group the post was originally from**.

### Indexes **and** objects in `for loops` with `enumerate()`

Previously, we have looked at two main approaches to `for loops`. We either iterate through an array and get the **values themselves**

```
chapters = ["chapter1", "chapter2", "chapter3"]
for c in chapters:
    analyse(c)

```

Or we've used `range()` to iterate over some indexes

```
chapters = ["chapter1", "chapter2", "chapter3"]
for i in range(len(chapters)):
    analyse_chapter_at(i)

```

But **YOU CAN HAVE THEM BOTH**

Below, we use the `enumerate()` function. This returns **two values** stored in **two separate variables**. The first contains the index, and the second the actual object. 

Turns out you can have it all. 

Check out the code below and see if you can understand it!

In [None]:
?ax.plot

In [None]:
#Which TFIDF scores would be useful when trying to determine which class?
import matplotlib.pyplot as plt
classes = [vectorised_space.todense(), vectorised_hockey.todense()]

#Good features
features = ['nhl', 'moon']

#Bad features
#features = ['article', 'subject']

fig, ax = plt.subplots(figsize=(12,8))
col = "bo"
ax.set_xlabel(features[0])
ax.set_ylabel(features[1])
for i, c in enumerate(classes):
    #Get TFIDF for all posts, for the column for each vocab
    x = c[:, vocab == features[0]]
    y = c[:, vocab == features[1]]
    ax.plot(x, y, col, label = "space" if i == 0 else "hockey")
    col = "gx"
ax.legend()

In [None]:
#Lots of values near 0
fig, ax = plt.subplots(figsize=(12,8))
for i,c in enumerate(classes):
    x = c[:, vocab == features[0]]
    y = c[:, vocab == features[1]]
    ax.set_xlim([0,0.5])
    ax.set_yscale("log")
    
    ax.hist(x, bins = 100)
    ax.hist(y, bins = 100)
    

### Supervised Learning Overview

In [None]:
from IPython.display import Image
Image("../media/supervisedlearninglearner.png")

### Models

Models don't have to be constructed with machine learning. The model is simply something that when given an **input**, which in our case waill be some vectorised text, it is able to produce some **output**. In the case of **classification** this will always be a single discrete label. 

This model could be made by hand, and in fact some text classification models were made by hand until quite recently. However, this can be time consuming and complicated, and result in models which are not particularly robust. 

Machine learning allows us to make **models** from **data**. This means that given **new data** that the model hasn't seen before, we can assign a new label to that new data. 

### Whats the simplest data-driven model?

Lets think of the simplest model we can make, using the data we have. For example, we could get 50% accuracy by just picking the same class every time without even looking at the data. Generally, we would hope to be doing better than **mean** or chance!

We can do much better than that using **learning algorithms** to generate our models in a process called **training**. We will cover a couple of learning algorithms used for classification later in the lecture. 


### Datasets

The format of our dataset is quite simple. Each example (in our case each text document), is represented by 1 or more numbers. This might be a Bag of Words Vector, a TF/IDF vector, or a Topic Model Vector, or some other representations we'll learn about later on in the course. 

This is the **input**. Each example also has a label, or an **output**. This label tells us which class the example belongs to and we can use this to **train** our model.

#### Labels
A quick word on labels in classification tasks. They are:

- Discrete 
- Categorical (not numerical, although we often use numbers). 1 is not less than 2, 3 is not half 6. 



### Training 

This is where we take the dataset (the input that represents each example, and its accompanying label), and generate a model that will learn to take new inputs, and be able to correctly label them. 

The way this happens is often starting from some random initialisation, we iteratively try each example, check if we got it right by comparing the model's prediction against the actual label, and then use this information to improve the model. Eventually, we end up with a model that works well (hopefully!). 

The algorithms we'll look at today actually don't really have much of a training process as such, unlike the more fancy learning algorithms we'll look at later in the course. 

### New Data

Now we've trained the model, its time to try it out on some new data! Remember, the goal is to make something that is able to correctly label new data (new instances of text) based on what it has learned about the categories from the dataset. It is worth noting here that models will only tend to be good at identifying new examples that are similar, or at are least different in similar ways, to the ones it has seen before.  

We'll look into how to evaluate our models later on. 

In [None]:
from sklearn.naive_bayes import GaussianNB

#Pick features
#features = ["subject","article"]

features = ["moon","nhl"]
#features = ["space","hockey", "moon","nhl"]
# features = np.random.choice(vocab, 100)

#Add features to main dataframe

vectorised_dense = vectorised.todense()


for f in features:
    df[f] = vectorised_dense[:, vocab == f]
#Initialise the model
gnb = GaussianNB()

#Get the dataset
X = df[features]
y = df["label"]

print(X.shape, y.shape)

#Train the model
model = gnb.fit(X, y)

#See if the model works
y_pred = model.predict(X)

#How many did it get right?
num_incorrect = (y != y_pred).sum()
total = df["label"].shape[0]
acc = (total - num_incorrect) / total * 100
print("Number of mislabeled points out of a total %d points : %d, %0.3f" % (total, num_incorrect, acc))

### How good is that score?

Is the model better than random? Here we show what happens if we randomly guess a label.

In [None]:
#Generate some random results
y_pred = np.random.randint(0,2,len(y))

#How many did it get right?
num_incorrect = (y != y_pred).sum()
total = df["label"].shape[0]
acc = (total - num_incorrect) / total * 100
print("Number of mislabeled points out of a total %d points : %d, %0.3f" % (total, num_incorrect, acc))

### Good for some....

For some examples the two words are really good, but for some examples where the TF/IDF for each word is near 0, its a bad representation

For example, out of 1000 of each category, we may have a 300 space emails where the features are (0,0), and 400 hockey emails where the features are (0,0). SO if we then have a new email that scores (0,0), what category do with put it in? We can't use just these features to reliably tell them apart!

In [None]:
#Get the accuracy 
acc = (total - num_incorrect) / total * 100
acc

### Evaluating Models 

#### Training Set Accuracy
Training set accuracy is when we take all the data we trained the model on and use it to determine the accuracy of that same trained model. It's important to remember that the goal of most traditional machine learning tasks is to take existing examples and produce something which generalises to new examples. 

In an issue known as overfitting, a model may have high training set accuracy, but when given new examples will perform poorly. In our text blog example, this means the model has learnt to identify the exact blog posts it has seen before, rather than learning something intrinsic about all space or hockey emails that would allow it to correctly label novel ones under different circumstances in the wild. 

#### Validation Set Accuracy
To get a more accurate measure of accuracy (meta, right?), you can hold back some of your data from training and use this to evaluate your model during development. This way your model is being tested on unseen data and you can have more confidence it will work when you put it into production. Proportions for this split vary, but 10% is often used.

The problem with test set accuracy is that you lose some of your precious data for training. Also, if you have small datasets, a 10% split of an already small number of examples may not actually give you much of an idea about how well your model is performing. 

You can see that no one method is right for all situations and compromises often have to be made. It's also important to note that these methods will only work as well as your data is good. They won’t, for example, spot bias in your model if your test set also lacks the necessary diversity. 

#### Test Sets
Some people hold back even more data to really test how well their model will work in the real world (or "production"). This is because they may overfit to their testing data when tuning their models. 

#### Cross Validation 
Its the best (or worst?) of both worlds approach! More on that later.

### `train_test_split(X, y, split)`

There is a function in the `sklearn` library that we can use called `train_test_split()` we can use to create our different datasets for **training** and **validation**.

In [None]:
#Add features to main dataframe
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(vectorised_df, df["label"], test_size=0.2, random_state=0)
#Train the model
model = gnb.fit(X_train,y_train)
#See if the model works
y_pred = model.predict(X_test)
num_incorrect = (y_test != y_pred).sum()
total = y_test.shape[0]
acc = (total - num_incorrect) / total * 100
print("Number of mislabeled points out of a total %d points : %d, Accuracy is %0.3f percent" % (total, num_incorrect, acc))

In [None]:
##Look up the documentation using ?
?train_test_split

### Comparing Representations

OK, so now we can compare TF/IDF, SVD and LDA representations, remember, one isn't better than the other for all datasets and problems. One might be better for this dataset and problem. 

In [None]:
# Set things up
lda_cv = CountVectorizer(stop_words='english', tokenizer=my_tokeniser,
                        max_df=.1,
                        max_features=5000)
count_data = lda_cv.fit_transform(df["text"])
lda = LatentDirichletAllocation(n_components=16,
                                random_state=123,
                                learning_method='batch')

In [None]:
lda_topics = lda.fit_transform(count_data)

In [None]:
#Compare the 3 representations
tfidf = np.asarray(vectorised.todense())
features = [tfidf, svd_topic_vectors, lda_topics]
i = 0
for f in features:
    print("Feature set ",i)
    i = i+1
    X_train, X_test, y_train, y_test = train_test_split(f, df["label"], test_size=0.3, random_state=0)
    gnb = GaussianNB()
    y_pred = gnb.fit(X_train, y_train).predict(X_test)
    num_incorrect = (y_test != y_pred).sum()
    total = y_test.shape[0]
    acc = (total - num_incorrect) / total * 100
    print("Number of mislabeled points out of a total %d points : %d, %0.3f accurate" % (total, num_incorrect, acc))

### Naive Bayes

Now we're going to learn about the algorithm that we've been using to classify, as it is one widely used in text classification tasks. 

When we see a new example we ask "Given I know that this document has these feature values, what is the probability that it is from each class?". Naive Bayes will then return the **most likely** label, given those values. 

```
P(c=0|x1,x2,x3..)
P(c=1|x1,x2,x3..)
```

How we reach this probability is in two steps.

First, we ask "What is the probability that a document in a particular class has these particular values?". We can look at a **probability distribution** (as shown in the plot below) to work this out. For example, we may see that that for **space (class 0)**, the probability of a seeing a **TF/IDF** score of 0.4 for the word **year** might be 0.084 and the probability in **hockey (class 1)** is only 0.069. 

```
P(time=0.4|c=0) = 0.084
P(time=0.4|c=1) = 0.069
```

In [None]:
from scipy import stats 
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12,8))
feature = "time"
for i,c in enumerate(classes):
    #Get vals for word
    tfidf_vals = c[:, vocab == feature]
    lim = 220
    #Remove 0 scores (for plotting)
    tfidf_vals = sorted(np.array(tfidf_vals.flatten())[0])[-lim:]
    x = np.linspace(0, 0.2, len(tfidf_vals))
    #Find mean and standard deviation
    m, s = stats.norm.fit(tfidf_vals)
    #Draw normal curve
    pdf_g = stats.norm.pdf(x, m, s)
    ax[i].set_ylim([0,25])
    ax[i].hist(tfidf_vals, bins = 40)
    ax[i].plot(x, pdf_g, label="Probabilty distribution")
    index = 43
    print("prob = ", pdf_g[index]/len(tfidf_vals))
    ax[i].plot(x[index], pdf_g[index],"rx",ms=12, label="tfxidf = 0.4") 
    ax[i].set_xlabel("tfxidf scores for \"" + feature +"\"")
    ax[i].set_title("space" if i == 0 else "hockey")
    ax[i].legend()


But this is not enough, because it may also be that **regardless** of observations, its actually way more likely that you'll get one class over another. 

In our case the classes are nicely balanced so this won't have much effect, because the chance of seeing each class, regardless of what we know about the text, is about 50/50

But using **Bayes Rule** we can calculate the probability of each class, given the observation, taking into account the probability before you had made the observation. 

```
P(c=0|year=0.1) = P(c=0) * P(year=0.1|c=0)
P(c=1|year=0.1) = P(c=1) * P(year=0.1|c=1)
```

But this is only for one of the features, and as we've seen, we may use anywhere from 1 to thousands! 

We can calculate each probability for each feature separately and combine them to get a final probability and this is where the **naive** in **naive bayes** comes in. 

The maths that we use to combine the probabilities assumes that all the features are independent and unrelated, however, it is quite likely that some will actually be related in some way. We ignore this assumption and use it anyway and it turns out it sitll works quite well!

## Summary

### Classification only works for what you trained it for 

1. Will always make pick one of the classes its been trained on 


2. Data has to be in the same format 


### Uses of Text Classification in Creative Industries 

- Classifying social media response to events 


    - Did people enjoy the show?


- Grouping media from text descriptions 


    - Can help if we have categories that we know are useful


- Identifying new items in museum collections 


    - Which experts should look at these documents?


- Filtering out inappropriate content 


    - Live performances with audience participation 

### Rules of Thumb

 - A model is created by a learning algorithm 
 
 
 - A classifier learns to attach discrete labels to new data
 
 
 - We want a model that works well with new data (generalises well, **not overfit**)
 
 
 - Different features can effect model performance
 
 
 - The model can only learn variations similar to that it has seen before  
     
     
     - Generally less features is better
     
     
     - Generally more data is better, especially for more complex problems 
     
     
 - More classes (more choices), generally is a harder problem 
 
 
 - Equal numbers of examples for each class is also generally better