# Representing text for Machine-learning

A long document made up of words is not a good representation for computers.

We need to convert the text into a better representation. Here we will look at
some baseline methods that work very well in practice. They also form the basis
of more complicated ideas used later on. Strong baselines are important in a world
of deep learning. You have to be able to demosntrate that the additional complexity,
reduced explainability, and additional technical debt from using deep learning is
worth it.

In [None]:
sentences = ["The Uber driver behind the wheel of an autonomous car that hit "
             "and killed a pedestrian in Arizona could have avoided the collision "
             "if she had not been distracted, according to police investigating "
             "the incident.",
             "An avoidability analysis by police in Tempe, Arizona following March's"
             " crash suggested that Rafaela Vasquez, Uber's safety driver, may have "
             "been watching the online video service Hulu in the car.",
             "The death of 49-year-old Elaine Herzberg is believed to be the first "
             "time an autonomous car has killed a bystander, prompting a series of "
             "investigations into what happened.",
            ]

In [None]:
# create a bag of words representation of the above sentences

In [None]:
# check that you can undo your bag of words

In [None]:
# what fraction of entries are not zero in X?

In [None]:
(X>0).sum(axis=1) / X.shape[1]

## Using ready made tools for this

scikit-learn has tools to do this for you.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# ... use CountVectorizer()

print(sorted(vect.get_feature_names()))

In [None]:
# use the learnt mapping to convert sentences to bag of words


In [None]:
# transform the vector representation back to "words"

In [None]:
# Is there a difference between what `CountVectorizer` says how many features
# there are and your manual code?

## Classify movies from their reviews

Fetch the dataset from http://ai.stanford.edu/~amaas/data/sentiment/ and un'tar it to
a directory near to this notebook. I placed it in `../data/`.

In [None]:
from sklearn.datasets import load_files

reviews_train = load_files("../data/aclImdb/train/", categories=['neg', 'pos'])

text_trainval, y_trainval = reviews_train.data, reviews_train.target

print("type of text_train: {}".format(type(text_trainval)))
print("length of text_train: {}".format(len(text_trainval)))
print("class balance: {}".format(np.bincount(y_trainval)))

In [None]:
print("classes: {}".format(set(y_trainval)))

In [None]:
# look at an example review
print("text_train[42]:\n{}".format(text_trainval[42].decode()))

In [None]:
# Vectorise the review texts!
from sklearn.model_selection import train_test_split


text_trainval = [doc.replace(b"<br />", b" ") for doc in text_trainval]

# ... your code here

In [None]:
# check out some of the words
feature_names = vect.get_feature_names()
print(feature_names[:10])
print(feature_names[30000:30010])
print(feature_names[::3000])

In [None]:
# fit a logistic regression model to the data and measure the performance
# of the model. We now have a baseline we need to improve on.
# What score do you achieve on which dataset? Did you split into training
# and testing datasets?
# 
# Once you have your model running tune the regularisation strength `C`
# using RandomSearchCV or LogisticRegressionCV which is more efficient
# Use scikit-learn for this

In [None]:
lr.score(X_val, y_val)

In [None]:
# Find out which words correlate with a good and a bad review.
# You can inspect the weights of the linear model by looking
# at the coefs_ property of your LogisticRegression instance
# Check the documentation to learn about all the properties:
# http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
#
# do they seem sensible? Could you explain why a review is
# getting a high or a low predicted sentiment?

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["font.size"] = 14

In [None]:
# ... your code here ...

## Bonus

* investigate how to configure the vectorizer to exclude stop words
* should you fix the spelling of misspelt words?
* only include words that appear more than N times?
* only include the M most frequent words?
* how few examples do you need to achieve a "good" performance?
* a bag of words does not know anything about the order of words,
  can you construct bi-grams (pairs of words) and improve the
  performance?
* construct a logistic regression model in keras and use that instead
  of using scikit-learn's implementation


# Term frequency, inverse document frequency

Instead of counting how often a word appears we can also use TfIdf.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline


text_classifier = make_pipeline(
    TfidfVectorizer(min_df=3, max_df=0.8, ngram_range=(1, 2)),
    LogisticRegression(),
)

In [None]:
%%time
text_classifier.fit(text_train, y_train)

In [None]:
text_classifier.score(text_val, y_val)

## Baseline results

With about 20s of computer time and a few minutes of work we are at 88% accuracy.

We have not really tuned this baseline yet, it is possible that some investment in 
finding better hyperparamters will improve the baseline further. Try it out.