# Bag of words: using scikit-learn and spaCy
Based on word frequencies, ignoring word order

### *Task:* categorize a news text

### Overview of steps:
1. download data sets (train and test)
2. preprocess
    - clean text (remove newlines, convert html, etc.)
    - tokenize (divide into words)
    - lemmatize (runnings, runs, ran -> run)
    - remove stop words
3. get frequency counts of tokens
4. train an SVM classifier with a linear kernel on the training data
5. test on the test data
6. get performance of classifier

source: https://nicschrading.com/project/Intro-to-NLP-with-spaCy/

### Setup:
1. Install spaCy:  
    `conda install spacy`  (if that doesn't work, go here: https://spacy.io/docs/usage/)  

2. Download spaCy model:  
    `python -m spacy download en`
    
3. Install scikit-learn:  
    `conda install scikit-learn` (if that doesn't work, go here: http://scikit-learn.org/stable/install.html)

3. Download stopwords from NLTK:

In [None]:
import nltk 
nltk.download("stopwords")

### Demo!

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
from sklearn.datasets import load_files
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn import metrics
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
import string
import numpy as np
import re

### 1. load the data sets (train and test)
We're using a subset of the dataset called “Twenty Newsgroups”. Here is the official description, quoted from the website:
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

In [None]:
train = load_files("data/20news-bydate-train")

What are the categories?

In [None]:
train.target_names

Let’s print the first lines of the first loaded file:

In [None]:
print("\n".join(train.data[0].split("\n")[:3]))

print(train.target_names[train.target[0]])


Load the test files

In [None]:
test = load_files("data/20news-bydate-test")

### 2. preprocess

In [None]:
# Set up spaCy
from spacy.en import English
parser = English()

# A custom stoplist
STOPLIST = set(stopwords.words('english') + ["n't", "'s", "'m", "ca"] + list(ENGLISH_STOP_WORDS))
# List of symbols we don't care about
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-----", "---", "...", "“", "”", "'ve"]

# Every step in a pipeline needs to be a "transformer". 
# Define a custom transformer to clean text using spaCy
class CleanTextTransformer(TransformerMixin):
    """
    Convert text to cleaned text
    """

    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}
    
# A custom function to clean the text before sending it into the vectorizer
def clean_text(text):
    # get rid of newlines
    text = text.strip().replace("\n", " ").replace("\r", " ")
    
    # replace HTML symbols
    text = text.replace("&amp;", "and").replace("&gt;", ">").replace("&lt;", "<")
    
    # lowercase
    text = text.lower()

    return text

# A custom function to tokenize the text using spaCy
# and convert to lemmas
def tokenize_text(sample):
    # get the tokens using spaCy
    tokens = parser(sample)

    # lemmatize
    lemmas = []
    for tok in tokens:
        lemmas.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_)
    tokens = lemmas

    # stoplist the tokens
    tokens = [tok for tok in tokens if tok not in STOPLIST]

    # stoplist symbols
    tokens = [tok for tok in tokens if tok not in SYMBOLS]

    # remove large strings of whitespace
    while "" in tokens:
        tokens.remove("")
    while " " in tokens:
        tokens.remove(" ")
    while "\n" in tokens:
        tokens.remove("\n")
    while "\n\n" in tokens:
        tokens.remove("\n\n")

    return tokens

### 3. get counts of tokens

In [None]:
# the vectorizer and classifer to use
# note that I changed the tokenizer in CountVectorizer to use a custom function using spaCy's tokenizer
vectorizer = CountVectorizer(tokenizer=tokenize_text, ngram_range=(1,1), encoding='latin-1')

### 4. train an SVM classifier with a linear kernel on the training data
Warning: this may take a couple of minutes

In [None]:
clf = LinearSVC()
# define the pipeline to clean, tokenize, vectorize, and classify
pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer), ('clf', clf)])

# train
pipe.fit(train.data, train.target)

### 5. test on the test data

In [None]:
# test
preds = pipe.predict(test.data)

### 6. get performance of classifier

In [None]:
print "Overall Accuracy", np.mean(preds == test.target)
print(metrics.classification_report(test.target, preds, target_names=test.target_names))


#### Confusion matrix

In [None]:
metrics.confusion_matrix(test.target, preds)