In [1]:
import numpy as np

# Python Classifier

## Introduction

The objective of this mini-project is to gain experience with natural language processing and use text data to train a machine learning model to make a classification. For this mini-project, I will be working with 4 articles from Wikipedia. There are 3 articles for python the snake because they are short comparing to the article for python the programming language. The objective is to train a model to classify whether a sentence is referring to python the snake or python the programming language.

## Loading the Data

In [2]:
import spacy
import wikipedia

The corpus is taken from Wikipedia articles and comes as a page of each article. To load the page into proper documents format, I will need to use the spacy library and transform each page into a list of sentences.

In [3]:
# Load text processing pipeline.
nlp = spacy.load('en_core_web_sm')

# Return a list of sentences in Wikipedia articles.
def pages_to_sentences(*pages):
    sentences = []
    
    for page in pages:
        p = wikipedia.page(page)
        doc = nlp(p.content)
        sentences += [sent.text for sent in doc.sents]
        
    return sentences

In [4]:
animal_sents = pages_to_sentences('Reticulated Python', 'Ball Python', 'Pythonidae')
language_sents = pages_to_sentences('Python (programming language)')

After the list of sentences is created, I will concatenate them as a complete set of documents. I will manually create the labels by multiplying each class with the length of respective documents and concatenate them together.

In [5]:
documents = animal_sents + language_sents
labels = (['animal'] * len(animal_sents)) + (['language'] * len(language_sents))

## Modeling

In [6]:
from spacy.lang.en import STOP_WORDS
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

I construct a machine learning model trained on a normalized raw counts algorithm that performs tf-idf weighting on the counts. Here are some things to consider:
* Consider some hyperparameters to tune for the model.
* Subsampling the training data will boost training times, which helpful when determining the best hyperparameters to use. Note, the final model will perform best if it is trained on the full data set.
* Including stop words may help with performance.
* Include bigram may help with performance but the risk of overfitting will increase.
* More documents will lead to better performance.

### Stop Words

Notice that there are many symbols in the documents, these symbols will not give any signal to the model. I decided to add those symbols to the stop words, so they will not be interpreted as features.

In [7]:
symbols = {'\n', '\n\n', '\n\n\n', ' ', '"', "'", "'s", '(', ')', ',', '-', '.', ':', ';', '<', '='}
evolved_stop_words = STOP_WORDS.union(symbols)

### Fitting Model

I use MultinomialNB model because it will give me the probability value of each class.

In [8]:
est = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words=evolved_stop_words, ngram_range=(1,2))),
    ('classifier', GridSearchCV(MultinomialNB(), param_grid={'alpha': np.linspace(1, 100, 100)}))
])
est.fit(documents, labels)

'Training accuracy: {}'.format(est.score(documents, labels))

  'stop_words.' % sorted(inconsistent))


'Training accuracy: 0.8971028971028971'

In [9]:
est.named_steps['vectorizer'].get_feature_names()[1000:1005]

['blah blah', 'blah eggs', 'blah evaluates', 'blender', 'blender cinema']

### Testing

Here are some testing documents that may trick the model.

In [10]:
test_docs = ["My Python program is only 100 bytes long.",
             "A python's bite is not venomous but still hurts.",
             "I can't find the error in the python code.",
             "Where is my pet python; I can't find her!",
             "I use for and while loops when writing Python.",
             "The python will loop and wrap itself onto me.",
             "I use snake case for naming my variables.",
             "My python has grown to over 10 ft long!",
             "I use virtual environments to manage package versions.",
             "Pythons are the largest snakes in the environment."]

In [11]:
est.classes_

array(['animal', 'language'], dtype='<U8')

In [12]:
y_proba = est.predict_proba(test_docs)
predicted_indices = (y_proba[:, 1] > 0.5).astype(int)

for i, index in enumerate(predicted_indices):
    print(test_docs[i], '--> {} at {:g}%'.format(est.classes_[index], y_proba[i, index] * 100))

My Python program is only 100 bytes long. --> language at 66.8501%
A python's bite is not venomous but still hurts. --> animal at 64.9406%
I can't find the error in the python code. --> language at 83.0824%
Where is my pet python; I can't find her! --> animal at 73.6011%
I use for and while loops when writing Python. --> language at 63.5988%
The python will loop and wrap itself onto me. --> language at 65.4708%
I use snake case for naming my variables. --> animal at 51.8593%
My python has grown to over 10 ft long! --> animal at 73.6018%
I use virtual environments to manage package versions. --> language at 75.1641%
Pythons are the largest snakes in the environment. --> animal at 91.4915%


The model did a good job with the test document unless it failed with the sentence "The python will loop and wrap itself onto me.", but it is not so poor because it is only sure on around 65%.