# Python demo: Bag-of-Words modeling entirely in `scikit-learn`

Bag-of-words should always be in your back pocket.  There will basically never be a situation where bag-of-word gives you no results, but more complex models do.  Usually, the bag-of-words model will at least give you _something_ as long as your task can be said to depend on the _meaning of the words in the text._  Bag-of-words also allows you a lot of opportunities to inject expert and domain knowledge into the modeling process, which we'll see in the notebooks that follow this one.

Bag-of-words involves us representing our documents as _vectors of word counts._  I.e.: our _features_ are "what words do you use, and how often do you use them?"  Bag-of-words ignored things like word order, syntactic relationships, etc.; while such things can definitely provide some useful information, a lot of real-world tasks will see pretty aggressively diminishing returns from them.

This notebook will show the most quick-and-dirty way to do bag-of-words models in Python: using `scikit-learn`'s `CoutnVectorizer()`.

Pros:
- Extremely simple code--just a few lines to get a whole model up and running.
- Often surprisingly accurate for many tasks, despite its simplicity.  This makes a scikit-learn pipeline an excellent "first attempt" at a text problem, or a great baseline to compare more sophisticated models against.
- Very fast (with the right choice of model, e.g. Naive Bayes, as we're using here).

Cons:
- Not a lot of fine-grained control, which usually leads to leaving accuracy on the table.

In [1]:
# requirements
# !conda install --yes pandas scikit-learn

In [2]:
import pandas as pd

# load the data
train = pd.read_csv("../../data/train.csv")
test = pd.read_csv("../../data/test.csv")

train.head()

Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category
0,en_0964290,product_en_0740675,reviewer_en_0342986,1,Arrived broken. Manufacturer defect. Two of th...,I'll spend twice the amount of time boxing up ...,en,furniture
1,en_0690095,product_en_0440378,reviewer_en_0133349,1,the cabinet dot were all detached from backing...,Not use able,en,home_improvement
2,en_0311558,product_en_0399702,reviewer_en_0152034,1,I received my first order of this product and ...,The product is junk.,en,home
3,en_0044972,product_en_0444063,reviewer_en_0656967,1,This product is a piece of shit. Do not buy. D...,Fucking waste of money,en,wireless
4,en_0784379,product_en_0139353,reviewer_en_0757638,1,went through 3 in one day doesn't fit correct ...,bubble,en,pc


`scikit-learn` has three main tools for doing the text-to-vector conversion:
- `sklearn.feature_extraction.text.CountVectorizer`, which represents each document as a vector of word counts.
- `sklearn.feature_extraction.text.TfidfVectorizer` and `TfidfTransformer`: applies Term Frequency-Inverse Document Frequency scaling, which can help improve accuracy for some models.  (`TfidfVectorizer` is the same thing as `CountVectorizer` followed by `TfidfTransformer`).
- `sklearn.feature_extraction.text.HashingVectorizer`: a version of `CountVectorizer` that uses the hashing trick to map directly from words to columns.  This can be a _lot_ faster for extremely large datasets, but it can also lead to _hash collisions_ where several words get mapped to to the same feature/column.

We're just going to use `CountVectorizer`--feel free to swap it out for `TfidfVectorizer` or `HashingVectorizer` on your on and see how the results change.  We're also going to use a Bernoulli Naive Bayes model to do the classification, since it's extremely fast even on massive, sparse datasets, and it'll be accurate enough.  Feel free to swap this out for any other models, but just be aware that the large sparse matrices we get from bag-of-words transformations tend to make most models run very slow.  (especially with multi-class classification like we're doing here).

In [3]:
# first: a helper function to absract the "fit + predict + score" logic.
from sklearn import metrics

def fit_and_score(clf, train, test):
    """fit the model `clf` to the `train` dataset and evaluate its
    performance on the `test` dataset."""
    clf.fit(train["review_body"], train["stars"])
    preds = clf.predict(test["review_body"])
    
    # calculate some classification metrics
    accuracy = metrics.accuracy_score(preds, test["stars"])
    f1 = metrics.f1_score(preds, test["stars"], average="macro")

    # and some regression metrics (since "predict the number of stars"
    # could reasonably be either kind of task).
    r2 = metrics.r2_score(preds, test["stars"])
    mae = metrics.mean_absolute_error(preds, test["stars"])
    
    return pd.Series({"Accuracy": accuracy, "F1": f1, "R2": r2, "MAE": mae})

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import Pipeline

# This is it--this is our pipeline.  CountVectorizer--dropping words
# that appears in >50% of our documents or <10 documents--followed
# by a Bernoulli Naive Bayes model.  Super simple, and super fast.
classifier = Pipeline([
    ("bag of words", CountVectorizer(max_df=0.5, min_df=10)),
    ("clf", BernoulliNB()),
])
fit_and_score(classifier, train, test).rename("Bag of Words + Linear kernel SVM")

Accuracy    0.453400
F1          0.433647
R2          0.315741
MAE         0.857600
Name: Bag of Words + Linear kernel SVM, dtype: float64

In [5]:
# fit a dummy classifier to check how much better than a random guess
# we are.
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import GridSearchCV

classifier = GridSearchCV(
    DummyClassifier(),
    param_grid={"strategy": ["most_frequent", "prior", "stratified", "uniform"]}
)
fit_and_score(classifier, train, test).rename("Dummy Classifier")

Accuracy    0.198800
F1          0.198675
R2         -0.975752
MAE         1.605800
Name: Dummy Classifier, dtype: float64

`scikit-learn` has a lot of option you can specify for the `CountVectorizer()` object.  You can filter tokens by frequency, apply stemming, capture n-grams, remove stopwords, etc.

As compact as the scikit-learn approach is, though, it wraps _all_ the language-y bits up in the `CountVectorizer()` and its options; we don't get a huge amont of freedom to muck around with the internals.  This is where we can use other libaries like Gensim and spaCy, which we'll see in the next notebook.

Despite its simplicity, this two-step pipeline we've used here is always a good tool to bust out for quick-and-dirty checks and testing.  It's fast, simple, and will usually give you a good baseline for model performance.  (though you should still always double-check against a dummy model to make sure you're doing better than a random guess).