Inspiration from: 
- https://www.datahubbs.com/tf-idf-starting-learning-text/
- https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

I decided to use the Porter Stemmer because it handles more than just removing s (i.e. kids -> kid), but still remains pretty legible and doesn't overstem too much.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [2]:
cv = CountVectorizer(stop_words='english',
                     ngram_range=(1, 2))

In [3]:
testdoc = pd.Series(data = ['There are kids playing',
                            'Oh to be a kid who played'])
testdoc

0       There are kids playing
1    Oh to be a kid who played
dtype: object

In [4]:
transformed_nostem = cv.fit_transform(testdoc)
print(cv.get_feature_names())
print(transformed_nostem.toarray())

['kid', 'kid played', 'kids', 'kids playing', 'oh', 'oh kid', 'played', 'playing']
[[0 0 1 1 0 0 0 1]
 [1 1 0 0 1 1 1 0]]


In [5]:
from nltk.stem import PorterStemmer
port = PorterStemmer()
analyzer = CountVectorizer().build_analyzer()

def stem_words(doc):
    return[port.stem(word) for word in analyzer(doc)]

In [6]:
cv_stem = CountVectorizer(stop_words='english', analyzer=stem_words, ngram_range=(1, 2))

In [7]:
transformed_withstem = cv_stem.fit_transform(testdoc)
print(cv_stem.get_feature_names())
print(transformed_withstem.toarray())

['are', 'be', 'kid', 'oh', 'play', 'there', 'to', 'who']
[[1 0 1 0 1 1 0 0]
 [0 1 1 1 1 0 1 1]]
