# [How to Prepare Text Data for Machine Learning with scikit-learn](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)  

## by Jason Brownlee on September 29, 2017 in Natural Language Processing

## Introduction

Text data requires special preparation before you can start using it for predictive modeling.  
The text must be parsed to remove words, called tokenization.  
Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).  
The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.  
In this tutorial, you will discover exactly how you can prepare your text data for predictive modeling in Python with scikit-learn.  
After completing this tutorial, you will know:

- How to convert text to word count vectors with CountVectorizer.
- How to convert text to word frequency vectors with TfidfVectorizer.
- How to convert text to unique integers with HashingVectorizer.  

Let’s get started.

## Bag-of-Words Model

We cannot work with text directly when using machine learning algorithms.  
Instead, we need to convert the text to numbers.  
We may want to perform classification of documents, so each document is an “input” and a class label is the “output” for our predictive algorithm.  
Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.  
A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW.

The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.  
This can be done by assigning each word a unique number.  
Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words.  
The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.  
This is the bag of words model, where we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order.  
There are many ways to extend this simple method, both by better clarifying what a “word” is and in defining what to encode about each word in the vector.  
The scikit-learn library provides 3 different schemes that we can use, and we will briefly look at each.

## Word Counts with CountVectorizer

The [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.  
You can use it as follows:
1. Create an instance of the CountVectorizer class.
2. Call the fit() function in order to learn a vocabulary from one or more documents.
3. Call the transform() function on one or more documents as needed to encode each as a vector.  

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.  
Because these vectors will contain a lot of zeros, we call them sparse.  
Python provides an efficient way of handling sparse vectors in the scipy.sparse package.  
The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to numpy arrays to look and better understand what is going on by calling the toarray() function.  
Below is an example of using the CountVectorizer to tokenize, build a vocabulary, and then encode a document.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a list of text documents:
text = ["The quick brown fox jumped over the lazy dog."]
# Create the transform:
vectorizer = CountVectorizer()
# Tokenize and build vocabulary:
vectorizer.fit(text)
# Summarize:
print("vectorizer.vocabulary: {}".format(vectorizer.vocabulary_))
# Encode the document:
vector = vectorizer.transform(text)
# Summarize the encoded vector:
print("vector.shape: {}".format(vector.shape))
print("type(vector): {}".format(type(vector)))
print("vector.toarray(): {}".format(vector.toarray()))

vectorizer.vocabulary: {'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
vector.shape: (1, 8)
type(vector): <class 'scipy.sparse.csr.csr_matrix'>
vector.toarray(): [[1 1 1 1 1 1 1 2]]


Above, you can see that we access the vocabulary to see what exactly was tokenized by calling:

In [7]:
print(vectorizer.vocabulary_)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}


We can see that all words were made lowercase by default and that the punctuation was ignored.  
These and other aspects of tokenizing can be configured and I encourage you to review all of the options in the API documentation.  
Running the example first prints the vocabulary, then the shape of the encoded document.  
We can see that there are 8 words in the vocab, and therefore encoded vectors have a length of 8.  
We can then see that the encoded vector is a sparse matrix.  
Finally, we can see an array version of the encoded vector showing a count of 1 occurrence for each word except the (index and id 7) that has an occurrence of 2.

In [8]:
print("vector.shape: {}".format(vector.shape))
print("type(vector): {}".format(type(vector)))
print("vector.toarray(): {}".format(vector.toarray()))

vector.shape: (1, 8)
type(vector): <class 'scipy.sparse.csr.csr_matrix'>
vector.toarray(): [[1 1 1 1 1 1 1 2]]


Importantly, the same vectorizer can be used on documents that contain words not included in the vocabulary.  
These words are ignored and no count is given in the resulting vector.  
For example, below is an example of using the vectorizer above to encode a document with one word in the vocab and one word that is not.

In [9]:
# Encode another sample document:
text2 = ["the puppy"]
vector = vectorizer.transform(text2)
print(vector.toarray())

[[0 0 0 0 0 0 0 1]]


Running this example prints the array version of the encoded sparse vector showing one occurrence of the one word in the vocabulary and the other word in the vocabulary ignored completely.  
The encoded vectors can then be used directly with a machine learning algorithm.