# [How to Prepare Text Data for Machine Learning with scikit-learn](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)  

## by Jason Brownlee on September 29, 2017 in Natural Language Processing

## Introduction

Text data requires special preparation before you can start using it for predictive modeling.  
The text must be parsed to remove words, called tokenization.  
Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).  
The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.  
In this tutorial, you will discover exactly how you can prepare your text data for predictive modeling in Python with scikit-learn.  
After completing this tutorial, you will know:

- How to convert text to word count vectors with CountVectorizer.
- How to convert text to word frequency vectors with TfidfVectorizer.
- How to convert text to unique integers with HashingVectorizer.  

Let’s get started.

## Bag-of-Words Model

We cannot work with text directly when using machine learning algorithms.  
Instead, we need to convert the text to numbers.  
We may want to perform classification of documents, so each document is an “input” and a class label is the “output” for our predictive algorithm.  
Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.  
A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW.  
The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.  
This can be done by assigning each word a unique number.  
Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words.  
The value in each position in the vector could be filled with a count or frequency of each word in the encoded document. This is the bag of words model, where we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order.

There are many ways to extend this simple method, both by better clarifying what a “word” is and in defining what to encode about each word in the vector.

The scikit-learn library provides 3 different schemes that we can use, and we will briefly look at each.