# One-hot encoding
In one-hot encoding approach a new dummy feature is created for each
unique value in the nominal feature column.

# Bag-of-words
The idea behind the bag-of-words can be summarized as follows:
1. We create a vocabulary of unique words from the entire set of documents.
2. We construct a feature vector from each document that contains the counts of
occurring words.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

## Create a set of documents

In [2]:
docs = np.array([
    'The sun is shining',
    'The weather is sweet, sweet',
    'The sun is shining and the weather is sweet'])

## Vectorize the documents

In [3]:
vectorizer = CountVectorizer(stop_words='english')
#stop words remuves words such as the, a, and an; 
#auxiliary verbs such as do, be, and will;
#and prepositions such as on, around, and beneath;

In [4]:
bag = vectorizer.fit_transform(docs).toarray()
bag

array([[1, 1, 0, 0],
       [0, 0, 2, 1],
       [1, 1, 1, 1]])

## Vocabulary

In [5]:
vectorizer.vocabulary_

{u'shining': 0, u'sun': 1, u'sweet': 2, u'weather': 3}