## BoW Dictionaries
One of the most common ways to implement the BoW model in Python is as a **dictionary with each key set to a word and each value set to the number of times that word appears.**

The words from the sentence go into the bag-of-words and come out as a dictionary of words with their corresponding counts. For statistical models, we call the text that we use to build the model our training data. Usually, we need to prepare our text data by breaking it up into documents (shorter strings of text, generally sentences).

**Let’s build a function that converts a given training text into a bag-of-words!**
```python
from preprocessing import preprocess_text
# Define text_to_bow() below:
def text_to_bow(some_text):
  bow_dictionary = {}
  tokens = preprocess_text(some_text)
  for token in tokens:
    if token in bow_dictionary:
      bow_dictionary[token] += 1
    else:
      bow_dictionary[token] = 1

  return bow_dictionary

print(text_to_bow("I love fantastic flying fish. These flying fish are just ok, so maybe I will find another few fantastic fish..."))
```

## Bow Vectors
A feature vector is a numeric representation of an item’s important features. Each feature has its own column. If the feature exists for the item, you could represent that with a 1. If the feature does not exist for that item, you could represent that with a `0`. 

Turning text into a BoW vector is known as **feature extraction or vectorization.** When building BoW vectors, we generally create a **features dictionary** of all vocabulary in our training data (usually several documents) mapped to indices.

## Building a Features Dictionary
```python
from preprocessing import preprocess_text
# Define create_features_dictionary() below:
def create_features_dictionary(documents):
  features_dictionary = {}
  merged = " ".join(documents)
  tokens = preprocess_text(merged)
  index = 0
  for token in tokens:
    if token not in features_dictionary:
      features_dictionary[token] = index
      index += 1

  return features_dictionary, tokens

training_documents = ["Five fantastic fish flew off to find faraway functions.", "Maybe find another five fantastic fish?", "Find my fish with a function please!"]

print(create_features_dictionary(training_documents)[0])
```

## Building a BoW Vector
Each index in the list will correspond to a word and be set to its count.  

```python
from preprocessing import preprocess_text
# Define text_to_bow_vector() below:
def text_to_bow_vector(some_text, features_dictionary):
  bow_vector = [0] * len(features_dictionary) # list of 0's length of dict
  tokens = preprocess_text(some_text)
  for token in tokens:
    feature_index = features_dictionary[token]
    bow_vector[feature_index] += 1
    
  return bow_vector, tokens


features_dictionary = {'function': 8, 'please': 14, 'find': 6, 'five': 0, 'with': 12, 'fantastic': 1, 'my': 11, 'another': 10, 'a': 13, 'maybe': 9, 'to': 5, 'off': 4, 'faraway': 7, 'fish': 2, 'fly': 3}

text = "Another five fish find another faraway fish."
print(text_to_bow_vector(text, features_dictionary)[0])
```

As with most tasks in Python, there's already a library that can do all of the above work for you.  

For `text_to_bow()`, you can approximate the functionality with the collections module’s `Counter()` function:
```python
from collections import Counter
 
tokens = ['another', 'five', 'fish', 'find', 'another', 'faraway', 'fish']
print(Counter(tokens))
 
# Counter({'fish': 2, 'another': 2, 'find': 1, 'five': 1, 'faraway': 1})
```

For vectorization, you can use `CountVectorizer` from the machine learning library `scikit-learn`. You can use `fit()` to train the features dictionary and then `transform()` to transform text into a vector:
```python
from sklearn.feature_extraction.text import CountVectorizer
 
training_documents = ["Five fantastic fish flew off to find faraway functions.", "Maybe find another five fantastic fish?", "Find my fish with a function please!"]
test_text = ["Another five fish find another faraway fish."]
bow_vectorizer = CountVectorizer()
bow_vectorizer.fit(training_documents)
bow_vector = bow_vectorizer.transform(test_text)
print(bow_vector.toarray())
# [[2 0 1 1 2 1 0 0 0 0 0 0 0 0 0]]
```

#### script.py
```python
from spam_data import training_spam_docs, training_doc_tokens, training_labels, test_labels, test_spam_docs, training_docs, test_docs
from sklearn.naive_bayes import MultinomialNB
# Import CountVectorizer from sklearn:
from sklearn.feature_extraction.text import CountVectorizer

# Define bow_vectorizer:
bow_vectorizer = CountVectorizer()

# Define training_vectors:
training_vectors = bow_vectorizer.fit_transform(training_docs)
# Define test_vectors:
test_vectors = bow_vectorizer.transform(test_docs)

spam_classifier = MultinomialNB()

def spam_or_not(label):
  return "spam" if label else "not spam"

# Uncomment the code below when you're done:
spam_classifier.fit(training_vectors, training_labels)

predictions = spam_classifier.score(test_vectors, test_labels)

print("The predictions for the test data were {0}% accurate.\n\nFor example, '{1}' was classified as {2}.\n\nMeanwhile, '{3}' was classified as {4}.".format(predictions * 100, test_docs[7], spam_or_not(test_labels[7]), test_docs[15], spam_or_not(test_labels[15])))
```

## Review of Bag-of-Words
You made it! And you’ve learned plenty about the bag-of-words language model along the way:

- Bag-of-words (BoW) — also referred to as the unigram model — is a statistical language model based on word count.
- There are loads of real-world applications for BoW.
- BoW can be implemented as a Python dictionary with each key set to a word and each value set to the number of times that word appears in a text.
- For BoW, training data is the text that is used to build a BoW model.
- BoW test data is the new text that is converted to a BoW vector using a trained features dictionary.
- A feature vector is a numeric depiction of an item’s salient features.
- Feature extraction (or vectorization) is the process of turning text into a BoW vector.
- A features dictionary is a mapping of each unique word in the training data to a unique index. This is used to build out BoW vectors.
- BoW has less data sparsity than other statistical models. It also suffers less from overfitting.
- BoW has higher perplexity than other models, making it less ideal for language prediction.
- One solution to overfitting is language smoothing, in which a bit of probability is taken from known words and allotted to unknown words.

The spam data for this lesson were taken from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms%20spam%20collection).

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository http://archive.ics.uci.edu/ml

In [1]:
from spam_data import training_spam_docs, training_doc_tokens, training_labels, training_docs
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

test_text = """
Play around with the spam classifier!
"""

bow_vectorizer = CountVectorizer()

training_vectors = bow_vectorizer.fit_transform(training_docs)
test_vectors = bow_vectorizer.transform([test_text])

spam_classifier = MultinomialNB()
spam_classifier.fit(training_vectors, training_labels)

predictions = spam_classifier.predict(test_vectors)

print("Looks like a normal email!" if predictions[0] == 0 else "You've got spam!")

FileNotFoundError: [Errno 2] No such file or directory: 'spam_data.p'