# Feature Extraction - Bag of Words

This is the paragraph:

*Bertrand Piccard and André Borschberg spent more than 10 years and 150 million euros in the research and development of the Solar Impulse solar-powered plane, and they are not ready to stop. During their journey from concept to successful flight, the pair delved deeply into the world of energy-efficient batteries and green technology, becoming experts in both fields. Using this knowledge and experience, Piccard and Borschberg plan to focus on applying what they learned to the broader areas of aviation and travel. “The future is clean. The future is you. The future is now. Let’s take it further,” said Solar Impulse co-founder Piccard following the completion of the groundbreaking flight.*

Taken from http://www.digitaltrends.com/cool-tech/solar-impulse-round-world-flight-finished/#ixzz4FoSpSgIC 

We are going to

* Remove Ponctuation and numbers
* Lowercase 
* Remove stopwords
* Tokenize the text
* Count the words frequencies
* Normalize with TF-IDF



In [7]:
original_text = "Bertrand Piccard and André Borschberg spent more than 10 years and 150 million euros in the research and development of the Solar Impulse solar-powered plane, and they are not ready to stop. During their journey from concept to successful flight, the pair delved deeply into the world of energy-efficient batteries and green technology, becoming experts in both fields. Using this knowledge and experience, Piccard and Borschberg plan to focus on applying what they learned to the broader areas of aviation and travel. “The future is clean. The future is you. The future is now. Let’s take it further,” said Solar Impulse co-founder Piccard following the completion of the groundbreaking flight."
print(original_text)

Bertrand Piccard and André Borschberg spent more than 10 years and 150 million euros in the research and development of the Solar Impulse solar-powered plane, and they are not ready to stop. During their journey from concept to successful flight, the pair delved deeply into the world of energy-efficient batteries and green technology, becoming experts in both fields. Using this knowledge and experience, Piccard and Borschberg plan to focus on applying what they learned to the broader areas of aviation and travel. “The future is clean. The future is you. The future is now. Let’s take it further,” said Solar Impulse co-founder Piccard following the completion of the groundbreaking flight.


# 1. Remove Ponctuation and numbers
Equivalent to keeping only letters

We'll use a regex library: **re** and **re.sub(pattern, repl, string)**


In [5]:
import re

text = re.sub("[^a-zA-Z]", 
        " ",                   # The pattern to replace it with
        text )  # The text to search
print(text)

Bertrand Piccard and Andr  Borschberg spent more than    years and     million euros in the research and development of the Solar Impulse solar powered plane  and they are not ready to stop  During their journey from concept to successful flight  the pair delved deeply into the world of energy efficient batteries and green technology  becoming experts in both fields  Using this knowledge and experience  Piccard and Borschberg plan to focus on applying what they learned to the broader areas of aviation and travel   The future is clean  The future is you  The future is now  Let s take it further   said Solar Impulse co founder Piccard following the completion of the groundbreaking flight 


# Alternatives

The accent is missing!

Many alternatives: http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python

* Extend the list of accepted letters with: àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ
* Use a list of ponctuation signs and remove characters not in the list (string.ponctuation)
* ...

In [16]:
import string
exclude = list(string.punctuation)
text = ''.join(ch for ch in original_text if ch not in exclude)
print(text)

Bertrand Piccard and André Borschberg spent more than 10 years and 150 million euros in the research and development of the Solar Impulse solarpowered plane and they are not ready to stop During their journey from concept to successful flight the pair delved deeply into the world of energyefficient batteries and green technology becoming experts in both fields Using this knowledge and experience Piccard and Borschberg plan to focus on applying what they learned to the broader areas of aviation and travel “The future is clean The future is you The future is now Let’s take it further” said Solar Impulse cofounder Piccard following the completion of the groundbreaking flight


In [23]:
numbers = [str(n) for n in range(10)]
exclude = list(string.punctuation + '“’”') + numbers
text = ''.join(ch for ch in original_text if ch not in exclude)
print(text)


Bertrand Piccard and André Borschberg spent more than  years and  million euros in the research and development of the Solar Impulse solarpowered plane and they are not ready to stop During their journey from concept to successful flight the pair delved deeply into the world of energyefficient batteries and green technology becoming experts in both fields Using this knowledge and experience Piccard and Borschberg plan to focus on applying what they learned to the broader areas of aviation and travel The future is clean The future is you The future is now Lets take it further said Solar Impulse cofounder Piccard following the completion of the groundbreaking flight


# Lowercase + Tokenize

In [25]:
text = text.lower().split()
print(text)

['bertrand', 'piccard', 'and', 'andré', 'borschberg', 'spent', 'more', 'than', 'years', 'and', 'million', 'euros', 'in', 'the', 'research', 'and', 'development', 'of', 'the', 'solar', 'impulse', 'solarpowered', 'plane', 'and', 'they', 'are', 'not', 'ready', 'to', 'stop', 'during', 'their', 'journey', 'from', 'concept', 'to', 'successful', 'flight', 'the', 'pair', 'delved', 'deeply', 'into', 'the', 'world', 'of', 'energyefficient', 'batteries', 'and', 'green', 'technology', 'becoming', 'experts', 'in', 'both', 'fields', 'using', 'this', 'knowledge', 'and', 'experience', 'piccard', 'and', 'borschberg', 'plan', 'to', 'focus', 'on', 'applying', 'what', 'they', 'learned', 'to', 'the', 'broader', 'areas', 'of', 'aviation', 'and', 'travel', 'the', 'future', 'is', 'clean', 'the', 'future', 'is', 'you', 'the', 'future', 'is', 'now', 'lets', 'take', 'it', 'further', 'said', 'solar', 'impulse', 'cofounder', 'piccard', 'following', 'the', 'completion', 'of', 'the', 'groundbreaking', 'flight']


# Remove Stop words

Several alternatives

* Manually 
* Using NLTK

## Installing and Downloading NLTK
        pip install nltk
        nltk.download()


In [31]:
# Manually

my_stopwords = ['and', 'in', 'of','the']

# Using NLTK

import nltk
from nltk.corpus import stopwords # Import the stop word list
nltk_stopwords = stopwords.words("english")

print('-------------------------------------------- STOPWORDS:')
print(stopwords.words("english"))
print('-------------------------------------------- TEXT:')
text = [w for w in text if not w in nltk_stopwords]
print(text)


-------------------------------------------- STOPWORDS:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very

# Vectorize



In [41]:
# Sort the text
text.sort()
print('-------------------------------------------- Sorted Text:')
print(text)

# Create a dictionnary of words and their counts
counts = [ (w,text.count(w)) for w in set(text)] 

counts

-------------------------------------------- Sorted Text:
['andré', 'applying', 'areas', 'aviation', 'batteries', 'becoming', 'bertrand', 'borschberg', 'borschberg', 'broader', 'clean', 'cofounder', 'completion', 'concept', 'deeply', 'delved', 'development', 'energyefficient', 'euros', 'experience', 'experts', 'fields', 'flight', 'flight', 'focus', 'following', 'future', 'future', 'future', 'green', 'groundbreaking', 'impulse', 'impulse', 'journey', 'knowledge', 'learned', 'lets', 'million', 'pair', 'piccard', 'piccard', 'piccard', 'plan', 'plane', 'ready', 'research', 'said', 'solar', 'solar', 'solarpowered', 'spent', 'stop', 'successful', 'take', 'technology', 'travel', 'using', 'world', 'years']


[('successful', 1),
 ('solar', 2),
 ('fields', 1),
 ('future', 3),
 ('becoming', 1),
 ('learned', 1),
 ('experience', 1),
 ('bertrand', 1),
 ('borschberg', 2),
 ('following', 1),
 ('take', 1),
 ('broader', 1),
 ('ready', 1),
 ('energyefficient', 1),
 ('aviation', 1),
 ('batteries', 1),
 ('clean', 1),
 ('areas', 1),
 ('travel', 1),
 ('development', 1),
 ('pair', 1),
 ('focus', 1),
 ('technology', 1),
 ('solarpowered', 1),
 ('said', 1),
 ('world', 1),
 ('million', 1),
 ('green', 1),
 ('using', 1),
 ('euros', 1),
 ('spent', 1),
 ('impulse', 2),
 ('plan', 1),
 ('research', 1),
 ('journey', 1),
 ('delved', 1),
 ('concept', 1),
 ('deeply', 1),
 ('knowledge', 1),
 ('experts', 1),
 ('andré', 1),
 ('piccard', 3),
 ('plane', 1),
 ('groundbreaking', 1),
 ('stop', 1),
 ('applying', 1),
 ('years', 1),
 ('lets', 1),
 ('flight', 2),
 ('cofounder', 1),
 ('completion', 1)]

# And now in one line!

* scikit : [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)
* nltk



In [51]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words= 'english')
X = vectorizer.fit_transform([original_text])
print(vectorizer.vocabulary_)
X.toarray()


{'150': 1, 'successful': 46, 'solar': 43, 'fields': 21, 'learned': 32, 'experience': 19, 'bertrand': 7, 'borschberg': 8, 'let': 33, 'future': 26, 'broader': 9, 'ready': 40, 'aviation': 5, 'batteries': 6, 'energy': 17, 'andré': 2, 'travel': 48, 'areas': 4, 'development': 15, 'pair': 35, 'focus': 23, 'technology': 47, 'world': 50, 'said': 42, 'million': 34, 'green': 27, 'journey': 30, 'using': 49, '10': 0, 'euros': 18, 'clean': 10, 'impulse': 29, 'research': 41, 'plan': 37, 'delved': 14, 'concept': 12, 'deeply': 13, 'knowledge': 31, 'following': 24, 'experts': 20, 'piccard': 36, 'powered': 39, 'efficient': 16, 'groundbreaking': 28, 'stop': 45, 'applying': 3, 'spent': 44, 'years': 51, 'founder': 25, 'flight': 22, 'plane': 38, 'completion': 11}


array([[1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        2, 1, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 3,
        1, 1, 1, 1, 1, 1, 1, 1]], dtype=int64)

In [55]:
# check:
import numpy as np
np.where(X.toarray() == 3)
vectorizer.vocabulary_['future']

26