## Learning from text

* Bag of words is the same of count vectorizer (each term, each word, is going to be up-weighted by how often it occurs in a document)
* [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) -  counting the number of times various words show up in a corpus



In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

vectorizer = CountVectorizer()

In [2]:
string1 = 'hi Katie the self driving car will be late Best Sebastian'
string2 = 'Hi Sebastian the machine learning class will be great great great Best Katie'
string3 = 'Hi Katie the machine learning class will be most excellent'
email_list = [string1, string2, string3]

In [3]:
bag_of_words = vectorizer.fit(email_list)
bag_of_words = vectorizer.transform(email_list)

In [4]:
print(bag_of_words)

  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 4)	1
  (0, 7)	1
  (0, 8)	1
  (0, 9)	1
  (0, 13)	1
  (0, 14)	1
  (0, 15)	1
  (0, 16)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (1, 6)	3
  (1, 7)	1
  (1, 8)	1
  (1, 10)	1
  (1, 11)	1
  (1, 13)	1
  (1, 15)	1
  (1, 16)	1
  (2, 0)	1
  (2, 3)	1
  (2, 5)	1
  (2, 7)	1
  (2, 8)	1
  (2, 10)	1
  (2, 11)	1
  (2, 12)	1
  (2, 15)	1
  (2, 16)	1


In [5]:
print(vectorizer.vocabulary_.get("great")) # or .get_feature_names()

6


* Vocabulary: not all words are equal, some words contain more information than others (check the low-information words - **stropwords**);
* Can be valuable to remove these words, don't allow them to become noise on the dataset;
* Get a list of stopwords using NLTK (natural language tool kit)

In [6]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [7]:
from nltk.corpus import stopwords

sw = stopwords.words("english")
len(sw)

179

* Vocabulary: not all unique words are different
* Using `stemmer` algorithm some words have the same root

In [8]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

In [9]:
stemmer.stem("responsiveness")

u'respons'

In [10]:
stemmer.stem("resposivity")

u'respos'

In [11]:
stemmer.stem("unresponsive")

u'unrespons'

## Order of operations in text processing

1. stemming
2. bag-of-words representation

If you put it in the bag of words representation before you stem, there's no poit in stemming because you could get the same word repeated many time within your bag of words representation;

Stemming will assume a string and the bag of words representation is going to look like a matrix with many documents

## The TfIdf representation
Tf is term frequency (similar to bag of words) and Idf is inverse document frequency (the word also gets a weighting that's related to how often it occurs in the corpus as a whole, in all the documents put together)

* It rates the rare words (help you distinguish messages from each other) more highly than the common words