# Bag of Words (Or: *What is unique about working with text*)

**Q**: How is text different from the types of data you've seen so far?
* It's a string, so can't do mathematical operations on it
* It can be ambigious
* It can be in a wide range of formats / files
* Has a wide granularity level (file, paragraph, sentence, word, character)
* It is unstructured

**Q**: What does this mean for how you can work with text compared to types of data you've worked with so far?
* Need to make it structured
* Need to clean it well

There are two main aspect of working with text that helps us feed it into computers / machine learning models:
1. Data/text preprocessing
2. Turning text into features

### 1. Data Preprocessing

You already know that data cleaning is a large part of a data scientist's workflow. When working with unstructured data such as text, data cleaning plays an even bigger role.

**Q**: Things we may want to do to clean our textual data:
* Lemmatization
* Stemming
* Tokenization
* Remove small words that don't contribute to the meaning
* ...

* Split the corpus into individual words / tokens
* Remove punctuation
* Deal with capitalization
* Remove most common words:
    * Use a list of words to remove (stop words)
    * Remove the words that appear in more than X% of documents
* Reduce words to their base parts:
    * *Stemming*: process of removing and replacing suffixes to get to the root of the words, *stem*.
        * based on heuristics (e.g. *-ational -> -ate*, *-tional -> -tion*)
        * does not always produce a word
        * feet->feet, wolves->wolv, cats->cat, talked->talk
    * *Lemmatization*: uses vocabulary and morphological analysis to return the base or dictionary form of a word, *lemma*.
        * does not always return a reduced form
        * feet->foot, wolves->wolf, cats->cat, talked->talked

In [None]:
corpus = ["we all love a yellow submarine",             # Beatles
          "yesterday, my submarine was in love",        # Beatles
          "we are love trouble with loyalty here",      # Eminem
          "loyalty to us is worth more than love is"]   # Eminem
labels = ['Beatles'] * 2 + ['Eminem'] * 2

### 2. Turning text into features

**Q**: Once you have your data cleaned and preprocessed like this, what can you imagine using as your features?
* Amount of certain words
* Length of song / words in song
* Categorize words, then use word count

#### Bag of words

Each token/word will be a feature/column.
* large sparse matrix
* word order lost
* counts not normalized

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer()

In [None]:
X = vectorizer.fit_transform(corpus)

In [None]:
X

In [None]:
pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

##### Exercise: How can we remove the most common words?

* Using a list of stop words
* Removing the words that appear in more than X% of documents

See `CountVectorizer` documentation for how to do each of these: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Remove most common words using these two methods. Use `.vocabulary_` and `.stop_words_` attributes to see which words have remained and which are removed (latter only in the case of the second method). 

What do you notice?

* Using a list of stop words

In [None]:
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
X_df = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

In [None]:
X_df

* Removing the words that appear in more than X% of documents

In [None]:
vectorizer = CountVectorizer(max_df=0.75)
X = vectorizer.fit_transform(corpus)
X_df = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

In [None]:
X_df

In [None]:
vectorizer.stop_words_

In [None]:
X_df.columns

In [None]:
vectorizer.vocabulary_

In [None]:
vectorizer.get_feature_names()

#### n-grams

Instead of single tokens, we now also count token pairs (bigrams), triplets (trigrams), etc.
* even larger sparser matrix
* preserves local word order
* counts not normalized
* too many features:
    * remove high-frequency n-grams: can include stop words; not very informative
    * remove low-frequency n-grams: typos and rare n-grams; likely to overfit

In [None]:
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(corpus)
X_df = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

In [None]:
X_df

#### TF-IDF

Stands for `term frequency - inverse document frequency` and aims to address the popularity/frequency of words in a corpus(not just inside of a single document).

##### TF = term frequency 

TF(t, d) - frequency of term (n-gram) _t_ in document _d_

##### IDF(t) = inverse document frequency (of term _t_ in the whole corpus)

$ IDF(t) = \log \frac{1+N}{1+N_t}+1 $


If term _t_ doesn't appear in many documents: IDF is "big".

If term _t_ appears in many documents: IDF is close to 1 ("small") -> common terms are penalized.

$ TFIDF(t, d) = TF(t,d)*IDF(t) $ 

**Q**: What kind of terms will have high TF-IDF?
* Those that appear a lot in small number of documents/songs.

##### Exercise: Implement TF-IDF vectorizer

Look up how to implement TF-IDF vectorizer in `scikit-learn`. How does your features dataframe differ from the `CountVectorizer` one?

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X_df = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

In [None]:
X_df

*Bonus question*: What can you say about the values in your new `X_df` (think about sums, normalizations, etc.)? Post your guesses in Slack! 

*Answer*: unlike `CountVectorizer`, using `TfidfVectorizer` results in normalized counts. What is normalized is the sum of squares, meaning that sum of squares in each row adds up to 1.

You can check this yourself:

In [None]:
np.square(X_df).sum(axis=1)

##### Extra exercise

Use your scraped lyrics and run them through a vectorizer of your choice. Then split the resulting feature vector `X` and your labels into training and test set and train a logistic regression model. Check how the choice of vectorizers and the parameters we mentioned affects the performance of your model.

Once you have your final model, run the following lines of code to get the words that are strongest predictors for each of your bands and post them in Slack.

`import operator`

`model = LogisticRegression()`

`print(operator.itemgetter(*np.argsort(model.coef_[0]))(vectorizer.get_feature_names())[-20:])`

`print(operator.itemgetter(*np.argsort(model.coef_[0]))(vectorizer.get_feature_names())[:20])`