## Bag of Words: our first attempt at Language Representation

### Objectives:
#### Understand we are doing feature engineering!
#### Understand what are Bag of Words vectors
#### Transform your corpus into Count Vectors and Tf-idf Vectors

---

### We can say that language is a form of mapping
#### * And mapping from ideas to symbols is common yet different across the world!
#### * From Phonograms - phonetic languages, e.g:

### `cat - /kat/`

#### * To Logograms - logical languages, e.g:

# `好`

#### * To Others - facial / body language cue, emojis

# 😋

#### * Today we introduce one more type - Bag of Words
#### * An approach to representing language differently from what we're 'used to'
#### * Computers like this way of seeing language - denotational semantics
#### * Its not the only way! We're learning more techniques tomorrow

---

### Steps

#### Get a corpus (needs to be in list form)

In [1]:
corpus = """What you want
Baby, I got it
What you need
Do you know I got it?
All I'm askin'
Is for a little respect when you get home (just a little bit)
Hey baby (just a little bit) when you get home
(Just a little bit) mister (just a little bit)
I ain't gonna do you wrong while you're gone
Ain't gonna do you wrong 'cause I don't wanna
All I'm askin'
Is for a little respect when you come home (just a little bit)
Baby (just a little bit) when you get home (just a little bit)
Yeah (just a little bit)
I'm about to give you all of my money
And all I'm askin' in return, honey
Is to give me my propers
When you get home (just a, just a, just a, just a)
Yeah, baby (just a, just a, just a, just a)
When you get home (just a little bit)
Yeah (just a little bit)"""

In [None]:
# our corpus needs TO BE IN LIST FORM
corpus = [corpus]

In [9]:
#we could choose to split our songs line by line - not for this week tho!
corpus[0].split('\n')

['What you want',
 'Baby, I got it',
 'What you need',
 'Do you know I got it?',
 "All I'm askin'",
 'Is for a little respect when you get home (just a little bit)',
 'Hey baby (just a little bit) when you get home',
 '(Just a little bit) mister (just a little bit)',
 "I ain't gonna do you wrong while you're gone",
 "Ain't gonna do you wrong 'cause I don't wanna",
 "All I'm askin'",
 'Is for a little respect when you come home (just a little bit)',
 'Baby (just a little bit) when you get home (just a little bit)',
 'Yeah (just a little bit)',
 "I'm about to give you all of my money",
 "And all I'm askin' in return, honey",
 'Is to give me my propers',
 'When you get home (just a, just a, just a, just a)',
 'Yeah, baby (just a, just a, just a, just a)',
 'When you get home (just a little bit)',
 'Yeah (just a little bit)']

In [22]:
#lets add a new song
new_song = """Looking out on the morning rain
I used to feel so uninspired
And when I knew I had to face another day
Lord, it made me feel so tired
Before the day I met you, life was so unkind
But you're the key to my peace of mind
'Cause you make me feel
You make me feel
You make me feel like
A natural woman (woman)
When my soul was in the lost and found
You came along to claim it
I didn't know just what was wrong with me
'Til your kiss helped me name it
Now I'm no longer doubtful, of what I'm living for
And if I make you happy I don't need to do more"""

In [23]:
corpus.append(new_song)

---

### Build a Count Vectorizer:
* Fit
* Then transform
* Do it in one step if you can!!

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [25]:
cv = CountVectorizer()

In [26]:
cv.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [27]:
vec_corpus = cv.transform(corpus)

In [28]:
vec_corpus

<2x97 sparse matrix of type '<class 'numpy.int64'>'
	with 116 stored elements in Compressed Sparse Row format>

In [29]:
vec_corpus.todense()

matrix([[ 1,  2,  4,  0,  1,  0,  3,  4,  0, 10,  0,  0,  1,  0,  1,  0,
          0,  3,  1,  0,  0,  0,  2,  0,  5,  2,  1,  2,  2,  0,  0,  0,
          1,  6,  1,  0,  1,  3,  2, 18,  0,  0,  0,  1,  0,  0, 12,  0,
          0,  0,  0,  0,  0,  0,  1,  0,  0,  1,  1,  0,  0,  2,  0,  0,
          1,  0,  0,  1,  0,  0,  0,  1,  0,  1,  2,  1,  0,  0,  0,  0,
          0,  2,  0,  0,  0,  1,  1,  0,  2,  6,  1,  0,  0,  2,  3, 13,
          0],
        [ 0,  0,  0,  1,  3,  1,  0,  0,  1,  0,  1,  1,  1,  1,  0,  2,
          1,  1,  1,  1,  1,  5,  1,  1,  0,  0,  0,  0,  0,  1,  1,  1,
          0,  0,  0,  1,  1,  0,  3,  1,  1,  1,  1,  1,  1,  1,  0,  1,
          1,  1,  1,  1,  1,  4,  6,  1,  1,  0,  0,  1,  1,  2,  1,  1,
          1,  1,  1,  2,  1,  1,  1,  0,  1,  1,  0,  0,  3,  1,  4,  1,
          1,  5,  1,  1,  1,  0,  0,  3,  2,  2,  0,  1,  2,  1,  0,  7,
          1]])

#### Sparse Matrix
Most of our matrix consists of zeroes. A Sparse Matrix only stores the non-zero values to save memory. We need to convert it into a **dense** matrix to view it effectively. Pandas helps as well

In [17]:
import pandas as pd

In [32]:
# Just a word counter for each song - we call this Term Frequency, where term = token = word
df = pd.DataFrame(vec_corpus.todense(), columns=cv.get_feature_names(), index=['Respect','Natural'])
df

Unnamed: 0,about,ain,all,along,and,another,askin,baby,before,bit,...,was,what,when,while,with,woman,wrong,yeah,you,your
Respect,1,2,4,0,1,0,3,4,0,10,...,0,2,6,1,0,0,2,3,13,0
Natural,0,0,0,1,3,1,0,0,1,0,...,3,2,2,0,1,2,1,0,7,1


## Pros and cons of Count Vectors???

## Cons
* Consistent spelling is a requirement
* Using a simple tokenization technique - only split by whte space, punctuation, tokens of n+1 words are ignored
* Case sensitive
* Different branches of the same root word look like different words
* Basically - MORE COLUMNS = HARDER TO FIND PATTERNS - curse of dimensionality
* Semantic similarity
* Context / grammatical context
* Word order isn't captured

## Pros
* Fast way of grasping your vocab
* Fast way of counting occurances of words
* Good first method of predicting language patterns
* Easy to understand
* Despite all the cons, we can still get good results on ML using this preprocessing technique

---

## Uniqueness-scaled BOW - Tf-Idf Vectors:

* TF - Term Frequency (count of a word w in doc d)
* IDF - Inverse Document Frequency

$TFIDF = TF(w,d) * IDF(w)$

$IDF(w) = log(\frac{1+ no.documents}{1 + no.documents containing word w})+1$

##### The steps for calculating TFIDF are:
* For each vector:
    * Calculate the term frequency for each term in the vector
    * Calculate the inverse doc frequency for each term in the vector
    * Multiply the two for each term in the vector
* Then normalise each vector by the Euclidean norm (numpy.linalg.norm)
    * $norm = \frac{v}{||v||^2}$

Check out the math behind TFIDF:
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [None]:
# The important bit is the no.of documents / no.of documents containing w! lets look at this for a couple of examples

In [57]:
df['respect'] #ratio = 2/1 = 2, MORE UNIQUE

Respect    2
Natural    0
Name: respect, dtype: int64

In [58]:
df['what'] #ratio = 2/2 = 1, LESS UNIQUE

Respect    2
Natural    2
Name: what, dtype: int64

In [40]:
from sklearn.feature_extraction.text import TfidfTransformer

In [42]:
tf = TfidfTransformer()

In [43]:
tf.fit(vec_corpus)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [44]:
tf_corpus = tf.transform(vec_corpus)

In [45]:
df2 = pd.DataFrame(tf_corpus.todense(), columns=cv.get_feature_names(), index=['Respect', 'Natural'])

In [46]:
# count vector results
df

Unnamed: 0,about,ain,all,along,and,another,askin,baby,before,bit,...,was,what,when,while,with,woman,wrong,yeah,you,your
Respect,1,2,4,0,1,0,3,4,0,10,...,0,2,6,1,0,0,2,3,13,0
Natural,0,0,0,1,3,1,0,0,1,0,...,3,2,2,0,1,2,1,0,7,1


In [47]:
#tfidf vector results
df2

Unnamed: 0,about,ain,all,along,and,another,askin,baby,before,bit,...,was,what,when,while,with,woman,wrong,yeah,you,your
Respect,0.038206,0.076412,0.152825,0.0,0.027184,0.0,0.114619,0.152825,0.0,0.382062,...,0.0,0.054368,0.163104,0.038206,0.0,0.0,0.054368,0.114619,0.353392,0.0
Natural,0.0,0.0,0.0,0.06968,0.148733,0.06968,0.0,0.0,0.06968,0.0,...,0.209039,0.099156,0.099156,0.0,0.06968,0.13936,0.049578,0.0,0.347044,0.06968


### whats the effect of tfidf? lets look

In [50]:
df['respect']

Respect    2
Natural    0
Name: respect, dtype: int64

In [51]:
df2['respect']

Respect    0.076412
Natural    0.000000
Name: respect, dtype: float64

In [52]:
df['what']

Respect    2
Natural    2
Name: what, dtype: int64

In [53]:
df2['what']

Respect    0.054368
Natural    0.099156
Name: what, dtype: float64

In [55]:
df.sum(axis=1)

Respect    133
Natural    110
dtype: int64

## Conclusions!!!
#### the word respect has more uniquess than the word what, so a higher score
#### the song natural has fewer words than the song respect, so each word occcurance is more important

---

## To make your code shorter, you could use the TfidfVectorizer
* This does both steps (count vectorizer and tfidfTransfomer) in one. The reason I show both in the tutorial is because its easier to understand word vectors this way

`from sklearn.feature_extraction.text import TfidfVectorizer`