### Bag of Words:
* The first attempt at creating word vectors. 
* The common approach for word vectorisation until 2013 (Mikolov et al)

#### Pros
* Works for any text
* Easy and fast to do
* Does not require a language model (just the corpus)

#### Cons
* Does not apply language knowledge (stopwords EN only)
* All words are equally similar / disimliar (discrete, orthogonal vectors)
* Order of words is ignored

In [2]:
artist = ['Bliss N Eso', 'Frank Sinatra']
corpus = ['Its up to you, New York, New york',
        'Come fly with me, lets fly lets fly away',
        'Theres a party at my house',
        'No strings attached from the get go']
from sklearn.feature_extraction.text import CountVectorizer

---

### The Count Vectorizer:
#### Steps to build
* Create a corpus
* Fit a CV on it
* Transform the corpus into a sparse, then dense, matrix

In [3]:
cv = CountVectorizer()
cv

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

#### Sparse Matrix
Most of our matrix consists of zeroes. A Sparse Matrix only stores the non-zero values to save memory. We need to convert it into a **dense** matrix to view it effectively.

In [4]:
#train the model
cv.fit(corpus) #expects a list of strings

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [5]:
cv.vocabulary_ #key and LOCATION of each word (not count)

{'its': 9,
 'up': 20,
 'to': 19,
 'you': 23,
 'new': 13,
 'york': 22,
 'come': 3,
 'fly': 4,
 'with': 21,
 'me': 11,
 'lets': 10,
 'away': 2,
 'theres': 18,
 'party': 15,
 'at': 0,
 'my': 12,
 'house': 8,
 'no': 14,
 'strings': 16,
 'attached': 1,
 'from': 5,
 'the': 17,
 'get': 6,
 'go': 7}

In [7]:
cv.get_feature_names()

['at',
 'attached',
 'away',
 'come',
 'fly',
 'from',
 'get',
 'go',
 'house',
 'its',
 'lets',
 'me',
 'my',
 'new',
 'no',
 'party',
 'strings',
 'the',
 'theres',
 'to',
 'up',
 'with',
 'york',
 'you']

In [8]:
#transform the output
corpus_vec = cv.transform(corpus)

In [9]:
corpus_vec #sparse matrix = 0 or none matrics

<4x24 sparse matrix of type '<class 'numpy.int64'>'
	with 24 stored elements in Compressed Sparse Row format>

In [10]:
corpus_vec.todense() #fills in all the nones with 0s

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1,
         0, 2, 1],
        [0, 0, 1, 1, 3, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 0, 0],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
         0, 0, 0],
        [0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
         0, 0, 0]], dtype=int64)

In [14]:
import pandas as pd
df = pd.DataFrame(corpus_vec.todense(), index = corpus, columns=cv.get_feature_names())

In [17]:
df

Unnamed: 0,at,attached,away,come,fly,from,get,go,house,its,...,no,party,strings,the,theres,to,up,with,york,you
"Its up to you, New York, New york",0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,1,0,2,1
"Come fly with me, lets fly lets fly away",0,0,1,1,3,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
Theres a party at my house,1,0,0,0,0,0,0,0,1,0,...,0,1,0,0,1,0,0,0,0,0
No strings attached from the get go,0,1,0,0,0,1,1,1,0,0,...,1,0,1,1,0,0,0,0,0,0


**A downside of the Count Vectorizer is that the uniqueness of words is not taken into consideration. This is where TF-IDF comes in.**

---

### The Tf-Idf Transformer:

* TF - Term Frequency (% count of a word w in doc d)
* IDF - Inverse Document Frequency

$TFIDF = TF(w,d) * IDF(w)$

$IDF(w) = log(\frac{1+ no.documents}{1 + no.documents containing word w})+1$

##### The steps for calculating TFIDF are:
* For each vector:
    * Calculate the term frequency for each term in the vector
    * Calculate the inverser doc frequency for each term in the vector
    * Multiply the two for each term in the vector
* Then normalise each vector by the Euclidean norm (numpy.linalg.norm)
    * $norm = \frac{v}{||v||^2}$

Check out the math behind TFIDF:
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [18]:
from sklearn.feature_extraction.text import TfidfTransformer

In [19]:
tf = TfidfTransformer()

In [21]:
#fit the output of the count vectorizer
tf.fit(corpus_vec)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [22]:
transform_vec = tf.transform(corpus_vec)

In [23]:
transform_vec

<4x24 sparse matrix of type '<class 'numpy.float64'>'
	with 24 stored elements in Compressed Sparse Row format>

In [24]:
#comparing the two dataframes
df = pd.DataFrame(corpus_vec.todense(), index = corpus, columns=cv.get_feature_names())
df

Unnamed: 0,at,attached,away,come,fly,from,get,go,house,its,...,no,party,strings,the,theres,to,up,with,york,you
"Its up to you, New York, New york",0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,1,0,2,1
"Come fly with me, lets fly lets fly away",0,0,1,1,3,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
Theres a party at my house,1,0,0,0,0,0,0,0,1,0,...,0,1,0,0,1,0,0,0,0,0
No strings attached from the get go,0,1,0,0,0,1,1,1,0,0,...,1,0,1,1,0,0,0,0,0,0


In [26]:
vdf = pd.DataFrame(transform_vec.todense().round(2), index = corpus, columns=cv.get_feature_names())
vdf

Unnamed: 0,at,attached,away,come,fly,from,get,go,house,its,...,no,party,strings,the,theres,to,up,with,york,you
"Its up to you, New York, New york",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.29,...,0.0,0.0,0.0,0.0,0.0,0.29,0.29,0.0,0.58,0.29
"Come fly with me, lets fly lets fly away",0.0,0.0,0.24,0.24,0.73,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.24,0.0,0.0
Theres a party at my house,0.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.45,0.0,...,0.0,0.45,0.0,0.0,0.45,0.0,0.0,0.0,0.0,0.0
No strings attached from the get go,0.0,0.38,0.0,0.0,0.0,0.38,0.38,0.38,0.0,0.0,...,0.38,0.0,0.38,0.38,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df2['artist'] = [0]*2 = [1]*2

---