### Leftovers

So there were two unanswered questions at the end of last class. Let's get into both of them. 
- Why did 'a' not get included in the `CountVectorizer`? And, if we wanted to, how do we include 'a'? 
- Why do the TF-IDF weights look...not sensible? 

#### CountVectorizers, again

With most of our `sklearn` classes and modules, there are some default values that will kick in when we don't specify anything. We saw a couple of examples of this with n-grams last week. If you don't specify `ngram_range` to the `CountVectorizer`, it will assume you only want to do `unigrams`. Similarly, if you don't specify a `stop_words` list, it will assume you don't want to factor stop words into your analysis. 

Other parameters that the `CountVectorizer` cares about include:
- `lowercase` (by default, True)
- `min_df` (by default, 1.0)
- `max_df` (by default, 1.0)
- `max_features` (by default, None, i.e. _everything_ is a feature). 

So, where does 'a' come in? By another parameter called `token_pattern`. The [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) (always always read the documentation!) suggests (emphasis mine): 
> Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. **The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)**. 

This means that single-character words ('A', 'I') will get dropped by the `CountVectorizer`.

In [None]:
song_titles = [
    "Candle in the wind", 
    "A pillow of winds", 
    "Wind of change",
    "The wind cries Mary"
]

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer



**Right, now, how can we get our CountVectorizer to include it?**

We know that the `token_pattern` is the parameter that's eliminating single-character words. So, let's have a look at the what the _actual_ value for `token_pattern` is. We can do this by simply running `cv.token_pattern`, which gives us the below result:

> `'(?u)\\b\\w\\w+\\b'`

![alt text](files/regex.png)

Source: https://www.xkcd.com/208/
    
    
OK, so we want to make a minor change to the above regex to include single-character words. How do we do it?


#### TF-IDF, again
Right, so, we know TF-IDF means "term frequency inverse document frequency." The calculation is:

> `tf * idf`

 But, how do you calculate `tf` and `idf` respectively? 
 
> `tf(t)` = `(# times term appears in document) ÷ (total # of words in document)`

For example, if you had a document with a single line: "Data, data, data. It's all about the data." Here, the `tf(t='data')` would be count('data') / count(# words), i.e. 4/8 or 1/2. 

Next up, we have "inverse document frequency" or "idf". The calculation for `idf` is:
> ` log [ n / df(d, t) + 1 ] + 1 `

...where: 
- `n` is the # of documents
- `df(d, t)` is the number of documents containing the term

Let's go back to our song titles toy example below to see how it works with that. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer



Note: for the purposes of this walkthrough, we aren't using stopwords or a stemmer. We'll get to that in a few cells. Now, let's take the example of 'Candle in the Wind'. The term frequency for each of the terms are 1/4, i.e.:
- `tf('candle') = tf('in') = tf('the') = tf('wind')` = 1/4 (because each term appears only once)

Let's now do the whole 'tfidf' for 'candle' and 'wind':

We can confirm if our IDF calculations are accurate by checking what `TfidfVectorizer` calculated. 

In [None]:
cv.idf_

Right, so, now we can multiply our term frequency with our inverse document frequency. There's another _implicit_ variable here called the loss, which we won't get into, but that will basically mean our final calculations will be _slightly_ different from the sklearn class.

We can now replicate this with the stemmed TfidfVectorizer that uses stopwords. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer,self).build_analyzer()
        return lambda doc:(stemmer.stem(word) for word in analyzer(doc))




Meanwhile, the IDF values calculated are:


In [None]:
tv.idf_