# Recap so far

## What we have done
- How to request web pages in Python
- We learned how to deal with the html document that we receive (Regular Expressions, HTML parsing)
- We used this knowledge to scarpe lyrics for a/several artist(s) (and store them?)
- Yesterday you started talking about how to deal with language in machine learning models (tokenizing, lemmatizing)
- This morning we talked about the problems of imbalanced datasets and how to potentially deal with them

## What is missing
- All of you have some or multiple bodies of text for different artists
- What is the artist in our case? The class, label, y, target
- What is our X? The lyrics
- How to turn the lyrics into some input that a machine learning model is able to understand

We are now going to take about the most basic approach of doing this.

## Today
- Convert words into numbers
- ASCII (encoding)??
- Text can be vectorized; We can translate each word into a binary/numerical representation

In [1]:
# Import pandas
import pandas as pd

# Import spacy and load the language model
import spacy
nlp = spacy.load('en_core_web_md')

In [2]:
# Load the dataset
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(remove=['headers','footers'])
news = pd.DataFrame({'data': data.data, 
                   'label': data.target}).iloc[:20]
news.head()

Unnamed: 0,data,label
0,I was wondering if anyone out there could enli...,7
1,A fair number of brave souls who upgraded thei...,4
2,"well folks, my mac plus finally gave up the gh...",4
3,Robert J.C. Kyanko (rob@rjck.UUCP) wrote:\n> a...,1
4,"From article <C5owCB.n3p@world.std.com>, by to...",14


In [3]:
news['data'][0], data.target_names[0]

('I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.',
 'alt.atheism')

In [4]:
# Define cleaning function
def clean_text(review, model):
    """preprocess a string (tokens, stopwords, lowercase, lemma & stemming) returns the cleaned result
        params: review - a string
                model - a spacy model
                
        returns: list of cleaned tokens
    """
    
    new_doc = ''
    doc = model(review)
    for word in doc:
        if not word.is_stop and word.is_alpha:
            new_doc = f'{new_doc} {word.lemma_.lower()}'
            
    return new_doc

In [5]:
# Preprocess the data
news['data'] = news['data'].apply(clean_text, model=nlp)
news.head()

Unnamed: 0,data,label
0,wonder enlighten car see day sport car look l...,7
1,fair number brave soul upgrade si clock oscil...,4
2,folk mac plus finally give ghost weekend star...,4
3,robert kyanko uucp write write article know w...,1
4,article tom baker article pack rat write clea...,14


## Bag of Words:
* The first attempt at creating word vectors. 
* The common approach for word vectorisation until 2013 (Mikolov et al)

#### Pros
* Works for any text
* Easy and fast to do
* Does not require a language model (just the corpus)

#### Cons
* Does not apply language knowledge (stopwords EN only)
* All words are equally similar / disimliar (discrete, orthogonal vectors)
* Order of words is ignored

---

### The Count Vectorizer:
#### Steps to build
* Create a corpus
* Fit a CV on it
* Transform the corpus into a sparse, then dense, matrix

In [6]:
# Import the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [7]:
# Instantiate the CountVectorizer
vectorizer = CountVectorizer(stop_words='english')

In [8]:
# Fit the CountVectorizer
vectorizer.fit(news['data'])

CountVectorizer(stop_words='english')

In [9]:
# Transform the CountVectorizer
vectorized_news = vectorizer.transform(news['data'])

In [10]:
vectorized_news # 20 newspaper articles x 1794 is the amount of distinct words

<20x1176 sparse matrix of type '<class 'numpy.int64'>'
	with 1584 stored elements in Compressed Sparse Row format>

#### Sparse Matrix
Most of our matrix consists of zeroes. A Sparse Matrix only stores the non-zero values to save memory. We need to convert it into a **dense** matrix to view it effectively.

In [11]:
# Look at the outcome
vectorized_news.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 1, ..., 1, 0, 0],
        [2, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [12]:
# Make a DataFrame out of it
vectorized_news = pd.DataFrame(vectorized_news.todense(), columns=vectorizer.get_feature_names(), index=news.index)
vectorized_news.head()

Unnamed: 0,able,abraham,abs,absolute,absurd,abuse,accel,acceleration,acceptance,access,...,yep,yes,yesterday,yhwh,yo,york,young,yr,zoom,zyklon
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [13]:
vectorized_news.tail()

Unnamed: 0,able,abraham,abs,absolute,absurd,abuse,accel,acceleration,acceptance,access,...,yep,yes,yesterday,yhwh,yo,york,young,yr,zoom,zyklon
15,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
16,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
17,0,0,1,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
18,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19,0,0,0,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,0,0


In [14]:
# Have a look at the columns
vectorized_news.columns

Index(['able', 'abraham', 'abs', 'absolute', 'absurd', 'abuse', 'accel',
       'acceleration', 'acceptance', 'access',
       ...
       'yep', 'yes', 'yesterday', 'yhwh', 'yo', 'york', 'young', 'yr', 'zoom',
       'zyklon'],
      dtype='object', length=1176)

**A downside of the Count Vectorizer is that the uniqueness of words is not taken into consideration. This is where TF-IDF comes in.**

---

### The Tf-Idf Transformer:

* TF - Term Frequency (# count of a word w in doc d)
* IDF - Inverse Document Frequency

$TFIDF = TF(w,d) * IDF(w)$

$IDF(w) = log(\frac{1+ no.documents}{1 + no.documents containing word w})+1$

##### The steps for calculating TFIDF are:
* For each vector:
    * Calculate the term frequency for each term in the vector
    * Calculate the inverser doc frequency for each term in the vector
    * Multiply the two for each term in the vector
* Then normalise each vector by the Euclidean norm (numpy.linalg.norm)
    * $norm = \frac{v}{||v||^2}$

Check out the math behind TFIDF:
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [15]:
# Import the TfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer

In [16]:
# Instantiate it
transformer = TfidfTransformer()

In [17]:
# Fit it
transformer.fit(vectorized_news)

TfidfTransformer()

In [18]:
# Transform it
transformed_news = transformer.transform(vectorized_news)

In [19]:
transformed_news

<20x1176 sparse matrix of type '<class 'numpy.float64'>'
	with 1584 stored elements in Compressed Sparse Row format>

In [20]:
# Make a DataFrame out of it
transformed_news = pd.DataFrame(transformed_news.todense(), 
                                columns=vectorized_news.columns, 
                                index=vectorized_news.index)
transformed_news.head()

Unnamed: 0,able,abraham,abs,absolute,absurd,abuse,accel,acceleration,acceptance,access,...,yep,yes,yesterday,yhwh,yo,york,young,yr,zoom,zyklon
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068885,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.110837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Compare the two DataFrames


### OK, so how could we use this to predict an artist with a classification model we have already seen?

**First, add a labels column to your dataframe by factorizing the artist name**

**Now follow the normal workflow for splitting, training and predicting a model**

1. Problem is well defined

2. The data was/is scraped by you

3. The data can be split by using the train-test-split

4. EDA is less important here

5. Feature Engineering: CountVectorization, TfidfTransformation --> We fit on the training data

6. Fit a model: Classifications models (Logistic Regression, Decision Tree, Random Forest, Naive Bayes on Friday)

7. Cross-Validation/Hyperparameter Optimization

8. Test the model on test data --> Just use .transform(X_train, y_train) and don't fit the Vectorizer and the Transformer again.

*In this case, the only difference is that we will need to pass our new data (a song lyric) through the word vectors first*

*Remember to not refit the model, just use it to transform the data*

---

## To make your code shorter, you could use the TfidfVectorizer
* This does both steps (count vectorizer and tfidfTransfomer) in one. The reason I show both in the tutorial is because its easier to understand word vectors this way

`from sklearn.feature_extraction.text import TfidfVectorizer`