In [None]:
import pandas as pd
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline

# Machine learning and text data

If we have a corpus of texts, we first must preprocess those texts into a format the algorithms can understand. This usually means converting the representation of our text into numbers! We will do this using Scikit-learn, a popular Machine Learning library. 

# CountVectorizer

`CountVectorizer` will help us quickly tokenize text, learn its vocabulary, and encode the text as a vector for use in machine learning. This is often referred to as document encoding. 

Note that many ML operations in Scikit-learn is often a two-part process:

### Fit -> Transform

This means slightly different things based on the kind of ML algorithm you're using. Here, when dealing with document encoding, *fitting* means to tokenize texts and learning the vocabulary, while *transforming* means to (yes) transform the texts into vectors of numbers based on the encoding of that vocabulary. Our *model* will be this transformed dataset.

In [None]:
# Define a corpus
corpus = [
    "This is the first document.",
    "This is the second document.",
    "And the third one.",
    "Here we go with the fourth document?"
    ]

# Define an empty bag (of words)
vectorizer = CountVectorizer()

# Use the .fit method to tokenize the text and learn the vocabulary
vectorizer.fit(corpus)

# Print the vocabulary. Note: this yields a dictionary. What are the keys and values? 
vectorizer.vocabulary_

Now that we've `fit` our texts to a vocabulary, we can `transform` our texts into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

In [None]:
vector = vectorizer.transform(corpus)

In [None]:
print(vector)

We can also view our document term matrix as a sparse array, where each row is a document and each column is a vocabulary word.


# Document term matrix

A [document term matrix](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) displays term frequencies that occur across a collection of documents. We want to encode the documents into a [sparse matrix](https://sebastianraschka.com/faq/docs/bag-of-words-sparsity.html#:~:text=By%20definition%2C%20a%20sparse%20matrix,as%20a%20word%2Dcount%20vector.&text=Thus%2C%20if%20most%20of%20your,most%20likely%20sparse%20as%20well!) to represent the frequencies of each vocabulary word across the documents.

We can take a look at our document term matrix as a sparse array, where each row is a document and each column is a vocabulary word.

In [None]:
print(vector.toarray())

However, representing our document-term matrix as a sparse matrix (i.e., a matrix where most values are 0) can end up costing us a lot of computing power. Performing operations across such a matrix may take a long time.

So, we can use an alternate data structure to represent the sparse data. The representation below is called a *coordinate list*. A list of tuples is stored with each tuple containing the row and column index, and the value.

The column headers of this coordinate list could read **(document number, vocabulary word), frequency**.


In [None]:
# Encode the documents
print(vector) 
print(vector.shape)
print(type(vector))

In [None]:
# Look at the arrays in the above cell. 
# In which documents does "and" appear? 
# What about "document"? What about "the"?
print(vectorizer.get_feature_names_out())


In [None]:
# What does this tell us? 
vectorizer.transform(['my new document is this']).toarray()

Read more [here](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/) about sparse matrices for machine learning.

Note that the kind of model we are building here does not take word order into account. It simply counts them per document! Put differently, `Countvectorizer` creates a **bag of words model**, which classifies a text by turning it into a "bag" of words to normalize and count them.


# Bigrams

In addition to uni-grams, using bigrams can be useful to preserve some ordering information. Here we can look at two (bi) or three (tri) or four (quad) or more words at a time! 

> NOTE: **`ngram_range=(1,2)`** will get you bigrams, **`ngram_range=(1,3)`** will get you tri-grams, **`ngram_range=(1,4)`** will get you quad-grams, etc. 

> **`token_pattern=r'\b\w+\b'`** is standard regex code to separate words.

We could also add many other parameters, such as `stop_words='english'` to add a stopwords list.

In [None]:
# Define a bigram bag of words 
bigram_vectorizer = CountVectorizer(ngram_range = (1,2),
                                    token_pattern = r'\b\w+\b')
bigram_vectorizer

In [None]:
# Analyze the bigram bag of words
analyze = bigram_vectorizer.build_analyzer()
analyze('Bigrams. Are cool!')

# Apply this idea to our `corpus` variable from above

In [None]:
corpus

In [None]:
# Corpus transformation
bigram_array = bigram_vectorizer.fit_transform(corpus).toarray()
print(bigram_array)

In [None]:
# What are feature names? The column names! The rows are our documents :) 
print(bigram_vectorizer.get_feature_names_out())

In [None]:
# Note that these counts are not word counts, but refer to the index of the word in the vocab
bigram_vectorizer.vocabulary_

# Review - Exploring Data with Pandas

For this next section we will use a dataset called `music_reviews.csv` (collected from [Metacritic](https://www.metacritic.com/)), which includes album reviews from well-known music magazines. 

Let's first explore the data. This serves not only as a basic informative purpose, but also to ensure there are not any glaring errors. Our data includes both the actual review (in the "body" column) and the numeric score, so we can start by exploring the latter.

First, what genres are in this dataset, and how many reviews in each genre?

In [None]:
reviews = pd.read_csv("../../Data/music_reviews.csv", sep = "\t")
print(reviews.shape)
reviews.head()

In [None]:
reviews['genre'].value_counts()

In [None]:
# Who were the artists?
reviews.artist.value_counts().head(20)

# or

# reviews['artist'].value_counts().head(20)

In [None]:
# Who were the reviewers?
reviews['critic'].value_counts().head(20)

In [None]:
# What was the distribution of review scores like?
reviews['score'].plot(kind='hist', 
                      bins = 50, 
                      figsize = (6, 3)); 

In [None]:
# Remember .groupby? It allows us to group the scores by genre!
reviews_grouped_by_genre = reviews.groupby('genre')

# Now let's get the mean for all scores, sorting in descending order  
reviews_grouped_by_genre['score'].mean().sort_values(ascending=False)

Together, let's make barplots for the number of reviews by genre.

In [None]:
# Get frequencies (counts) for the number of reviews by genre
reviews['genre'].value_counts()

In [None]:
# Convert this to a data frame
gen = pd.DataFrame(reviews['genre'].value_counts())

gen = gen.reset_index()
gen.columns =['GENRE', 'COUNT']

gen

In [None]:
# Create the plot
gen_fig = sns.barplot(x = 'COUNT', 
                      y = 'GENRE', 
                      data = gen, 
                      orient = 'h')

We could also make barplots for average review score by genre and boxplots for the review scores by genre. 

In [None]:
mean_review = reviews.groupby('genre')['score'].mean().sort_values()
mean_review.plot.barh(rot = 15);

In [None]:
# Boxplots of average score by genre
sns.boxplot(x = "score", y = "genre", data = reviews);

> NOTE: remember that exploring your data with basic summary statistics and visualizations is a good first step before anything more complex!

# TF-IDF

[Term frequency–inverse document frequency (TFIDF)](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) can be thought of as an extension of `CountVectorizer`. However, instead of counting words, TFIDF identifies unique words within and across documents. We'll talk more about what it is later. First, let's recap a bit of data preprocessing.

Let's use Python's scikit-learn package again. We'll use `Counvectorizer` as we did before, but we'll also use a word weighting technique called TF-IDF (term frequency inverse document frequency) to identify important and discerning words within this dataset with Pandas.

We ask the question: **what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums?**

In [None]:
# What is going on here?
def remove_digits(comment):
    return ''.join([ch for ch in comment if not ch.isdigit()])

reviews['body_without_digits'] = reviews['body'].apply(remove_digits)

In [None]:
reviews

In [None]:
# View the first body entry
reviews["body"][0]

In [None]:
# View that same body entry - but without digits! What happened?
list(reviews["body_without_digits"])[190]

# `CountVectorizer` revisited

Let's first revisit `CountVectorizer` and see what kind of vocabulary we are dealing with in the music reviews "body_without_digits" column. Whoa, that is a lot of words!

In [None]:
cv = CountVectorizer()
cv_bow = cv.fit_transform(reviews['body_without_digits'])
print(cv_bow)

This format is actually called Compressed Sparse Format and is useful because we can save huge document term matrices in this format - but it is difficult to look at for a human. Let's convert it to a format we are more familiar with - a dataframe. We call this object "dtm" as it is a document-term matrix - a matrix with all the terms and their counts in all the document. 

In [None]:
dtm = pd.DataFrame(cv_bow.toarray(), columns=cv.get_feature_names_out(), index=reviews.index)
print(dtm.shape)

dtm.head(5)

Note most of the counts are 0: it's a sparse matrix. We're spending a lot of memory resources on zero values, which do not contain any useful information. This is why we use other representations such as Coordinate Lists or Compressed Sparse Matrices.

In [None]:
# Look at just the first row in its entirety
# Now do a command + f / control + f search for the number 1
pd.set_option('display.max_rows', None)
dtm.iloc[0]

# What can we do with a DTM?

In [None]:
# Quickly identify the most frequent words:
dtm.sum().sort_values(ascending=False).head(20)

In [None]:
# View the most infrequent words:
dtm.sum().sort_values().head(20)

In [None]:
# View the average number of times each word is used in a review:
dtm.mean().sort_values(ascending=False).head(20)

# TF-IDF scores

How to find distinctive words in a corpus is a long-standing question in text analysis. Today, we'll learn one simple approach to this: TF-IDF. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguishing. We want to identify words that are unevenly distributed across the corpus using TF-IDF. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

Traditionally, the inverse document frequency is calculated as such:

**idf_word1 = number_of_documents / number_documents_with_word1**

so TF-IDF is:

**tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)**

You can, and often should, normalize the numerator (so there's no bias for longer or shorter documents). Otherwise, long documents (with lots of words) would affect the TF-IDF scores.

**tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)**

We can calculate all of this manually, but scikit-learn has a built-in function to do so. This function also uses log frequencies, so the numbers will not correspond excactly to the calculations above. We'll use the [scikit-learn calculation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

# TF-IDFVectorizer Function

To do so, we simply do the same thing we did before with `CountVectorizer`, but instead we use the function `TfidfVectorizer`

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

tfidfvec = TfidfVectorizer(min_df=2,lowercase=True,stop_words=stop)

tfidf_bow = tfidfvec.fit_transform(reviews['body_without_digits'])
print(tfidf_bow)

Note that each word in each review now has a TF-IDF score attached to it.

Again, let's turn this into a (sparse) Pandas DataFrame: 

In [None]:
tfidf = pd.DataFrame(tfidf_bow.toarray(), columns=tfidfvec.get_feature_names_out(), index=reviews.index)
tfidf.head(5)

Note that we still have a lot of zeroes – that is, documents in which certain words don't appear at all (and thus don't receive a TF-IDF score).

What are the words with the highest TF-IDF score across all documents?

In [None]:
# Look at the 20 words with highest tf-idf weights:
tfidf.max().sort_values(ascending=False).head(20)

Ok! We have successfully identified content words, without removing stop words. What else do you notice about this list?

# Identifying Distinctive Words

What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.

First we add in a column of genre: 

In [None]:
tfidf['genre_'] = reviews['genre']
tfidf.head()

Now let's compare the words with the highest tf-idf weight for each genre. We'll create three dataframes:

In [None]:
rap = tfidf[tfidf['genre_'] == 'Rap']
indie = tfidf[tfidf['genre_'] == 'Indie']
jazz = tfidf[tfidf['genre_'] == 'Jazz']

In [None]:
# Have a quick look
rap.head(3)

In [None]:
# Note: max() gets the max value for each row 
# numeric_only() excludes the "genre_" column
rap.max(numeric_only=True).sort_values(ascending=False).head(10)

In [None]:
indie.max(numeric_only=True).sort_values(ascending=False).head(10)

In [None]:
jazz.max(numeric_only=True).sort_values(ascending=False).head(10)

What does this tell us? For instance, it might be interesting that "authentic" is typically used in rap reviews, as well as terms like "tight" and "punch". Meanwhile, indie is connected with words like "likable" and "awesome", and jazz with more technical terminology like "minimalist", "innovative" and "descending".

In week 5 you will learn about topic modeling to see how machines can identify potentially abstract topics in text(s)!