In [None]:
import pandas as pd
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline

# Machine learning and text data

If we have a corpus of texts, we first must preprocess those texts into a format the algorithms can understand. This usually means converting the representation of our text into numbers!

# Bag of words model

A bag of words model classifies a text by turning it into a "bag" of words to normalize and count them. Scikit-learn's `CountVectorizer` function helps us do this.

# CountVectorizer

`CountVectorizer` will help us quickly tokenize text, learn its vocabulary, and encode the text as a vector for use in machine learning. This is often referred to as document encoding. 

In [None]:
# Define a corpus
corpus = [
    "This is the first document.",
    "This is the second second document.",
    "And the third one.",
    "Is this is the first is document and?"
    ]

# Define an empty bag (of words)
vectorizer = CountVectorizer()

# Use the .fit method to tokenize the text and learn the vocabulary
vectorizer.fit(corpus)

# Print the vocabulary
vectorizer.vocabulary_

Our output is a dictionary. What are the keys and values? 

# Document term matrix

A [document term matrix](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) displays term frequencies that occur across a collection of documents. We want to encode the documents into a [sparse matrix](https://sebastianraschka.com/faq/docs/bag-of-words-sparsity.html#:~:text=By%20definition%2C%20a%20sparse%20matrix,as%20a%20word%2Dcount%20vector.&text=Thus%2C%20if%20most%20of%20your,most%20likely%20sparse%20as%20well!) to represent the frequencies of each vocabular word across the documents.

The column headers could read **(document number, vocabulary word)   frequency**

In [None]:
# Encode the documents
vector = vectorizer.transform(corpus)
print(vector) # 4 x 9 sparse matrix - four documents with nine words across them!
print(vector.shape)
print(type(vector))

In [None]:
# View the vector as arrays (4 x 9). Nice!!
# Each row is a document
# Each column is a vocabulary word (0 thru 8)
print(vector.toarray())

In [None]:
# Look at the arrays in the above cell. 
# In which documents does "and" appear? 
# What about "document"? What about "the"?
vectorizer.get_feature_names()

In [None]:
# What does this tell us? 
vectorizer.transform(['document']).toarray()

# Bigrams

In addition to uni-grams, using bigrams can be useful to preserve some ordering information. Here we can look at two (bi) or three (tri) or four (quad) or more words at a time! 

> NOTE: **`ngram_range=(1,2)`** will get you bigrams, **`ngram_range=(1,3)`** will get you tri-grams, **`ngram_range=(1,4)`** will get you quad-grams, etc. 

> **`token_pattern=r'\b\w+\b'`** is standard regex code to separate words.

In [None]:
# Define a bigram bag of words
bigram_vectorizer = CountVectorizer(ngram_range = (1,2),
                                    token_pattern = r'\b\w+\b', 
                                    min_df = 1)
bigram_vectorizer

In [None]:
# Analyze the bigram bag of words
analyze = bigram_vectorizer.build_analyzer()
analyze('Bi-grams. Are cool!')

# Apply this idea to our `corpus` variable from above

Why do we have four rows? 

How many columns do we have? What do they represent? 

In [None]:
corpus

In [None]:
# Corpus transformation
x = bigram_vectorizer.fit_transform(corpus).toarray()
print(x)

In [None]:
# What are feature names? The column names! The rows are our documents :) 
bigram_vectorizer.get_feature_names()

In [None]:
# What is going on here? 
# Search for which document a certain vocabular word appears
feature_index = bigram_vectorizer.vocabulary_.get('is the')
x[:, feature_index]

# Document encoding / machine learning (continued)

[Term frequency–inverse document frequency (TFIDF)](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) can be thought of as an extension of `CountVectorizer`. However, instead of counting words, TFIDF identifies unique words within and across documents. 

# Vocabulary
- **Document Term Matrix:** Is a matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

- **TF-IDF Scores:** Short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

- **Topic Modeling:** A general class of statistical models that uncover abstract topics within a text. It uses the co-occurrence of words within documents, compared to their distribution across documents, to uncover these abstract themes. The output is a list of weighted words, which indicate the subject of each topic, and a weight distribution across topics for each document.
    
- **LDA:** Latent Dirichlet Allocation. A particular model for topic modeling. It does not take document order into account, unlike other topic modeling algorithms. Also see word2vec and BERT! (Week 5)

# DTM/TF-IDF

- Let's use Python's scikit-learn package learn to make a document term matrix from the dataset `music_reviews.csv` (collected from [Metacritic](https://www.metacritic.com/)). We will then use the DTM and a word weighting technique called TF-IDF (term frequency inverse document frequency) to identify important and discerning words within this dataset with Pandas.

We ask the question: **what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums?**

In [None]:
reviews = pd.read_csv("../../Data/music_reviews.csv", sep = "\t")
reviews.head()

# Review - Explore the Data with Pandas

Let's first explore the data. This serves not only as a basic informative purpose, but also to ensure there are not any glaring errors. 

First, what genres are in this dataset, and how many reviews in each genre?

In [None]:
reviews['genre'].value_counts()

In [None]:
# Who were the artists?
reviews.artist.value_counts().head(20)

# or

# reviews['artist'].value_counts().head(20)

In [None]:
# Who were the reviewers?
reviews['critic'].value_counts().head(20)

In [None]:
# What was the distribution of review scores like?
reviews['score'].plot(kind='hist', 
                      bins = 50, 
                      figsize = (6, 3)); 

In [None]:
# View average score by genre
reviews_grouped_by_genre = reviews.groupby("genre")
# print(reviews_grouped_by_genre)
reviews_grouped_by_genre['score'].mean().sort_values(ascending=False)

Together, let's make barplots for the number of reviews by genre.

In [None]:
# Get frequencies (counts) for the number of reviews by genre
reviews["genre"].value_counts()

In [None]:
# Convert this to a data frame
gen = pd.DataFrame(reviews["genre"].value_counts())
gen = gen.reset_index()
gen

In [None]:
# Check out the new column names
list(gen.columns)

In [None]:
# Rename these columns
gen = gen.rename(columns = {"index":"GENRE", "genre":"COUNT"})
gen

In [None]:
# Create the plot
gen_fig = sns.barplot(x = 'COUNT', 
                      y = 'GENRE', 
                      data = gen, 
                      orient = 'h')

We could also make barplots for average review score by genre and boxplots for the review scores by genre. 

In [None]:
mean_review = reviews.groupby('genre')['score'].mean()
mean_review.plot.barh(rot = 15);

In [None]:
# Boxplots of average score by genre
sns.boxplot(x = "score", y = "genre", data = reviews);

# Wait, so what about TF-IDF?

Now that we have a sense of how these albums were scored by genre we can take a look at the language of the reviews themselves to see how those words might relate to the album scores!

> NOTE: remember that exploring your data with basic summary statistics and visualizations is a good first step before anything more complex!

In [None]:
# What is going on here?
def remove_digits(comment):
    return ''.join([ch for ch in comment if not ch.isdigit()])

reviews['body_without_digits'] = reviews['body'].apply(remove_digits)

In [None]:
reviews

In [None]:
# View the first body entry
list(reviews["body"])[0]

In [None]:
# View that same body entry - but without digits! What happened?
list(reviews["body_without_digits"])[0]

# `CountVectorizer` revisited

Let's revisit `CountVectorizer` and see what kind of vocabulary we are dealing with in the music reviews "body_without_digits" column. Whoa, that is a lot of words!!!

In [None]:
countvec = CountVectorizer()
sparse_dtm = countvec.fit_transform(reviews['body_without_digits'])
print(sparse_dtm)

This format is actually called Compressed Sparse Format and is useful because we can save huge document term matrices in this format - but it is difficult to look at for a human. Let's convert it to a format we are more familiar with - a data frame:

In [None]:
dtm = pd.DataFrame(sparse_dtm.toarray(), columns=countvec.get_feature_names(), index=reviews.index)
print(dtm.shape)

# Whaaaaaaaaaaaat is going on?
dtm.head(n = 5)

In [None]:
# Look at just the first row in its entirety
dtm.iloc[0]

In [None]:
# How about....
# Now do a command + f / control + f search for the number 1
pd.set_option('display.max_rows', None)
dtm.iloc[0]

# What can we do with a DTM?

In [None]:
# Quickly identify the most frequent words:
dtm.sum().sort_values(ascending=False).head(20)

In [None]:
# View the most infrequent words:
dtm.sum().sort_values().head(20)

In [None]:
# View the average number of times each word is used in a review:
dtm.mean().sort_values(ascending=False).head(20)

# TF-IDF scores

How to find distinctive words in a corpus is a long-standing question in text analysis. Today, we'll learn one simple approach to this: TF-IDF. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguishing. We want to identify words that are unevenly distributed across the corpus using TF-IDF. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

Traditionally, the inverse document frequency is calculated as such:

**number_of_documents / number_documents_with_term**

so:

**tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)**

You can, and often should, normalize the numerator: 

**tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)**

We can calculate this manually, but scikit-learn has a built-in function to do so. This function also uses log frequencies, so the numbers will not correspond excactly to the calculations above. We'll use the [scikit-learn calculation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

# TF-IDFVectorizer Function

To do so, we simply do the same thing we did before with `CountVectorizer`, but instead we use the function `TfidfVectorizer`

In [None]:
tfidfvec = TfidfVectorizer()
sparse_tfidf = tfidfvec.fit_transform(reviews['body_without_digits'])
print(sparse_tfidf)

Turn this into a Pandas DataFrame: 

In [None]:
tfidf = pd.DataFrame(sparse_tfidf.toarray(), columns=tfidfvec.get_feature_names(), index=reviews.index)
tfidf.head(n = 50)

In [None]:
# Look at the 20 words with highest tf-idf weights:
tfidf.max().sort_values(ascending=False).head(20)

Ok! We have successfully identified content words, without removing stop words. What else do you notice about this list?

# Identifying Distinctive Words

What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.

First we add in a column of genre: 

In [None]:
tfidf['genre_'] = reviews['genre']
tfidf.head()

Now lets compare the words with the highest tf-idf weight for each genre: 

In [None]:
rap = tfidf[tfidf['genre_'] == 'Rap']
indie = tfidf[tfidf['genre_'] == 'Indie']
jazz = tfidf[tfidf['genre_'] == 'Jazz']

In [None]:
rap.max(numeric_only=True).sort_values(ascending=False).head(10)

In [None]:
indie.max(numeric_only=True).sort_values(ascending=False).head(10)

In [None]:
jazz.max(numeric_only=True).sort_values(ascending=False).head(10)

In week 5 you will learn about topic modeling to see how machines can identify potentially abstract topics in text(s)! [Check out the sweet animation on Wikipedia page](https://en.wikipedia.org/wiki/Topic_model)