<a href="https://colab.research.google.com/github/michalis0/DataMining_and_MachineLearning/blob/master/week9/Text_Analytics_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Mining and Machine Learning - Week 7
# Text Analytics 2

[Text Analytics](https://people.ischool.berkeley.edu/~hearst/text-mining.html) (or text mining) is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles.

### This will be one of the most difficult lab sessions of the semester. Don't hesitate to ask if anything is unclear!

### Table of Contents
#### 0. Project: Git and GitHub
#### 1. Recap on text representation
* 1.1 Some important concepts
* 1.2 Bag of Words (BOW)
* 1.3 TF-IDF

#### 2. Introduction to Gensim and Word Embedding
* 2.1 Word embedding with Word2Vec
* 2.2 Exercise

#### 3. Complaints Classification: TF-IDF vs. Doc2Vec
* 3.1 Load and clean data
* 3.2 EDA
* 3.3 Classification using TF-IDF and Logistic Regression
* 3.4 Classification using Doc2Vec and Logistic Regression

Author: Luc Kunz

## 0. Project: Git and GitHub
For the project, you will have to work with Git and GitHub. The following documentation can be useful to you:
* [Git and GitHub tutorial for beginners](https://www.youtube.com/playlist?list=PL4cUxeGkcC9goXbgTDQ0n_4TBzOO0ocPR)
* [GitHub Desktop video 1](https://www.youtube.com/watch?v=fJtyf62yAb8)
* [GitHub Desktop video 2](https://www.youtube.com/watch?v=GqNAD4XoZ6k)
* [Git Cheat Sheet](https://education.github.com/git-cheat-sheet-education.pdf)

If you're having troubles completing your project using Git and/or GitHub Desktop, please let me know by email/slack and we can arrange an additional lab session on how to do a python project with Git and GitHub Desktop.

In [None]:
# Import required packages
import gensim
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import bs4 as bs
import urllib.request
import spacy
import string
import math
from gensim.models import Word2Vec
import matplotlib.pyplot as plt
import seaborn as sns

# Load English language model of spacy
sp = spacy.load('en_core_web_sm')

## 1. Recap on text representation
In order to be able to use texts as inputs for classification, we have to transform them into numbers (i.e. vectors). There are several ways of doing this.

### 1.1 Some important concepts
* Document = some text i.e. a string (e.g. a sentence, a tweet, paragraph of text, book, news article, etc.).
* Corpus = collection of documents.
* Dictionary = list of unique tokens in (preprocessed) corpus.
* Vector = mathematical representatation of a document (e.g. Bag of Words).
* Model = algorithm used for transforming vectors from one representation to another (e.g. TF-IDF).

In [None]:
# A document
doc = 'Tom confessed that he had fallen in love with me' # single quotes
doc = "Tom confessed that he had fallen in love with me" # double quotes
doc = """Tom confessed that he had fallen in love with me.""" # triple quotes

In [None]:
# A corpus
d1 = "Tom confessed that he had fallen in love with me"
d2 = "We musn't joke around with love"
d3 = "Human-caused climate change has caused land ice to melt and ocean water to expand"
d4 = "Climate change is not really happening"
d5 = "We asked Tom what he wanted for Christmas"
corpus = [d1, d2, d3, d4, d5]

# Preprocessing
from gensim.utils import simple_preprocess
processed_corpus = []
for doc in corpus:
  processed_corpus.append(simple_preprocess(doc))
processed_corpus

In [None]:
# A dictionary
from gensim import corpora
dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

In [None]:
dictionary.token2id

### 1.2 Bag of Words (BOW)

Bag of Words is the simplest approach to achieve the transformation of documents into vectors. It is divided into two basic steps:
* Create a dictionary of unique words from the corpus.
* Analyse the documents, i.e. for each word in the dictionary and each document, add 1 if the word is in the document, otherwise 0.

Let's try to code it from scratch using spacy:

In [None]:
# Tokens in document
def get_tokens(document):
  doc_tokens = []
  for token in sp(document):
      if (token.is_punct == False) and (token.is_space == False):
        doc_tokens.append(token.lower_)
  return doc_tokens

In [None]:
# List of unique words in corpus (dictionary)
def vocabulary(corpus):
  # Delare output
  word_list = []
  # Loop documents - lower each word and add it to the output
  for document in corpus:
    spacy_doc = sp(document)
    for token in spacy_doc:
      if token.lower_ not in word_list and (token.is_punct == False) and (token.is_space == False):
        word_list.append(token.lower_)
  # Return output
  return word_list
    
vocabulary(corpus)

We now have a function to get the words of a document and a function to get the unique words of a corpus of documents. We can use them to create the Bag of Words.

In [None]:
# Bag of Words
def bow(document, corpus):
  # Get tokens
  doc_tokens = get_tokens(document)
  corpus_tokens = vocabulary(corpus)
  # Initialization
  bag = {}
  for token in corpus_tokens:
    bag[token] = 0
  # Add 1 if token is in document
  for token in doc_tokens:
    bag[token] += 1
  # Return
  return bag

bow(d1, corpus)

In [None]:
# Dataframe - all documents in corpus
bag_of_words = []
for doc in corpus:
  bag = bow(doc, corpus)
  bag_of_words.append(bag)
  
pd.DataFrame(bag_of_words)

Remarks:
* This is not perfect (e.g. we could remove stopwords, use n-grams, lemmas).
* We can use [CountVectorizer](https://https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from sklearn as shown below.

In [None]:
# Using CountVectorizer
vectorizer = CountVectorizer()
bag_of_words = vectorizer.fit_transform(corpus).todense()
bag_of_words

In [None]:
# Features
vectorizer.vocabulary_

In [None]:
# DataFrame
bag_of_words = pd.DataFrame(bag_of_words, columns=vectorizer.get_feature_names())
bag_of_words

Advantages of BOW:
* No need of huge corpus of words to get good results in practice.
* Easy to understand (i.e. not mathematically complex).

Disadvantages of BOW:
* A lot of zeros (imagine a corpus of 1000 articles) --> consume memory and space.
* Does not maintain any context information ("I eat a fish" vs. "A fish eats me").
* Half solutions: n-grams, specifiying min_df and max_df (see [documentation](https://https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)).


### 1.3 TF-IDF
TF-IDF is a type of bag of words approach where instead of adding zeros and ones in the embedding vector, you add floating numbers that contain more useful information compared to zeros and ones. The idea is to emphasize words that appear in few documents in the corpus. A word that appear many times but only in one document will have a high value (close to one) compared to words that appear many times in many documents. This word is then very useful to identify the document.

TF(word, document) = Term frequency = (Number of occurences of a word in document)/(Total words in the document)
- greater if word appears many times in document

IDF(word) = Inverse Document Frequency = Log((Total number of documents)/(Number of documents containing the word))
- greater if word appears in fewer doucuments

TF-IDF = TF*IDF

In [None]:
# Term frequency (TF)
def tf(document):
  # Get tokens
  tokens = get_tokens(document)
  # Initialization
  term_freq = {}
  for token in tokens:
    term_freq[token] = 0
  # Increment
  for token in tokens:
    term_freq[token] += 1/len(tokens)
  # Return
  return term_freq

tf(d3)

In [None]:
# Inverse document frequency
def idf(corpus):
  # Get list of unique words in corpus
  voc = vocabulary(corpus)
  # Initialization
  inv_doc_freq = {}
  for word in voc:
    inv_doc_freq[word] = 0
  # Number of apparition of word
  for word in voc:
    for document in corpus:
      doc_tokens = get_tokens(document)
      if word in doc_tokens:
        inv_doc_freq[word] += 1
  #print(inv_doc_freq)
  #print("\n----------------------\n")
  # IDF
  inv_doc_freq = {k: math.log(len(corpus) / inv_doc_freq[k]) for k in inv_doc_freq.keys()}
  # Return
  return inv_doc_freq

idf(corpus)

In [None]:
# TF-IDF
def tfidf(document, corpus):
  # TF
  tf_bag = tf(document)
  # IDF
  idf_bag = idf(corpus)
  # TF*IDF
  tfidf_bag = {k: tf_bag[k]*idf_bag[k] for k in tf_bag.keys()}
  
  return tfidf_bag

tfidf(d3, corpus)

In [None]:
# DataFrame
bag_of_words_tfidf = []
for doc in corpus:
  bag = tfidf(doc, corpus)
  bag_of_words_tfidf.append(bag)
  
pd.DataFrame(bag_of_words_tfidf).fillna(0)

Remarks:
* This is not perfect (e.g. we could remove stopwords, use n-gramsm lemmas)
* We can use [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from sklearn as shown below.

In [None]:
# Using TfidfVectorizer
vectorizer = TfidfVectorizer()
bag_of_words = vectorizer.fit_transform(corpus).todense()

# DataFrame
bag_of_words = pd.DataFrame(bag_of_words, columns=vectorizer.get_feature_names())
bag_of_words

Advantage of TF-IDF:
* Smart way of representing documents in corpus. More information is provided.

Disadvantages of TF-IDF (same as for BOW):
* A lot of zeros (imagine a corpus of 1000 articles) --> consume memory and space
* Does not maintain any context information ("I eat a fish" vs. "A fish eats me")
* Half solutions: n-grams, specifiying min_df and max_df (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)).

## 2. Introduction to Gensim and Word Embedding

In the following, we illustrate how we can find out the relations between words in a dataset, compute the similarity between them, or use the vector representation of those words as input for other applications such as text classification or clustering.

We use Gensim. A complete tutorial can be found [here](https://www.tutorialspoint.com/gensim/gensim_introduction.htm).

### 2.1 Word Embedding with Word2Vec

Word embediing approaches use deep learning and neural network-based techniques to convert words into corresponding vectors so that semantically similar vectors are close to each other in an N-dimensional space, where N refers to the dimensions of the vectors. The underlying assumption is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model.

Two word embedding methods:
* Word2Vec by Google
* GloVe (Global vectors for Word Representation) by Stanford

Word2Vec gives astonishing results. Its ability to maintain a semantic relationship is reflected in a classic example where if you have a vector for the word "King" and you remove the vector represented by the word "Man" from the "King" and add "Woman", you get a vector that is close to the vector "Queen". 

* King - Man + Woman = Queen

Second example: "dog", "puppy" and "pup" are often used in similar situations, with similar surrounding words like "good", "fluffy" or "cute", and according to Word2Vec they will therefore share a similar vector representation.

In real applications, Word2Vec models are created from billions of documents. For example, [Google's Word2Vec model](https://code.google.com/archive/p/word2vec/) is formed from 3 million words and phrases.

GloVe is an extension of Word2Vec. More information [here](https://nlp.stanford.edu/projects/glove/). 

More detail on word embedding will be given in the class following this lab session. You can also click [here](https://www.youtube.com/watch?v=yFFp9RYpOb0) to watch a video on Word2Vec.

In [None]:
# Get texts from Wikipedia
def get_text(url):
  scrapped_data = urllib.request.urlopen(url)
  article = scrapped_data.read()
  parsed_article = bs.BeautifulSoup(article,'lxml')
  paragraphs = parsed_article.find_all('p')
  article_text = ""
  for p in paragraphs:
    article_text += p.text
  return article_text

machine_learning = get_text("https://en.wikipedia.org/wiki/Machine_learning")
ai = get_text("https://en.wikipedia.org/wiki/Artificial_intelligence")

machine_learning

In [None]:
ai

In [None]:
# Group texts in list
texts = [machine_learning, ai]

In [None]:
# Create tokenizer function for preprocessing
def spacy_tokenizer(text):

    # Define stopwords, punctuation, and numbers
    stop_words = spacy.lang.en.stop_words.STOP_WORDS
    punctuations = string.punctuation
    numbers = "0123456789"

    # Create spacy object
    mytokens = sp(text)

    # Lemmatize each token and convert each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Remove stop words and punctuation
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # Remove sufix like ".[1" in "experience.[1"
    mytokens_2 = []
    for word in mytokens:
      for char in word:
        if (char in punctuations) or (char in numbers):
          word = word.replace(char, "")
      if word != "":
        mytokens_2.append(word)

    # Return preprocessed list of tokens
    return mytokens_2

# Tokenize texts
processed_texts = []
for text in texts:
  processed_text = spacy_tokenizer(text)
  processed_texts.append(processed_text)

In [None]:
for processed_text in processed_texts:
  print(processed_text[:20])

In [None]:
# Word embedding 
### Parameters: 
#     - min_count: minimum number of occurence of single word in corpus to be taken into account
#     - size: dimension of the vectors representing the tokens
#     - IMPORTANT: processed_texts must be a list of lists of tokens object!
word2vec = Word2Vec(processed_texts, min_count=2, size=100)
vocab = word2vec.wv.vocab
print(vocab)

In [None]:
# Vector
v1 = word2vec.wv['intelligence'] 
v1

In [None]:
# Similar vectors/words
sim_words = word2vec.wv.most_similar('intelligence')
sim_words

In [None]:
# Similarity between two words
word2vec.wv.similarity('computer', 'animal')

In [None]:
word2vec.wv.similarity('computer', 'machine')

Remarks:
* Many things can be done with Gensim (e.g. [topic modelling](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/))
* There exists also `Doc2Vec`, which is used to create a vectorised representation of a group of words (i.e. a document) taken collectively as a single unit (illustrated in the next section).

### 2.2 Exercise
Analyze the wikipedia article on [Coronavirus](https://en.wikipedia.org/wiki/Coronavirus) as above. Follow the steps and send your answers and code @Luc Kunz on Slack (direct message) or via Zoom (private). This is a good way to improve your participation grade.

In [None]:
# 1. Get text from URL - use the get_text() function defined above
coronavirus = get_text('https://en.wikipedia.org/wiki/Coronavirus')

# 2. Processing - tokenization using the spacy_tokenizer() function
processed_corona = spacy_tokenizer(coronavirus)
processed_corona[:10]

In [None]:
# 3. What is the number of occurence of the word "virus"?
count = 0
for word in processed_corona:
  if word == 'virus':
    count += 1
count

In [None]:
# 4. Create a Word2Vec representation of the article with a min_count of 1 and a vector size of 50
word2vec_corona = Word2Vec([processed_corona], min_count=1, size=50)

# 5. What is the 10 most similar words of "virus"
word2vec_corona.wv.most_similar('virus')

## 3. Complaints Classification: TF-IDF vs. Doc2Vec
We classify consumer finance complaints into 12 pre-defined categories using:
* TF-IDF and logistic regression
* Doc2Vec and logistic regression

We use the same tokenizer function, train-test split, classification algorithm, etc. The only difference is the mathematical representation (i.e. the vectorization from the tokens) of the complaints:
* TF-IDF: important words (or n-grams) are words (n-grams) that frequently appear in few documents.
* Doc2Vec: similar documents must be close to each other in n-dimensional space. Focus on the context of the documents.

### 3.1 Load and clean data
We work with a sample of a large data set from Data.gov that can be found on [here](https://catalog.data.gov/dataset/consumer-complaint-database).



In [None]:
# Load data from GitHub
path = "https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/week9/data/complaints_sample.csv"
df = pd.read_csv(path, index_col=0)
df.head()

In [None]:
df.info()

The data set includes 18 columns and 9101 rows describing consumer complaints about financial products. In this case, we want to predict the `Product` categorie based on the text of the complaint (i.e. `Consumer complaint narrative`).

In [None]:
# Select columns of interest
data = df[["Product", "Consumer complaint narrative"]]
data.head()

Around 2/3 of the complaints are null values. They are not useful for the prediction so we drop them.

In [None]:
# Drop NaN
print(data.isnull().sum())
data = data.dropna().reset_index(drop=True)
data.head()

In [None]:
data.info()

We end up with 3137 complaints for which we would like to predict the product concerned.

### 3.2 EDA

In [None]:
# Total number of words - over 600,000
data['Consumer complaint narrative'].apply(lambda x: len(x.split(' '))).sum()

In [None]:
# Sample
data['Consumer complaint narrative'].sample().values[0]

The data has been anonymized (i.e. names, dates, IDs, etc. have been replaced by XXXX).

In [None]:
# Imbalanced dataset
data.Product.value_counts()

There are 17 categories. We group some of them together (e.g. `Credit card`, `Prepaid card`, and `Credit or prepaid card`) because they are sub-categories of each other. We end up with 12 categories.

In [None]:
# Clean
dic_replace = {'Credit reporting':'Credit reporting, credit repair services, or other personal consumer reports', 
               'Credit card':'Credit card or prepaid card', 
               'Payday loan':'Payday loan, title loan, or personal loan', 
               'Money transfers':'Money transfer, virtual currency, or money service',
               'Prepaid card':'Credit card or prepaid card',
               'Virtual currency':'Money transfer, virtual currency, or money service'}
data.replace(dic_replace, inplace=True)
data.Product.value_counts()

In [None]:
# Plot number of complaints per category
cnt_pro = data['Product'].value_counts()
plt.figure(figsize=(12,4))
sns.barplot(cnt_pro.index, cnt_pro.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Product', fontsize=12)
plt.xticks(rotation=90)
plt.show()

In [None]:
# Base rate
round(len(data[data.Product == "Credit reporting, credit repair services, or other personal consumer reports"]) / len (data), 4)

### 3.3 Classification using TF-IDF and Logistic Regression

In [None]:
# Import packages
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
# Define tokenizer function
def spacy_tokenizer(sentence):

    punctuations = string.punctuation
    stop_words = spacy.lang.en.stop_words.STOP_WORDS

    # Create token object, which is used to create documents with linguistic annotations.
    mytokens = sp(sentence)

    # Lemmatize each token and convert each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Remove stop words and punctuation
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # Remove anonymous dates and people
    mytokens = [ word.replace('xx/', '').replace('xxxx/', '').replace('xx', '') for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in ["xxxx", "xx", ""] ]

    # Return preprocessed list of tokens
    return mytokens

In [None]:
# Select features
X = data['Consumer complaint narrative'] # the features we want to analyze
ylabels = data['Product'] # the labels, or answers, we want to test against

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2, random_state=1234)

X_train

In [None]:
y_train

In [None]:
%%time
# Define vectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), tokenizer=spacy_tokenizer)

# Define classifier
classifier = LogisticRegression(solver='lbfgs', max_iter=1000)

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

In [None]:
# Predictions
y_pred = pipe.predict(X_test)

# Evaluate model
print(round(accuracy_score(y_test, y_pred), 4))
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(15,15))
sns.heatmap(conf_mat, annot=True, fmt='d')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

### 3.4 Classification using Doc2Vec and Logistic Regression
We now try to do the same exercise, but using [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html).

In [None]:
# Tokenize data - same tokenizer function as before
%%time
from gensim.models.doc2vec import TaggedDocument
sample_tagged = data.apply(lambda r: TaggedDocument(words=spacy_tokenizer(r['Consumer complaint narrative']), tags=[r.Product]), axis=1)
print(sample_tagged.head(20))

In [None]:
sample_tagged.values[10]

In [None]:
# Train test split - same split as before
train_tagged, test_tagged = train_test_split(sample_tagged, test_size=0.2, random_state=1234)

train_tagged

In [None]:
test_tagged

In [None]:
# Allows to speed up a bit
import multiprocessing
cores = multiprocessing.cpu_count()

In [None]:
# Define Doc2Vec and build vocabulary
from gensim.models import Doc2Vec

model_dbow = Doc2Vec(dm=0, vector_size=30, negative=6, hs=0, min_count=1, sample=0, workers=cores, epoch=300)
model_dbow.build_vocab([x for x in train_tagged.values])

We now train the distributed bag of words model. In short, it trains a neural network and the optimal weights are the coefficients of the vectors of the documents. Therefore, similar documents will be close to each other in the N-dimentional space (N being the size of the vectors). More information on this [here](https://thinkinfi.com/simple-doc2vec-explained/).

In [None]:
# Train distributed Bag of Word model
model_dbow.train(train_tagged, total_examples=model_dbow.corpus_count, epochs=model_dbow.epochs)

In [None]:
# Select X and y
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=100)) for doc in sents])
    return targets, regressors

y_train, X_train = vec_for_learning(model_dbow, train_tagged)
y_test, X_test = vec_for_learning(model_dbow, test_tagged)

In [None]:
# Each document (i.e. complaint) is now a vector in the space of 30 dimentions.
# Similar complaints should have similar vector representation.
X_train[:3]

In [None]:
# Fit model on training set - same algorithm as before
logreg = LogisticRegression(max_iter=1000, solver='lbfgs')
logreg.fit(X_train, y_train)

# Predictions
y_pred = logreg.predict(X_test)

# Evaluate model
print(round(accuracy_score(y_test, y_pred), 4))
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(15,15))
sns.heatmap(conf_mat, annot=True, fmt='d')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

## References
* https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4
* https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f