# Practical 1: Pre-processing
#### Ayoub Bagheri
<img src="img/uu_logo.png" alt="logo" align="right" title="UU" width="50" height="20" />

In this practical, we are going to do some text preprocessing! Are you looking for Python documentation to refresh you knowledge of programming? If so, you can check https://docs.python.org/3/reference/

Google Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with:
* Zero configuration required
* Free access to GPUs
* Easy sharing

Colab notebooks are Jupyter notebooks that are hosted by Colab. Here you can find links to more detailed introductions to Colab: https://colab.research.google.com/notebooks/intro.ipynb

We need the following packages:

In [1]:
import nltk
import string
from nltk.tokenize import word_tokenize, sent_tokenize
import pandas as pd 
import re
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # for bag of words feature extraction
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

Use can simply run `!pip install package_name` to install a package. Generally, you only need to install each package once on your computer and load it again, however, in Colab you may need to reinstall a package once you are reconnecting to the network.

NB: nltk package comes with many corpora, toy grammars, trained models, etc. A complete list is posted at: http://nltk.org/nltk_data/

To install the data, after installing nltk, use nltk’s data downloader as "nltk.download()".

### Let's get started!

### Pre-processing a simple text
#### `If you feel comfortable with Python and Google colab skip to question 8.`

Open Colab and create a new empty notebook to work with Python 3! Go to https://colab.research.google.com/ and login with your account. Then click on "File $\rightarrow$ New notebook".

1\. **Text is known as a string object or as an array of characters. Create an object _a_ with the value of "Hello @Text Mining World! I'm here to learn everything, right?", and then print it!**

2\. **Since this is an array, print the first and last character of your object.**

3\. **Use the function lower() from the nltk package to convert the characters in the object to their lowercase form and save it into a new object b.**

4\. **Use the _string_ package to print the list of punctuations.**

Punctuations can separate characters, words, phrases, or sentences. In some applications they are very important to the task at hand, in others they are redundant and should be removed!

5\. **Use the code below to remove the punctuations from the our string. Name your object c.**

In [None]:
# Remmebr there are many ways to remove punctuations! This is only one of them:
c = "".join([char for char in b if char not in string.punctuation])
print(c)

6\. **Use the `word_tokenize` function from _nltk_ and tokenize string *b*. Compare that with the tokenization of string _c_.**

7\. **Use sent_tokenize function from the _nltk_ package and split string b into sentences. Compare that with the sentence tokenization of string c.**

### Pre-processing a text corpus

Pre-processing a corpus is similar to pre-processing a text string. 

Here are some resources for puclic text data sets:
- CLARIN Resource Families: https://www.clarin.eu/portal
- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php?format=&task=&att=&area=&numAtt=&numIns=&type=text&sort=nameUp&view=table
- Kaggle: https://www.kaggle.com/

Here, we want to analyze and pre-process the `Taylor Swift song lyrics` from all her albums. We downloaded this data set from the Kaggle website and put that already in the data folder. Here, you can find more information about the original data set: https://www.kaggle.com/PromptCloudHQ/taylor-swift-song-lyrics-from-all-the-albums

8\. **Read the “taylor_swift.csv” data set. Check the head and tail functions.**

9\. **Use the code below to replace the '\n' notations with a space character to remove the line breaks. In this code, a new column has been added to the dataframe, _Preprocessed_ _Lyrics_. We are going to fill this column out with the preprocessed text including the steps in the following questions.**

In [None]:
def remove_linebreaks(text):
    """custom function to remove the line breaks"""
    return re.sub(r'\n', ' ', text)

ts_lyrics["Preprocessed Lyrics"] = ts_lyrics["Lyrics"].apply(lambda text: remove_linebreaks(text))
ts_lyrics.head()

10\. **Write a custom function to remove the punctuations. (Hint: You can use the method in question 5 or make use of the function maketrans from the string package.)**

11\. **Convert the characters to their lower forms. Think about why and when we need this step in our analysis.**

12\. **Use the code below to list the 20 most frequent terms in your preprocessed lyrics.**

In [None]:
# To get all lyrics in one text, you can concatenate all of them using the " ".join(list) syntax, 
# which joins all elements in a list separating them by whitespace.
text = " ".join(lyric for lyric in ts_lyrics["Preprocessed Lyrics"])

# split() returns list of all the words in the string
split_it = text.split()
  
# Pass the split_it list to instance of Counter class.
Counter = Counter(split_it)
  
# most_common() produces k frequently encountered input values and their respective counts.
most_occur = Counter.most_common(20)
  
print(most_occur)


You see that these are mainly the stop words. Before removing them let's plot a worcloud of our data.

13\. **Use the code below to plot a wordcloud with a maximum of 50 words. Check the command _?WordCloud_ to review the help page of this function.**

In [None]:
wordcloud = WordCloud(max_font_size=50, max_words=50, background_color="white").generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

14\. **Run the codes given below to remvoe the stop words, and update the stop words by adding words: "im", "youre", "id", "dont", "cant", "didnt", "ive", "ill", "hasnt". Show the 20 most frequent terms and plot the wordcould of 50 words again.**

In [None]:
# run the code nltk.download('stopwords') if needed
stop_words = set(stopwords.words('english'))
print(stop_words)

In [None]:
stop_words.update(["im", "youre", "id", "dont", "cant", "didnt", "ive", "ill", "hasnt"])
# stop_words.discard('word') # this is when you want to remove a word from the list
print(stop_words)

In [None]:
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in stop_words])

ts_lyrics["Preprocessed Lyrics"] = ts_lyrics["Preprocessed Lyrics"].apply(lambda text: remove_stopwords(text))
ts_lyrics.head()

In [None]:
from collections import Counter
# To get all lyrics in one text, you can concatenate all of them using the " ".join(list) syntax, 
# which joins all elements in a list separating them by whitespace.
text = " ".join(lyric for lyric in ts_lyrics["Preprocessed Lyrics"])

# split() returns list of all the words in the string
split_it = text.split()
  
# Pass the split_it list to instance of Counter class.
Counter = Counter(split_it)
  
# most_common() produces k frequently encountered input values and their respective counts.
most_occur = Counter.most_common(20)
  
print(most_occur)

15\. **We can apply stemming or lemmatization on our text data. Apply a lemmatizer from nltk and save the results.**

### Vector space model: Bag-of-Words

16\. **Use the CountVectorizer from the sklearn package and build a bag of words model on _Preprocessed Lyrics_ based on term frequency. Check the shape of the output matrix.**

17\. **Inspect the first 100 terms in the vocabulary.**

18\. **Using TfidfVectorizer, you can create a model based on tfidf. Use the code below to apply a TfidfVectorizer on your text data. Does the shape of the output matrix differ from dtm?**

19\. **Use the TfidfVectorizer to create an n-gram based model with n = 1 and 2. (Hint: Use the ngram_range argument to determine the lower and upper boundary of the range of n-values for different n-grams to be extracted.)**

20\. **We want to compare the lyrics of Friends theme song with the lyrics of Taylor Swift's songs and find the most similar one. Use the code below to, first, apply the pre-processing steps and then transform the text into counts and tfidf vectors. Do the bag of words models (tf vs tfidf) agree on the most similar song to Friends theme song?**

In [None]:
friends_theme_lyrics = "So no one told you life was going to be this way. Your job's a joke, you're broke, you're love life's DOA. It's like you're always stuck in second gear, When it hasn\'t been your day, your week, your month, or even your year. But, I\'ll be there for you, when the rain starts to pour. I\'ll be there for you, like I\'ve been there before. I\'ll be there for you, cause you\'re there for me too."
friends_theme_lyrics

In [None]:
friends_theme_lyrics = remove_punctuation(friends_theme_lyrics)
friends_theme_lyrics = friends_theme_lyrics.lower()
friends_theme_lyrics = remove_stopwords(friends_theme_lyrics)
friends_theme_lyrics = lemmatize_words(friends_theme_lyrics)
friends_theme_lyrics

In [None]:
friends_theme_lyrics_tf = vectorizer1.transform([friends_theme_lyrics])
friends_theme_lyrics_tf.shape
dtm.shape

In [None]:
# compute and print the cosine similarity matrix
cosine_sim_dtm = cosine_similarity(dtm, friends_theme_lyrics_tf)

In [None]:
max_index = np.argmax(cosine_sim_dtm, axis=0)
print(cosine_sim_dtm[max_index])
max_index

In [None]:
ts_lyrics.iloc[max_index]

In [None]:
ts_lyrics["Preprocessed Lyrics"].iloc[38]

In [None]:
friends_theme_lyrics_tfidf = vectorizer3.transform([friends_theme_lyrics])
print(friends_theme_lyrics_tfidf.shape)
print(tfidf_matrix3.shape)
# compute and print the cosine similarity matrix
cosine_sim_tfidf = cosine_similarity(tfidf_matrix3, friends_theme_lyrics_tfidf)

In [None]:
max_index = np.argmax(cosine_sim_tfidf, axis=0)
print(cosine_sim_tfidf[max_index])
max_index

In [None]:
ts_lyrics.iloc[max_index]

In [None]:
ts_lyrics["Preprocessed Lyrics"].iloc[16]