# Practical 1: Text Pre-processing
<img src="img/uu_logo.png" alt="logo" align="right" title="UU" width="50" height="20" />


#### Applied Text Mining - Utrecht Summer School

In this practical, we are first going to get acquainted with Python in Google Colab, then we will do some text preprocessing! Are you looking for Python documentation to refresh you knowledge of programming? If so, you can check https://docs.python.org/3/reference/

Google Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with:
* Zero configuration required
* Free access to GPUs
* Easy sharing

Colab notebooks are Jupyter notebooks that are hosted by Colab. Here you can find links to more detailed introductions to Colab: https://colab.research.google.com/notebooks/intro.ipynb

### Let's get started!

### Pre-processing simple texts

1\. **Open Colab and create a new empty notebook to work with Python 3!** We are going to work with the Python libraries NLTK, Gensim, and spaCy. 

Go to https://colab.research.google.com/ and login with your account. Then click on "File $\rightarrow$ New notebook".

2\. **Text is also known as a string variable, or as an array of characters. Create a variable _a_ with the text value of "Hello @Text Mining World! I'm here to learn everything, right?", and then print it!**

In [1]:
a = "Hello @Text Mining World! I'm here to learn everything, right?"
a

"Hello @Text Mining World! I'm here to learn everything, right?"

3\. **Since this is an array, print the first and last character of your variable.**

In [57]:
print(a[0]) # if you do not use the print function, it will print only the last argumnet in the cell
print(a[61])
l = len(a)
print("Length of your string is: ", l)
print(a[l-1])

H
?
Length of your string is:  62
?


4\. **Use _"!pip install"_ command and install the packages: nltk, spacy, gensim, string, numpy.**

Generally, you only need to install each package once on your computer and load it again, however, in Colab you may need to reinstall a package once you are reconnecting to the network.

NB: nltk comes with many corpora, toy grammars, trained models, etc. A complete list is posted at: http://nltk.org/nltk_data/
To install the data, after installing nltk, use nltk’s data downloader as "nltk.download()".

In [None]:
!pip install numpy
!pip install string
!pip install nltk
!pip install gensim
!pip install spacy

NB: If you want to stop your code in Jupyter and it does not stop from running:
- raise SystemExit("Stop right there!")
- Or: One simple trick to get rid of this problem, is to press "ctrl+a" to select all the code of that particular cell you want to stop the execution of and press "ctrl+x" to cut the entire cell code. Now the cell is empty and just the empty cell is executed.

5\. **Import (load) the nltk package and use the function _lower_ to convert the characters in string _a_ to their lowercase form and save it into a new variable _b_.**

In [3]:
import nltk
# nltk.download()
b = a.lower()
b

"hello @text mining world! i'm here to learn everything, right?"

6\. **Use the _string_ package to print the list of punctuations.**

Punctuations can separate characters, words, phrases, or sentences. In some applications they are very important to the task at hand, in others they are redundant and should be removed!

In [4]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


7\. **Use the punctuation list to remove the punctuations from the lowercase form of our example string _a_. Name your variable _c_.**

In [5]:
# Remmebr there are many ways to remove punctuations! This is only one of them:
c = "".join([char for char in b if char not in string.punctuation])
print(c)

hello text mining world im here to learn everything right


8\. **Use _word_tokenize_ function from _nltk_ and tokenize string _b_. Compare that with the tokenization of string _c_.**

In [62]:
from nltk.tokenize import word_tokenize
print(word_tokenize(b))
print(word_tokenize(c))


['hello', '@', 'text', 'mining', 'world', '!', 'i', "'m", 'here', 'to', 'learn', 'everything', ',', 'right', '?']
['hello', 'text', 'mining', 'world', 'im', 'here', 'to', 'learn', 'everything', 'right']


We see that the main difference is in punctuations, however, we also see that some words are now combined togehter in the tokenization of string c.

9\. **Use the function _Regexptokenizer_ from _nltk_ to tokenize string _b_ whilst removing punctuations. This way you will avoid unnecessary concatenations.**

In [6]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize(b)

['hello',
 'text',
 'mining',
 'world',
 'i',
 'm',
 'here',
 'to',
 'learn',
 'everything',
 'right']

With this tokenizer, you get the same output as with tokenizing the string c.

10\. **Use _sent_tokenize_ function from the _nltk_ package and split string _b_ into sentences. Compare that with the sentence tokenization of string _c_.**

In [64]:
from nltk.tokenize import sent_tokenize
print(sent_tokenize(b))
print(sent_tokenize(c))

['hello @text mining world!', "i'm here to learn everything, right?"]
['hello text mining world im here to learn everything right']


An obvious question in your mind would be why sentence tokenization is needed when we have the option of word tokenization. Imagine you need to count average words per sentence, how you will calculate? For accomplishing such a task, you need both NLTK sentence tokenizer as well as NLTK word tokenizer to calculate the ratio. Such output serves as an important feature for machine training as the answer would be numeric.

### Pre-processing a text corpus (data set)

Pre-processing a data set is similar to pre-processing simple text strings. First, we need to get some data. For this, we can use our own data set, or we can scrape data from web or use social media APIs. There are also some websites with publicly available data sets:
- CLARIN Resource Families: https://www.clarin.eu/portal
- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php?format=&task=&att=&area=&numAtt=&numIns=&type=text&sort=nameUp&view=table
- Kaggle: https://www.kaggle.com/

Here, we want to analyze and pre-process the Taylor Swift song lyrics data from all her albums. We downloaded this data set from the Kaggle website and put that already in the data folder. Here is the link to the original data set: https://www.kaggle.com/PromptCloudHQ/taylor-swift-song-lyrics-from-all-the-albums

11\. Read the “taylor_swift.csv” data set from the data folder. Check the head and tail functions with your dataframe.

In [10]:
import pandas as pd 
ts_lyrics = pd.read_csv("data/taylor_swift_lyrics.csv")

In [13]:
ts_lyrics.head()

Unnamed: 0,Artist,Album,Title,Lyrics
0,Taylor Swift,Taylor Swift,Tim McGraw,He said the way my blue eyes shinx\nPut those ...
1,Taylor Swift,Taylor Swift,Picture to Burn,"State the obvious, I didn't get my perfect fan..."
2,Taylor Swift,Taylor Swift,Teardrops on my Guitar,"Drew looks at me,\nI fake a smile so he won't ..."
3,Taylor Swift,Taylor Swift,A Place in This World,"I don't know what I want, so don't ask me\n'Ca..."
4,Taylor Swift,Taylor Swift,Cold As You,You have a way of coming easily to me\nAnd whe...


In [14]:
ts_lyrics.tail()

Unnamed: 0,Artist,Album,Title,Lyrics
127,Taylor Swift,folklore,mad woman,What did you think I'd say to that?\nDoes a sc...
128,Taylor Swift,folklore,epiphany,"Keep your helmet\nKeep your life, son\nJust a ..."
129,Taylor Swift,folklore,betty,"Betty, I won't make assumptions about why you ..."
130,Taylor Swift,folklore,peace,Our coming of age has come and gone\nSuddenly ...
131,Taylor Swift,folklore,hoax,My only one\nMy smoking gun\nMy eclipsed sun\n...


In [31]:
ts_lyrics.iloc[0]

Artist                                                     Taylor Swift 
Album                                                      Taylor Swift 
Title                                                         Tim McGraw
Lyrics                 He said the way my blue eyes shinx\nPut those ...
Preprocessed Lyrics    He said the way my blue eyes shinx Put those G...
Name: 0, dtype: object

In [32]:
ts_lyrics.head(1)

Unnamed: 0,Artist,Album,Title,Lyrics,Preprocessed Lyrics
0,Taylor Swift,Taylor Swift,Tim McGraw,He said the way my blue eyes shinx\nPut those ...,He said the way my blue eyes shinx Put those G...


12\. Add a new column to the dataframe and name it _Preprocessed_ _Lyrics_ , then fill the column out with the preprocessed text including the steps in this and the following questions. First replace the '\n' notations with a space character.

In [30]:
import re
def remove_linebreaks(text):
    """custom function to remove the line breaks"""
    return re.sub(r'\n', ' ', text)

ts_lyrics["Preprocessed Lyrics"] = ts_lyrics["Lyrics"].apply(lambda text: remove_linebreaks(text))
ts_lyrics.head()

Unnamed: 0,Artist,Album,Title,Lyrics,Preprocessed Lyrics
0,Taylor Swift,Taylor Swift,Tim McGraw,He said the way my blue eyes shinx\nPut those ...,He said the way my blue eyes shinx Put those G...
1,Taylor Swift,Taylor Swift,Picture to Burn,"State the obvious, I didn't get my perfect fan...","State the obvious, I didn't get my perfect fan..."
2,Taylor Swift,Taylor Swift,Teardrops on my Guitar,"Drew looks at me,\nI fake a smile so he won't ...","Drew looks at me, I fake a smile so he won't s..."
3,Taylor Swift,Taylor Swift,A Place in This World,"I don't know what I want, so don't ask me\n'Ca...","I don't know what I want, so don't ask me 'Cau..."
4,Taylor Swift,Taylor Swift,Cold As You,You have a way of coming easily to me\nAnd whe...,You have a way of coming easily to me And when...


13\. Write another custom function to remove the punctuations. You can use the previous method or make use of the function maketrans from the string package. 

In [34]:
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', string.punctuation))

ts_lyrics["Preprocessed Lyrics"] = ts_lyrics["Preprocessed Lyrics"].apply(lambda text: remove_punctuation(text))
ts_lyrics.head()

Unnamed: 0,Artist,Album,Title,Lyrics,Preprocessed Lyrics
0,Taylor Swift,Taylor Swift,Tim McGraw,He said the way my blue eyes shinx\nPut those ...,he said the way my blue eyes shinx put those g...
1,Taylor Swift,Taylor Swift,Picture to Burn,"State the obvious, I didn't get my perfect fan...",state the obvious i didnt get my perfect fanta...
2,Taylor Swift,Taylor Swift,Teardrops on my Guitar,"Drew looks at me,\nI fake a smile so he won't ...",drew looks at me i fake a smile so he wont see...
3,Taylor Swift,Taylor Swift,A Place in This World,"I don't know what I want, so don't ask me\n'Ca...",i dont know what i want so dont ask me cause i...
4,Taylor Swift,Taylor Swift,Cold As You,You have a way of coming easily to me\nAnd whe...,you have a way of coming easily to me and when...


14\. Change all the characters to their lower forms. Think about why and when we need this step in our analysis.

In [35]:
ts_lyrics["Preprocessed Lyrics"] = ts_lyrics["Preprocessed Lyrics"].str.lower()
ts_lyrics.head()

Unnamed: 0,Artist,Album,Title,Lyrics,Preprocessed Lyrics
0,Taylor Swift,Taylor Swift,Tim McGraw,He said the way my blue eyes shinx\nPut those ...,he said the way my blue eyes shinx put those g...
1,Taylor Swift,Taylor Swift,Picture to Burn,"State the obvious, I didn't get my perfect fan...",state the obvious i didnt get my perfect fanta...
2,Taylor Swift,Taylor Swift,Teardrops on my Guitar,"Drew looks at me,\nI fake a smile so he won't ...",drew looks at me i fake a smile so he wont see...
3,Taylor Swift,Taylor Swift,A Place in This World,"I don't know what I want, so don't ask me\n'Ca...",i dont know what i want so dont ask me cause i...
4,Taylor Swift,Taylor Swift,Cold As You,You have a way of coming easily to me\nAnd whe...,you have a way of coming easily to me and when...


15\. List the 20 most frequent terms in this dataframe. And plot a wordcloud. 

12\. Calculate average word per sentence in the data set. Can you extend this ratio for each document?

13\. Plot wordclounds

14\. Remove stop words and calculate the ratio word per sentence again

15\. Find all the context-related words in your data set and plot . Let's define a word context-related only if it appears in more than 90 percent of documents. First plot a word cloud of these words in your data set.

16\. Modify the list of stop words by adding context-related words. Remove the list of modified stop words from your data and analyze the result.

17\. What percentage of documents in the data set are not written in English?

18\. Find all the dates and create a dataframe of sentences versus dates if any.


In [None]:
tokenizer.tokenize(b)