## Typical Pre-processing for Text Data
1. **Tokenization**: given a text, this will separate it into individual words.
2. **Normalization**: convert text into all lowercase, spelling mistake correction, etc. 
3. **Cleaning**: remove unwanted parts, e.g., punctuation, stop words, etc.
4. **Lemmatization**/**stemming**: convert individual words to the corresponding 'root word'. There is a difference between 'lemmatization' & 'stemming', you may check in some references if you want to know further.

## Tokenization

In [2]:
import nltk
from nltk.tokenize import word_tokenize

text1 = "After watching two hours non stop, \
        he says that the film is really fantastic #brilliant."
text2 = "Foods sold there are little bit pricy, \
        meanwhile the taste is not delicious #notrecommended."

tokens1 = word_tokenize(text1)
print("tokens1:\n", tokens1)

tokens2 = word_tokenize(text2)
print("tokens2:\n", tokens2)

tokens1:
 ['After', 'watching', 'two', 'hours', 'non', 'stop', ',', 'he', 'says', 'that', 'the', 'film', 'is', 'really', 'fantastic', '#', 'brilliant', '.']
tokens2:
 ['Foods', 'sold', 'there', 'are', 'little', 'bit', 'pricy', ',', 'meanwhile', 'the', 'taste', 'is', 'not', 'delicious', '#', 'notrecommended', '.']


## Normalization
In this block of code, we try one of normalization processes: converting to lowercase.

In [3]:
normalized_words1 = [w.lower() for w in tokens1]
print("normalised_words1:\n", normalized_words1)

normalized_words2 = [w.lower() for w in tokens2]
print("normalised_words2:\n", normalized_words2)

normalised_words1:
 ['after', 'watching', 'two', 'hours', 'non', 'stop', ',', 'he', 'says', 'that', 'the', 'film', 'is', 'really', 'fantastic', '#', 'brilliant', '.']
normalised_words2:
 ['foods', 'sold', 'there', 'are', 'little', 'bit', 'pricy', ',', 'meanwhile', 'the', 'taste', 'is', 'not', 'delicious', '#', 'notrecommended', '.']


## Cleaning 01: remove punctuation


In [8]:
import string

table = str.maketrans('', '', string.punctuation)
punc_removed1 = [w.translate(table) for w in normalized_words1]
print("punc_removed1:\n", punc_removed1)

punc_removed2 = [w.translate(table) for w in normalized_words2]
print("punc_removed2:\n", punc_removed2)

punc_removed1:
 ['after', 'watching', 'two', 'hours', 'non', 'stop', '', 'he', 'says', 'that', 'the', 'film', 'is', 'really', 'fantastic', '', 'brilliant', '']
punc_removed2:
 ['foods', 'sold', 'there', 'are', 'little', 'bit', 'pricy', '', 'meanwhile', 'the', 'taste', 'is', 'not', 'delicious', '', 'notrecommended', '']


## Cleaning 02: remove not alphabetic

In [10]:
isalpha_words1 = [word for word in punc_removed1 if word.isalpha()]
print("isalpha_words1:\n", isalpha_words1)

isalpha_words2 = [word for word in punc_removed2 if word.isalpha()]
print("isalpha_words2:\n", isalpha_words2)

isalpha_words1:
 ['after', 'watching', 'two', 'hours', 'non', 'stop', 'he', 'says', 'that', 'the', 'film', 'is', 'really', 'fantastic', 'brilliant']
isalpha_words2:
 ['foods', 'sold', 'there', 'are', 'little', 'bit', 'pricy', 'meanwhile', 'the', 'taste', 'is', 'not', 'delicious', 'notrecommended']


## Cleaning 03: remove stop words

In [18]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
# print("stop_words:\n", stop_words)

stopWords_removed1 = [w for w in isalpha_words1 if not w in stop_words]
print("stopWords_removed1:\n", stopWords_removed1)

stopWords_removed2 = [w for w in isalpha_words2 if not w in stop_words]
print("stopWords_removed2:\n", stopWords_removed2)

stopWords_removed1:
 ['watching', 'two', 'hours', 'non', 'stop', 'says', 'film', 'really', 'fantastic', 'brilliant']
stopWords_removed2:
 ['foods', 'sold', 'little', 'bit', 'pricy', 'meanwhile', 'taste', 'delicious', 'notrecommended']


## Stemming

In [19]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

stemmed_word1 = [ps.stem(w) for w in stopWords_removed1]
print("stemmed_word1:\n", stemmed_word1)

stemmed_word2 = [ps.stem(w) for w in stopWords_removed2]
print("stemmed_word2:\n", stemmed_word2)

stemmed_word1:
 ['watch', 'two', 'hour', 'non', 'stop', 'say', 'film', 'realli', 'fantast', 'brilliant']
stemmed_word2:
 ['food', 'sold', 'littl', 'bit', 'prici', 'meanwhil', 'tast', 'delici', 'notrecommend']


## Lemmatization

In [21]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

lemmatized_words1 = [lemmatizer.lemmatize(w) for w in stopWords_removed1]
print("lemmatized_words1:\n", lemmatized_words1)

lemmatized_words2 = [lemmatizer.lemmatize(w) for w in stopWords_removed2]
print("lemmatized_words2:\n", lemmatized_words2)

lemmatized_words1:
 ['watching', 'two', 'hour', 'non', 'stop', 'say', 'film', 'really', 'fantastic', 'brilliant']
lemmatized_words2:
 ['food', 'sold', 'little', 'bit', 'pricy', 'meanwhile', 'taste', 'delicious', 'notrecommended']


# Example of Converting Preprocessed Text into Numerical Features

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

# merge two texts into one list
two_preprocessed_text = [lemmatized_words1, lemmatized_words2]

# define tfidf vectorizer
def dummy(doc):
    return doc

tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy,
    preprocessor=dummy,
    token_pattern=None)

# train
model = tfidf.fit(two_preprocessed_text)
# transform to numerical features using the trained model
numerical_features = model.transform(two_preprocessed_text).toarray()

print("numerical_features of text1:\n", numerical_features[0],
      "; shape:", numerical_features[0].shape)
print("numerical_features of text2:\n", numerical_features[1],
      "; shape:", numerical_features[1].shape)

numerical_features of text1:
 [ 0.          0.31622777  0.          0.31622777  0.31622777  0.
  0.31622777  0.          0.          0.31622777  0.          0.
  0.31622777  0.31622777  0.          0.31622777  0.          0.31622777
  0.31622777] ; shape: (19,)
numerical_features of text2:
 [ 0.33333333  0.          0.33333333  0.          0.          0.33333333
  0.          0.33333333  0.33333333  0.          0.33333333  0.33333333
  0.          0.          0.33333333  0.          0.33333333  0.          0.        ] ; shape: (19,)


## Question 01 (Q01)
What is/are the difference(s) between stemming and lemmatization?

**Answer:**<br>
Stemming is reducing words to their root form which does not need to be a valid word on the dictionary. It seek for common morphological words and simply chop off the word to the minimum length of the similar words.

Meanwhile lemmatization is reducing words to their dictonary form words which should be registered in the dictonary. It uses lexical knowledge to get the base form of a dictionary word.

## Question 02 (Q02)
Please explain what TF-IDF is! <br>
***Note***: (i) you can insert picture (if you want) in the answer, and then upload all the materials (this ipynb file and the pictures) into one zip file to the course portal, (ii) you can also use mathematical equation here, for exampe: you can write $log_{2}(P_{i})$ by using `$log_{2}(P_{i})$`.

**Answer:**<br>
**TF-IDF** (term frequency — inverse document frequency) is a method to score relevancy of a term in a document, by checking its occurence rate in a document and the importancy of the term.

**TF** (Term Frequency) measures how frequently a term occures in a document. If a word occurs multiple times, this word might has a high term frequency.

TF(t) = (Number of times term t appears in the document)/(Total number of terms in the document)

**IDF** (Inverse Document Frequency) measures how important a term is. Not every word with high term frequency is important (such as "is", "of"). So we need to weigh down the frequent terms and scale up the rare ones.

IDF(t) = log(Total number of documents/Number of documents with the term t in it)

For a term i in document j:

$w_{i,j} = tf_{i,j} * log(\frac{N}{df(i)})$

## (Bonus) Question 03 (Q03)
What are other methods that can be used to convert "preprocessed text" to "numerical features" other than TF-IDF?

**Answer:**<br>
**Bag of Words**
On bag of words, a text is represented as the bag (multiset) of its words. This method checks whether the word is present in the text or not without considering the order of the words.

Example: vector conversion of sentence
```
    John likes to watch movies. Mary likes movies too. John also likes to watch football games.
```

can be represented as:

```
    BoW = {"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,"also":1,"football":1,"games":1};
```

By using CountVectorizer function, we can convert text document to matrix of word count.