Date: 03/09/2021

Version: 3.0

Environment: Python 3.8.3 and Anaconda 6.0.3

Operating System: macOS Big Sur (Version 11.5.1)

Libraries used:
* nltk (Natural Language Toolkit)
* re (for regular expressions)
* math (for mathemaical function usage)
* json (json functions)
* nltk.tokenize (for tokenization)
* nltk.probability (for processing probabilistic information, e.g FreqDist)
* itertools (contains functions for efficient looping)

## Table of Contents
* 1. Introduction
* 2. Import libraries
* 3. Conversion of PDF to text file
* 4. Operations on generated text file
* 5. Tokenization
* 6. Context-independent stopwords removal
* 7. Generate the 200 bigram collocations
* 8. Re-tokenize
* 9. Stemming (Porter) the unigrams
* 10. Removing Context-dependent and Rare Tokens
* 11. Generating sorted corpus vocab
* 12. CountVectorizer
* Conclusion

## 1. Introduction

This task wants us to perform text-preprocessing on the designated dataset provided of various cryptocurrency related articles. The dataset provided is in the form of a pdf, which would be required to converted to suitable format so as to preprocess. Preprocessed data is used directly in downtream applications like document summaraization, recommender systems, and learning-to-rand mathods.

The main goal of this task is to generate:

1. Generate sorted corpus vocabulary `.txt` file.
2. For each day, generate the sparse representation (i.e., doc-term matrix) in `.txt` format.


Provided resources/datasets:
* Sample dataset for test run.
* Designated `.pdf` dataset according to student ID.
* `stopwords_en.txt` file of english stopwords

More details for each task will be given in the following sections.

## 2. Import libraries

In [1]:
import nltk
import re
import math
import json 

from nltk.tokenize import RegexpTokenizer
from nltk.probability import *
from nltk.util import ngrams
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer

from __future__ import division
from itertools import chain

## 3. Conversion of PDF to text file

To perform preprocessing we require certain format. As we are given a `.pdf` file, we convert it to `.txt` format using the `pdfminer`

In [2]:
!pip install pdfminer



In [3]:
!pdf2txt.py -o text_31072100.txt '31072100_task2_pdf.pdf'
print('Conversion completed!')

Conversion completed!


## 4. Operations on generated text file

In [4]:
# Reading the produced .txt file
 
pdfTxtFile = 'text_31072100.txt'

with open(pdfTxtFile) as pdf_txt:
    txt_content = pdf_txt.read()
#     print(repr(txt_content))

To extract only the content and date of articles i.e. discard titles and segregate different articles, we use regular expression and apply `finall()` to get the tuples of `(date, content)`.

In [5]:
# re_article = re.compile("(?s)(\[\d{4}[-|/](?:0?[1-9]|1[0-2])[-|/](?:[0-3]?[0-9])\]?)(?:.*?\\n\\n)(.*?)(?=\[\d{4}[-|/](?:0?[1-9]|1[0-2])[-|/](?:[0-3]?[0-9])\]?|$)")

re_article = re.compile(
                        "(?s)" 
                        "(\[\d{4}[-|/](?:0?[1-9]|1[0-2])[-|/](?:[0-3]?[0-9])\]?)"
                        "(?:.*?\\n\\n)"
                        "(.*?)"
                        "(?=\[\d{4}[-|/](?:0?[1-9]|1[0-2])[-|/](?:[0-3]?[0-9])\]?|$)"
                    )



a_list = re_article.findall(txt_content)
# a_list

In [6]:
len(a_list)

528

> Number of articles we get are `528`. 

But there is possibility that many articles have same date. In such a case we need to concatenate these articles. This can be done using the dictionary and the following code snippet.

In [7]:
# Concatenating articles with same date

articles_dict = {}

for element in a_list:
        
    if element[0] in articles_dict:
        print(element[0])
        
        articles_dict.update({
            element[0]: articles_dict[element[0]] + element[1]
        })
        
        print('*'*20)
        print(articles_dict[element[0]])
    else:
        articles_dict[element[0]] = element[1]

[2018-07-09]
********************
The problem is not when people in power start requiring very strict regulations on bitcoins, but the
enforcement of those regulations. They plainly do not have the manpower and the technical
capability to do it.Also, the people in power should first stop the money laundering ring that involves
some of the biggest banks in the world before attacking bitcoin. Or maybe the big banks' monopoly
on money laundering is threatened by bitcoin? heheheThree of the world’s most respected
economists have led a joint attack on bitcoin, claiming the digital currency will be “regulated into
oblivion” as governments globally move to clamp down on money laundering.  Joseph Stiglitz,
Nouriel Roubini and Kenneth Rogoff have renewed their assault on the cryptocurrency believing it
will be subject to further sharp and damaging falls as authorities crack down on criminals using
bitcoin to launder money and avoid taxes.Stiglitz, the Nobel Prize-winning economist, told Financi

In [8]:
len(articles_dict)

415

> After concatenating number of articls with unique date came down to `415`

After observing some words in the corpus, we observed some words are getting `\n` between the word itself. Eg: `bitcoin\n-unity`, `smart\n-contracts` especially in urls as these are being continued to next line in the original file. We neeed to handle such scenarios as they can lead us to incorrect or misleading bigrams in future steps. To tackle this we make use of `json` library's `dumps()` to replace them with only `-`. `loads()` bring us back to the state we had before using `dumps()`.

Basically what `dumps()` does it is convert our corpus/datatype to `str` so that we can apply str operation on it like `replace()`.

In [9]:
urls_correction = json.dumps(articles_dict).replace('\\n-','-')

articles_dict = json.loads(urls_correction)
articles_dict

{'[2021-07-09]': 'If there are Russian users reading this, it might be good to begin learning about Monero and other\nanonymous coins. You cannot risk your wallets being tainted with dirty coins if it is not tainted\nalready. Also, I know there might be some of you would who will use the argument that there was\ndata from Chainalysis that showed only 1% of bitcoin transactions were criminal. However, you\ncannot trust the Russian government not to frame you or taint your wallet themselves. This is okay if\nyour wallet holds only a small amount. What if you are a holder of $1 million in bitcoin? You might\nnot be 100% safe.Russian Prosecutor General Igor Krasnov has revealed that legislative\namendments are being prepared on the confiscation of crypto assets. “A serious challenge is the\ncriminal use of cryptocurrencies in our country,” he said.Russia is preparing amendments to the\ncurrent legislation to allow for the confiscation of crypto assets found to be proceeds from crime,\nTass

In [10]:
len(articles_dict) # Rechecking that we haven't lost any info

415

## 5. Tokenization

Tokenization is a process of breaking down a given paragraph of text into a list of sentence or words. When paragraph is broken down into list of sentences, it is called sentence tokenization. Similarly, if the sentences are further broken down into list of words, it is known as Word tokenization.

Here we will be performing `Word Tokenization` using the provided Regex in the assignment specification
<h4><center>r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"</center></h4>

In [11]:
tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")

In [12]:
def tokenizeRawArticleData(date):
    """
     function to tokenize the article paragraph
     :param date: date of the article
    """
    raw_article = articles_dict[date].lower()            # lowercasing the whole article text
    tokenised_article = tokenizer.tokenize(raw_article)  # applying tokenizer
    return (date, tokenised_article)

In [13]:
# Applying the function to whole corpus of articles

tokenized_corpus = dict(tokenizeRawArticleData(date) for date in articles_dict)
# tokenized_corpus

### Corpus Stats Helper Function

In [14]:
def stats_corpus(tokenized_corpus):
    words = list(chain.from_iterable(tokenized_corpus.values()))
    vocab = set(words)
    lexical_diversity = len(words)/len(vocab)
    print ("Vocabulary size: ",len(vocab),"\nTotal number of tokens: ", len(words), \
    "\nLexical diversity: ", lexical_diversity)

In [15]:
stats_corpus(tokenized_corpus)

Vocabulary size:  11408 
Total number of tokens:  93858 
Lexical diversity:  8.227384291725105


In [16]:
len(tokenized_corpus.values())  # Re-assuring

415

## 6. Context-independent stopwords  removal

Stop words are such words which are very common in occurrence such as 'a','an','the', 'at' etc. We ignore such words during the preprocessing part since they do not give any important information and would just take additional space.

There can be two kinds of stopwords:

1. Context-independent 
2. Context-dependent *(later)*

We will be using the custom stopwords file `stopwords_en.txt` provided to remove the context-independent stopwords from the corpus.

In [17]:
# reading the stopwords file into a list where each element will be a stopword

with open('Assessment 1/Task2 datasets/stopwords_en.txt') as stop_words_file:
    stop_words = stop_words_file.read().splitlines()
    
print(type(stop_words))
print(len(stop_words))

<class 'list'>
571


> There are `571` stopwords provided in the the custom file.

We convert the list of stopwords to sets using `set()`.

**Why convert to sets ?**

If you have hashable items, which means both the item order and duplicates are disregarded, Python set is better choice than list as set uses hash. Indeed, set takes constant time to check the membership. Let's convert the stopword list into a stopword set, then search to remove all the stopwords. Please also note that if you try to perform iteration, list is much better than set. 

In [18]:
stop_words_set = set(stop_words)

for date_doc in tokenized_corpus:
    tokenized_corpus[date_doc] = [ w for w in tokenized_corpus[date_doc] if w not in stop_words_set ] 

In [19]:
stats_corpus(tokenized_corpus)

Vocabulary size:  10923 
Total number of tokens:  48748 
Lexical diversity:  4.462876499130275


## 7. Generate the 200 bigram collocations

Besides unigrams that we have been working on so far,
N-grams of texts are also extensively used in various text analysis tasks.
They are basically contiguous sequences of `n` words from a given sequence of text.
When computing the n-grams you typically move a fixed size window of size n
words forward

**What are N-grams used for?** They can be used to build n-gram language model that
can be further used for speech recognition, spelling correction, entity detection, etc.
In terms of text mining tasks, n-grams is used for developing features for 
classification algorithms, such as SVMs, MaxEnt models, Naive Bayes, etc.
The idea is to expand the unigram feature space with n-grams.

The first step is to concatenate all the tokenized patents using the chain.frome_iterable function. The returned list by the function contains a list of all the words seprated by while space.

In [20]:
words_tcorpus = list(chain.from_iterable(tokenized_corpus.values()))
len(words_tcorpus)

48748

Extracting from a text a list of n-gram can be easily accomplished with function `ngram()`:

In [21]:
bi_grams = ngrams(words_tcorpus, n=2)
fd_bigram = FreqDist(bi_grams)

Collocations are expressions of multiple words that commonly co-occur. 

For example, to extract bigram collocations, we can firstly extract bigrams then get the commonly co-occurring ones by ranking the bigrams by some measures. A commonly used measure is Pointwise Mutual Information (PMI). The following code will find the best 200 bigrams using the PMI scores.

In [22]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(words_tcorpus)

In [23]:
top_200_bigrams= finder.nbest(bigram_measures.pmi, 200)
top_200_bigrams

[('a-blockbuster', 'year-for'),
 ('a-bullish', 'sign-returns'),
 ('a-kind', 'downloadable'),
 ('a-peek', 'into-how'),
 ('abandoned', 'smelter'),
 ('abolish-cash', 'in-favor'),
 ('accenture-creates', 'blockchain-editing'),
 ('accepting-bitcoin', 'in-swi'),
 ('access-post', 'fork-token'),
 ('achievements-and', 'aims-for'),
 ('achilles', 'tendon'),
 ('acqu', 'ired-circle'),
 ('acquisitions-marketplace', 'unveile'),
 ('action-against', 'poloniex-if'),
 ('add-ons', 'aesthetic'),
 ('adding-nort', 'h-americas'),
 ('adds-dynamic', 'bitcoin-mode'),
 ('adept', 'assigning'),
 ('adjustment', 'two-weeks'),
 ('adoption-of', 'blockchain-technologies'),
 ('adornohttps', 'btctheory'),
 ('advance-bill', 'regulate-bitcoin'),
 ('advance-cost', 'bitcoin-entered'),
 ('adventure', 'saas'),
 ('advertisers-a', 'cut-of'),
 ('advised', 'refrain'),
 ('advisor-offers', 'predictions-about'),
 ('aesthetic', 'gameplay'),
 ('afiyu-i', 'blokchejn'),
 ('aforementioned', 'frozen'),
 ('african-bitcoin', 'trading-platform'

In [24]:
len(top_200_bigrams)

200

## 8. Re-tokenize

Now, we have 200 bigrams from corpus which we do not want split into two individual words. So we do retokenize using `MWEtokenizer`

> A ``MWETokenizer`` takes a string which has already been divided into tokens and
retokenizes it, merging multi-word expressions into single tokens, using a lexicon
of MWEs: [Source](https://www.nltk.org/_modules/nltk/tokenize/mwe.html)

In [25]:
mwe_tokenizer = MWETokenizer(top_200_bigrams)

collocated_articles = dict((date, mwe_tokenizer.tokenize(article)) for date,article in tokenized_corpus.items())

all_collocated_words = list(chain.from_iterable(collocated_articles.values()))
collocated_vocab = list(set(all_collocated_words))

In [26]:
print(len(all_collocated_words))
print(len(collocated_vocab))

48555
10730


> * `48555` words in our `collocated_articles` corpus
* `10730` unique words in our `collocated_articles` corpus

## 9. Stemming (Porter) the unigrams

Stemming means mapping a group of words to the same stem by removing prefixes or suffixes without giving any value to the “grammatical meaning” of the stem formed after the process.

e.g.

computation --> comput

computer --> comput 

hobbies --> hobbi

We can see that stemming tries to bring the word back to their base word but the base word may or may not have correct grammatical meanings.

We will be using `PorterStemmer` (as per specification) available in NLTK package. Porter stemmer is the oldest algorithm present and was the most popular to use.

### **Important**

* Since we want our bigrams in the corpus to remain intact, be don't apply stemming on them. Stemming is being applied only to unigrams by the following code. To not stem bigrams we check if `_` is a character in the word we keep as it is.

```python
[ ... stemmer.stem(collocated_articles[date_article][w]) if '_' not in collocated_articles[date_article][w] else collocated_articles[date_article][w] ... ] 
```

* We need to discard words with length smaller than 3, so keep only the words having `length >= 3` and this is applied to whole corpus (unigrams or bigrams)

In [27]:
stemmer = PorterStemmer()

uni_bigram_articles = {}

for date_article in collocated_articles:
    uni_bigram_articles[date_article] = [ stemmer.stem(collocated_articles[date_article][w]) if '_' not in collocated_articles[date_article][w] else collocated_articles[date_article][w] for w in range(len(collocated_articles[date_article])) ] 
    
    # Removing words with < 3 length
    uni_bigram_articles[date_article] = [ w for w in uni_bigram_articles[date_article] if len(w) >= 3]

In [28]:
uni_bigram_articles

{'[2021-07-09]': ['russian',
  'user',
  'read',
  'good',
  'begin',
  'learn',
  'monero',
  'anonym',
  'coin',
  'risk',
  'wallet',
  'taint',
  'dirti',
  'coin',
  'taint',
  'argument',
  'data',
  'chainalysi',
  'show',
  'bitcoin',
  'transact',
  'crimin',
  'trust',
  'russian',
  'govern',
  'frame',
  'taint',
  'wallet',
  'wallet',
  'hold',
  'small',
  'amount',
  'holder',
  'million',
  'bitcoin',
  'safe',
  'russian',
  'prosecutor',
  'gener',
  'igor',
  'krasnov',
  'reveal',
  'legisl',
  'amend',
  'prepar',
  'confisc',
  'crypto',
  'asset',
  'challeng',
  'crimin',
  'cryptocurr',
  'countri',
  'russia',
  'prepar',
  'amend',
  'current',
  'legisl',
  'confisc',
  'crypto',
  'asset',
  'found',
  'proce',
  'crime',
  'tass',
  'report',
  'wednesday',
  'krasnov',
  'virtual',
  'asset',
  'sourc',
  'incom',
  'crimin',
  'emphas',
  'cryptocurr',
  'corrupt',
  'includ',
  'briberi',
  'latenc',
  'crimin',
  'act',
  'recent',
  'aggrav',
  'cryp

In [29]:
stats_corpus(uni_bigram_articles)

Vocabulary size:  8050 
Total number of tokens:  48161 
Lexical diversity:  5.9827329192546586


## 10. Removing Context-dependent and Rare Tokens

For removing context-dependent and rare tokens we need to follow the following rules:
* For context-dependent stopwords, you must set the threshold to more than `ceil(Number_of_days / 2)`.
* Rare tokens (with the threshold set to less than `10 days`) must be removed from the vocab.

From the rules, it is clear that we need to take into account the **document frequency** for both scenarios.

> Document frequency is the number of documents containing a particular term (regardless of how many times it occurs in the same document)

Our approach will be to collect both the context dependent and rare tokens in the single list `final_remove_tokens` and remove them from corpus in the similar way we remove stopwords.

In [30]:
df_words = list(chain.from_iterable([set(value) for value in uni_bigram_articles.values()]))
df_ub_grams = FreqDist(df_words)
df_ub_grams

FreqDist({'http': 381, 'bitcoin': 327, 'news': 204, 'cryptocurr': 180, 'www': 172, 'read': 154, 'blockchain': 147, 'currenc': 140, 'market': 136, 'time': 130, ...})

**IMPORTANT**: It is highly possible that bigrams occur `lesser than 10` and will be considered as rare tokens. This is not what we want so we apply rare tokens condition only on unigrams and not on bigrams using the following snippet inside below code:

```python
...
if token_docfreq < 10  and '_' not in token:
    ...
```

In [31]:
context_dep_stopwords = []
rare_tokens = []

number_of_days = len(uni_bigram_articles)

for token, token_docfreq in df_ub_grams.items():
    
    # if token's doc-freq is greater then the condition append it to list
    if token_docfreq > math.ceil(number_of_days/2):
        context_dep_stopwords.append(token)
        
    # if token's doc-freq is smaller then 10 and token is not bigram, append it to list
    if token_docfreq < 10  and '_' not in token:
        rare_tokens.append(token)

# set because it uses hash and can be removed faster from corpus    
final_remove_tokens = set(context_dep_stopwords + rare_tokens)   
len(final_remove_tokens)

7097

> There are `7097` tokens (includes both context-dependent stopwords and rare tokens) we need to remove 

In [32]:
for article_date in uni_bigram_articles:
    uni_bigram_articles[article_date] = [ w for w in uni_bigram_articles[article_date] if w not in final_remove_tokens ] 

In [33]:
stats_corpus(uni_bigram_articles)

Vocabulary size:  953 
Total number of tokens:  29773 
Lexical diversity:  31.24134312696747


## 11. Generating sorted corpus vocab

This task has to generate a file with specified format containing words/tokens in corpus with it's respective term frequency (TF)

In [34]:
uni_bigram_words = list(chain.from_iterable([value for value in uni_bigram_articles.values()]))
tokens_freq_dict = FreqDist(uni_bigram_words)
# len(set(uni_bigram_words))

In [35]:
tokens_freq_dict

FreqDist({'cryptocurr': 594, 'blockchain': 381, 'market': 375, 'exchang': 334, 'currenc': 323, 'bank': 310, 'price': 308, 'trade': 276, 'news': 266, 'digit': 244, ...})

In [36]:
final_vocab_dict = dict(tokens_freq_dict)

In [37]:
len(final_vocab_dict)

953

In [38]:
vocab_output_file = open('31072100_vocab.txt', "w")
sorted_vocab = sorted(final_vocab_dict.items())
        
for word, freq in sorted_vocab:  
    vocab_output_file.write(str(word) + ':' + str(freq) + '\n')

vocab_output_file.close()

## 12. CountVectorizer

In [39]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word", vocabulary=set(uni_bigram_words)) 

In [40]:
data_features = vectorizer.fit_transform([' '.join(value) for value in uni_bigram_articles.values()])
print(data_features.shape)

(415, 953)


In [41]:
list(set(uni_bigram_words)-set(vectorizer.get_feature_names()))

[]

In [42]:
vocab = sorted(set(uni_bigram_words))
vocab2 = vectorizer.get_feature_names()

In [43]:
dates_articles = list(uni_bigram_articles.keys())

In [44]:
count_vec_file = open('31072100_countVec.txt', "a")

In [45]:
flag = False
for row in range(data_features.shape[0]+1):
    if row == data_features.shape[0]:
        count_vec_file.write(output_str.rstrip(',') + '\n')
    else:
        article_date = dates_articles[row]
        if flag == True:
            count_vec_file.write(output_str.rstrip(',') + '\n')
        output_str = str(article_date)+','
        for idx, count in enumerate(data_features.toarray()[row], 1):
            if count > 0:
                output_str += str(idx) + ':' + str(count) + ','
                flag = True

In [46]:
count_vec_file.close()

# Conclusion

The task was performed and two required files of vocab and countvector was created.

# References

* Tut 5 Solutions, FIT5196, Monash University, Link Confedential
* Tut 4 Solutions, FIT5196, Monash University, Link Confedential