# Text processing
#### Creator Name: ZHIYIN WANG

Date: 13/09/2020

Version: 2.0

Environment: Python 3.6.0

Libraries used:
* pandas 0.19.2 (for data frame, included in Anaconda Python 3.6) 
* nltk 3.2.2 (Natural Language Toolkit, included in Anaconda Python 3.6)
* nltk.collocations (for finding bigrams, included in Anaconda Python 3.6)
* nltk.tokenize (for tokenization, included in Anaconda Python 3.6)
* nltk.stem (for stemming words, included in Anaconda Python 3.6)
* nltk.util (for finding bigrams, included in Anaconda Python 3.6)
* nltk.probability (for counting token frequency, included in Anaconda Python 3.6)
* sklearn.feature_extraction.text (for creating countvector matrix, included in Anaconda Python 3.6)
* langid (for classifying language of text, not included in Anaconda Python 3.6)
* itertools (for connecting all values in library together, included in Anaconda Python 3.6)

## 1. Introduction
This assignment comprises the execution of different text processing and analysis tasks applied to documents in Excel format. There are a total of 81 excels in one 17.7 MB file named `31436285.xlsx`. The required tasks are the following:

1. Generate the corpus vocabulary with the same structure as sample_vocab.txt . Please note that the vocabulary must be sorted alphabetically.

2. For each day (i.e., sheet in your excel file), calculate the top 100 frequent unigram and top-100 frequent bigrams according to the structure of the sample_100uni.txt and sample_100bi.txt . If you have less than 100 bigrams for a particular day, just include the top-n bigrams for that day (n<100).

3. Generate the sparse representation (i.e., doc-term matrix) of the excel file according to the structure of the sample_countVec.txt

More details for each task will be given in the following sections.

## 2.  Import libraries 

install and import from libraries

In [1]:
!pip install langid



In [2]:
import pandas as pd
from langid import classify
from itertools import chain
import nltk
from nltk import MWETokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.collocations import *
from nltk.probability import FreqDist
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.util import ngrams

## 3. Examing and Loading data

first, lets read the excel file into excel_data

In [3]:
# read the excel data 
excel_data = pd.ExcelFile('31436285.xlsx')
#excel_data = pd.ExcelFile("sample.xlsx")

The excel data contains many excel sheets

Create a dictionary named "data" to contain all the excel sheets.

In [4]:
# create a library to contain data
data = {}        

When reading every sheet in the excel file into our data dictionary, the excel sheets need to be reformated first.

The sheets contains NAN rows and the column need to be renamed.

dropna(0, how = "all") is applied here to drop all rows with only NA. In excel sheets with incorrect column names, the column names should be renamed by the first row of that sheet with rename()

After the adjustment to rows and column names, remember to reindex the sheet so that erros don't occur in later loops.

Next, the tweets in each sheet need to be in english, need to drop the tweets that are not in english. Non-english tweets are identified by classify() function. Tweets with classify() results not "en" need to be droped. However, drop the row while looping also changes index, thus cause issues while looping. All the index of tweets that are not in english will be appended in to the drop_list and drop together at once later using drop().

Now, we have all the english tweets in the excel sheets in correct format. Add them to the dictionary.

In [5]:
# read every sheet in the excel and drop the empty columns in every sheet
for i in range(len(excel_data.sheet_names)):
    # read excel sheet
    df = excel_data.parse(i)

    # drop na rows
    df = df.dropna(0, how = "all")
    
    # if the column names are not correct
    # rename the column names with the first row and drop the first row 
    if 'text' not in df.columns:
        df.rename(columns = df.iloc[0], inplace = True)
        df = df[1:]
        
    # Reindex rows
    df.index = range(len(df.index))
    
    # create a list to contain the index of rows to drop
    drop_list = []
    
    # check if the text is in english, only keep english twitters
    # append index of rows need to be droped to the list
    for j in range(len(df.text)):
        if classify(str(df.text[j]))[0] != "en":
            # append index to drop_list
            drop_list.append(j)
    
    # drop the rows where twitters are not in english
    df.drop(drop_list,inplace = True) 
    
    # append the excel sheet to library
    data[excel_data.sheet_names[i]] = "\n".join(map(str,df.text.values))

## 4. Transform data to tokens

To generate vocab, unigram, bigram, tokens need to be generated from the raw text data first.

create new dictionary to hold tokens in each sheet.

In [21]:
# create the token dictionary 

# initial token dictionary 
token_dic = {}

Create a function to help tokenize the entire excel dictionary.

RegexpTokenizer is used here with given the regex ("[a-zA-Z]+(?:[-'][a-zA-Z]+)?").
The regular expression catches any word and words that contain "-". Applying it to tweets in excel sheets will capture every single word in tweets as a token.

Before tokenize the tweets, let's make all text in tweets to lower case using lower() for consistency.

.tokenize() is used here to tokenize the tweets and the resulted tokens will be returned.

In [22]:
# define function for creating unigram tokens
def tokenizeRawData(text):
    # regex to extract token
    tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")
    
    # transform into lower case to keep all words consistent 
    text_lower = text.lower()
    
    # create unigram tokens
    all_uni_tokens = tokenizer.tokenize(text.lower())
    
    return(all_uni_tokens)


Loop through every excel sheet and apply the tokenize function to every tweets. The result tokens will be stored in the new token dictionary created before.

In [23]:
# create token dictionary with keys = dates, value = tokens, all tokens in lower case
for k,v in data.items():
    tokens = tokenizeRawData(v)
    token_dic[k] = tokens

## 5. Generate unigrams

The output has following requirements:

1. The context-independent and context-dependent (with the threshold set to more than 60 days ) stop words must be removed from the vocab. The provided context-independent stop words list (i.e, stopwords_en.txt ) must be used.
2. Tokens should be stemmed using the Porter stemmer.
3. Rare tokens (with the threshold set to less than 5 days ) must be removed from the vocab.
4. Creating sparse matrix using countvectorizer.
5. Tokens with the length less than 3 should be removed from the vocab.
6. First 200 meaningful bigrams (i.e., collocations) must be included in the vocab using PMI measure.

As announced, context dependent stopwords need to be removed first before stemming.

To find tokens with document frequency < 5 and >60, FreqDist() can be used to count the frequency of tokens.

First, chain.from_iterable is used to put all tokens in the token dictionary together, set() is applied here to only keeps unique token in every day. From there we got all unique tokens gather together in a list. FreqDist() the list and store name it words_freq.

Next step is to find the tokens in words_freq with frequency < 5 and > 60. Words_freq is like a dictionary with key = token_name and value = frequency count. 
Thus, find the context dependent stopwords with loop and name the set as a variable. This is a set here as the results from the loop has duplicates that we don't need and set() is faster for future iteratoin.

In [24]:
# Create Unigrams 

# create vocab set contain unique words based on the word list
# And count frequency of words appear in total days, one day = one count, the document frequency.
words_freq = FreqDist(list(chain.from_iterable([set(value) for value in token_dic.values()])))

# define context dependent stopwords that need to be removed from vocab
# appear less than 5 days
lessFreqwords = set([k for k, v in words_freq.items() if v < 5])
# appear more than 60 days
highFreqwords = set([k for k, v in words_freq.items() if v > 60])


Now, we can remove the context dependent stopwords from the token dictionary. Two functions are created here for each frequency threshold. And stopwords with frequency < 5 is removed first and frequency > 60 is removed next.

Store the new tokens into a new token dictionary.

In [25]:
# define functions for removing words that meets the low frequency threshold.
def removeLessFreqWords(date):
    return (date, [w for w in token_dic[date] if w not in lessFreqwords])

# remove low frequency words from token library (frequency < 5)
token_dic_2 = dict(removeLessFreqWords(date) for date in token_dic.keys())

# define functions for removing words that meets the high frequency threshold.
def removeHighFreqWords(date):
    return (date, [w for w in token_dic_2[date] if w not in highFreqwords])

# remove high frequency words from token library (frequency > 60)
token_dic_3 = dict(removeHighFreqWords(date) for date in token_dic_2.keys())



The independent stopwords given in the file "stopwords_en" are removed next, read the file first. stopword list is covert to set() for faster iteration here.

Same as before, define a function for stopwords removal. And apply it to every value inside the token dictionary.
These are removed before stemming as some stopwords will be stemmed and won't be removed if trying to remove it after stemming.

The new tokens is stored into token_dic_clean

In [26]:
# read stopwords text and covert to set for faster iteration.
stopwords = []
with open("stopwords_en.txt") as f:
    stopwords = f.read().splitlines()
stopwords_set = set(stopwords)  

# stopwords is removed here as they are meanningless and compress the data.
# These indenpendent Stopwards take large porpotion of the whole dataset
def stopword_remove(date):
    return(date, [w for w in token_dic_3[date] if w not in stopwords_set])

# new dictionary with stopwords removed
token_dic_clean = dict(stopword_remove(date) for date in token_dic_3.keys())

Stemming need to be using Porter method. PorterStemmer() is used here to stem every token in the cleaned token dictionary.

Tokens that have a length less than 3 need to be removed. This is not done before stemming as stemming may creates tokens with ength less than 3. We want those to be removed as well. 

The tokens with length less than 3 are removed with function "removeLengthless3". 

Now, store the remaining tokens, and we have the final token dictionary.

In [27]:
# define stemming method - Porter
def stemming(date):
    stem = PorterStemmer()
    return(date, [stem.stem(w) for w in token_dic_clean[date]])

# stem the text here to avoid miss count of words with same meaning. 
# As told in annocement stemming should be down after context-dependent stopwords removal
token_dic_stemmed = dict(stemming(date) for date in token_dic_clean.keys())

# remove words with length less than 3 from tokens
def removeLengthless3(date):
    return(date, [w for w in token_dic_stemmed[date] if (len(w) > 2)])

# remove words with length less than 3 after stemmed, before final output to make sure no word with length < 3 in output.
token_dic_final = dict(removeLengthless3(date) for date in token_dic_stemmed.keys())

Use FreqDist to find the frequency of each token in each date and store in a dictionary named token_dic_unigram

In [28]:
# define function to filter top 100 unigrams in a day
def unigram(date):
    return(date, FreqDist(token_dic_final[date]).most_common(100))

# create dictionary contain the top 100 unigrams for each day
token_dic_unigram = dict(unigram(date) for date in token_dic_final.keys())

The token_dic_unigram contain the frequency count and unigrams we want to output. Output to txt file in format of date:(unigram1:count),(unigram2:count)...

In [29]:
# output unigrams to txt

unigram_file = open("100uni.txt", "w")
for k,v in token_dic_unigram.items():
    unigram_file.write("{}:{}".format(k,v) + "\n")
unigram_file.close()

## 6. Generate bigrams

Bigrams are generated using ngrams() where n = 2 means bigrams.

Bigrams are output to text file in same format as unigrams.

In [30]:
# create bigrams

# define the rule of finding top 100 bigram, use ngrams with n = 2
def bigram(date):
    return(date, FreqDist(ngrams(token_dic[date], n = 2)).most_common(100))

# create dictionary contain top 100 bigrams for each day
# In order to match the sample_bigram output as close as possible.
# Bigram is created from data that did not go through any stopwords removal
token_dic_bigram = dict(bigram(date) for date in token_dic.keys())

# output bigrams to txt
bigram_file = open("100bi.txt", "w")
for k,v in token_dic_bigram.items():
    bigram_file.write("{}:{}".format(k,v) + "\n")
bigram_file.close()

## 7. Generate vocabulary list

The vocabulary list need to contain both 200 meaningful bigrams and all unigrams.

We have the unigrams, now we need to find the 200 bigrams.
The bigrams are found using nltk.collocations.BigramAssocMeasures() with PMI. Incase there are bigrams with less than 3, they are removed as well.

The vocabulary list does not require frequency, thus we can use set() to get all the unique unigrams.

MWETokenizer is used to convert bigrams from (x,y) format to x_y format.
final_vocab is the vocabulary list we wish to output, use sort() to sort it in alphabetical order.

In [31]:
# create vocab list

# a list of all original unigram tokens 
#token_list = list(chain.from_iterable([value for value in token_dic.values()]))

# vocab is formed by unique tokens in all days (which extract from tokens that are cleaned)
vocab = list(chain.from_iterable([value for value in token_dic_final.values()]))
vocab_set = set(vocab)

# To make the bigrams as meaningfor as possible, bigrams are gernerated from tokens with all stopwards removed.
# it is not stemmed as in bigrams results would be wierd when words are stemmed
# create list with the tokens mentioned above.
bigram_list = list(chain.from_iterable([value for value in token_dic.values()]))
bigram_measures = nltk.collocations.BigramAssocMeasures()

# find bigrams in the token lists
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(bigram_list)
# To make sure length < 3 is removed 
bigram_finder.apply_word_filter(lambda w: len(w) < 3)
# use pmi measures to find the top 200 bigrams
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200)


# change from set to list
vocab_list = list(vocab_set)
vocab_list.sort()
# create a identical token list but with bigrams added as unigrams
token_list = list(vocab_set)

# append bigrams into vocab list
for i in top_200_bigrams:
    vocab_list.append(i) # list with vocabs + 200 bigrams


# append every single word in bigrams into the new token list
for bi in top_200_bigrams:
    for w in bi:
        token_list.append(w)

# use mwe tokenizer to tokenize the token_list with our vocab_list to make the result bigrams include "_"
mwe_tokenizer = MWETokenizer(vocab_list)
final_vocab = mwe_tokenizer.tokenize(token_list)

# sort alphabetically
final_vocab.sort()

Then we can output the result to text and assign a number to each of the word by order.

In [32]:
# output vocab list to txt
vocab_file = open("vocab.txt", "w")
i = 0
for w in final_vocab:
    vocab_file.write("{}:{}".format(w,i) + "\n")
    i = i + 1
vocab_file.close()

## 8. Generate countvectors

First, need to find all bigrams in the tokens dictionary.
This is achieved by the filter_token function. The results are stored in a new dictionary.

Next, combine the new bigram dictionary with the unigram dictionary by matching keys.
Now, a complete dictionary with bigrams and unigrams is created.

From there create count_matrix and check its shape.

In [33]:
# create countvec 

# filter out top 200 bigrams and other words contain "_" in each document.
def filter_tokens(date):
    return(date, [w for w in mwe_tokenizer.tokenize(token_dic[date]) if "_" in w])

# Store the bigram tokens into its corresponding key(date) in the dictionary
vocab_docu = dict(filter_tokens(date) for date in token_dic.keys())

# filter out the ones that are not bigrams in our vocab
vocab_docu_2 = dict((date, [w for w in vocab_docu[date] if w in final_vocab]) for date in vocab_docu.keys())

# combine two dictionary together
for k,v in token_dic_final.items():
    if k in vocab_docu_2:
        vocab_docu_2[k] = vocab_docu_2[k] + v

# generate vectorizer
#vectorizer = CountVectorizer(analyzer = "word")

# create our matrix that contains number of documents and vocab counts by joining all tokens
#count_matrix = vectorizer.fit_transform([' '.join(value) for value in vocab_docu_2.values()])

#count_matrix.shape

Output to txt file, with numbers assigned to each of words in the vocabulary. Also find the frequecy of each word by using FreqDist(). So that each word is repesented with a number same as the vocab.txt.
Write into txt file in format of date: (vocab_no : count) for each day.

In [34]:
# create countVectors
out_file = open("countVec.txt", 'w')

# create a dictorary to hold the numbers assigned to each vocab
vocab_dict = {}
i = 0
for w in final_vocab:
    vocab_dict[w] = i
    i = i + 1

# write to txt in format [date:(vocab_no, counts)]
for i, d in vocab_docu_2.items():
    d_idx = [vocab_dict[w] for w in d]
    # write "date," before writing countVectors
    out_file.write("{},".format(i))
    # write countVectors
    for k, v in FreqDist(d_idx).items():
        out_file.write("{}:{} ".format(k,v))
    out_file.write('\n')
out_file.close()