# FIT5196 Assessment 1
## Task 2
#### Student Name: Anirban Roy Chowdhury
#### Student ID: 30539676

Date: 13/09/20

Version: 3.0

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:
* re (for regular expression, included in Anaconda Python 3.6) 
* os (for )
* nltk (Natural language toolkit)
* pandas (for data frame, included in Anaconda Python 3.6)
* langid (for language classification)
* nltk.tokenize (for tokenization, both MWE and single word)
* nltk.stem (for porter stemmer)
* itertools (for iteration methods)
* nltk.collocations (for finding bigrams, and unigrams)
* nltk.probability ()
* sklearn.feature_extraction.text (for CounterVectorizatoin, to generate sparse matrix representation)

# 1. Introduction
For this task we have multiple steps that need to be executed. 

<p>Our goal is to generate:

1. 100uni.txt - A file containing top 100 unigrams according to each day along with its count.
2. 100bi.txt - A file containing top 100 bigrams according to each day along with its count, set according to certain measures.
3. vocab.txt - A text file containing all the word within our corpus. Each word is assigned a unique index to e referred with.
4. countVec.txt - A text file containing the sparse doc matrix representation, where each sheet is displayed as a string of numbers, where the key is the unique index referred to the word in vocab.txt and the value is the frequency of the word.
</p>

<p>
A general run down of the steps of how we will be achieving this is:

1. Read and clean all our excel data sheets.
2. Identify and remove all non english tweets.
3. tokenize the text.
4. Remove context dependent and context independent stop words.
5. Porter Stemming of each token.
6. Generation of top 100 unigrams and bigrams per day(sheet).
7. Create Sparse matrix representation of the document
8. Generation of vocab and counVec.

</p>

In [None]:
import os,re
import nltk
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
from langid.langid import LanguageIdentifier, model
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
import itertools
from nltk.collocations import *
from nltk.tokenize import MWETokenizer
from itertools import chain
from nltk.probability import *
from sklearn.feature_extraction.text import CountVectorizer

# 2. Methodology 

## 2.1.Data Exploration & Wrangling

### 2.1.1 Initialize variables

<p>
We initialize out language identifier, the given regex and our porter stemmer. These will be used later on in the assignment, but are initialized here. 
</p>

In [None]:
#Initializing our classifier
identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
#Intialize tokenizer
tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")
#Porter Stemmer
stemmer = PorterStemmer()

### 2.1.2. Read Data & Create context independent stop word list 

<p>
We can read the entire excel workbook along with all the sheets by passing "sheet_name=None" .
</p>

In [None]:
#Read all the sheets in the excel file without using the excel headers
#excel_data_set = pd.read_excel('D:\\2.Uni Work\\FIT5196\\Assignment 1\\30539676.xlsx', sheet_name = None, header= None)
excel_data_set = pd.read_excel('D:\\2.Uni Work\\FIT5196\\Assignment 1\\part2\\sample.xlsx', sheet_name = None, header= None)

In [None]:
#Creating a list of alll stop words
stop_words = []
#open the file
with open('stopwords_en.txt', 'r', encoding='utf8') as f:
    #Iterate over each line while stripping '\n' from the text and adding it into the List
    stop_words = [line.lower().rstrip('\n') for line in f]

## 2.2. Initial Manipulation (Cleaning and removing useless data)
<p>
Since our data has multiple null rows and column, duplicate tweets and non-enlgish tweets, we will need to do a thorough cleaning of the data before we can start with out text analysis.
</p> 

<p>
There are multiple steps in our wrangling steps:

1. Since each sheet within the excel, does not have the table of data properly placed according to the starting indexes, we will need to drop all useless null values both on the column and row axis.
2. We create our inital 3 column and remove the header.
3. Drop any duplicate tweets by 'id' column and keep only the last occurance.
4. Lower case the text column.
5. Since Langid cannot be parallelized thorugh apply function, iterate over each row and classify each text according to their langugage.
6. Drop all rows where the "lang" column does not have english as its classified language.
7. Tokenize the remaining tweets and add the tokens to the "Token" column.
8. Remove all context-independent stop words from the "token" column.
9. remove all tokens with length less than 3.
10. Reset Index.
 </p>

In [None]:
unigrams_dict = {}
bigrams_dict = {}
#iterate over the dict containing the name of the sheet as key and dataframe containing the sheet data as value
for key, df in excel_data_set.items():
    #Dropping columns with null values
    df.dropna(how="all",axis=1,inplace= True)    
    #Dropping rows with null values
    df.dropna(subset=df.columns,inplace= True)
    #Changing columns names for easier manipulation and readability 
    df.columns = ['text','id','created_at']
    #Removing header from our dataframes - i.e text, id, created_at tags present in the sheet
    df.drop(df.head(1).index, inplace=True)
    #Remove duplicates
    df.drop_duplicates('id',keep='first',inplace=True)
    # Adding intermediary columns
    df['lang'] = ""
    df['token'] = ""
    df['tokens_minus_context_independent_stop_words'] = ""
    df['porter_stemmed_token'] = ""
    df['bigram_token'] = ""
    df['concatenated_tokens'] = ""
    df['tokens_minus_all_stop_words'] = ""
    #Lower case and cast to string
    df['text'] = df['text'].apply(lambda x:str(x).lower())
    #Iterrate over each tweet and add the classified language to the 'lang' column
    for index, row in df.iterrows():
        row['lang'] = identifier.classify(row['text'])[0]
    #Drop all rows without english as langugage in the created column
    df.drop(df.loc[df['lang']!='en'].index, inplace=True)
    #Add a column with tthe tokenized result of text row.
    df['token'] = df['text'].apply(lambda x: tokenizer.tokenize(x))
    #Only keep those words not in stop_word list
    df['tokens_minus_context_independent_stop_words'] = df['token'].apply(lambda x: [token for token in x if token not in stop_words])
    df['tokens_minus_context_independent_stop_words'] = df['tokens_minus_context_independent_stop_words'].apply(lambda x: [token for token in x if len(token)>=3])
    #df.drop('lang', axis=1, inplace=True)
    #Reset index for easier manipulation
    df.reset_index(drop=True, inplace=True)

## 2.3. Creating Context-dependent Stopword list

<p>
We have now cleaned all the useless information, previously we had created a list of context-Independent stop words, now we will be creating the context-dependent stopword list, using the rules stated in the specification:

* Words appearing in more than 60days(sheet)
* Rare tokens which are appearing in fewer than 5days(sheet)

Since we have already tokenized our text and removed context-independent stop words, we can use the below code to get the count of days(sheet) a word appears in.

We create a list of all words within each day(sheet) per sheet, then create a set of each list, now eah list has only the unique words within that day(sheet). If we get the freqdist of this list, it will give us the count of days(sheet) each unique word has appeared in.

By filtering this list according to our condition we get the lists of context-dependent stopwords.
</p>

In [None]:
#Iterate over excel_sheet_data
#Each df column is converted to a list
#Then only the unique values are kept
#Chain it into a list, Chain.from_iteratbles creates a list of list containing each list of tokens.
#Now on getting the freqDist, since each list is representing the unique words within the file i.e, date
#The freq will give us the number of documents the unique word has appeared in.
words_2 = list(chain.from_iterable([set(list(chain(*df['tokens_minus_context_independent_stop_words']))) for key, df in excel_data_set.items()]))
fd_2 = FreqDist(words_2)
words_greater_60 = list(filter(lambda x:fd_2[x]>60,fd_2))
words_lesser_5 = list(filter(lambda x:fd_2[x]<5,fd_2))
context_dependent_stopwords = words_greater_60 + words_lesser_5

In [None]:
words_lesser_5

A look at the frequencies of few words.

In [None]:
fd_2.plot(25)

## 2.4. Final Manipulation (Remove Context dependent words, porter steeming & concatination.

<p>
Following are the steps:

1. remove any words that are appearing in our context-dependent stop words lists (i.e words occuring in greater than 60days & less than 5 days)
2. Porter stem all words.
3. We have also concatinated all the tokens for each tweet back into a single string, this is mostly for countervectorization that we will be doing further below, this rows also gives us a basic gist of each tweet devoid of any stopwords and only showing those words, which are relevant to textual analysis.
</p>

In [None]:
for key, df in excel_data_set.items():
#     #Porter stem
    print(key)
    df['porter_stemmed_token'] = df['tokens_minus_context_independent_stop_words'].apply(lambda x:[stemmer.stem(token) for token in x if token not in context_dependent_stopwords])
    #Concatinating all the porter stemmed token together per row.
    df['concatenated_tokens'] = df['porter_stemmed_token'].apply(lambda x: " ".join(x))
    print(df['concatenated_tokens'])

## 2.5.Bigram & Unigram Generation


In [None]:
#Function taken from tutorial to return a list of top 100 bigrams.
def get_bigrams(input_list,pmi_measure=100):
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(input_list)
    bigram_finder.apply_freq_filter(20)
    #bigram_finder.apply_word_filter(lambda w: len(w) < 3)# or w.lower() in ignored_words)
    top_n_bigrams = bigram_finder.nbest(bigram_measures.pmi, pmi_measure) # Top-100 bigrams
    return top_n_bigrams

<p>
Following are the steps executed to get our unigram & bigram dict:

1. Each entry into the unigram_dict has its key as the datasheet name (i.e, date) and value as the sorted list of Frequency distribution of the porter stemmed token on that sheet.
2. Per sheet we generate a list of top 100 bigram tokens using our get_bigram() function, which return these values based on the parameters mentioned in the specification.
3. We initialize a MWE tokenizer per sheet.
4. The MWE tokenizer is then used to generate bigrams by applying it on the 'token' column, note this column only has the token - no pre processing is done on this column. The result are then stored in 'bigram_token' column.
5. Since MWE tokenizer joins 2 unigram into a bigram using the default seperator '_', we will filter the 'bigram_token' column, only leaving in those column with '_'.
6. Finally similar to the unigram_dict, make an entry into the bigram_dict containing the frequency distribution of the bigrams generated per sheet.
</p>

In [None]:
for key, df in excel_data_set.items():
    #add entry into unigram dict from porter stemmed tokens for a single sheet.
    unigrams_dict[key] = sorted(list(dict(FreqDist(list(chain(*df['porter_stemmed_token'].tolist())))).items()),key=lambda x:x[1],reverse=True)[:100]
    #####################BIGRAMS#####################
    #Get 100 bigrams per sheet according to paramets in function.
    top_100_bigrams = get_bigrams(list(chain(*df['token'].tolist())),100)
    #Intialize the MWE tokenizer for the current shet by passing the top 100 bigrams
    mwetokenizer = MWETokenizer(top_100_bigrams)
    #Apply the trained bigrams on the token column to generate bigrams
    df['bigram_token'] = df['token'].apply(lambda x: mwetokenizer.tokenize(x))
    #only keep the words with "_" in them, since MWE join bigrams using "_"
    df['bigram_token'] =  df['bigram_token'].apply(lambda x: [token for token in x if '_' in token])
    #add entry into the dictionary of bigrams containing the top 100 bigram of the sheet.
    bigrams_dict[key] = sorted(list(dict(FreqDist(list(chain(*df['bigram_token'].tolist())))).items()),key=lambda x:x[1],reverse=True)[:100]

In [None]:
test_key = list(excel_data_set.keys()).pop(0)

In [None]:
excel_data_set[test_key]['text'][:10]

In [None]:
excel_data_set['2020-03-22']['concatenated_tokens'][:10]

Below code is just to reformat our dictionary into the format of the assignment specification, it just splits the joined tuple, into the desired format

In [None]:
#Reformat our dictionary into our desired output format
new_bigram_dict = {}
for key, value in bigrams_dict.items():
    temp = []
    for x in value:
        word_split = x[0].split("_")
        a = ((word_split[0],word_split[1]),x[1])
        temp.append(a)
    new_bigram_dict[key] = temp

Output code for creating the required _100uni.txt and _100bi.txt files

In [None]:
#Write to the desired outputfile according to the given schema
with open('30539676_100uni.txt','w+',encoding='utf-8') as f:
    for key, val in unigrams_dict.items():
        f.write('%s:%s\n' % (key,val))

In [None]:
#Write to the desired outputfile according to the given schema
with open('30539676_100bi.txt','w+',encoding='utf-8') as f:
    for key, val in new_bigram_dict.items():
        f.write('%s:%s\n' % (key,val))

## 2.6. Creating sparse matrix to generate countvec.txt and vocab.txt


In below 2 code cells we are initializing our CountVectorizer and fitting it over a list of list, where each inner list containg the entire vocbulary corpus of a sheet, and the outer list containg lists with each list representating a document (sheet).

In [None]:
#Create sample vocab and count vectorizer
vectorizer = CountVectorizer(analyzer = "word") 

In [None]:
data_features = vectorizer.fit_transform([" ".join(df['concatenated_tokens'].tolist()) for key,df in excel_data_set.items()])
print (data_features.shape)

Create a counter and then iterativly defined a dictionary with "key" as the word and "value" as the index counter. The output is our document-by-word matrix dimensions. In our case since our excel sheet has 81 sheets. It will be 81*no_of_features

In [None]:
#Creating a dict for vocab index, each word is assigned a unique index.
vocab = vectorizer.get_feature_names()
vocab_index_dict = {}
counter = 0
for word in vocab:
    vocab_index_dict[word] = counter
    counter += 1

<p>
The following code cell creates the doc-word matrix where the key is the name of the sheet(in our case the date), while the value is a dictionary which contains the words that appear in that sheet along with their counts. 

Following are the steps taken:

1. Creat a list of keys from excel_data_set
2. Get the vocab list from "vectorizer.get_feature_names()"
3. iterate over each row of the 2D array returned by "data_features.toarray()".
    1. Here each row is the doc-word representation of a sheet.
4. Now by using zip() function we can zip together the vocab and and the doc-word representation of that sheet.
    1. Since the vocab is sorted in the same order as the array representation returned by "data_features.toarray()", zip will correctly combine the word and the letter.
5. Add the dict word_count_dict created as a value to doc_matrix_dict with key being the date. 

</p>

In [None]:
vocab = vectorizer.get_feature_names()
date_keys = list(excel_data_set.keys())
word_count_dict ={}
doc_matrix_dict = {}
for doc_matrix_rep in data_features.toarray():
    word_count_dict = {}
    date_key = date_keys.pop(0)
    for word, count in zip(vocab,doc_matrix_rep):
        if count>0:
            word_count_dict[word] = count
    doc_matrix_dict[date_key] = word_count_dict

<p>
The folllowing code is used to create the countvec representation that is required, The below code cell replaces each occurance of word with its corresponsing unique id from _vocab.txt. This way were are able to create a countVec representation of each document(sheet).
</p>

In [None]:
#Creating output for countvec.txt
reformated_doc_matrix_date = {}
for key,val in doc_matrix_dict.items():
    word_keys = val.keys()
    temp_dict = dict()
    for word in word_keys:
        if word in vocab_index_dict:
            temp_dict[vocab_index_dict[word]] = val[word]
    reformated_doc_matrix_date[key] = temp_dict

Code for creating the required files.

In [None]:
#Code to write vocab.txt
with open('30539676_vocab.txt','w+',encoding='utf-8') as f:
    for key,val in vocab_index_dict.items():
        f.write('%s:%s\n'%(key,vocab_index_dict[key]))

In [None]:
with open('30539676_countvec.txt','w+') as f:
    for key, val in reformated_doc_matrix_date.items():
        f.write('%s,%s\n' % (key,str(val).strip('{}')))

# 3. Conclusion

<p>
Following tasks have been done:
    
    1. Parsed the data, dropped null and duplicate values.
    2. Keep only english text
    3. Tokenize
    4. Remove context independent and dependent words
    5. Porter stemmed eah token
    6. Generated top 100 unigrams and bi grams per sheet
    7. Generated vocab of the corpus
    8. Generated doc-word matrix representation through CounterVectorizer
</p>