# FIT5196 Task 2 in Assessment 1
#### Student Name: Armin Berger
#### Student ID: 26255367

Date: 29/08/2020

Version: 1.0

Environment: Python 2.7.11 and Jupyter notebook

Libraries used:

* landid      (to check for the langauge of a tweet, included in Anaconda Python 2.7) 
* re          (for regular expression, included in Anaconda Python 2.7) 
* numpy       (for numpy array, included in Anaconda Python 2.7)
* pandas      (for dataframe, included in Anaconda Python 2.7) 
* itertools   (for combining lists, included in Anaconda Python 2.7)
* nltk        (for tokenizing and stemming, included in Anaconda Python 2.7)
* sklearn     (for sparse matrix, included in Anaconda Python 2.7)



# <span style="color:blue"> 2.0 General Steps </span>

In order to achieve the required outcomes of task 2 we first need to take several general steps that will help us in all sub-task of task2.

Firstly, we will import all required libraries.

Secondly, we will parse all of the the xlsx tweet file data and save its tokens in a dict.

Thirdly, we will get all the context dependent stop words( appear more than 60 days) and all the rare words (appear less than 5 days).

### 1. Import libraries

In [33]:
# Firstly, we will import a multitude of libraries to help us with this task

import re # used for regular expression

import langid # used to check whether a tweet is in english 

# character encoding standard
import unicodedata

# nltk used for parsing and cleaning text
import nltk
from nltk.stem import PorterStemmer # porter stemmer = applies 5 rules
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer
from nltk.util import ngrams

# for vectorisation
from sklearn.feature_extraction.text import CountVectorizer

# obtaining ounctuations
import string

# for data frame manipulation 
import pandas as pd

# for combining lists into one singualr list
from itertools import chain

import collections

### 2. Parse the xlsx File and get all tokens per sheet 

In this step we parse all the information of the provided Excel file which contains the tweets. 

One of the main obstacles while reading in the tweet data from the xlsx file was the varying postion of the sheets headers as well as the lack of headers for some sheets. Thus, I dedcided to drop all empty rows or columns as well as assign a header row to sheets missing headers, so that each dataframe had a header row in the same postition.

After a number of wrangling steps and tokenization of each english tweet we save the retreived tokens per sheet and date information in a dict.

In [34]:
## GENERAL TOOLS: defining global variables/methods that get used throughout the code

# read in provided conetext independent stop words from text file
stop_words = open('stopwords_en.txt','r').readlines()

# remove \n from each stopword and save as a set
stop_words = set([x.replace("\n","") for x in stop_words])

# set the tokenizer to the desired regular expression
tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")

# calling the PorterStemmer function
porter = PorterStemmer()

In [35]:
## NOTE: this code block runs for a while but creates the desired output (please don't restart it, just let it run)

## READ IN TWEET DATA AND TOKENIZE 

# create a dict that saves all tokens per sheet as a list 
all_tweets_dict = {}

# read in excel with mutiple sheets 
tweet_data_frames = pd.ExcelFile('26255367.xlsx')

# save the title of each sheet in a list
list_sheet_titles = tweet_data_frames.sheet_names

# iterate through all the sheets in the dataframe 
for item in list_sheet_titles:
    
    # save each read in sheet as a data frame
    # parse is similar to the read_excel methode and transforms a specified sheet 
    # into a data frame
    sheet_data_frame = tweet_data_frames.parse(sheet_name=item)
    
    # drop all empty columns
    sheet_data_frame = sheet_data_frame.dropna(axis='columns', how='all')
    
    # drop all empty rows
    sheet_data_frame = sheet_data_frame.dropna(axis=0, how='all')
    
    # turn the frist row into the header of each column
    # adapted from user ostrokach on the 02/09/2020 
    # from: https://stackoverflow.com/questions/31328861/python-pandas-replacing-header-with-top-row
    
    # we search for the column text, first row
    if sheet_data_frame.iloc[0][0] == 'text':
        sheet_data_frame = sheet_data_frame.rename(columns=sheet_data_frame.iloc[0]).drop(sheet_data_frame.index[0]) 
    
    # reset the index of each data frame
    sheet_data_frame = sheet_data_frame.reset_index()
    
    # after creating a uniform dataframe format for each sheet read in we save the tweet text
    # as a list
    text_per_sheet = sheet_data_frame['text'].values.tolist()
    
    # create list that saves all lists of tokens per sheet
    list_all_token_per_sheet = []
    
    # now tokenize the tweet text
    for tweet in text_per_sheet:
            
        # check if a tweet consists of alpahbestic characters 
        if type(tweet) == str:
            tweet_language = langid.classify(tweet)
            # check if a tweet is in english using the langid package
            if tweet_language[0] == 'en':

                # set tweet to lower case
                tweet = tweet.lower()

                # now we tokenize each sentence with the regular expression provided to us

                # we now apply the tokenizer that was used during the week 5 tutorial
                # result is a list of lists containg all the tweet word tokens of the read in sheet
                unigram_tokens = tokenizer.tokenize(tweet)

                # append list of tokens to all tokens list
                list_all_token_per_sheet.append(unigram_tokens)

    # combine all list of tokens into one list per sheet
    list_all_token_per_sheet = list(chain.from_iterable(list_all_token_per_sheet))
    
    # assign the date as a key and the list of tweet text as a value
    all_tweets_dict[item] = list_all_token_per_sheet
    
    
# the final result is a list of dicts that conatain the tweets date and their text
# infromation

### 3. Generate all rare words and context dependent stop words

In this step we seek to find all rare words and context-dependent stop words with a threshold of less than 5 days and more than 60 days.

In [36]:
## WORD FREQUENCY

# firstly, in order try to find both context dependent stop words and rare words we need to get the sheet/document 
# frequency of each word accross all the sheets 

# get all the tokens into one list using list comprehsions
# adapted logic from tutorial week 5
token_document_frequencey = list(chain.from_iterable([set(item) for item in all_tweets_dict.values()]))

# get the document/sheet frequency of each token(in how many sheets does a word appear)
# using the function FreqDist introduced in tutorial week 5
token_document_frequencey = FreqDist(token_document_frequencey)


In [39]:
## CONTEXT DEPEDENT STOPWORDS

# after getting the document frequency of each token we can sub select the too frequent and rare ones

# get all context dependent stop words by only selecting the ones with a frequency count larger or equal to 61
# list for all context_dependent_stop_words
context_dependent_stop_words = []

# iterate through the items of frequency dict
for keys, values in token_document_frequencey.items():
    
    # only select words with a frequency larger or equal to 61
    if values >= 61:
        
        # append key as stop word
        context_dependent_stop_words.append(key)
        
# save all context_dependent_stop_words as a set for fast retrival 
context_dependent_stop_words = set(context_dependent_stop_words)


## RARE WORDS

# list for rare words
rare_words = []

# iterate through the items of frequency dict
for keys, values in token_document_frequencey.items():
    
    # check for frequency
    if values <= 4:
        
        # append key as rare word
        rare_words.append(keys)
        
# save all rare_words as a set for fast retrival 
rare_words = set(rare_words)


# <span style="color:blue">Task 2.1 - Generate Corpus Vocabulary </span>

### 1. Get all unigrams for the corpus vocabulary

Since all of the data within the xlsx file is already read in saved we can directly start using it. To get all unigrams we need to filter out all the undesired words. 

1. Get a list of all unigrams by chaining together all dictionary values 

2. Remove all regular stop words

3. Remove all rare words

4. Remove any token that is in context_dependent_stop_words

5. Remove any token with a length smaller than three

6. Stemm the remaining tokens using the PorterStemmer

7. Append all stemmed tokens to the corpus_vocabulary list

8. Only keep unique words by turning it into a set

9. Sort the set alphabetically

Done, now you have all the unigrams for the corpus vocabulary!

In [40]:
# chain together all unigrams that were stored as dict values
all_unigram_tokens = list(chain.from_iterable(all_tweets_dict.values()))


In [41]:
## FILTER OUT WORDS

# list for all tokens in the corpus vocabulary
unigram_vocabulary = []

# iterate through the list of all tokens
for word in all_unigram_tokens:
    
    # check if word is in regular stopwords 
    if word not in stop_words:
        
        # check if word is in rare words
        if word not in rare_words:
            
            # check if word is in context dependent stopwords
            if word not in context_dependent_stop_words:
                
                # check if word is equal or longer than len() = 3 
                if len(word) >= 3:
                    
                    # stem each word using the porter stemmer
                    word = porter.stem(word)
                    
                    # append each word to the corpus_vocabulary list
                    unigram_vocabulary.append(word)

                        
# sort vocabs and turn list into set
unigram_vocabulary = sorted(set(unigram_vocabulary))

# now we have all the unigrams 

### 2. Get all bigrams for the corpus vocabulary

Since all of the data within the xlsx file is already read in saved we can directly start using it. To get all unigrams we need to filter out all the undesired words. 

1. Get a list of all tokens by chaining together all dictionary values 

2. Remove all regular stop words

3. Remove all rare words

4. Remove any token that is in context_dependent_stop_words

5. Remove any token with a length smaller than three

6. Stemm the remaining tokens using the PorterStemmer

7. Append all stemmed tokens to the corpus_vocabulary list

8. Only keep unique words by turning it into a set

9. Sort the set alphabetically

Done, now you have all the bigrams for the corpus vocabulary!

In [42]:
# Get list of all tokens by chaining together all dictinary values
all_tokens = list(chain.from_iterable([item for item in all_tweets_dict.values()]))


In [43]:
## CREATE BIGRAMS

# most of this code was adapted from "tutorial_05_answer" example that was posted on moodle

# set the nltk.collocations methode for later use
bigram_measures = nltk.collocations.BigramAssocMeasures()

# use the methode nltk.collocations.BigramCollocationFinder.from_words to get all bigrams
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_tokens)

# use of lambda function to remove all words samller than 3
bigram_finder.apply_word_filter(lambda w: len(w) < 3)

# get top 200 bigrams using nbest(pmi,2 00)
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200)

# new list for top 200 concatinated bigrams
top_200_bigrams_connected = []

# iterate through all bigrams and connect them with '_'
for item in top_200_bigrams:
    
    # append all connected bigrams 
    top_200_bigrams_connected.append(item[0] + '_' + item[1])
    
# Now we have the top 200 bigrams 

### 3.  Combine unigrams and bigrams into one list and sort 

In [44]:
# new list for both unigrams and bigrams of corpus vocabulary 
corpus_vocabulary = list(unigram_vocabulary) + list(top_200_bigrams_connected)

# sort the new list
corpus_vocabulary = sorted(corpus_vocabulary)

### 4. Write unigrams and bigrams to txt file

In the final step we write all unigrams and bigrams to a txt file.

In [45]:
# we create a txt file and set it to write mode (w)
token_file = open('<26255367>_vocab.txt','w')

# iterate through all the vocabs
for i in range(len(corpus_vocabulary)):
        
    # combine both the vocab and its id
    info = f"{corpus_vocabulary[i]}:{i}"
    
    # write to file
    token_file.write(info)
    
    # go to the next file line
    token_file.write("\n")

# we close the xml file after writing to it
token_file.close()

# <span style="color:blue">Task 2.2 - Generate the Sparse Representation of the Corpus Vocabulary</span>

After getting the corpus vocabulary we can create a count vector which is used for further text analysis. To create the sparse representation we need to follow the following set of steps:

1. Create a dict called corpus_vocab_dict_id that contains all tokens of the corpus vocabulary as keys and id as values

2. Create a dict called sparse_vocab_dict in which each key-value pair represents each day/sheet. Only stem each token and only append as a value if they are in the corpus vocabulary 

3. Create a text file to which we write the sparse representation

4. Assign each token in sparse_vocab_dict with their matching id number from corpus_vocab_dict_id

5. Write the date of the iterated sheet to the text file

6. Get frequency of each token in tokens_per_sheet_id

7. Go the next file at the end of the sheet and close the file

Now you have the sparse representation of the corpus vocabulary written to a text file!

In [46]:
# dict to save all the vocabs and their id in 
corpus_vocab_dict_id = {}

# iterate through the list of vocabs
for i in range(len(corpus_vocabulary)):
    
    # add each vocab and its id to the dict
    corpus_vocab_dict_id[corpus_vocabulary[i]] = i

    
# create a dict for each day/sheet which only contains all tweet tokens that are part
# of the corpus vocabulary
sparse_vocab_dict = {}

# loop through the dict containing all tokens 
for key,value in all_tweets_dict.items():
    
    # create a list to save all the tokens in
    vocab_token_list = []
    
    # loop through each token
    for item in value:
        
        # stem token
        word = porter.stem(item)
        
        # check if token is part of the vocabulary
        if word in unigram_vocabulary:
            
            # append all tokens that are part of the vocabulary
            vocab_token_list.append(word)
    
    # add the date as key with the list of tokens as a value to sparse_vocab_dict
    sparse_vocab_dict[key] = vocab_token_list
            

## now we can create the actual sparse representation and write it to a text file 

# create a text file to write to
sparse_rep = open('<26255367>_countVec.txt', 'w')

for key, value in sparse_vocab_dict.items():
    
    # use list comprehension to loop through all tokens in value and assign them their
    # matching number 
    tokens_per_sheet_id = [corpus_vocab_dict_id[x] for x in value]
    
    # write date to the file 
    sparse_rep.write(key)
    
    # CHECK AGAIN
    for keys, vals in FreqDist(tokens_per_sheet_id).items():
        sparse_rep.write(f',{keys}:{vals}')
    
    # after each sheet/day is done go to the next line like shown in the sample output
    sparse_rep.write('\n')
    
    
# close the text file
sparse_rep.close()

# <span style="color:blue">Task 2.3 - Find Top 100 Unigrams each day</span>

### 1. Parse the xlsx File

Like in the previous task we need to read in the xlsx file and each of the sheets.
Since we have done that before we will be reusing the read in data from earlier saved in the dict: all_tweets_dict.

In this case, it is important to save the data from each sheet separately since we searching the top 100 unigrams and bigrams of each day (each sheet).

### 2. Create unigrams for each sheet

In this step, we will extract all text data from each sheet (representing one day of tweets) and then process the text to extract the top 100 unigrams. To do that we need to follow the following steps:

1. Iterate through the dict containing all tokens by date/sheet.
   For each sheet then:

2. Iterate through all tokens per sheet

3. Remove all regular stop words

4. Remove all rare words

5. Remove any token that is in context_dependent_stop_words

6. Remove any token with a length smaller than three

7. Stemm the remaining tokens using the PorterStemmer

8. Append all stemmed tokens to the all_tokens_each_day list

9. Only keep unique words by turning it into a set

10. Get frequency of words using FreqDist 

11. Save word and frequency in the list of tuples

12. Sort the list based on word frequency and save in a dict call all_unigrams

In [47]:
# dict that saves the top 100 unigrams of all excel sheets 
all_unigrams = {}

# iterate through the 
for key,values in all_tweets_dict.items():
    
    all_tokens_each_day = []
    list_of_tup = []
    
    # loop through all tokens per sheet
    for word in values:
        
        # check if word is in regular stopwords 
        if word not in stop_words:

            # check if word is in rare words
            if word not in rare_words:

                # check if word is in context dependent stopwords
                if word not in context_dependent_stop_words:

                    # check if word is equal or longer than len() = 3 
                    if len(word) >= 3:

                        # stem each word using the porter stemmer
                        word = porter.stem(word)
                        #print(word)

                        # append each word to the top_100_tokens_each_day list
                        all_tokens_each_day.append(word)
    
    # count frequency of each word per sheet              
    top_100_tokens_each_day = FreqDist(all_tokens_each_day)

    # iterate through the items of frequency dict
    for keys, values in top_100_tokens_each_day.items():
        list_of_tup.append((keys,values))
    
    # use lambda function to sort the list of tuples by the frequencey of each token
    # used the logic provided by user: Sven Marnach
    # from: https://stackoverflow.com/questions/8459231/sort-tuples-based-on-second-parameter
    # accessed on: 11/09/2020
    list_of_tup = sorted(list_of_tup,key=lambda x: x[1], reverse = True) 
    
    # append key and value to dict 
    all_unigrams[key] = list_of_tup[:100]


### 3. Write top 100 unigrams  each day + frequency to txt file

In the final step, we write the top 100 unigrams of each day and their respective frequency to a txt file.

In [48]:
# we create a txt file and set it to write mode (w)
unigram_file = open('<26255367>_100uni.txt','w')

# iterate through dict containg all unigrams and their counts
for key,value in all_unigrams.items():
        
        # write date to the file
        unigram_file.write(f"{key}:")
        
        # write list of top 100 unigrams of each sheet to the file
        unigram_file.write(str(value))
        
        # do to next line
        unigram_file.write("\n")

# we close the text file after writing to it
unigram_file.close()

# <span style="color:blue">Task 2.4 - Find Top 100 Bigrams each day</span>

A bigram is a combination of two words that commonly appear together(like "monash","university"). To collect the top 100 bigrams of each day/sheet we need to follow the following set of steps:

1. Iterate through the dict containing all tokens by date/sheet.
   For each sheet then:

2. Iterate through all tokens per sheet

3. Remove all regular stop words

4. Remove all rare words

5. Remove any token that is in context_dependent_stop_words

6. Remove any token with a length smaller than three

7. Stemm the remaining tokens using the PorterStemmer

8. Append all stemmed tokens to the all_tokens_each_day list

9. Only keep unique words by turning the list into a set

10. Get frequency of words using FreqDist 

11. Save word and frequency in a list of tuples

12. Sort the list based on word frequency and save in a dict call all_bigrams

### 1. Generate top 100 bigrams each day

In [49]:
# dict that saves the top 100 bigrams of all excel sheets 
all_bigrams = {}

# iterate through the values in each dict
for key,values in all_tweets_dict.items():
    
    # create list to hold bigrams and their frequency
    list_of_tup_bi = []
    
    # in this step we get all possible bigrams for each day/sheet using ngrams() and then
    # count their frequency using FreqDist function
    bigrams_frequency = FreqDist(ngrams(values, n=2))


    # iterate through the items of frequency dict
    for keys, values in bigrams_frequency.items():
        list_of_tup_bi.append((keys,values))
    
    # use lambda function to sort the list of tuples by the frequencey of each token
    # used the logic provided by user: Sven Marnach
    # from: https://stackoverflow.com/questions/8459231/sort-tuples-based-on-second-parameter
    # accessed on: 11/09/2020
    list_of_tup_bi = sorted(list_of_tup_bi ,key=lambda x: x[1], reverse = True) 
    
    # append key and value to dict
    if len(list_of_tup_bi) > 100:
        all_bigrams[key] = list_of_tup_bi[:100]
    else:
        all_bigrams[key] = list_of_tup_bi

### 2. Write top 100 bigrams  each day and their frequency to txt file

In the final step we write the top 100 bigrams and their respective frequencey to a txt file.

In [50]:
# we create a txt file and set it to write mode (w)
bigram_file = open('<26255367>_100bi.txt','w')

# iterate through dict containg all bigrams and their counts
for key,value in all_bigrams.items():
        
        # write date to the file
        bigram_file.write(f"{key}:")
        
        # write list of top 100 bigrams each sheet to file
        bigram_file.write(str(value))
        
        # go to next line
        bigram_file.write("\n")
        

# we close the text file after writing to it
bigram_file.close()

# 3. Summary
Give a short summary of your work done above, such as your findings.

The main difficulty of task 2 was to "re-engineer" the text processing steps taken to arrive at the sample output. Since the order of steps was not given and in some cases steps were skipped or executed in a different order it was hard to match the sample output.

Thus, to get closer to the desired output I had to read in the sample output files and analyze them to see patterns within the output. Such patterns included whether words seemed to be stemmed, whether they were shorter than len(3), or whether they contained stopword or rare words. After finding such patterns in the different output files, one could figure out which steps were taken and similarly process the input files.

Since the tutorials worked on similar tasks, I reused the logic of the code in many areas. Whenever I reused code logic from the tutorials I noted it right above the code.
