# Processing Text Data with NLTK
#### Author Name: Dahye Kim

Date: 02/09/2020

Version: 1.0

Environment: Python 3.7.9 and Jupyter notebook

Libraries used: 
* pandas (for reading excel and create data frame with the tweets, included in Anaconda Python 3) 
* nltk (for tokenisation, frequency calculation, building matrix for distFreq) 
* xlrd (for parsing excel file)
* numpy (for wrangling the excel file to data frame) 

## 1.  Import libraries 

In [3]:
from __future__ import division # chaining series of iterative lists
from itertools import chain 

import pandas as pd # parsing Excel file and organise the parsed object into data frame 
import xlrd # reading excel file and parse each sheet in excel file 
import langid # identify the language used in the tweet 
import nltk # for text analysis - including retrieving document frequency, vocab frequency, extracing bigrams, etc. 
import numpy as np # for wrangling the data frame 
from nltk.stem import PorterStemmer # stemming tokens after necessary filtering steps 
from nltk.probability import FreqDist as fd # retrieving document frequency and vocab frequency
from nltk.util import ngrams # retrieving all the possible bigram combinations 
from sklearn.feature_extraction.text import CountVectorizer as cv # create sparse matrix using count vectors 

## 2. Parsing the excel file and create data frame for all the tweets

To parse the excel file, I imported pandas and xlrd. After reading the excel file with ```ExcelFile()``` function from pandas. To create a data frame for each excel sheet, I used pandas and numpy for possible missing values or errors in the parsing process. 

In [4]:
excel_data = pd.ExcelFile('semiStructuredTweets.xlsx')

In [5]:
dfs = []
# all the excel tweets parsed by .parse() function are appended into dfs

for i in range(len(excel_data.sheet_names)):
    df = excel_data.parse(i)
    # a new data frame created by the .parse() function 
    
    df.dropna(axis = 1, how = 'all', inplace = True)
    df.dropna(axis = 0, how = 'all', inplace = True)
    # drop all the empty rows and columns 
    
    df.columns = df.iloc[0,:]
    # the new data frame's column name is the first row of the data frame ('text', 'id', 'created_at')
    
    df.drop(df.index[0], axis = 0, inplace = True)
    # the first row, after becoming the column name of each data frame, is dropped 
    
    df.iloc[:,2]=df.iloc[:,2].apply(lambda x: x.split('T')[0])
    # only keeping the date for each tweet 
    # the created-date of the tweet is in the third column of each data frame 
    
    df.reset_index(inplace = True, drop=True)
    # the row index is reset 
    
    dfs.append(df)
    # new data frame is appended into dfs list 

When looping through each data frame, I realised that some data frame did not have ```['text', 'id', 'created_at']``` as the column names. These data frames had the first piece of tweet in the data frame assigned as the column name. I corrected this by inserting the column name as a new row in the data frame using numpy. While looping through each data frame, I also called ```langid.classify()``` function for each tweet and identified if they are English tweets. All non-English tweets are dropped after the classification

In [6]:
from nltk.tokenize import RegexpTokenizer as tokeniser 
tokenizer = tokeniser(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")
tokens = dict()

for i in range(len(excel_data.sheet_names)): 
# looping through each data frame in dfs 

    if dfs[i].columns[0] != 'text' or dfs[i].columns[1]!='id' or dfs[i].columns[2]!='created_at': 
    # the data frame whose column names are not ['text', 'id', 'created_at'], the first tweet should've been assigned 
    # as the column name.
    
        pd.DataFrame(np.insert(df.values, 0, values = dfs[i].columns, axis = 0))
        # the column names is inserted into the first row of the data frame with np.insert()
        dfs[i].columns = ['text', 'id', 'created_at']
        # the column names of the data frame is newly assigned 
        
    dfs[i]['lang'] = dfs[i]['text'].apply(lambda x: langid.classify(str(x))[0])
    # identify the language of each piece of tweet with langid.classify(). A new column 'lang' is created
    dfs[i] = dfs[i][dfs[i].lang == 'en']
    # keeping the only the tweets whose language is English
    dfs[i].drop(['lang'], inplace = True, axis = 1)
    # drop the 'lang' column after using it for filtering purpose 

    dfs[i]['text']=dfs[i]['text'].apply(lambda x: str(x).lower())
    # change all the tweets to lower-case letters 
    dfs[i]['tokens']=dfs[i]['text'].apply(lambda x: tokenizer.tokenize(x))
    # create a new column called 'tokens', which contains all the tokens of respective tweets 
    tokens[dfs[i].iloc[0,2]] = list(chain.from_iterable(list(dfs[i]['tokens'])))
    # create a key-value pair in the dictionary token, whose key is the created_date of tweets in each sheet
    # the value of the key is the list of tokens from each tweet 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


# 3. Corpus Vocabularies

For the corpus vocabularies, I first retrieve the 'bag of words' containing all the words that occurred in every single tweet. The list ```words``` contains all the words with repetition, and the set ``vocabs`` keeps only the unique vocab. After that I retrieved all the bigrams and unigrams for the corpus vocabulary list. Following is the step of procedures I have taken: 

* Retrieving bigrams 
    1. Use the CollocationFinder to locate all the possible bigrams in the bag of words 
    2. Filter all the bigrams which contains the words, whose length is less than 3
    3. Use the PMI measure to retrieve the top 200 meaningful bigrams from the filtered list of bigrams 
    4. Join the bigrams to appropriate format 

## 3.1 Vocab Bigrams

In [7]:
words = list(chain.from_iterable(tokens.values()))
# create a list of words composed of the tokens from all the tweets
vocabs = set(words)
# use set() function to retrieve unique words and remove repetitive words from the list words

In [8]:
bigram_measures = nltk.collocations.BigramAssocMeasures()

bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(words)
# use collocations to locate the bigrams from the bag of words 

bigram_finder.apply_word_filter(lambda w: len(w) < 3)
# filter out the bigrams whose composed word's length is less than 3 

vocabBigrams = bigram_finder.nbest(bigram_measures.pmi, 200)
# use the PMI measure to retrieve 200 most meaningful bigrams 
vocabBigrams = [i[0]+'_'+i[1] for i in vocabBigrams]
# apply 'xxx_xxx' format to all the bigrams retrieved 

### Possible shortcomings to this approach: 

When retrieving the most meaningful bigrams from the bag of words, I filtered all the bigrams that contains the words, whose length is less than 3. This could filter out a lot of bigrams that could also be critical for analysing the tweets that are to do with specific topics. 
When creating a corpus vocabulary, it could be more reasonable to retrieve the most meaningful bigrams by available measures first.

## 3.2 Vocab Unigrams  

To retrieve the unigrams, I followed the steps as below: 

1. Use ```FreqDist()``` to retrieve the document frequency of each unique word in a tweet. 
2. Creating the list of context-independent stopwords, context-dependent stopwords, and the rare tokens based on the document frequency and the given context-independent stopwords. 
3. Filter the words in the list of context-independent stopwords. 
4. Filter the words in the list of context-dependent stopwords. 
5. Remove the tokens whose length is less than 3. 
4. Use ```PorterStemmer()``` to stem the rest of the tokens. 
5. Create a set of unique unigrams.

### 3.2.1 context-independent and -dependent stopwords  

In [10]:
stopwords = set([word.rstrip() for word in open('stopwords_en.txt', 'r').readlines()])
# loop through each line in 'stopwords_en.txt' to retrieve all the context-independent stopwords 

freqDist = fd(list(chain.from_iterable([set(value) for value in tokens.values()])))
# get the document frequency of each token with freqDist function by chaining the list of tokens of each value in the 
# tokens dictionary 

frequentTokens = set([key for key, val in freqDist.items() if val >60])
# the threshold of the context-dependent stopwords is 60. 
# The tokens whose document frequency is larger than 60 is categorised as context-dependent stopwords 

rareTokens = set([key for key, val in freqDist.items() if val <5])
# the threshold of the rare tokens is 5. 
# The tokens whose document frequency is smaller than 5 is categorised as rare tokens 

### 3.2.2 Extracting unigrams  

In [11]:
stemmer = PorterStemmer()
# the unigrams are stemmed with porter stemmer 

unigrams = list(chain.from_iterable(tokens.values()))
# create a list of words composed of the tokens from all the tweets

unigrams =[stemmer.stem(w) \
           for w in unigrams \
           if w not in stopwords \
           if w not in frequentTokens \
           if len(w)>=3 \
           if w not in rareTokens]
# remove all the context-independent stopwords from the list of tokens 

vocabUnigrams = set(unigrams)
# extract unique words from the refined list of tokens 

## 3.3 Writing the corpus vocabularies into a file  

Each vocabulary in the corpus vocabulary file is assinged with a unique integer ID. This ID is utilised when creating sparce matrix of the document

In [12]:
sampleVocab = open('vocab.txt', 'w')

corpusVocabs = list(vocabUnigrams) + list(vocabBigrams)
# create a list of corpus vocabularies by adding the list of unigrams and bigrams together 

for index, item in enumerate(sorted(corpusVocabs)): 
    # sort the list of corpus vocabulary when writing into the file 
    sampleVocab.write('{}:{}'.format(item, index))
    # assigning token id to each token 
    sampleVocab.write('\n')

sampleVocab.close()

# 4. Unigrams  

To extract unigrams from each document, I followed the steps for each 'bag of words' from each document as below: 

1. Filter the words in the list of context-independent stopwords. 
2. Filter the words in the list of context-dependent stopwords. 
3. Remove the tokens whose length is less than 3. 
4. Use ```PorterStemmer()``` to stem the rest of the tokens. 
5. Create a set of unique unigrams for each document.
6. Use ```FreqDist().most_common()``` function to extract the top 100 most common unigrams and its frequency, followed by sorting the list based on the frequency. 

In [13]:
dailyUnigrams=dict()

for date in tokens.keys(): 
    temp = [stemmer.stem(w) for w in tokens[date]\
           if w not in stopwords \
           if w not in frequentTokens \
           if len(w)>=3 \
           if w not in rareTokens]
    dailyUnigrams[date]=fd(temp).most_common(100)
    # generate 100 most frequent unigrams from each day of tweets and sort them based on the frequency
    # remove duplicate bigram-frequency tuple from the list

## 4.1 Writing the 100 most frequent unigrams of each day and their frequencies to a file  

In [14]:
sampleVocab = open('100uni.txt', 'w')

for key, value in dailyUnigrams.items(): 
    sampleVocab.write('{}:{}'.format(key, value))
    sampleVocab.write('\n')

sampleVocab.close()

# 5. Bigrams  

To extract bigrams from each document, I followed the steps for each 'bag of words' from each document as below: 

1. Create all the possible combination of the unigram with the bag of words
6. Use ```FreqDist().most_common()``` function to extract the top 100 most common unigrams and its frequency, followed by sorting the list based on the frequency. 

In [15]:
dailyBigrams=dict()

for date in tokens: 
    fdbigram = fd(ngrams(tokens[date], n = 2))
    
    # with ngrams() we generate all the possible combination of bigrams with the bag of words 
    # then we use freqDist to calculate the frequency of each bigram combinations 
    
    dailyBigrams[date] = sorted(list(set(fdbigram.most_common(100))), key = lambda x: x[1], reverse = True)
    # each bigrams are presented as tuples
    # the list of bigrams is then sorted based on the frequency 
    # only 100 most frequently appearing bigrams are put into the list 
    

### Possible shortcomings to this approach: 

The bigram output actually contains a lot of bigrams that is composed of two context-independent stopwords of context-dependent stopwords. Such bigrams actually occur quite frequently in each document and could stop us from analysing the utility of bigrams in each text document. The output bigrams could be more meaningful if: 

* extract meaningful bigrams with collocations and different measures 
* remove bigrams composed of two context-independent stopwords


## 5.1 Writing the 100 most frequent bigrams of each day and their frequencies to a file  

In [16]:
sampleVocab = open('100bi.txt', 'w')

for key, value in dailyBigrams.items(): 
    sampleVocab.write('{}:{}'.format(key, value))
    sampleVocab.write('\n')

sampleVocab.close()

# 6. Sparce Matrix  

In [17]:
for i in vocabBigrams: 
    vocabUnigrams.add(i)
    # created a unified set of corpus vocabularies 
    # this aims to enhance the efficiency when looping through the vocabs when creating a sparce matrix 
vocabs = vocabUnigrams

In [18]:
# creating a dictionary of tokens and their respective created date
# the tokens in this dictionaries are the ones in corpusVocabs

filteredToken=dict()
# a dictionary for creating sparce matrix 

for date in tokens.keys(): 
    
    # for the vocab of each day's tweets, filter out all the words that are not in the corpus vocabulary 
    filteredToken[date] = [stemmer.stem(w) \
           for w in tokens[date] \
           if stemmer.stem(w) in vocabs]

    # filter all the words that are not in the corpus vocabularies 

In [19]:
matrix = open('countVec.txt', 'w')

vocabDict = {vocab:index for index, vocab in enumerate(sorted(list(vocabs)))}
# assign index ID to each vocabulary in corpus vocabs list

for date, unigram in filteredToken.items(): 
    
    matrix.write(date)
    # the start of the line is the date (document name) of the tweets 
    
    d_idx = [vocabDict[singleToken] for singleToken in unigram]
    # create a list of index in respective of each token in the list of vocabs in the dictionary 
    
    for k, v in fd(d_idx).items(): 
        matrix.write(',{}:{}'.format(k, v))
        # write in the frequency of each index occurred in d_idx based on each document
    matrix.write('\n')

matrix.close()

## Bibliography  

GALLAGHER, J. (2020, August 6). How to Sort a Dictionary by Value in Python. CAREER KARMA. https://careerkarma.com/blog/python-sort-a-dictionary-by-value/#:~:text=To%20sort%20a%20dictionary%20by%20value%20in%20Python%20you%20can,Dictionaries%20are%20unordered%20data%20structures.

How to open multiple files in a directory. (2016, August 17). Stack Overflow. https://stackoverflow.com/questions/38991923/how-to-open-multiple-files-in-a-directory/38992988