# Data Preprocessing and LDA Model Fitting for Iteration 3 of Fair Is:
In this notebook I document preprocessing and LDA model fitting of my Primary Dataset for iteration 3 of the Fair Is project. 
For more information on the previous iterations of this project you can see my Data Managemant Plan and Methodologies Statement. 

To see how the Primary Dataset was created see Data Creation for Iteration 3 of Fair Is.

This notebook is split between Preprocessing Steps and Model Fitting for our corpus of 308 papers.

I conducted the following preprocessing steps:
- **Tokenization**
    - *ngrams*
    - *bi-grams*
    - *ngram verbs*
    - *ngram nouns*
    - *bigram nouns*
- **Lematization**
- **Creation of Dictionary and Document Term Matrices**

I conducted the following Topic Modeling Steps: 
- **Fit model using LDA**
- **Fitting other tipic CorEx (Correlation Explanation)**

- **Further Normalization**

### Importing Libraries and Packages:

In [1]:
import pandas as pd
import json
import csv
import nltk as nltk
import gensim as gm
import os
import os.path
import numpy as np


In [2]:
#in order to use the word_tokenize function we need the nltk punkt package. 
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/aster/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Accessing Clean Dataset

In [2]:
clean_data = os.path.join('../data/processed_data/csv/cleaned_primary_data_12022021.csv')
data = pd.read_csv(clean_data)

In [3]:
#just checking out data real quick
data.head()

Unnamed: 0,X,title,abstract
0,1,unfair items detection educational measurement,measurement professionals come agreement defi...
1,2,fairness academic course timetabling,consider problem creating fair course timetab...
2,3,safeguarding ecommerce advisor cheating behavi...,electronic marketplaces transaction buyers wi...
3,4,decomposition maxmin fair curriculumbased cou...,propose decomposition maxmin fair curriculumb...
4,5,fair assignment indivisible objects ordinal pr...,consider discrete assignment problem agents e...


In [4]:
data.columns

Index(['X', 'title', 'abstract'], dtype='object')

## Preprocessing:

Preprocessing is somewhat similar to cleaning, it describes the steps made to prepare data to be fit to a model. While the primary dataset is structured and cleaned it's still much closer to unstructured text than to something machine readable. 

### Tokenization

*Tokenization* is the process of seperating meaningful strings of text into units called tokens. in Text Analysis, and Natural Language Processing more generally, models don't "understand" or "read" text in the way a human does. 

In tokenization i'm also thing about ngrams, or a sequence of tokens where *n* is some number --- A single token would be a unigram, two tokens would be a bigram, and so forth.

Using bigrams allows us to take account of terms like "machine learning" rather than consider them seperate terms. 

So we will create new columns from our abstracts: 

- unigram tokens for titles
- unigram tokens for abstracts
- bigram tokens for titles
- bigram tokens for abstracts

#### unigram tokens for titles:

The following tookenization code was figured out by Professor Vicky Rampin:

In [5]:
# map = iterator (goes thru each row)
# x = specific title that is being tokenized in the specific moment

data['title_tokens'] = data['title'].map(lambda x: nltk.word_tokenize(x))

#### bigram tokens for titles:

In [6]:
data['title_bigrams'] = data['title_tokens'].apply(lambda row: list(nltk.bigrams(row)))
#print(data['title_bigrams'])

#### unigram tokens for abstracts:

In [7]:
#now let's apply this to abstracts:
data['abstract_tokens'] = data['abstract'].map(lambda x: nltk.word_tokenize(x))


#### bigram tokens for abstracts:

In [8]:
data['abstract_bigrams'] = data['abstract_tokens'].apply(lambda row: list(nltk.bigrams(row)))
#checking dataframe
#data.head()

Unnamed: 0,X,title,abstract,title_tokens,title_bigrams,abstract_tokens,abstract_bigrams
0,1,unfair items detection educational measurement,measurement professionals come agreement defi...,"[unfair, items, detection, educational, measur...","[(unfair, items), (items, detection), (detecti...","[measurement, professionals, come, agreement, ...","[(measurement, professionals), (professionals,..."
1,2,fairness academic course timetabling,consider problem creating fair course timetab...,"[fairness, academic, course, timetabling]","[(fairness, academic), (academic, course), (co...","[consider, problem, creating, fair, course, ti...","[(consider, problem), (problem, creating), (cr..."
2,3,safeguarding ecommerce advisor cheating behavi...,electronic marketplaces transaction buyers wi...,"[safeguarding, ecommerce, advisor, cheating, b...","[(safeguarding, ecommerce), (ecommerce, adviso...","[electronic, marketplaces, transaction, buyers...","[(electronic, marketplaces), (marketplaces, tr..."
3,4,decomposition maxmin fair curriculumbased cou...,propose decomposition maxmin fair curriculumb...,"[decomposition, maxmin, fair, curriculumbased,...","[(decomposition, maxmin), (maxmin, fair), (fai...","[propose, decomposition, maxmin, fair, curricu...","[(propose, decomposition), (decomposition, max..."
4,5,fair assignment indivisible objects ordinal pr...,consider discrete assignment problem agents e...,"[fair, assignment, indivisible, objects, ordin...","[(fair, assignment), (assignment, indivisible)...","[consider, discrete, assignment, problem, agen...","[(consider, discrete), (discrete, assignment),..."


### Parts of Speech Tagging

In [10]:
#in order to use the nltk pos_tag function need averaged_perceptron_tagger
#nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/aster/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [9]:
#trying parts of speech tagging with unigram tokens
data['title_tokens_pos'] = data['title_tokens'].apply(lambda row: list(nltk.pos_tag(row)))

In [10]:
#does this also work with bigrams? 
data['title_bigrams_pos'] = data['title_tokens_pos'].apply(lambda row: list(nltk.bigrams(row)))

In [11]:
#now parts of speech tagging for abstracts bigrams
data['abstract_tokens_pos'] = data['abstract_tokens'].apply(lambda row: list(nltk.pos_tag(row)))
data['abstract_bigrams_pos'] = data['abstract_tokens_pos'].apply(lambda row: list(nltk.bigrams(row)))

In [12]:
#checking to see if this worked:
data.head()

Unnamed: 0,X,title,abstract,title_tokens,title_bigrams,abstract_tokens,abstract_bigrams,title_tokens_pos,title_bigrams_pos,abstract_tokens_pos,abstract_bigrams_pos
0,1,unfair items detection educational measurement,measurement professionals come agreement defi...,"[unfair, items, detection, educational, measur...","[(unfair, items), (items, detection), (detecti...","[measurement, professionals, come, agreement, ...","[(measurement, professionals), (professionals,...","[(unfair, JJ), (items, NNS), (detection, VBP),...","[((unfair, JJ), (items, NNS)), ((items, NNS), ...","[(measurement, NN), (professionals, NNS), (com...","[((measurement, NN), (professionals, NNS)), ((..."
1,2,fairness academic course timetabling,consider problem creating fair course timetab...,"[fairness, academic, course, timetabling]","[(fairness, academic), (academic, course), (co...","[consider, problem, creating, fair, course, ti...","[(consider, problem), (problem, creating), (cr...","[(fairness, JJ), (academic, JJ), (course, NN),...","[((fairness, JJ), (academic, JJ)), ((academic,...","[(consider, VB), (problem, NN), (creating, VBG...","[((consider, VB), (problem, NN)), ((problem, N..."
2,3,safeguarding ecommerce advisor cheating behavi...,electronic marketplaces transaction buyers wi...,"[safeguarding, ecommerce, advisor, cheating, b...","[(safeguarding, ecommerce), (ecommerce, adviso...","[electronic, marketplaces, transaction, buyers...","[(electronic, marketplaces), (marketplaces, tr...","[(safeguarding, VBG), (ecommerce, NN), (adviso...","[((safeguarding, VBG), (ecommerce, NN)), ((eco...","[(electronic, JJ), (marketplaces, NNS), (trans...","[((electronic, JJ), (marketplaces, NNS)), ((ma..."
3,4,decomposition maxmin fair curriculumbased cou...,propose decomposition maxmin fair curriculumb...,"[decomposition, maxmin, fair, curriculumbased,...","[(decomposition, maxmin), (maxmin, fair), (fai...","[propose, decomposition, maxmin, fair, curricu...","[(propose, decomposition), (decomposition, max...","[(decomposition, NN), (maxmin, NN), (fair, NN)...","[((decomposition, NN), (maxmin, NN)), ((maxmin...","[(propose, JJ), (decomposition, NN), (maxmin, ...","[((propose, JJ), (decomposition, NN)), ((decom..."
4,5,fair assignment indivisible objects ordinal pr...,consider discrete assignment problem agents e...,"[fair, assignment, indivisible, objects, ordin...","[(fair, assignment), (assignment, indivisible)...","[consider, discrete, assignment, problem, agen...","[(consider, discrete), (discrete, assignment),...","[(fair, JJ), (assignment, NN), (indivisible, J...","[((fair, JJ), (assignment, NN)), ((assignment,...","[(consider, VB), (discrete, JJ), (assignment, ...","[((consider, VB), (discrete, JJ)), ((discrete,..."


### Lemmatization
*Lemmatization* is a commonly used pre-processing step in text analysis. When we lemmatize tokens, we shorten them to the shortest meaningful root of a word, called a lemma. For example *running* becomes run. 

To lemmatize we will using the WordNetLemmatizer, a tool that is part of the NLTK package and uses [WordNet](https://wordnet.princeton.edu/), a database of semantic relations between word forms in over 200 languages to lemmatize the words in our abstract and title data. 

Finally, WordNetLemmatizer allows us to chose the part of speech of the lemma. In this iteration I am selecting the verb part of speech, one because i'm considering the "understanding" of fairness as proceedural, in action and secondly for the sake of making a decision to move through this project. I also leave code as a comment to return noun forms of lemmas (if no part of speech is specified WordNetLemmatizer defaults to nouns). 

In [46]:
#in order to use lemmatization we need to use import the WordNetLemmatizer and wordnet dictionary
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
from nltk.corpus import treebank

In [27]:
#set the WordNetLemmatizer to a variable
lem = WordNetLemmatizer()

In [91]:
def lem_text(text):
    return [lem.lemmatize(w, 'v') for w in text] #for verbs
    #return [lem.lemmatize(w, 'n') for w in text] #remove hastag/pound, comment out verb line and run cell for nouns

In [92]:
#Lemmatize titles:
data['title_lemmatized']= data['title_tokens'].map(lambda x: lem_text(x))


In [96]:
#Lemmatize abstracts:
data['abstract_lemmatized'] = data['abstract_tokens'].map(lambda x: lem_text(x))

In [106]:
#Get title lemma bigrams:
data['title_bigram_lemmatized']=data['title_lemmatized'].apply(lambda row: list(nltk.bigrams(row)))


In [109]:
#Get abstract lemma bigrams:
data['abstract_bigram_lemmatized']=data['abstract_lemmatized'].apply(lambda row: list(nltk.bigrams(row)))
                                                                     

### Further Cleaning and Normalization:

One last check to see if there are any rows without values (imagine an empty cell in a spreadsheet) we need to consider. 

In [113]:
data.isnull().values.any()

False

We'll also deal with a stray column at the begiining since by changing it's name and setting it as our index. 

In [129]:
#changing column name
data = data.rename(columns={"X":"id"})

In [134]:
#checking to make sure it worked:
#data.head()

In [131]:
#set newly renamed column as index
data = data.set_index('id')

In [133]:
#checking to make sure it worked
#data.head()

Finally we save our preprocessed data and we're ready for Topic Modeling!

In [136]:
data.to_csv('../data/processed_data/csv/processed_primary_data.csv')

In [138]:
#just making sure everything looks good!
#test = pd.read_csv('../data/processed_data/csv/processed_primary_data.csv')
#test.head()

## Topic Modeling:

### Creation of Dictionary, Corpus and Document Term Matrices

### LDA - Latent Dirichlet Allocation

### CorEX - Correlation Explanation