# Data Preprocessing and LDA Model Fitting for Iteration 3 of Fair Is:
In this notebook I document preprocessing and LDA model fitting of my Primary Dataset for iteration 3 of the Fair Is project. 
For more information on the previous iterations of this project you can see my Data Managemant Plan and Methodologies Statement. 

To see how the Primary Dataset was created see Data Creation for Iteration 3 of Fair Is.

This notebook is split between Preprocessing Steps and Model Fitting for our corpus of 308 papers.

I conducted the following preprocessing steps:
- **Tokenization**
    - *ngrams*
    - *bi-grams*
    - *ngram verbs*
    - *ngram nouns*
    - *bigram nouns*
- **Lematization**
- **Creation of Dictionary and Document Term Matrices**

I conducted the following Topic Modeling Steps: 
- **Fit model using LDA**
- **Fitting other tipic CorEx (Correlation Explanation)**

- **Further Normalization**

### Importing Libraries and Packages:

In [1]:
import pandas as pd
import json
import csv
import nltk as nltk
import gensim as gm
import os
import os.path
import numpy as np


In [2]:
#in order to use the word_tokenize function we need the nltk punkt package. 
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/aster/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Accessing Clean Dataset

In [3]:
clean_data = os.path.join('../data/processed_data/csv/cleaned_primary_data_12022021.csv')
data = pd.read_csv(clean_data)

In [4]:
#just checking out data real quick
data.head()

Unnamed: 0,X,title,abstract
0,1,unfair items detection educational measurement,measurement professionals come agreement defi...
1,2,fairness academic course timetabling,consider problem creating fair course timetab...
2,3,safeguarding ecommerce advisor cheating behavi...,electronic marketplaces transaction buyers wi...
3,4,decomposition maxmin fair curriculumbased cou...,propose decomposition maxmin fair curriculumb...
4,5,fair assignment indivisible objects ordinal pr...,consider discrete assignment problem agents e...


In [5]:
data.columns

Index(['X', 'title', 'abstract'], dtype='object')

## Preprocessing:

Preprocessing is somewhat similar to cleaning, it describes the steps made to prepare data to be fit to a model. While the primary dataset is structured and cleaned it's still much closer to unstructured text than to something machine readable. 

### Tokenization

*Tokenization* is the process of seperating meaningful strings of text into units called tokens. in Text Analysis, and Natural Language Processing more generally, models don't "understand" or "read" text in the way a human does. 

In tokenization i'm also thing about ngrams, or a sequence of tokens where *n* is some number --- A single token would be a unigram, two tokens would be a bigram, and so forth.

Using bigrams allows us to take account of terms like "machine learning" rather than consider them seperate terms. 

So we will create new columns from our abstracts: 

- unigram tokens for titles
- unigram tokens for abstracts
- bigram tokens for titles
- bigram tokens for abstracts

#### unigram tokens for titles:

The following tookenization code was figured out by Professor Vicky Rampin:

In [6]:
# map = iterator (goes thru each row)
# x = specific title that is being tokenized in the specific moment

data['title_tokens'] = data['title'].map(lambda x: nltk.word_tokenize(x))

#### bigram tokens for titles:

In [7]:
data['title_bigrams'] = data['title_tokens'].apply(lambda row: list(nltk.bigrams(row)))
print(data['title_bigrams'])

0      [(unfair, items), (items, detection), (detecti...
1      [(fairness, academic), (academic, course), (co...
2      [(safeguarding, ecommerce), (ecommerce, adviso...
3      [(decomposition, maxmin), (maxmin, fair), (fai...
4      [(fair, assignment), (assignment, indivisible)...
                             ...                        
303    [(one, label), (label, one), (one, billion), (...
304    [(reviewable, automated), (automated, decision...
305    [(dangers, stochastic), (stochastic, parrots),...
306    [(formalizing, trust), (trust, artificial), (a...
307    [(tilt, gdpraligned), (gdpraligned, transparen...
Name: title_bigrams, Length: 308, dtype: object


#### unigram tokens for abstracts:

In [8]:
#now let's apply this to abstracts:
data['abstract_tokens'] = data['abstract'].map(lambda x: nltk.word_tokenize(x))


#### bigram tokens for abstracts:

In [9]:
data['abstract_bigrams'] = data['abstract_tokens'].apply(lambda row: list(nltk.bigrams(row)))
data.head()

Unnamed: 0,X,title,abstract,title_tokens,title_bigrams,abstract_tokens,abstract_bigrams
0,1,unfair items detection educational measurement,measurement professionals come agreement defi...,"[unfair, items, detection, educational, measur...","[(unfair, items), (items, detection), (detecti...","[measurement, professionals, come, agreement, ...","[(measurement, professionals), (professionals,..."
1,2,fairness academic course timetabling,consider problem creating fair course timetab...,"[fairness, academic, course, timetabling]","[(fairness, academic), (academic, course), (co...","[consider, problem, creating, fair, course, ti...","[(consider, problem), (problem, creating), (cr..."
2,3,safeguarding ecommerce advisor cheating behavi...,electronic marketplaces transaction buyers wi...,"[safeguarding, ecommerce, advisor, cheating, b...","[(safeguarding, ecommerce), (ecommerce, adviso...","[electronic, marketplaces, transaction, buyers...","[(electronic, marketplaces), (marketplaces, tr..."
3,4,decomposition maxmin fair curriculumbased cou...,propose decomposition maxmin fair curriculumb...,"[decomposition, maxmin, fair, curriculumbased,...","[(decomposition, maxmin), (maxmin, fair), (fai...","[propose, decomposition, maxmin, fair, curricu...","[(propose, decomposition), (decomposition, max..."
4,5,fair assignment indivisible objects ordinal pr...,consider discrete assignment problem agents e...,"[fair, assignment, indivisible, objects, ordin...","[(fair, assignment), (assignment, indivisible)...","[consider, discrete, assignment, problem, agen...","[(consider, discrete), (discrete, assignment),..."


In [69]:

#abstract_tokens_pos --> create
#abstract_bigrams_pos --> create

### Parts of Speech Tagging

In [10]:
#in order to use the nltk pos_tag function need averaged_perceptron_tagger
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/aster/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [13]:
#trying parts of speech tagging with unigram tokens
data['title_tokens_pos'] = data['title_tokens'].apply(lambda row: list(nltk.pos_tag(row)))

In [15]:
#checking to see if it worked
data['title_tokens_pos'].head()

0    [(unfair, JJ), (items, NNS), (detection, VBP),...
1    [(fairness, JJ), (academic, JJ), (course, NN),...
2    [(safeguarding, VBG), (ecommerce, NN), (adviso...
3    [(decomposition, NN), (maxmin, NN), (fair, NN)...
4    [(fair, JJ), (assignment, NN), (indivisible, J...
Name: title_tokens_pos, dtype: object

In [17]:
#does this also work with bigrams? 
data['title_bigrams_pos'] = data['title_tokens_pos'].apply(lambda row: list(nltk.bigrams(row)))

In [19]:
#now parts of speech tagging for abstracts bigrams
data['abstract_tokens_pos'] = data['abstract_tokens'].apply(lambda row: list(nltk.pos_tag(row)))
data['abstract_bigrams_pos'] = data['abstract_tokens_pos'].apply(lambda row: list(nltk.bigrams(row)))

In [20]:
#checking to see if this worked:
data.head()

Unnamed: 0,X,title,abstract,title_tokens,title_bigrams,abstract_tokens,abstract_bigrams,title_tokens_pos,title_bigrams_pos,abstract_tokens_pos,abstract_bigrams_pos
0,1,unfair items detection educational measurement,measurement professionals come agreement defi...,"[unfair, items, detection, educational, measur...","[(unfair, items), (items, detection), (detecti...","[measurement, professionals, come, agreement, ...","[(measurement, professionals), (professionals,...","[(unfair, JJ), (items, NNS), (detection, VBP),...","[((unfair, JJ), (items, NNS)), ((items, NNS), ...","[(measurement, NN), (professionals, NNS), (com...","[((measurement, NN), (professionals, NNS)), ((..."
1,2,fairness academic course timetabling,consider problem creating fair course timetab...,"[fairness, academic, course, timetabling]","[(fairness, academic), (academic, course), (co...","[consider, problem, creating, fair, course, ti...","[(consider, problem), (problem, creating), (cr...","[(fairness, JJ), (academic, JJ), (course, NN),...","[((fairness, JJ), (academic, JJ)), ((academic,...","[(consider, VB), (problem, NN), (creating, VBG...","[((consider, VB), (problem, NN)), ((problem, N..."
2,3,safeguarding ecommerce advisor cheating behavi...,electronic marketplaces transaction buyers wi...,"[safeguarding, ecommerce, advisor, cheating, b...","[(safeguarding, ecommerce), (ecommerce, adviso...","[electronic, marketplaces, transaction, buyers...","[(electronic, marketplaces), (marketplaces, tr...","[(safeguarding, VBG), (ecommerce, NN), (adviso...","[((safeguarding, VBG), (ecommerce, NN)), ((eco...","[(electronic, JJ), (marketplaces, NNS), (trans...","[((electronic, JJ), (marketplaces, NNS)), ((ma..."
3,4,decomposition maxmin fair curriculumbased cou...,propose decomposition maxmin fair curriculumb...,"[decomposition, maxmin, fair, curriculumbased,...","[(decomposition, maxmin), (maxmin, fair), (fai...","[propose, decomposition, maxmin, fair, curricu...","[(propose, decomposition), (decomposition, max...","[(decomposition, NN), (maxmin, NN), (fair, NN)...","[((decomposition, NN), (maxmin, NN)), ((maxmin...","[(propose, JJ), (decomposition, NN), (maxmin, ...","[((propose, JJ), (decomposition, NN)), ((decom..."
4,5,fair assignment indivisible objects ordinal pr...,consider discrete assignment problem agents e...,"[fair, assignment, indivisible, objects, ordin...","[(fair, assignment), (assignment, indivisible)...","[consider, discrete, assignment, problem, agen...","[(consider, discrete), (discrete, assignment),...","[(fair, JJ), (assignment, NN), (indivisible, J...","[((fair, JJ), (assignment, NN)), ((assignment,...","[(consider, VB), (discrete, JJ), (assignment, ...","[((consider, VB), (discrete, JJ)), ((discrete,..."


### Lemmatization

In [21]:
#in order to use lemmatization we need to use
from nltk.stem import WordNetLemmatizer

In [98]:
lem = WordNetLemmatizer()

In [157]:
def lemmatize_text(alist):
    for text in alist:
        for w in text:
    #for w in range(len(text)):
        #for w in row:
            return lem.lemmatize(w)
            #return lem.lemmatize(w, pos='n')
        #.apply(lambda x: lemmatize_text(x))

In [164]:
data['title_lemmatized']= data['title_tokens'].map(lambda x: lemmatize_text(x))

In [165]:
data.head()

Unnamed: 0,X,title,abstract,title_tokens,title_bigrams,abstract_tokens,abstract_bigrams,title_tokens_pos,title_bigrams_pos,abstract_tokens_pos,abstract_bigrams_pos,title_lemmatized
0,1,unfair items detection educational measurement,measurement professionals come agreement defi...,"[unfair, items, detection, educational, measur...","[(unfair, items), (items, detection), (detecti...","[measurement, professionals, come, agreement, ...","[(measurement, professionals), (professionals,...","[(unfair, JJ), (items, NNS), (detection, VBP),...","[((unfair, JJ), (items, NNS)), ((items, NNS), ...","[(measurement, NN), (professionals, NNS), (com...","[((measurement, NN), (professionals, NNS)), ((...",u
1,2,fairness academic course timetabling,consider problem creating fair course timetab...,"[fairness, academic, course, timetabling]","[(fairness, academic), (academic, course), (co...","[consider, problem, creating, fair, course, ti...","[(consider, problem), (problem, creating), (cr...","[(fairness, JJ), (academic, JJ), (course, NN),...","[((fairness, JJ), (academic, JJ)), ((academic,...","[(consider, VB), (problem, NN), (creating, VBG...","[((consider, VB), (problem, NN)), ((problem, N...",f
2,3,safeguarding ecommerce advisor cheating behavi...,electronic marketplaces transaction buyers wi...,"[safeguarding, ecommerce, advisor, cheating, b...","[(safeguarding, ecommerce), (ecommerce, adviso...","[electronic, marketplaces, transaction, buyers...","[(electronic, marketplaces), (marketplaces, tr...","[(safeguarding, VBG), (ecommerce, NN), (adviso...","[((safeguarding, VBG), (ecommerce, NN)), ((eco...","[(electronic, JJ), (marketplaces, NNS), (trans...","[((electronic, JJ), (marketplaces, NNS)), ((ma...",s
3,4,decomposition maxmin fair curriculumbased cou...,propose decomposition maxmin fair curriculumb...,"[decomposition, maxmin, fair, curriculumbased,...","[(decomposition, maxmin), (maxmin, fair), (fai...","[propose, decomposition, maxmin, fair, curricu...","[(propose, decomposition), (decomposition, max...","[(decomposition, NN), (maxmin, NN), (fair, NN)...","[((decomposition, NN), (maxmin, NN)), ((maxmin...","[(propose, JJ), (decomposition, NN), (maxmin, ...","[((propose, JJ), (decomposition, NN)), ((decom...",d
4,5,fair assignment indivisible objects ordinal pr...,consider discrete assignment problem agents e...,"[fair, assignment, indivisible, objects, ordin...","[(fair, assignment), (assignment, indivisible)...","[consider, discrete, assignment, problem, agen...","[(consider, discrete), (discrete, assignment),...","[(fair, JJ), (assignment, NN), (indivisible, J...","[((fair, JJ), (assignment, NN)), ((assignment,...","[(consider, VB), (discrete, JJ), (assignment, ...","[((consider, VB), (discrete, JJ)), ((discrete,...",f


In [None]:
#Notes from Nick:
#see if you can isolate a single term, using basic lemmatizer
#lower the ocmplexity adn zero down on a smaller version for the problem. try it with one and then go from there.
#think about arrays
#try it out with one less complex example

#array of array is an issue

# tyringing it unfair 
# trying it [unfair, educational, measurement]

In [124]:
def func_y(text):
    for x in text:
        print(x)

In [141]:
def lemy(text):
    for w in text:
        print(lem.lemmatize(w))

In [145]:
words = ['libraries', 'corpora', 'fairness', 'adversarial', 'mitigating', 'recommendation'] 

In [146]:
lemy(words)
#func_y(words)

library
corpus
fairness
adversarial
mitigating
recommendation


In [135]:
lemmatize_text(data['title_tokens'])

unfair
item
detection
educational
measurement
fairness
academic
course
timetabling
safeguarding
ecommerce
advisor
cheating
behavior
towards
robust
trust
model
handling
unfair
rating
decomposition
maxmin
fair
curriculumbased
course
timetabling
problem
fair
assignment
indivisible
object
ordinal
preference
online
fair
division
analysing
food
bank
problem
relation
accuracy
fairness
binary
classification
fair
task
allocation
transportation
efficiency
sequenceability
fair
division
indivisible
good
additive
preference
fairness
program
property
fair
division
via
social
comparison
balancing
lexicographic
fairness
utilitarian
objective
application
kidney
exchange
fairjudge
trustworthy
user
prediction
rating
platform
beyond
parity
fairness
objective
collaborative
filtering
new
fairness
metric
recommendation
embrace
difference
impossibility
fairness
generalized
impossibility
result
decision
networked
fairness
cake
cutting
fairnessaware
machine
learning
perspective
fairness
testing
testing
software

policy
fairness
accuracy
distributed
governance
role
computing
social
change
role
computing
social
change
relationship
trust
ai
trustworthy
machine
learning
technology
philosophical
basis
algorithmic
recourse
philosophical
basis
algorithmic
recourse
effect
confidence
explanation
accuracy
trust
calibration
aiassisted
decision
making
leaveoneout
unfairness
fairness
welfare
equity
personalized
pricing
reimagining
algorithmic
fairness
india
beyond
narrative
counternarratives
data
sharing
africa
whole
thing
smack
gender
algorithmic
exclusion
bioimpedancebased
body
composition
analysis
algorithmic
recourse
counterfactual
explanation
intervention
semioticsbased
epistemic
tool
reason
ethical
issue
digital
technology
design
development
measurement
fairness
fairness
risk
assessment
instrument
postprocessing
achieve
counterfactual
equalized
odds
high
dimensional
model
explanation
axiomatic
approach
agentbased
model
evaluate
intervention
online
dating
platform
decrease
racial
homogamy
designing
ac

In [166]:
#test_1 = data['title_tokens'].apply(lambda row: list(lem.lemmatize(row)))
#test_2 = data['title_tokens'].apply(lambda row: lem.lemmatize(row))

#need to amke this not a list somehow. 
# getting the error: "TypeError: unhashable type: 'list'"
# if the column/list i'm passing isn't hashable then it's immutable
#if it's immutable then it can't be manipulated. I.e. I can't lemmatize it. 
#So I need to make it unimmutable, I need to make it hashable


### Further Cleaning and Normalization:

- check for nulls and decide what to do with them
- whitespace
- get rid of punctuations for the lists we created. 

## Topic Modeling:

### Creation of Dictionary, Corpus and Document Term Matrices

### LDA - Latent Dirichlet Allocation

### CorEX - Correlation Explanation