# Data Preprocessing and LDA Model Fitting for Iteration 3 of Fair Is:
In this notebook I document preprocessing and LDA model fitting of my Primary Dataset for iteration 3 of the Fair Is project. 
For more information on the previous iterations of this project you can see my Data Managemant Plan and Methodologies Statement. 

To see how the Primary Dataset was created see Data Creation for Iteration 3 of Fair Is.

This notebook is split between Preprocessing Steps and Model Fitting for our corpus of 308 papers.

I conducted the following preprocessing steps:
- **Tokenization**
    - *ngrams*
    - *bi-grams*
    - *ngram verbs*
    - *ngram nouns*
    - *bigram nouns*
- **Lematization**
- **Creation of Dictionary and Document Term Matrices**

I conducted the following Topic Modeling Steps: 
- **Fit model using LDA**
- **Fitting other tipic CorEx (Correlation Explanation)**

- **Further Normalization**

### Importing Libraries and Packages:

In [1]:
import pandas as pd
import json
import csv
import nltk as nltk
import gensim as gm
import os
import os.path
import numpy as np


In [15]:
#in order to use the word_tokenize function we need the nltk punkt package. 
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/aster/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Accessing Clean Dataset

In [23]:
clean_data = os.path.join('../data/processed_data/csv/cleaned_primary_data_12022021.csv')
data = pd.read_csv(clean_data)

In [27]:
#just checking out data real quick
data.head()

Unnamed: 0,X,title,abstract
0,1,unfair items detection educational measurement,measurement professionals come agreement defi...
1,2,fairness academic course timetabling,consider problem creating fair course timetab...
2,3,safeguarding ecommerce advisor cheating behavi...,electronic marketplaces transaction buyers wi...
3,4,decomposition maxmin fair curriculumbased cou...,propose decomposition maxmin fair curriculumb...
4,5,fair assignment indivisible objects ordinal pr...,consider discrete assignment problem agents e...


In [13]:
data.columns

Index(['X', 'title', 'abstract'], dtype='object')

## Preprocessing:

Preprocessing is somewhat similar to cleaning, it describes the steps made to prepare data to be fit to a model. While the primary dataset is structured and cleaned it's still much closer to unstructured text than to something machine readable. 

### Tokenization

*Tokenization* is the process of seperating meaningful strings of text into units called tokens. in Text Analysis, and Natural Language Processing more generally, models don't "understand" or "read" text in the way a human does. 

In tokenization i'm also thing about ngrams, or a sequence of tokens where *n* is some number --- A single token would be a unigram, two tokens would be a bigram, and so forth.

Using bigrams allows us to take account of terms like "machine learning" rather than consider them seperate terms. 

So we will create new columns from our abstracts: 

- unigram tokens for titles
- unigram tokens for abstracts
- bigram tokens for titles
- bigram tokens for abstracts

#### unigram tokens for titles:

The following tookenization code was figured out by Professor Vicky Rampin:

In [44]:
# map = iterator (goes thru each row)
# x = specific title that is being tokenized in the specific moment

data['title_tokens'] = data['title'].map(lambda x: nltk.word_tokenize(x))

   X                                              title  \
0  1     unfair items detection educational measurement   
1  2               fairness academic course timetabling   
2  3  safeguarding ecommerce advisor cheating behavi...   
3  4   decomposition maxmin fair curriculumbased cou...   
4  5  fair assignment indivisible objects ordinal pr...   

                                            abstract  \
0   measurement professionals come agreement defi...   
1   consider problem creating fair course timetab...   
2   electronic marketplaces transaction buyers wi...   
3   propose decomposition maxmin fair curriculumb...   
4   consider discrete assignment problem agents e...   

                                        title_tokens  
0  [unfair, items, detection, educational, measur...  
1          [fairness, academic, course, timetabling]  
2  [safeguarding, ecommerce, advisor, cheating, b...  
3  [decomposition, maxmin, fair, curriculumbased,...  
4  [fair, assignment, indivisible

#### bigram tokens for titles:

In [51]:
data['title_bigrams'] = data['title_tokens'].apply(lambda row: list(nltk.bigrams(row)))
print(data['title_bigrams'])

0      [(unfair, items), (items, detection), (detecti...
1      [(fairness, academic), (academic, course), (co...
2      [(safeguarding, ecommerce), (ecommerce, adviso...
3      [(decomposition, maxmin), (maxmin, fair), (fai...
4      [(fair, assignment), (assignment, indivisible)...
                             ...                        
303    [(one, label), (label, one), (one, billion), (...
304    [(reviewable, automated), (automated, decision...
305    [(dangers, stochastic), (stochastic, parrots),...
306    [(formalizing, trust), (trust, artificial), (a...
307    [(tilt, gdpraligned), (gdpraligned, transparen...
Name: title_tokens_bigrams, Length: 308, dtype: object


#### unigram tokens for abstracts:

In [46]:
#now let's apply this to abstracts:
data['abstract_tokens'] = data['abstract'].map(lambda x: nltk.word_tokenize(x))


0    [measurement, professionals, come, agreement, ...
1    [consider, problem, creating, fair, course, ti...
2    [electronic, marketplaces, transaction, buyers...
3    [propose, decomposition, maxmin, fair, curricu...
4    [consider, discrete, assignment, problem, agen...
Name: abstract_tokens, dtype: object

#### bigram tokens for abstracts:

In [56]:
data['abstract_bigrams'] = data['abstract_tokens'].apply(lambda row: list(nltk.bigrams(row)))
data.head()

In [69]:

#abstract_tokens_pos --> create
#abstract_bigrams_pos --> create

### Parts of Speech Tagging

In [54]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/aster/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [58]:
data['title_tokens_pos_tagged'] = nltk.pos_tag(data['title_tokens'])
data['title_tokens_pos_tagged'].head

AttributeError: 'list' object has no attribute 'isdigit'

In [59]:
#trying parts of speech tagging with unigram tokens
data['title_tokens_pos_tagged'] = data['title_tokens'].apply(lambda row: list(nltk.pos_tag(row)))

In [60]:
data['title_tokens_pos_tagged'].head()

0    [(unfair, JJ), (items, NNS), (detection, VBP),...
1    [(fairness, JJ), (academic, JJ), (course, NN),...
2    [(safeguarding, VBG), (ecommerce, NN), (adviso...
3    [(decomposition, NN), (maxmin, NN), (fair, NN)...
4    [(fair, JJ), (assignment, NN), (indivisible, J...
Name: title_tokens_pos_tagged, dtype: object

In [62]:
test = data['title_tokens_pos_tagged'].apply(lambda row: list(nltk.bigrams(row)))

In [63]:
print(test)

0      [((unfair, JJ), (items, NNS)), ((items, NNS), ...
1      [((fairness, JJ), (academic, JJ)), ((academic,...
2      [((safeguarding, VBG), (ecommerce, NN)), ((eco...
3      [((decomposition, NN), (maxmin, NN)), ((maxmin...
4      [((fair, JJ), (assignment, NN)), ((assignment,...
                             ...                        
303    [((one, CD), (label, VBZ)), ((label, VBZ), (on...
304    [((reviewable, JJ), (automated, VBD)), ((autom...
305    [((dangers, NNS), (stochastic, JJ)), ((stochas...
306    [((formalizing, VBG), (trust, NN)), ((trust, N...
307    [((tilt, NN), (gdpraligned, VBD)), ((gdpralign...
Name: title_tokens_pos_tagged, Length: 308, dtype: object


In [61]:
#does this also work with bigrams? 
data['title_tokens_pos_bigrams'] = data['title_tokens_bigrams'].apply(lambda row: list(nltk.pos_tag(row)))

AttributeError: 'tuple' object has no attribute 'isdigit'

### Lematization

for lematization we should first use the unigram version of words, then make a lema bigrams. 

In [None]:
#lematized unigram tokens

#lematized bigram tokens

#from parts of speech tagged unigrams select out 

In [None]:
#Create csvs with IDs for 

ID title and abstract

ID title tokens

ID title bigram

ID title unigram lema

ID 

### Creation of Dictionary and Document Term Matrices

## Topic Modeling:

### LDA - Latent Dirichlet Allocation

### CorEX - Correlation Explanation