# Data Preprocessing and LDA Model Fitting for Iteration 3 of Fair Is:
In this notebook I document preprocessing and LDA model fitting of my Primary Dataset for iteration 3 of the Fair Is project. 
For more information on the previous iterations of this project you can see my Data Managemant Plan and Methodologies Statement. 

To see how the Primary Dataset was created see Data Creation for Iteration 3 of Fair Is.

This notebook is split between Preprocessing Steps and Model Fitting for our corpus of 308 papers.

I conducted the following preprocessing steps:
- **Tokenization**
    - *ngrams*
    - *bi-grams*
    - *ngram verbs*
    - *ngram nouns*
    - *bigram nouns*
- **Lematization**
- **Creation of Dictionary and Document Term Matrices**

I conducted the following Topic Modeling Steps: 
- **Fit model using LDA**
- **Fitting other tipic CorEx (Correlation Explanation)**

- **Further Normalization**

### Importing Libraries and Packages:

In [1]:
import pandas as pd
import json
import csv
import nltk as nltk
import gensim as gm
import os
import os.path
import numpy as np


In [2]:
#in order to use the word_tokenize function we need the nltk punkt package. 
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/aster/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Accessing Clean Dataset

In [2]:
clean_data = os.path.join('../data/processed_data/csv/cleaned_primary_data_12022021.csv')
data = pd.read_csv(clean_data)

In [3]:
#just checking out data real quick
data.head()

Unnamed: 0,X,title,abstract
0,1,unfair items detection educational measurement,measurement professionals come agreement defi...
1,2,fairness academic course timetabling,consider problem creating fair course timetab...
2,3,safeguarding ecommerce advisor cheating behavi...,electronic marketplaces transaction buyers wi...
3,4,decomposition maxmin fair curriculumbased cou...,propose decomposition maxmin fair curriculumb...
4,5,fair assignment indivisible objects ordinal pr...,consider discrete assignment problem agents e...


In [4]:
data.columns

Index(['X', 'title', 'abstract'], dtype='object')

## Preprocessing:

Preprocessing is somewhat similar to cleaning, it describes the steps made to prepare data to be fit to a model. While the primary dataset is structured and cleaned it's still much closer to unstructured text than to something machine readable. 

### Tokenization

*Tokenization* is the process of seperating meaningful strings of text into units called tokens. in Text Analysis, and Natural Language Processing more generally, models don't "understand" or "read" text in the way a human does. 

In tokenization i'm also thing about ngrams, or a sequence of tokens where *n* is some number --- A single token would be a unigram, two tokens would be a bigram, and so forth.

Using bigrams allows us to take account of terms like "machine learning" rather than consider them seperate terms. 

So we will create new columns from our abstracts: 

- unigram tokens for titles
- unigram tokens for abstracts
- bigram tokens for titles
- bigram tokens for abstracts

#### unigram tokens for titles:

The following tookenization code was figured out by Professor Vicky Rampin:

In [5]:
# map = iterator (goes thru each row)
# x = specific title that is being tokenized in the specific moment

data['title_tokens'] = data['title'].map(lambda x: nltk.word_tokenize(x))

#### bigram tokens for titles:

In [6]:
data['title_bigrams'] = data['title_tokens'].apply(lambda row: list(nltk.bigrams(row)))
#print(data['title_bigrams'])

#### unigram tokens for abstracts:

In [7]:
#now let's apply this to abstracts:
data['abstract_tokens'] = data['abstract'].map(lambda x: nltk.word_tokenize(x))


#### bigram tokens for abstracts:

In [8]:
data['abstract_bigrams'] = data['abstract_tokens'].apply(lambda row: list(nltk.bigrams(row)))
#checking dataframe
#data.head()

Unnamed: 0,X,title,abstract,title_tokens,title_bigrams,abstract_tokens,abstract_bigrams
0,1,unfair items detection educational measurement,measurement professionals come agreement defi...,"[unfair, items, detection, educational, measur...","[(unfair, items), (items, detection), (detecti...","[measurement, professionals, come, agreement, ...","[(measurement, professionals), (professionals,..."
1,2,fairness academic course timetabling,consider problem creating fair course timetab...,"[fairness, academic, course, timetabling]","[(fairness, academic), (academic, course), (co...","[consider, problem, creating, fair, course, ti...","[(consider, problem), (problem, creating), (cr..."
2,3,safeguarding ecommerce advisor cheating behavi...,electronic marketplaces transaction buyers wi...,"[safeguarding, ecommerce, advisor, cheating, b...","[(safeguarding, ecommerce), (ecommerce, adviso...","[electronic, marketplaces, transaction, buyers...","[(electronic, marketplaces), (marketplaces, tr..."
3,4,decomposition maxmin fair curriculumbased cou...,propose decomposition maxmin fair curriculumb...,"[decomposition, maxmin, fair, curriculumbased,...","[(decomposition, maxmin), (maxmin, fair), (fai...","[propose, decomposition, maxmin, fair, curricu...","[(propose, decomposition), (decomposition, max..."
4,5,fair assignment indivisible objects ordinal pr...,consider discrete assignment problem agents e...,"[fair, assignment, indivisible, objects, ordin...","[(fair, assignment), (assignment, indivisible)...","[consider, discrete, assignment, problem, agen...","[(consider, discrete), (discrete, assignment),..."


### Parts of Speech Tagging

In [10]:
#in order to use the nltk pos_tag function need averaged_perceptron_tagger
#nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/aster/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [9]:
#trying parts of speech tagging with unigram tokens
data['title_tokens_pos'] = data['title_tokens'].apply(lambda row: list(nltk.pos_tag(row)))

In [10]:
#does this also work with bigrams? 
data['title_bigrams_pos'] = data['title_tokens_pos'].apply(lambda row: list(nltk.bigrams(row)))

In [11]:
#now parts of speech tagging for abstracts bigrams
data['abstract_tokens_pos'] = data['abstract_tokens'].apply(lambda row: list(nltk.pos_tag(row)))
data['abstract_bigrams_pos'] = data['abstract_tokens_pos'].apply(lambda row: list(nltk.bigrams(row)))

In [12]:
#checking to see if this worked:
data.head()

Unnamed: 0,X,title,abstract,title_tokens,title_bigrams,abstract_tokens,abstract_bigrams,title_tokens_pos,title_bigrams_pos,abstract_tokens_pos,abstract_bigrams_pos
0,1,unfair items detection educational measurement,measurement professionals come agreement defi...,"[unfair, items, detection, educational, measur...","[(unfair, items), (items, detection), (detecti...","[measurement, professionals, come, agreement, ...","[(measurement, professionals), (professionals,...","[(unfair, JJ), (items, NNS), (detection, VBP),...","[((unfair, JJ), (items, NNS)), ((items, NNS), ...","[(measurement, NN), (professionals, NNS), (com...","[((measurement, NN), (professionals, NNS)), ((..."
1,2,fairness academic course timetabling,consider problem creating fair course timetab...,"[fairness, academic, course, timetabling]","[(fairness, academic), (academic, course), (co...","[consider, problem, creating, fair, course, ti...","[(consider, problem), (problem, creating), (cr...","[(fairness, JJ), (academic, JJ), (course, NN),...","[((fairness, JJ), (academic, JJ)), ((academic,...","[(consider, VB), (problem, NN), (creating, VBG...","[((consider, VB), (problem, NN)), ((problem, N..."
2,3,safeguarding ecommerce advisor cheating behavi...,electronic marketplaces transaction buyers wi...,"[safeguarding, ecommerce, advisor, cheating, b...","[(safeguarding, ecommerce), (ecommerce, adviso...","[electronic, marketplaces, transaction, buyers...","[(electronic, marketplaces), (marketplaces, tr...","[(safeguarding, VBG), (ecommerce, NN), (adviso...","[((safeguarding, VBG), (ecommerce, NN)), ((eco...","[(electronic, JJ), (marketplaces, NNS), (trans...","[((electronic, JJ), (marketplaces, NNS)), ((ma..."
3,4,decomposition maxmin fair curriculumbased cou...,propose decomposition maxmin fair curriculumb...,"[decomposition, maxmin, fair, curriculumbased,...","[(decomposition, maxmin), (maxmin, fair), (fai...","[propose, decomposition, maxmin, fair, curricu...","[(propose, decomposition), (decomposition, max...","[(decomposition, NN), (maxmin, NN), (fair, NN)...","[((decomposition, NN), (maxmin, NN)), ((maxmin...","[(propose, JJ), (decomposition, NN), (maxmin, ...","[((propose, JJ), (decomposition, NN)), ((decom..."
4,5,fair assignment indivisible objects ordinal pr...,consider discrete assignment problem agents e...,"[fair, assignment, indivisible, objects, ordin...","[(fair, assignment), (assignment, indivisible)...","[consider, discrete, assignment, problem, agen...","[(consider, discrete), (discrete, assignment),...","[(fair, JJ), (assignment, NN), (indivisible, J...","[((fair, JJ), (assignment, NN)), ((assignment,...","[(consider, VB), (discrete, JJ), (assignment, ...","[((consider, VB), (discrete, JJ)), ((discrete,..."


### Lemmatization
*Lemmatization* is a commonly used pre-processing step in text analysis. When we lemmatize tokens, we shorten them to the shortest meaningful root of a word, called a lemma. For example *running* becomes run. 

To lemmatize we will using the WordNetLemmatizer, a tool that is part of the NLTK package and uses [WordNet](https://wordnet.princeton.edu/), a database of semantic relations between word forms in over 200 languages to lemmatize the words in our abstract and title data. 

Finally, WordNetLemmatizer allows us to chose the part of speech of the lemma. In this iteration I am selecting the verb part of speech, one because i'm considering the "understanding" of fairness as proceedural, in action and secondly for the sake of making a decision to move through this project. I also leave code as a comment to return noun forms of lemmas (if no part of speech is specified WordNetLemmatizer defaults to nouns). 

In [46]:
#in order to use lemmatization we need to use import the WordNetLemmatizer and wordnet dictionary
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
from nltk.corpus import treebank

In [27]:
#set the WordNetLemmatizer to a variable
lem = WordNetLemmatizer()

In [91]:
def lem_text(text):
    return [lem.lemmatize(w, 'v') for w in text] #for verbs
    #return [lem.lemmatize(w, 'n') for w in text] #remove hastag/pound, comment out verb line and run cell for nouns

In [92]:
#Lemmatize titles:
data['title_lemmatized']= data['title_tokens'].map(lambda x: lem_text(x))


In [96]:
#Lemmatize abstracts:
data['abstract_lemmatized'] = data['abstract_tokens'].map(lambda x: lem_text(x))

In [106]:
#Get title lemma bigrams:
data['title_bigram_lemmatized']=data['title_lemmatized'].apply(lambda row: list(nltk.bigrams(row)))


In [109]:
#Get abstract lemma bigrams:
data['abstract_bigram_lemmatized']=data['abstract_lemmatized'].apply(lambda row: list(nltk.bigrams(row)))
                                                                     

### Further Cleaning and Normalization:

One last check to see if there are any rows without values (imagine an empty cell in a spreadsheet) we need to consider. 

In [113]:
data.isnull().values.any()

False

We'll also deal with a stray column at the begiining since by changing it's name and setting it as our index. 

In [129]:
#changing column name
data = data.rename(columns={"X":"id"})

In [134]:
#checking to make sure it worked:
#data.head()

In [131]:
#set newly renamed column as index
data = data.set_index('id')

In [133]:
#checking to make sure it worked
#data.head()

Finally we save our preprocessed data and we're ready for Topic Modeling!

In [136]:
data.to_csv('../data/processed_data/csv/processed_primary_data.csv')

In [138]:
#just making sure everything looks good!
#test = pd.read_csv('../data/processed_data/csv/processed_primary_data.csv')
#test.head()

## Topic Modeling:

### Creation of Dictionary, Corpus and Document Term Matrices

For Topic Modeling we will require two objects made from our data: 
- Dictionary:
Like a dictionary in the traditional sense, this is an object that holds all of the avaialble words in our documents. 

- Document-Term Matrix (DTM):
Also called a corpus, The Document Term Matrix, is a kind of table that lists all of the words in our dictionary along it's x axis and all of the documents in our corpus as it's index. The values in each cell are the frquency of that term in that document: 

For example:

doc_1= "all that you touch"
doc_2= "you change"

- Our dictionary would be = ["all", "that", "you", touch", "change"]
- Our corpus would be = [doc_1, doc_2]

- Our DTM would then list how mnay times "all" appeared in each document, how many times "that" appeared in each document, how many times "you" appeared in each document, and so forth. 

In [140]:
#first let's read in all our processed data:
docs = pd.read_csv('../data/processed_data/csv/processed_primary_data.csv')

In [141]:
docs.head()

Unnamed: 0,id,title,abstract,title_tokens,title_bigrams,abstract_tokens,abstract_bigrams,title_tokens_pos,title_bigrams_pos,abstract_tokens_pos,abstract_bigrams_pos,title_lemmatized,abstract_lemmatized,title_bigram_lemmatized,abstract_bigram_lemmatized
0,1,unfair items detection educational measurement,measurement professionals come agreement defi...,"['unfair', 'items', 'detection', 'educational'...","[('unfair', 'items'), ('items', 'detection'), ...","['measurement', 'professionals', 'come', 'agre...","[('measurement', 'professionals'), ('professio...","[('unfair', 'JJ'), ('items', 'NNS'), ('detecti...","[(('unfair', 'JJ'), ('items', 'NNS')), (('item...","[('measurement', 'NN'), ('professionals', 'NNS...","[(('measurement', 'NN'), ('professionals', 'NN...","['unfair', 'items', 'detection', 'educational'...","['measurement', 'professionals', 'come', 'agre...","[('unfair', 'items'), ('items', 'detection'), ...","[('measurement', 'professionals'), ('professio..."
1,2,fairness academic course timetabling,consider problem creating fair course timetab...,"['fairness', 'academic', 'course', 'timetabling']","[('fairness', 'academic'), ('academic', 'cours...","['consider', 'problem', 'creating', 'fair', 'c...","[('consider', 'problem'), ('problem', 'creatin...","[('fairness', 'JJ'), ('academic', 'JJ'), ('cou...","[(('fairness', 'JJ'), ('academic', 'JJ')), (('...","[('consider', 'VB'), ('problem', 'NN'), ('crea...","[(('consider', 'VB'), ('problem', 'NN')), (('p...","['fairness', 'academic', 'course', 'timetabling']","['consider', 'problem', 'create', 'fair', 'cou...","[('fairness', 'academic'), ('academic', 'cours...","[('consider', 'problem'), ('problem', 'create'..."
2,3,safeguarding ecommerce advisor cheating behavi...,electronic marketplaces transaction buyers wi...,"['safeguarding', 'ecommerce', 'advisor', 'chea...","[('safeguarding', 'ecommerce'), ('ecommerce', ...","['electronic', 'marketplaces', 'transaction', ...","[('electronic', 'marketplaces'), ('marketplace...","[('safeguarding', 'VBG'), ('ecommerce', 'NN'),...","[(('safeguarding', 'VBG'), ('ecommerce', 'NN')...","[('electronic', 'JJ'), ('marketplaces', 'NNS')...","[(('electronic', 'JJ'), ('marketplaces', 'NNS'...","['safeguard', 'ecommerce', 'advisor', 'cheat',...","['electronic', 'marketplaces', 'transaction', ...","[('safeguard', 'ecommerce'), ('ecommerce', 'ad...","[('electronic', 'marketplaces'), ('marketplace..."
3,4,decomposition maxmin fair curriculumbased cou...,propose decomposition maxmin fair curriculumb...,"['decomposition', 'maxmin', 'fair', 'curriculu...","[('decomposition', 'maxmin'), ('maxmin', 'fair...","['propose', 'decomposition', 'maxmin', 'fair',...","[('propose', 'decomposition'), ('decomposition...","[('decomposition', 'NN'), ('maxmin', 'NN'), ('...","[(('decomposition', 'NN'), ('maxmin', 'NN')), ...","[('propose', 'JJ'), ('decomposition', 'NN'), (...","[(('propose', 'JJ'), ('decomposition', 'NN')),...","['decomposition', 'maxmin', 'fair', 'curriculu...","['propose', 'decomposition', 'maxmin', 'fair',...","[('decomposition', 'maxmin'), ('maxmin', 'fair...","[('propose', 'decomposition'), ('decomposition..."
4,5,fair assignment indivisible objects ordinal pr...,consider discrete assignment problem agents e...,"['fair', 'assignment', 'indivisible', 'objects...","[('fair', 'assignment'), ('assignment', 'indiv...","['consider', 'discrete', 'assignment', 'proble...","[('consider', 'discrete'), ('discrete', 'assig...","[('fair', 'JJ'), ('assignment', 'NN'), ('indiv...","[(('fair', 'JJ'), ('assignment', 'NN')), (('as...","[('consider', 'VB'), ('discrete', 'JJ'), ('ass...","[(('consider', 'VB'), ('discrete', 'JJ')), (('...","['fair', 'assignment', 'indivisible', 'object'...","['consider', 'discrete', 'assignment', 'proble...","[('fair', 'assignment'), ('assignment', 'indiv...","[('consider', 'discrete'), ('discrete', 'assig..."


In [142]:
docs.columns

Index(['id', 'title', 'abstract', 'title_tokens', 'title_bigrams',
       'abstract_tokens', 'abstract_bigrams', 'title_tokens_pos',
       'title_bigrams_pos', 'abstract_tokens_pos', 'abstract_bigrams_pos',
       'title_lemmatized', 'abstract_lemmatized', 'title_bigram_lemmatized',
       'abstract_bigram_lemmatized'],
      dtype='object')

In [501]:
#Loading our processed data columns into variables
doc_t = docs['title_tokens']
doc_a = docs['abstract_tokens']

In [199]:
import re

In [139]:
#To create dictionaries for our tiles and abstracts we need gensim corpora
from gensim import corpora

In [None]:
#I also want to think about extremes: specifically words that are very minimal and words that are too prevelant. 
#first I want to try fitting a model, then consider trying to filter extremes to see how topics change

#for abstracts:
#filter out tokens that appear in less than 10 documents
#filter out tokens that appear in more than 60% of documents
#a_dict.filter_extremes(no_below=10, no_above=0.6)

#for titles:
#filter out tokens that appear in less than 10 documents
#filter out tokens that appear in more than 60% of documents
#t_dict.filter_extremes(no_below=10, no_above=0.6)

In [502]:
def to_dict(text):
    pat1 = r"\["
    pat2 = r"\]"
    x = text.str.replace(pat1, "", regex=True)
    w = x.str.replace(pat2, "", regex=True)
    y = w.str.replace("'", "", regex =True)
    return y.str.split(",")

In [503]:
#last preprocessing steps to make our data ready for turning into a dictionary
doc_t=to_dict(docs['title_tokens'].apply(lambda x: x.strip()))
doc_a=to_dict(docs['abstract_tokens'].apply(lambda x: x.strip()))

In [504]:
doc_t[0]

['unfair', ' items', ' detection', ' educational', ' measurement']

In [505]:
doc_a[65]

['beginning',
 ' history',
 ' ai',
 ' interest',
 ' games',
 ' platform',
 ' research',
 ' field',
 ' developed',
 ' humanlevel',
 ' competence',
 ' complex',
 ' games',
 ' became',
 ' target',
 ' researchers',
 ' worked',
 ' reach',
 ' relatively',
 ' recently',
 ' target',
 ' finally',
 ' met',
 ' traditional',
 ' tabletop',
 ' games',
 ' backgammon',
 ' chess',
 ' go',
 ' current',
 ' research',
 ' focus',
 ' shifted',
 ' electronic',
 ' games',
 ' provide',
 ' unique',
 ' challenges',
 ' often',
 ' case',
 ' ai',
 ' research',
 ' results',
 ' liable',
 ' exaggerated',
 ' misrepresented',
 ' either',
 ' authors',
 ' third',
 ' parties',
 ' extent',
 ' games',
 ' benchmark',
 ' consist',
 ' fair',
 ' competition',
 ' human',
 ' ai',
 ' also',
 ' matter',
 ' debate',
 ' work',
 ' review',
 ' statements',
 ' made',
 ' authors',
 ' third',
 ' parties',
 ' general',
 ' media',
 ' academic',
 ' circle',
 ' game',
 ' benchmark',
 ' results',
 ' discuss',
 ' factors',
 ' can',
 ' impact',
 

In [506]:
#creating titles dictionary:
t_dict = corpora.Dictionary(doc_t)

#creating titles corpus/dtm:
t_corpus = [t_dict.doc2bow(text) for text in doc_t]


#[dictionary.doc2bow(doc.split()) for doc in text]
#t_dict = corpora.Dictionary([doc_t.split() for doc in doc_t])
#a_dict = corpora.Dictionary(doc_a) #302 unique tokens

In [512]:
a_dict = corpora.Dictionary(doc_a)

In [513]:
a_dict.filter_extremes(no_below=5, no_above=0.01)

In [515]:
print(a_dict.filter_extremes(no_below=5, no_above=0.01,keep_n=100))

None


In [531]:
#creating abstract dictionary:
a_dict = corpora.Dictionary(doc_a)

#creating abstract corpus:
a_corpus = [a_dict.doc2bow(text) for text in doc_a]

In [473]:
#filter extremes does not seem to work! Will return to later. 
#t_dict = t_dict.filter_extremes(no_below=5, no_above=0.3)
#for titles:
#filter out tokens that appear in less than 10 documents
#filter out tokens that appear in more than 60% of documents


In [443]:
#t_dict[3]

In [None]:
#from gensim documentation:
#dictionart
#id2word = corpora.Dictionary(data_lemmatized)
#variable holding all our documents
#texts = data_lemmatized
#corpus = [id2word.doc2bow(text) for text in texts]

Now we can use a python module called pickle to save our dictionaries and corpora/DTMs for later use. This way we don't have to go through the entire notebook if we come back to this later on!

In [170]:
import pickle

In [544]:
#all title objects:
with open('../data/processed_data/objects/title_tokens_12152021.pkl','wb') as a:
    pickle.dump(doc_t, a)

with open('../data/processed_data/objects/title_tokens_dictionary_12152021.pkl','wb') as b:
    pickle.dump(t_dict, b)

with open('../data/processed_data/objects/title_tokens_corpus_12152021.pkl','wb') as c:
    pickle.dump(t_corpus, c) 

#all abstract objects
with open('../data/processed_data/objects/abstract_tokens_12152021.pkl','wb') as d:
    pickle.dump(doc_a, d) 

with open('../data/processed_data/objects/abstract_tokens_dictionary_12152021.pkl','wb') as e:
    pickle.dump(a_dict, e)

with open('../data/processed_data/objects/abstract_tokens_corpus_12152021.pkl','wb') as f:
    pickle.dump(a_corpus, f) 

### LDA - Latent Dirichlet Allocation

explaining laten dirichlet allocation and porbability distribution.

LDA requires a *k-value*, where k is the number of topics to attempt clustering topics into. LDA will also ask us for a number of top words to look for. 

I would like to see what topics come up on the low end with k=6

On the high end using Arvind Narayanan's turtorial research 21 definitions of fairness and their politics, I'll try k=21

I'd then like to use feature engineering to find a third k=value to try. 

For each k-value/LDA batch I'll be using the top 10 terms for each topic. 

Finally as closing analysis procudures I'll:
- pickle each model
- create wordclouds with the top 10 words of each topic
- use pyLDAviz to generate visualizations

In [526]:
#setting iniital k-values
k_1 = 21

In [517]:
len(doc_t)

308

In [525]:
#To try a different lower k-value i'm going to use the square root of n/2 where n is the number of documents
import math

#math.sqrt(308/2)
k_2=math.sqrt(308/2)
print(math.sqrt(308/2))
print(k_3)

12.409673645990857
12.409673645990857


In [447]:
#First go with titles!
import gensim



In [532]:
#Title Model where k=21
lda_t_1 = gensim.models.ldamodel.LdaModel(corpus=t_corpus, id2word=t_dict, num_topics=k_1)

In [533]:
#Printing Ttile Topics where k=21
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)



title_topics = pd.DataFrame(lda_t_1.print_topics(num_words=20))
title_topics.columns = ["topics", "top 10 terms"]
title_topics

Unnamed: 0,topics,top 10 terms
0,2,"0.046*"" fairness"" + 0.020*"" classification"" + 0.020*"" algorithmic"" + 0.013*"" accountability"" + 0.013*"" ai"" + 0.013*""fair"" + 0.013*"" division"" + 0.013*"" fair"" + 0.013*"" systems"" + 0.007*"" intersectional"" + 0.007*"" advisor"" + 0.007*"" food"" + 0.007*"" trust"" + 0.007*"" study"" + 0.007*"" robust"" + 0.007*""safeguarding"" + 0.007*"" divisive"" + 0.007*"" goods"" + 0.007*""cant"" + 0.007*"" accounting"""
1,14,"0.030*"" explanations"" + 0.020*"" ai"" + 0.020*"" case"" + 0.020*""fairness"" + 0.010*""impossible"" + 0.010*"" contrastive"" + 0.010*"" unfairness"" + 0.010*"" online"" + 0.010*"" scenario"" + 0.010*"" towards"" + 0.010*"" mitigating"" + 0.010*"" semantic"" + 0.010*"" bios"" + 0.010*"" setting"" + 0.010*"" explainable"" + 0.010*""ethical"" + 0.010*"" algorithmic"" + 0.010*"" models"" + 0.010*"" covid"" + 0.010*"" nlp"""
2,4,"0.056*"" fair"" + 0.023*"" learning"" + 0.015*"" equality"" + 0.015*""learning"" + 0.015*"" eliminating"" + 0.015*"" machine"" + 0.015*""fair"" + 0.015*"" fairness"" + 0.011*"" clustering"" + 0.008*"" understanding"" + 0.008*"" crowd"" + 0.008*"" unfair"" + 0.008*"" trust"" + 0.008*"" discounted"" + 0.008*"" lessons"" + 0.008*"" add"" + 0.008*"" social"" + 0.008*"" equity"" + 0.008*"" false"" + 0.008*"" dominant"""
3,5,"0.065*"" fairness"" + 0.028*"" fair"" + 0.018*"" definitions"" + 0.018*"" algorithmic"" + 0.012*"" computing"" + 0.012*"" task"" + 0.012*"" causal"" + 0.012*"" counterfactual"" + 0.012*"" framework"" + 0.012*"" bias"" + 0.012*"" decisions"" + 0.012*""fairness"" + 0.006*"" aggregation"" + 0.006*""reputation"" + 0.006*"" minimal"" + 0.006*"" toolkit"" + 0.006*""roles"" + 0.006*"" fairmaml"" + 0.006*"" multiresource"" + 0.006*""dissecting"""
4,13,"0.031*"" fairness"" + 0.016*"" social"" + 0.016*""measuring"" + 0.016*"" populationlevel"" + 0.016*"" learning"" + 0.016*"" signaling"" + 0.016*""roles"" + 0.016*""active"" + 0.016*""access"" + 0.016*"" machine"" + 0.016*"" instead"" + 0.016*"" source"" + 0.016*"" nonexpert"" + 0.016*"" comprehension"" + 0.016*"" inequality"" + 0.016*"" computing"" + 0.016*"" unawareness"" + 0.016*"" metrics"" + 0.016*"" change"" + 0.001*""fair"""
5,17,"0.071*"" fairness"" + 0.036*"" fair"" + 0.015*"" classification"" + 0.015*"" accuracy"" + 0.015*"" data"" + 0.015*"" ai"" + 0.015*"" construction"" + 0.015*"" causal"" + 0.015*"" systems"" + 0.007*"" theorem"" + 0.007*"" prediction"" + 0.007*"" maximin"" + 0.007*"" eligibility"" + 0.007*"" biased"" + 0.007*"" use"" + 0.007*"" people"" + 0.007*"" transparently"" + 0.007*""point"" + 0.007*"" improve"" + 0.007*"" pathways"""
6,19,"0.053*""fairness"" + 0.032*"" learning"" + 0.022*"" fairness"" + 0.011*"" fairly"" + 0.011*"" methodological"" + 0.011*"" optimization"" + 0.011*"" critically"" + 0.011*"" product"" + 0.011*"" deep"" + 0.011*""human"" + 0.011*"" metaalgorithm"" + 0.011*"" clinical"" + 0.011*"" guarantees"" + 0.011*"" research"" + 0.011*""disparate"" + 0.011*"" program"" + 0.011*"" effects"" + 0.011*"" bioinspired"" + 0.011*"" manipulation"" + 0.011*"" constraints"""
7,16,"0.050*"" fairness"" + 0.029*"" models"" + 0.029*""fairness"" + 0.022*"" data"" + 0.022*""fair"" + 0.015*"" systems"" + 0.015*"" detection"" + 0.015*"" twosided"" + 0.007*"" collaborative"" + 0.007*"" graphical"" + 0.007*"" using"" + 0.007*"" filtering"" + 0.007*"" sensing"" + 0.007*"" can"" + 0.007*"" computational"" + 0.007*"" parrots"" + 0.007*"" eyes"" + 0.007*"" measuring"" + 0.007*"" big"" + 0.007*""multiwinner"""
8,0,"0.046*"" fair"" + 0.027*"" fairness"" + 0.015*"" privacy"" + 0.015*"" models"" + 0.015*"" individual"" + 0.015*"" data"" + 0.012*"" learning"" + 0.012*"" machine"" + 0.008*"" matters"" + 0.008*"" public"" + 0.008*"" facial"" + 0.008*"" performing"" + 0.008*"" neural"" + 0.008*""beyond"" + 0.008*"" monotonic"" + 0.008*"" sequenceability"" + 0.008*"" training"" + 0.008*"" high"" + 0.008*"" sharing"" + 0.008*"" accountability"""
9,18,"0.018*"" prediction"" + 0.009*"" thought"" + 0.009*"" online"" + 0.009*"" datasets"" + 0.009*"" platforms"" + 0.009*"" counterfactual"" + 0.009*""towards"" + 0.009*"" automated"" + 0.009*"" differences"" + 0.009*""algorithmic"" + 0.009*"" clauses"" + 0.009*"" detector"" + 0.009*"" assessment"" + 0.009*"" mobility"" + 0.009*"" allocation"" + 0.009*"" healthcare"" + 0.009*"" achieve"" + 0.009*"" rating"" + 0.009*"" new"" + 0.009*"" machine"""


In [545]:
#Abstracts LDA where k=21:
lda_a_1 = gensim.models.ldamodel.LdaModel(corpus=a_corpus, id2word=a_dict, num_topics=k_1)

In [546]:
#Printing abstract topics where k=21:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)



abstract_topics = pd.DataFrame(lda_a_1.print_topics(num_words=20))
abstract_topics.columns = ["topics", "top 10 terms"]
abstract_topics

Unnamed: 0,topics,top 10 terms
0,13,"0.014*"" fairness"" + 0.007*"" models"" + 0.007*"" groups"" + 0.007*"" can"" + 0.006*"" fair"" + 0.006*"" systems"" + 0.006*"" machine"" + 0.005*"" show"" + 0.005*"" data"" + 0.005*"" algorithm"" + 0.005*"" learning"" + 0.005*"" different"" + 0.005*"" used"" + 0.004*"" agents"" + 0.004*"" paper"" + 0.004*"" algorithmic"" + 0.004*"" work"" + 0.003*"" model"" + 0.003*"" proposed"" + 0.003*"" kmeans"""
1,7,"0.015*"" fairness"" + 0.008*"" algorithm"" + 0.008*"" stories"" + 0.008*"" fair"" + 0.007*"" data"" + 0.007*"" users"" + 0.006*"" learning"" + 0.005*"" truth"" + 0.005*"" can"" + 0.005*"" risk"" + 0.005*"" two"" + 0.004*"" paper"" + 0.004*"" social"" + 0.004*"" accuracy"" + 0.004*"" algorithms"" + 0.004*"" attributes"" + 0.004*"" groups"" + 0.003*"" tasks"" + 0.003*"" false"" + 0.003*"" different"""
2,5,"0.013*"" trust"" + 0.008*"" ai"" + 0.007*"" fairness"" + 0.006*"" systems"" + 0.006*"" legal"" + 0.006*"" model"" + 0.006*"" can"" + 0.005*"" paper"" + 0.004*"" models"" + 0.004*"" methods"" + 0.004*"" ml"" + 0.004*"" learning"" + 0.004*"" two"" + 0.004*"" data"" + 0.004*"" behavior"" + 0.004*"" recourse"" + 0.004*"" decision"" + 0.003*"" discuss"" + 0.003*"" decisions"" + 0.003*"" method"""
3,4,"0.013*"" fairness"" + 0.012*"" data"" + 0.011*"" model"" + 0.010*"" can"" + 0.008*"" learning"" + 0.007*"" systems"" + 0.007*"" decision"" + 0.006*"" models"" + 0.005*"" show"" + 0.005*"" explanations"" + 0.005*"" decisions"" + 0.004*"" unfairness"" + 0.004*"" used"" + 0.004*"" fair"" + 0.004*"" bias"" + 0.004*"" results"" + 0.004*"" algorithm"" + 0.004*"" paper"" + 0.003*"" work"" + 0.003*"" also"""
4,18,"0.039*"" fairness"" + 0.013*"" data"" + 0.010*"" model"" + 0.008*"" learning"" + 0.007*"" can"" + 0.007*"" work"" + 0.006*"" models"" + 0.006*"" fair"" + 0.005*"" show"" + 0.005*"" machine"" + 0.005*"" different"" + 0.005*"" algorithms"" + 0.004*"" problem"" + 0.004*"" systems"" + 0.004*"" algorithmic"" + 0.004*"" notions"" + 0.003*"" provide"" + 0.003*"" unfairness"" + 0.003*"" decision"" + 0.003*"" existing"""
5,9,"0.020*"" learning"" + 0.015*"" fairness"" + 0.008*"" algorithms"" + 0.007*"" model"" + 0.007*"" framework"" + 0.006*"" deep"" + 0.006*"" models"" + 0.006*"" can"" + 0.005*"" fair"" + 0.005*"" problem"" + 0.004*"" algorithmic"" + 0.004*"" people"" + 0.004*"" machine"" + 0.004*"" systems"" + 0.004*"" data"" + 0.004*"" propose"" + 0.004*"" trust"" + 0.004*"" may"" + 0.004*"" used"" + 0.003*"" different"""
6,16,"0.012*"" data"" + 0.008*"" transparency"" + 0.006*"" information"" + 0.005*"" ml"" + 0.005*"" analysis"" + 0.004*"" risk"" + 0.004*"" users"" + 0.004*"" assessments"" + 0.004*"" algorithmic"" + 0.004*"" present"" + 0.004*"" paper"" + 0.004*"" use"" + 0.004*"" language"" + 0.003*"" decisions"" + 0.003*"" systems"" + 0.003*"" different"" + 0.003*"" fair"" + 0.003*"" fairness"" + 0.003*"" gdpr"" + 0.003*"" legal"""
7,20,"0.015*"" fairness"" + 0.009*"" can"" + 0.009*"" bias"" + 0.007*"" learning"" + 0.006*"" fair"" + 0.006*"" accuracy"" + 0.005*"" model"" + 0.005*"" data"" + 0.005*"" show"" + 0.005*"" algorithms"" + 0.004*"" paper"" + 0.004*"" models"" + 0.004*"" results"" + 0.004*"" propose"" + 0.004*"" machine"" + 0.004*"" two"" + 0.003*"" however"" + 0.003*"" using"" + 0.003*"" metrics"" + 0.003*"" used"""
8,19,"0.013*"" fair"" + 0.012*"" algorithmic"" + 0.011*"" fairness"" + 0.008*"" problem"" + 0.007*"" algorithms"" + 0.006*"" social"" + 0.006*"" can"" + 0.005*"" algorithm"" + 0.004*"" systems"" + 0.004*"" also"" + 0.004*"" new"" + 0.004*"" agents"" + 0.003*"" impact"" + 0.003*"" accountability"" + 0.003*"" show"" + 0.003*"" concepts"" + 0.003*"" computer"" + 0.003*"" problems"" + 0.003*"" different"" + 0.003*"" several"""
9,11,"0.014*"" fairness"" + 0.013*"" fair"" + 0.010*"" can"" + 0.010*"" systems"" + 0.005*"" work"" + 0.005*"" classification"" + 0.004*"" data"" + 0.004*"" learning"" + 0.004*"" algorithms"" + 0.004*"" different"" + 0.004*"" approach"" + 0.003*"" interventions"" + 0.003*"" also"" + 0.003*"" show"" + 0.003*"" online"" + 0.003*"" discrimination"" + 0.003*"" problem"" + 0.003*"" machine"" + 0.003*"" decisionmaking"" + 0.003*"" platform"""


In [541]:
#Title Model where k=math.sqrt(308/2)
lda_t_2 = gensim.models.ldamodel.LdaModel(corpus=t_corpus, id2word=t_dict, num_topics=k_3)

In [538]:
#Printing Title Model where k=math.sqrt(308/2)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)



title_topics = pd.DataFrame(lda_t_2.print_topics(num_words=20))
title_topics.columns = ["topics", "top 10 terms"]
title_topics

Unnamed: 0,topics,top 10 terms
0,0,"0.039*"" fairness"" + 0.017*"" explanations"" + 0.013*""fair"" + 0.013*"" models"" + 0.009*"" vision"" + 0.009*"" computer"" + 0.009*"" learning"" + 0.009*""model"" + 0.009*"" bias"" + 0.009*"" decisionmaking"" + 0.009*"" explainable"" + 0.009*"" constraints"" + 0.009*"" counterfactual"" + 0.009*"" trust"" + 0.009*"" ai"" + 0.005*"" intelligence"" + 0.005*"" application"" + 0.005*"" slopes"" + 0.005*"" algorithmic"" + 0.005*"" adaptation"""
1,1,"0.042*"" fairness"" + 0.038*""fairness"" + 0.023*"" learning"" + 0.015*"" fair"" + 0.015*"" data"" + 0.012*"" minimal"" + 0.012*"" framework"" + 0.012*"" fairly"" + 0.012*"" warnings"" + 0.012*"" counterfactual"" + 0.012*"" fairmaml"" + 0.008*"" networks"" + 0.008*"" neural"" + 0.008*"" causal"" + 0.008*""causal"" + 0.008*"" recourse"" + 0.008*"" discrimination"" + 0.008*"" testing"" + 0.008*"" model"" + 0.008*"" edge"""
2,2,"0.030*"" fair"" + 0.025*"" division"" + 0.020*"" fairness"" + 0.015*"" ai"" + 0.015*"" gdpr"" + 0.015*"" classification"" + 0.010*"" perspective"" + 0.010*"" interventions"" + 0.010*""algorithmic"" + 0.010*"" education"" + 0.010*"" unfairness"" + 0.010*"" contextual"" + 0.010*"" interpretation"" + 0.010*"" model"" + 0.010*"" linguistic"" + 0.010*""online"" + 0.010*""concept"" + 0.005*"" exclusionary"" + 0.005*"" admissions"" + 0.005*"" us"""
3,3,"0.048*"" fairness"" + 0.028*"" fair"" + 0.016*"" learning"" + 0.012*"" deep"" + 0.012*"" systems"" + 0.008*"" study"" + 0.008*"" bias"" + 0.008*"" multiple"" + 0.008*"" accuracy"" + 0.008*"" constraints"" + 0.008*"" policies"" + 0.008*"" understanding"" + 0.008*"" recommender"" + 0.008*"" data"" + 0.008*"" evaluate"" + 0.008*"" classification"" + 0.004*"" predictions"" + 0.004*"" models"" + 0.004*"" case"" + 0.004*"" extensible"""
4,4,"0.035*"" learning"" + 0.026*""fairness"" + 0.018*"" machine"" + 0.018*"" algorithms"" + 0.013*"" systematic"" + 0.013*"" fair"" + 0.013*"" fairness"" + 0.009*"" predictive"" + 0.009*"" investigating"" + 0.009*"" accountability"" + 0.009*"" online"" + 0.009*"" fairnessaware"" + 0.009*"" robustness"" + 0.009*"" algorithm"" + 0.009*"" deep"" + 0.009*"" models"" + 0.005*"" bias"" + 0.005*""differentially"" + 0.005*"" optimization"" + 0.005*"" approaches"""
5,5,"0.044*"" fairness"" + 0.037*"" learning"" + 0.037*"" fair"" + 0.019*"" machine"" + 0.015*"" models"" + 0.015*"" algorithmic"" + 0.011*"" false"" + 0.008*"" constraints"" + 0.008*"" dynamic"" + 0.008*"" federated"" + 0.008*"" unfairness"" + 0.008*"" impact"" + 0.008*"" towards"" + 0.008*"" representations"" + 0.008*""learning"" + 0.008*""fair"" + 0.008*"" systems"" + 0.004*"" negatives"" + 0.004*""understanding"" + 0.004*"" organizations"""
6,6,"0.028*"" fair"" + 0.020*"" learning"" + 0.016*"" ai"" + 0.016*"" data"" + 0.016*""fairness"" + 0.012*"" trust"" + 0.012*"" making"" + 0.012*"" decision"" + 0.012*""fair"" + 0.008*""learning"" + 0.008*"" accountability"" + 0.008*"" principles"" + 0.008*"" framework"" + 0.008*"" representations"" + 0.008*""fairnessaware"" + 0.008*"" individually"" + 0.008*"" fairness"" + 0.008*"" classification"" + 0.004*""closing"" + 0.004*"" accuracy"""
7,7,"0.064*"" fairness"" + 0.022*""fairness"" + 0.016*"" algorithmic"" + 0.011*"" classifiers"" + 0.011*""fair"" + 0.011*"" constraints"" + 0.011*"" search"" + 0.011*"" clustering"" + 0.011*"" welfare"" + 0.006*""fat"" + 0.006*"" diverse"" + 0.006*"" neural"" + 0.006*"" playing"" + 0.006*"" attributes"" + 0.006*"" bayes"" + 0.006*"" reframing"" + 0.006*"" critically"" + 0.006*"" multiple"" + 0.006*"" evaluation"" + 0.006*"" transparency"""
8,8,"0.023*"" fairness"" + 0.023*""fair"" + 0.020*"" fair"" + 0.014*"" social"" + 0.012*"" algorithmic"" + 0.012*""fairness"" + 0.012*"" classification"" + 0.008*"" definitions"" + 0.008*"" fairnessaware"" + 0.008*"" unfair"" + 0.008*"" towards"" + 0.008*"" constraints"" + 0.008*"" basis"" + 0.008*"" data"" + 0.008*"" via"" + 0.008*""philosophical"" + 0.008*"" systems"" + 0.008*"" optimization"" + 0.008*"" recourse"" + 0.007*"" behavior"""
9,9,"0.067*"" fairness"" + 0.021*"" fair"" + 0.018*"" learning"" + 0.016*"" machine"" + 0.016*""fair"" + 0.011*""algorithmic"" + 0.011*"" models"" + 0.011*"" ai"" + 0.008*"" effects"" + 0.008*"" towards"" + 0.008*"" approach"" + 0.008*"" prediction"" + 0.008*"" classification"" + 0.008*"" twosided"" + 0.008*"" systems"" + 0.005*"" ml"" + 0.005*"" people"" + 0.005*"" metrics"" + 0.005*"" goods"" + 0.005*"" indivisible"""


In [542]:
#Abstract Model where k=math.sqrt(308/2)
lda_a_2 = gensim.models.ldamodel.LdaModel(corpus=a_corpus, id2word=a_dict, num_topics=k_2)

In [543]:
#Printing Abstract Topics where k=math.sqrt(308/2)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)



abstract_topics_k2 = pd.DataFrame(lda_a.print_topics(num_words=20))
abstract_topics_k2.columns = ["topics", "top 10 terms"]
abstract_topics_k2

Unnamed: 0,topics,top 10 terms
0,0,"0.015*"" fairness"" + 0.009*"" learning"" + 0.009*"" data"" + 0.007*"" models"" + 0.006*"" can"" + 0.005*"" systems"" + 0.005*"" model"" + 0.005*"" problem"" + 0.005*"" machine"" + 0.005*"" algorithms"" + 0.004*"" paper"" + 0.004*"" training"" + 0.003*"" different"" + 0.003*"" framework"" + 0.003*"" decisions"" + 0.003*"" constraints"" + 0.003*"" classification"" + 0.003*"" present"" + 0.003*"" fair"" + 0.003*"" decision"""
1,1,"0.027*"" fairness"" + 0.011*"" fair"" + 0.009*"" can"" + 0.006*"" algorithms"" + 0.006*"" data"" + 0.006*"" show"" + 0.005*"" algorithm"" + 0.005*"" learning"" + 0.005*"" problem"" + 0.005*"" model"" + 0.004*"" approach"" + 0.004*"" machine"" + 0.004*"" different"" + 0.004*"" notions"" + 0.004*"" notion"" + 0.004*"" results"" + 0.004*"" method"" + 0.003*"" proposed"" + 0.003*"" two"" + 0.003*"" work"""
2,2,"0.023*"" fairness"" + 0.009*"" data"" + 0.008*"" systems"" + 0.007*"" algorithms"" + 0.006*"" learning"" + 0.005*"" unfairness"" + 0.005*"" can"" + 0.005*"" metrics"" + 0.005*"" show"" + 0.005*"" algorithmic"" + 0.004*"" fair"" + 0.004*"" discrimination"" + 0.004*"" paper"" + 0.004*"" algorithm"" + 0.004*"" machine"" + 0.004*"" also"" + 0.004*"" bias"" + 0.003*"" different"" + 0.003*"" problem"" + 0.003*"" using"""
3,3,"0.025*"" fairness"" + 0.011*"" data"" + 0.010*"" learning"" + 0.008*"" fair"" + 0.007*"" can"" + 0.007*"" machine"" + 0.006*"" model"" + 0.005*"" work"" + 0.005*"" models"" + 0.005*"" systems"" + 0.004*"" problem"" + 0.004*"" framework"" + 0.004*"" algorithms"" + 0.004*"" different"" + 0.004*"" paper"" + 0.004*"" ai"" + 0.004*"" show"" + 0.003*"" propose"" + 0.003*"" constraints"" + 0.003*"" performance"""
4,4,"0.010*"" learning"" + 0.009*"" fairness"" + 0.007*"" model"" + 0.007*"" machine"" + 0.006*"" models"" + 0.005*"" may"" + 0.005*"" data"" + 0.004*"" can"" + 0.004*"" algorithms"" + 0.004*"" social"" + 0.004*"" groups"" + 0.004*"" two"" + 0.003*"" using"" + 0.003*"" datasets"" + 0.003*"" used"" + 0.003*"" work"" + 0.003*"" use"" + 0.003*"" new"" + 0.003*"" group"" + 0.003*"" world"""
5,5,"0.021*"" fairness"" + 0.013*"" model"" + 0.011*"" models"" + 0.009*"" data"" + 0.007*"" learning"" + 0.005*"" can"" + 0.005*"" fair"" + 0.005*"" used"" + 0.004*"" different"" + 0.004*"" propose"" + 0.004*"" algorithms"" + 0.004*"" also"" + 0.004*"" explanations"" + 0.004*"" decision"" + 0.004*"" approach"" + 0.004*"" work"" + 0.003*"" machine"" + 0.003*"" paper"" + 0.003*"" sensitive"" + 0.003*"" framework"""
6,6,"0.026*"" fairness"" + 0.011*"" learning"" + 0.008*"" can"" + 0.007*"" algorithms"" + 0.006*"" algorithmic"" + 0.006*"" systems"" + 0.005*"" show"" + 0.004*"" work"" + 0.004*"" data"" + 0.004*"" machine"" + 0.004*"" different"" + 0.004*"" new"" + 0.004*"" fair"" + 0.004*"" social"" + 0.004*"" may"" + 0.004*"" decision"" + 0.003*"" welfare"" + 0.003*"" ai"" + 0.003*"" agents"" + 0.003*"" models"""
7,7,"0.010*"" data"" + 0.010*"" fairness"" + 0.007*"" can"" + 0.007*"" users"" + 0.006*"" systems"" + 0.005*"" algorithm"" + 0.005*"" different"" + 0.004*"" fair"" + 0.004*"" models"" + 0.004*"" problem"" + 0.004*"" algorithms"" + 0.004*"" results"" + 0.004*"" show"" + 0.004*"" stories"" + 0.004*"" decision"" + 0.004*"" paper"" + 0.003*"" learning"" + 0.003*"" use"" + 0.003*"" performance"" + 0.003*"" bias"""
8,8,"0.011*"" fairness"" + 0.007*"" learning"" + 0.005*"" can"" + 0.005*"" groups"" + 0.005*"" fair"" + 0.005*"" two"" + 0.005*"" bias"" + 0.004*"" algorithm"" + 0.004*"" data"" + 0.004*"" show"" + 0.004*"" accuracy"" + 0.004*"" also"" + 0.003*"" existing"" + 0.003*"" model"" + 0.003*"" research"" + 0.003*"" recommendations"" + 0.003*"" classification"" + 0.003*"" models"" + 0.003*"" machine"" + 0.003*"" analysis"""
9,9,"0.019*"" fairness"" + 0.011*"" fair"" + 0.008*"" can"" + 0.007*"" data"" + 0.005*"" different"" + 0.005*"" groups"" + 0.004*"" users"" + 0.004*"" agents"" + 0.004*"" impact"" + 0.004*"" systems"" + 0.004*"" also"" + 0.004*"" paper"" + 0.003*"" clustering"" + 0.003*"" problem"" + 0.003*"" decisions"" + 0.003*"" show"" + 0.003*"" learning"" + 0.003*"" algorithmic"" + 0.003*"" algorithm"" + 0.003*"" algorithms"""


### Model Saving

In [571]:
#title model where k=21
lda_t_1.save('../data/data_products/models/lda_titles_k21')

#title model where k=math.sqrt(308/2)
lda_t_2.save('../data/data_products/models/lda_titles_k12')

#abstract model where k=21
lda_a_1.save('../data/data_products/models/lda_abstracts_k21')

#abstract model where k=math.sqrt(308/2)
lda_a_2.save('../data/data_products/models/lda_abstracts_k12')

### Topic Model Visualization: 


In [516]:
import wordcloud

In [555]:
import pyLDAvis.gensim_models

In [564]:
#title model where k=21
lda_t_1_display = pyLDAvis.gensim_models.prepare(lda_t_1, corpus=t_corpus, dictionary=t_dict, sort_topics=False)

pyLDAvis.display(lda_t_1_display)

  default_term_info = default_term_info.sort_values(


In [559]:
#title model where k=math.sqrt(308/2)
lda_t_2_display = pyLDAvis.gensim_models.prepare(lda_t_2, corpus=t_corpus, dictionary=t_dict, sort_topics=False)

pyLDAvis.display(lda_t_2_display)

  default_term_info = default_term_info.sort_values(


In [561]:
#abstract model where k=21
lda_a_1_display = pyLDAvis.gensim_models.prepare(lda_a_1, corpus=a_corpus, dictionary=a_dict, sort_topics=False)

pyLDAvis.display(lda_a_1_display)

  default_term_info = default_term_info.sort_values(


In [562]:
#abstract model where k=math.sqrt(308/2)
lda_a_2_display = pyLDAvis.gensim_models.prepare(lda_a_2, corpus=a_corpus, dictionary=a_dict, sort_topics=False)

pyLDAvis.display(lda_a_2_display)

  default_term_info = default_term_info.sort_values(


### Saving pyLDAvis:

In [572]:
#saving title model vis where k=21
t_k21 = pyLDAvis.gensim_models.prepare(lda_t_1, corpus=t_corpus, dictionary=t_dict, sort_topics=False)
pyLDAvis.save_html(t_k21,'../data/data_products/vis/lda_titles_k21.html')

  default_term_info = default_term_info.sort_values(


In [573]:
#saving title model k=math.sqrt(308/2)
t_k12 = pyLDAvis.gensim_models.prepare(lda_t_2, corpus=t_corpus, dictionary=t_dict, sort_topics=False)
pyLDAvis.save_html(t_k12,'../data/data_products/vis/lda_titles_k12.html')

  default_term_info = default_term_info.sort_values(


In [574]:
#saving abstract model where k=21
a_k21 = pyLDAvis.gensim_models.prepare(lda_a_1, corpus=a_corpus, dictionary=a_dict, sort_topics=False)
pyLDAvis.save_html(a_k21,'../data/data_products/vis/lda_abstracts_k21.html')

  default_term_info = default_term_info.sort_values(


In [575]:
#saving abstract model k=math.sqrt(308/2)
a_k12 = pyLDAvis.gensim_models.prepare(lda_a_2, corpus=a_corpus, dictionary=a_dict, sort_topics=False)
pyLDAvis.save_html(a_k12,'../data/data_products/vis/lda_abstracts_k12.html')

  default_term_info = default_term_info.sort_values(


### Model Evaluating

#notes fomr chris:

- try to pick a smaller number of topics cause this may be getting to in the weeds and there might be a lot of overlap
- to get a smaller number try using square root of n/2 , where n is the number of documents
- then compare difference and variation between n=12 and n=21

-don't worry at this point about inclusion of words like "fairness" "fair" cause taking them out at this point may mess with th eword proximity LDA uses ton decide on topics. 




In [None]:
#fit a model where k=3 as a guess number
#fit a model where k=10 as an inbetween
#fit a model where k=20 per Narayanan

#pickle our models

#look into feature engineering to see if there is a better number of topics we can try
#look into clustering algorithms like k-means which we've done before

#create a pylda viz and a wordcloud for each of our topics top 10 words
#then do some type of reading into and explanation of each. 

#getting rid of words that appear to frequently across topics. 