The preprocessing steps below are based on: 

Notebook to accompany article by Arseniev-Koehler and Foster, "Teaching an algorithm what it means to be fat: machine-learning as a model for cultural learning." Code last checked on Python 3 in Windows 11/27/2019. Please cite our paper or <a href="https://github.com/arsena-k/Word2Vec-bias-extraction">GitHub repo</a> if reused. 

The sample csv file can be found in the <a href="https://www.dropbox.com/sh/hlkvu07umu5f3ok/AABfDVdAlQ3K5DOoaaACQPw-a?dl=0"> course Dropbox folder.</a>

<b>Getting Started:</b>
<ul>
    <li>Install needed packages, especially make sure you have installed Cython</li>
    <li>Know where this Jupyter Notebook is saved on your computer. That folder will be called your "working directory", and it is where we will keep all models and data files. If you save a model, it will save there. If you download a dataset or pretrained model, you should save the file to there: when loading in files, Juputer Notebook will look in your working directory for that file by default.</li>
</ul>

In [19]:
import cython #ENSURE cython package is installed on computer/canopy
from gensim.models import phrases 
from gensim import corpora, models, similarities #calc all similarities at once, from http://radimrehurek.com/gensim/tut3.html
from sklearn.metrics.pairwise import cosine_similarity
from scipy import spatial
from statistics import mean
from gensim.models import Word2Vec, KeyedVectors
import pandas as pd
from string import ascii_letters, digits
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inline
from gensim.test.utils import datapath
#np.set_printoptions(threshold=np.inf) #set to print full output

#if you don't have any of these libraries, you will need to install them first with conda install or pip install

In [20]:
dat = pd.read_csv('NativeVoicesCorpus-Sample.csv') #this file should be in your working directory - the same folder where this Jupyter Notebook is saved
dat #view what it looks like

Unnamed: 0,Document,Date,Text
0,ASP-v1-003,1789-08-22,"No. 3. [1st Session. THE SIX NATIONS, THE WYAN..."
1,ASP-v1-004,1789-08-22,No. 4. [1st Session. THE SOUTHERN TRIBES. COMM...
2,ASP-v1-005,1789-09-16,No. 5. [1st Session. WABASH. COMMUNICATED to T...
3,ASP-v1-006,1789-09-17,"No. 6. [1st Session. THE SIX NATIONS, THE WYAN..."
4,ASP-v1-007,1789-09-18,No. 7. 1st Session. INDIAN TREATIES. COMMUNICA...
5,ASP-v1-008,1790-01-11,No. 8. [2d Session. THE CREEKS. COMMUNICATED T...
6,ASP-v1-010,1790-08-04,No. 10. [2d Session. r THE CREEKS. COMMUNICATE...
7,ASP-v1-011,1790-08-06,No. 11. [2d Session. THE CREEKS. COMMUNICATED ...
8,ASP-v1-012,1790-08-07,No. 12. [2d Session. THE CREEKS. COMMUNICATED ...
9,ASP-v1-013,1790-08-11,No. 13. [2d Session. THE CHEROKEES. COMMUNICAT...


In [21]:
dat2= [str(i[2]) for i in dat.values] #only need the text of each document in the 2nd index, not the document, nor the date, in the 0th & 1st indices
translator = str.maketrans(ascii_letters, ascii_letters, digits) #stripping digits here, see: https://stackoverflow.com/questions/32538305/using-translate-on-a-string-to-strip-digits-python-3

sentences=[]
for i in dat2:
    docs= i.translate(translator)
    docs= docs.strip()
    sentences.append(docs.split())
    
print(sentences[1])
print(len(sentences)) # check that there are actually around 50 documents here. It's ok if not perfectly clean (i.e.  some words that are nonsensical) for these purposes

['No.', '.', '[st', 'Session.', 'THE', 'SOUTHERN', 'TRIBES.', 'COMMUNICATED', 'TO', 'THE', 'SENATE', 'AUGUST', ',', '.', 'The', 'President', 'of', 'the', 'United', 'States', 'came', 'into', 'the', 'Senate', 'Chamber,', 'attended', 'by', 'General', 'Knox,', 'and', 'laid', 'before', 'the', 'Senate', 'the', 'following', 'statement', 'of', 'facts,', 'with', 'the', 'questions', 'thereto', 'annexed,', 'for', 'their', 'advice', 'and', 'consent:', 'To', 'conciliate', 'the', 'powerful', 'tribes', 'of', 'Indians', 'in', 'the', 'Southern', 'district,', 'amounting', 'probably', 'to', 'fourteen', 'thousand', 'fighting', 'men,', 'and', 'to', 'attach', 'them', 'firmly', 'to', 'the', 'United', 'States,', 'may', 'be', 'regarded', 'as', 'highly', 'worthy', 'of', 'the', 'serious', 'attention', 'of', 'Government.', 'The', 'measure', 'includes,', 'not', 'only', 'peace', 'and', 'security', 'to', 'the', 'whole', 'southern', 'frontier,', 'but', 'is', 'calculated', 'to', 'form', 'a', 'barrier', 'against', 'the