* This file generates a comparison dataframe (see last cell) to compare different combinations of stemming, stopwords, and type of algorithm used (gensim vs glove):
  * stemming + stopwords + gensim
  * stemming + no stopwords + gensim
  * stemming + stopwords + glove
  * stemming + no stopwords + glove 
  * lemmatizing + stopwords + gensim
  * lemmatizing + no stopwords + gensim
  * lemmatizing + stopwords + glove
  * lemmatizing + no stopwords + glove 
* Gensim uses CBOW (neural network based) for training vs glove that uses word co-association matrix (no neural network)
* The models were empirically compared with different years and and were compared (using most_similar) with top 20 words close to an empirically chosen word 
  * See the word embedding presentation linked here ____ for different years and words they were compared against 
* **Conclusion: Empirically determined that Gensim model with lemmatization and stopwords included is the best approach**

# Training GloVe model on neuroscience papers

# Conclusion

Gensim model with lemmatization and stopwords included is the best approach

# Testing

In [None]:
!pip install python-docx

Collecting python-docx
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 8.5 MB/s 
Building wheels for collected packages: python-docx
  Building wheel for python-docx (setup.py) ... [?25l[?25hdone
  Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size=184508 sha256=8d2e573bc0b3b732d5cc50e9803869659ef042203d1fa202d9cd4eb16c3c366b
  Stored in directory: /root/.cache/pip/wheels/f6/6f/b9/d798122a8b55b74ad30b5f52b01482169b445fbb84a11797a6
Successfully built python-docx
Installing collected packages: python-docx
Successfully installed python-docx-0.8.11


In [None]:
!pip install glove_python-binary

Collecting glove_python-binary
  Downloading glove_python_binary-0.2.0-cp37-cp37m-manylinux1_x86_64.whl (948 kB)
[?25l[K     |▍                               | 10 kB 24.5 MB/s eta 0:00:01[K     |▊                               | 20 kB 31.1 MB/s eta 0:00:01[K     |█                               | 30 kB 34.7 MB/s eta 0:00:01[K     |█▍                              | 40 kB 23.4 MB/s eta 0:00:01[K     |█▊                              | 51 kB 9.8 MB/s eta 0:00:01[K     |██                              | 61 kB 11.2 MB/s eta 0:00:01[K     |██▍                             | 71 kB 9.6 MB/s eta 0:00:01[K     |██▊                             | 81 kB 10.6 MB/s eta 0:00:01[K     |███                             | 92 kB 8.5 MB/s eta 0:00:01[K     |███▌                            | 102 kB 9.0 MB/s eta 0:00:01[K     |███▉                            | 112 kB 9.0 MB/s eta 0:00:01[K     |████▏                           | 122 kB 9.0 MB/s eta 0:00:01[K     |████▌                 

In [None]:
import numpy as np

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

In [None]:
from docx import Document
import nltk
nltk.download('punkt')
import re
from nltk import sent_tokenize
import pandas as pd
from nltk.corpus import stopwords
nltk.download('stopwords')
import pickle
import numpy as np
import glob

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnowballStemmer 
from nltk.stem import WordNetLemmatizer

In [None]:
import nltk 
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
from glove import Corpus, Glove

In [1]:
!git clone 'https://github.com/igorbrigadir/stopwords.git'

Cloning into 'stopwords'...
remote: Enumerating objects: 149, done.[K
remote: Total 149 (delta 0), reused 0 (delta 0), pack-reused 149[K
Receiving objects: 100% (149/149), 85.27 KiB | 498.00 KiB/s, done.
Resolving deltas: 100% (52/52), done.


In [4]:
alir3z4_data = '/content/stopwords/en/alir3z4.txt'

more_stops = pd.read_csv('/content/stopwords/en/alir3z4.txt')
new_stops = list(more_stops["'ll"])

In [None]:
DOMAIN_STOPS = {'pubmed', 'et', 'al', 'page'}
STOPWORDS =  set(stopwords.words('english') + stopwords.words('german') +  stopwords.words('dutch') + stopwords.words('french') +  stopwords.words('spanish')  + new_stops) | DOMAIN_STOPS
STOPWORDS = set(STOPWORDS)

In [None]:
ROOT = "/content/drive/MyDrive/regen_x"

In [None]:
# for lemmatization 
import spacy
# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en', disable=['parser', 'ner'])

In [None]:
def get_docx(file_path):
    doc = []
    for para in Document(file_path).paragraphs:
        if para.text == "":
            continue
        doc += (sent_tokenize(para.text.lower()))
    return doc


# This functions takes a folder of files and returns one array with 
# all of the files processed sentences(which themselves are a list of words) as elements 
# def get_proc_docs(training_paper_year, STARTWORDS, STOPWORDS, max_papers=None, verbose=True, use_porter=False, useStopWords=False):
#   global_path = "/content/drive/MyDrive/regen_x/data/ocr_paper_COMPREHENSIVE/"
#   folder_path = global_path + "{}/".format(training_paper_year)
#   print(folder_path) 
#   file_paths = glob.glob(folder_path + "*.docx")

#   print("Number of files: {}".format(len(file_paths)))
#   if len(file_paths) == 0:
#     raise Exception("Folder has no files - maybe drive was not mounted?")
#   ## -- Collecting Papers from Given Year -- ##
#   proc_docs = [] 

#   counter = 1
#   length = len(file_paths)
#   for f in file_paths:
#     doc = get_docx(f)
    
#     for sentence in doc:
#       # don't think we need to remove stopwords and such if we're training embeddings 
#       # do lemmatization here as well 

#       proc_sentence = [] 
#       if useStopWords:
#         proc_sentence = [word for word in re.findall(r'\w+', sentence)]
#       else:
#         proc_sentence = [word for word in re.findall(r'\w+', sentence)]

#       if use_porter:
#         proc_sentence = do_stemming(proc_sentence) 
#       else:
#         proc_sentence = do_lemmatizing(proc_sentence) 

#       if useStopWords and use_porter:
#         proc_sentence = [word for word in proc_sentence if word not in STEMMED_STOPWORDS]
#       elif useStopWords and not use_porter:
#         proc_sentence = [word for word in proc_sentence if word not in LEMMATIZED_STOPWORDS]

#       proc_docs.append(proc_sentence)  

#     if(verbose):
#       print("\t{}/{}".format(counter, length))
#     counter += 1

#     if max_papers != None:
#       if counter == max_papers+1:
#         break 

#   return proc_docs

def get_proc_docs(training_paper_year, STARTWORDS, STOPWORDS, max_papers=None, verbose=True, use_porter=False, useStopWords=False):
  global_path = "/content/drive/MyDrive/regen_x/data/ocr_paper_COMPREHENSIVE/"
  folder_path = global_path + "{}/".format(training_paper_year)
  print(folder_path) 
  file_paths = glob.glob(folder_path + "*.docx")

  print("Number of files: {}".format(len(file_paths)))
  if len(file_paths) == 0:
    raise Exception("Folder has no files - maybe drive was not mounted?")
  ## -- Collecting Papers from Given Year -- ##
  proc_docs = [] 

  counter = 1
  length = len(file_paths)
  for f in file_paths:
    doc = get_docx(f)
    
    for sentence in doc:
      proc_sentence = [] 
      if useStopWords:
        proc_sentence = [word for word in re.findall(r'\w+', sentence) if ((len(word) > 2) and (word not in STOPWORDS))]
      else:
        proc_sentence = [word for word in re.findall(r'\w+', sentence)]

      if use_porter:
        proc_sentence = do_stemming(proc_sentence) 
      else:
        proc_sentence = do_lemmatizing(proc_sentence) 

      proc_docs.append(proc_sentence)  

    if(verbose):
      print("\t{}/{}".format(counter, length))
    counter += 1

    if max_papers != None:
      if counter == max_papers+1:
        break 

  return proc_docs

def do_stemming(filtered):
	stemmed = []
	for f in filtered:
		stemmed.append(PorterStemmer().stem(f))
		#stemmed.append(LancasterStemmer().stem(f))
		#stemmed.append(SnowballStemmer('english').stem(f))
	return stemmed

def do_lemmatizing(filtered):
  # convert list to string 
  spacy_parsed_text = nlp(" ".join(filtered)) 
  # Get the lemma for each token in the parsed text 
  
  # I wanted to keep pronouns so not taking lemma if it's a pronoun but if you want to remove pronouns use below commented line 
  # return " ".join([token.lemma_ for token in doc])

  # return as list of words again 
  return [token.lemma_ if token.lemma_ != '-PRON-' else token.lower_ for token in spacy_parsed_text]
 

def get_start_stop():
    domain_stops = {'pubmed', 'et', 'al', 'page'}
    with open('/content/stopwords/en/alir3z4.txt', 'r') as fn:
        new_stops = [line.strip() for line in fn.readlines()]
    STOPWORDS =  set(stopwords.words('english') + stopwords.words('german') +  stopwords.words('dutch') + stopwords.words('french') +  stopwords.words('spanish')  + new_stops) | domain_stops

    fn = glob.glob(ROOT + '/data/start-words/*')
    ALL_STARTS = [pickle.load(open(f , 'rb')) for f in fn]
    STARTWORDS = {}
    for f in ALL_STARTS:
      STARTWORDS.update(f)
    STARTWORDS = set(STARTWORDS.keys())

    assert(type(STOPWORDS)==set and type(STARTWORDS)==set)
    return (STARTWORDS, STOPWORDS)

In [None]:
STARTWORDS, STOPWORDS = get_start_stop()

In [None]:
STEMMED_STOPWORDS = do_stemming(STOPWORDS) 
LEMMATIZED_STOPWORDS = do_lemmatizing(STOPWORDS)

# Glove Model

In [None]:
def train_glove(proc_docs):
  #Creating a corpus object
  corpus = Corpus() 

  #Training the corpus to generate the co occurence matrix which is used in GloVe
  corpus.fit(proc_docs, window=10)

  glove = Glove(no_components=5, learning_rate=0.05) 
  glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
  glove.add_dictionary(corpus.dictionary)
  # glove.save('glove.model')

  return glove 

In [None]:
def get_glove_model(year, STARTWORDS, STOPWORDS, max_papers=None):
  proc_docs = get_proc_docs(year, STARTWORDS, STOPWORDS, max_papers)
  return train_glove(proc_docs)

In [None]:
# This functions takes a folder of files and returns one array with 
# all of the files processed sentences(which themselves are a list of words) as elements 
# def get_proc_docs_glove(training_paper_year, STARTWORDS, STOPWORDS, max_papers=None, verbose=True, use_porter=False, useStopWords=False):
#   global_path = "/content/drive/MyDrive/regen_x/data/ocr_paper_COMPREHENSIVE/"
#   folder_path = global_path + "{}/".format(training_paper_year)
#   print(folder_path) 
#   file_paths = glob.glob(folder_path + "*.docx")

#   print("Number of files: {}".format(len(file_paths)))
#   if len(file_paths) == 0:
#     raise Exception("Folder has no files - maybe drive was not mounted?")
#   ## -- Collecting Papers from Given Year -- ##
#   proc_docs = [] 

#   counter = 1
#   length = len(file_paths)
#   for f in file_paths:
#     doc = ' '.join(get_docx(f))
#     # proc_doc = [word for word in re.findall(r'\w+', doc.lower()) if ((word in STARTWORDS) and (len(word) > 2) and (word not in STOPWORDS))]
    
#     proc_doc = [word for word in re.findall(r'\w+', doc) if ((len(word) > 2))]

#     if use_porter:
#       proc_doc = do_stemming(proc_doc)      
#     else:
#       proc_doc = do_lemmatizing(proc_doc)

#     if useStopWords and use_porter:
#       proc_doc = [word for word in proc_doc if (word not in STEMMED_STOPWORDS)]
#     if useStopWords and not use_porter:
#       proc_doc = [word for word in proc_doc if (word not in LEMMATIZED_STOPWORDS)]



#     proc_docs.append(proc_doc)
#     print("{}/{}".format(counter, length))
#     counter += 1

#     if max_papers != None:
#       if counter == max_papers+1:
#         break 

#   return proc_docs

def get_proc_docs_glove(training_paper_year, STARTWORDS, STOPWORDS, max_papers=None, verbose=True, use_porter=False, useStopWords=False):
  global_path = "/content/drive/MyDrive/regen_x/data/ocr_paper_COMPREHENSIVE/"
  folder_path = global_path + "{}/".format(training_paper_year)
  print(folder_path) 
  file_paths = glob.glob(folder_path + "*.docx")

  print("Number of files: {}".format(len(file_paths)))
  if len(file_paths) == 0:
    raise Exception("Folder has no files - maybe drive was not mounted?")
  ## -- Collecting Papers from Given Year -- ##
  proc_docs = [] 

  counter = 1
  length = len(file_paths)
  for f in file_paths:
    doc = ' '.join(get_docx(f))
    # proc_doc = [word for word in re.findall(r'\w+', doc.lower()) if ((word in STARTWORDS) and (len(word) > 2) and (word not in STOPWORDS))]
    
    proc_doc = [] 
    
    if useStopWords:
      proc_doc = [word for word in re.findall(r'\w+', doc) if ((len(word) > 2) and (word not in STOPWORDS))]
    else:
      proc_doc = [word for word in re.findall(r'\w+', doc)]


    if use_porter:
      proc_doc = do_stemming(proc_doc)      
    else:
      proc_doc = do_lemmatizing(proc_doc)



    proc_docs.append(proc_doc)
    print("{}/{}".format(counter, length))
    counter += 1

    if max_papers != None:
      if counter == max_papers+1:
        break 

  return proc_docs

# Gensim Model

In [None]:
from gensim.models import Word2Vec

# Comparison

In [None]:
import itertools 
set(itertools.permutations([True, True, False, False], 2))

{(False, False), (False, True), (True, False), (True, True)}

In [None]:
def train_models_for_year(year, word):
  df = pd.DataFrame()

  permutations = [(True, False), (True, True), (False, False), (False, True)]
  for p in permutations:
    proc_docs = get_proc_docs(year, STARTWORDS, STOPWORDS, verbose=True, use_porter=p[0], useStopWords=p[1])
    proc_docs_glove = get_proc_docs_glove(year, STARTWORDS, STOPWORDS, verbose=True, use_porter=p[0], useStopWords=p[1]) # don't split into sentences for GloVe

    gensim_model = Word2Vec(sentences=proc_docs, min_count=1) 
    glove_model = train_glove(proc_docs_glove) 

    pre = ""
    stop = ""
    if p[0] == False:
      pre = "Lemmatization "
    else:
      pre = "Stemming "

    if p[1] == False:
      stop = "*Stopwords Included* - "
    else:
      stop = "No Stopwords - "

    try:
      print(gensim_model.wv.most_similar(word, topn=20))
      df[pre + stop + "Gensim"] = [word_tuple[0] for word_tuple in gensim_model.wv.most_similar(word, topn=20)]
    except KeyError:
      df[pre + stop + "Gensim"] = ["Word Not Found"] * 20

    try:
      print(glove_model.most_similar(word, number=21))
      df[pre + stop + "Glove"] = [word_tuple[0] for word_tuple in glove_model.most_similar(word, number=21)]
    except:
      df[pre + stop + "Glove"] = ["Word Not Found"] * 20

  return df
  

In [None]:
df = train_models_for_year(1907, "eye")

/content/drive/MyDrive/regen_x/data/ocr_paper_COMPREHENSIVE/1907/
Number of files: 5
	1/5
	2/5
	3/5
	4/5
	5/5
/content/drive/MyDrive/regen_x/data/ocr_paper_COMPREHENSIVE/1907/
Number of files: 5
1/5
2/5
3/5
4/5
5/5
Performing 30 training epochs with 4 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19
Epoch 20
Epoch 21
Epoch 22
Epoch 23
Epoch 24
Epoch 25
Epoch 26
Epoch 27
Epoch 28
Epoch 29
[('part', 0.9918349981307983), ('other', 0.9902467727661133), ('side', 0.9868088364601135), ('cornea', 0.9850528240203857), ('muscl', 0.9843302965164185), ('lesion', 0.9836300611495972), ('lower', 0.9828320741653442), ('involv', 0.9822753071784973), ('on', 0.9810366630554199), ('upper', 0.9804495573043823), ('sensori', 0.9785416126251221), ('motor', 0.977988600730896), ('posterior', 0.9775254726409912), ('cell', 0.9768990874290466), ('anterior', 0.9768800735473633), ('region'

In [None]:
df

Unnamed: 0,Stemming *Stopwords Included* - Gensim,Stemming *Stopwords Included* - Glove,Stemming No Stopwords - Gensim,Stemming No Stopwords - Glove,Lemmatization *Stopwords Included* - Gensim,Lemmatization *Stopwords Included* - Glove,Lemmatization No Stopwords - Gensim,Lemmatization No Stopwords - Glove
0,part,should,treatment,angioma,part,operation,patient,lapse
1,other,neuralgia,patient,genev,lesion,sufficiently,report,direction
2,side,all,report,sound,side,semaphore,cornea,application
3,cornea,make,oper,enucl,motor,later,age,exciting
4,muscl,patient,age,sit,other,try,treatment,chorea
5,lesion,with,solut,correct,neuron,child,operation,subject
6,lower,ptosi,cornea,myopic,position,repeat,solution,host
7,involv,short,day,experi,diagnosis,his,3,quiet
8,on,brain,1,demonstr,cornea,use,paper,hausmann
9,upper,agitan,month,quiet,brain,instrument,lens,rosis
