## Vocabulary building on Semantic Scholar

The semantic scholar dataset is our largest dataset of purely computer science text. We would like to build an overall vocabulary based on this dataset. The data is already cleaned using the cleaning3 script.

- Remove stopwords (remember that text is lemmatised and a lemmatised list must be passed!)
- Find a representative list of 1-3grams

In order to filter, we would like to take only words that occur a certain number of times. However, since there is more data in more recent years, it is unfair to set a min document frequency over the entire dataset.

In our previous work:

**For each year:**
- remove stopwords
- use countvectorizer to create a vocabulary of ngrams (1-5), using min_df of 6. The idea of this is that if a term is important, at some point it will appear in more than 6 papers in a year. This acts to limit the size of the vocabulary.
- We then attempted to remove redundant terms (more important when using large ngrams). This prevents too much term overlap.

However:
- We are playing with a very large amount of data
- We have not yet tried to remove german abstracts

What we could do this time:
- remove german abstracts 
- remove stopwords
- use countvectorizer to create a vocab of 1-5grams using min df of 6
- vectorize the entire dataset and apply a second min df of 20. We cannot reasonably work out trends which occur less than 20 times in a dataset of 2 million abstracts.

In [1]:
from langdetect import detect
import json
from collections import defaultdict
import numpy as np
import boto3
import re
import unicodedata
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from tqdm import tqdm
import random
import pickle

import sys
import time
import csv
sys.path.append("../../tools")
import my_stopwords3

from sys import getsizeof

stop = my_stopwords3.get_stopwords()

# Language detection

### How long does it take to detect german language abstracts?

In [3]:
langs = defaultdict(list)

with open("../../Data/semantic_scholar_cleaned/2000.txt", "r") as f:
    for line in tqdm(f,  desc='Lines in file processed...'):
        langs[detect(line[0:1000])].append(line[0:200])



Lines in file processed...: 65450it [08:53, 122.78it/s]


TypeError: object of type '_io.TextIOWrapper' has no len()

It would take about 7 hours to run this on the entire dataset on this PC

### How good is the detection?


In [15]:
for key in langs.keys():
    print(key, len(langs[key]))

en 64710
de 451
ca 5
es 60
pt 33
fr 163
it 6
pl 2
sv 1
hr 1
so 7
ro 2
nl 2
sq 3
lt 1
cy 3


In [16]:
langs['de']

['feldbusse fur die heim und gebaudeautomation feldbusse sind gerade im bereich der gebaudeautomation heutzutage nicht mehr wegzudenken in diesem beitrag soll eine kurze einfuhrung in die gebaudeautomat',
 'entwurfsmustergesteuerte erzeugung von ocl constraint eine der grosten hurden auf dem weg zur formalen verifikation von software ist die erstellung und validierung der hierfur notwendigen formalen spe',
 '3ddashmodellierung von pflanzenblattern mittels eines depth from motion verfahrens in dieser arbeit wird ein verfahren zur gewinnung von 3ddashmodellen de adergerusts von pflanzenblattern fur die bota',
 'hierarchische wasserscheiden transformation zur lippensegmentierung in farbbildsequenzen zur losung komplexer segmentierungsprobleme wird eine hierarchische und farbbasierte wasserscheidentransformati',
 'selbstorganisierende klassifikation von 2d laserscans zur navigation eines amr der vorliegende beitrag stellt einen neuronalen interpretationsansatz fur laserscans vor der eine z

In [17]:
langs['es']

['identificacion de patrones de reutilizacion de requisitos de sistemas de informacion resumen en este articulo se exponen algunos de los resultados de la aplicacion de la plantillas patrones de requisi',
 'una introduccion la bibliotecas digitales geograficas resumen la bibliotecas digitales geograficas bdg contienen documentos con una caracteristica especial estan georreferenciados e decir un documento',
 'base de datos con acceso en linea consulta visual de fondos documentales de arquitectura el proyecto de digitalizacion del archivo grafico de la escuela de arquitectura de barcelona resumen la activid',
 'principios basicos de usabilidad para ingenieros software la usabilidad e un tema que esta cobrando una importancia cada vez mayor en el desarrollo de software pesar de ello la ingenieria del software',
 'relaciones entre casos de uso en el unified modeling language el unified modeling language uml e un lenguaje grafico semiformal que ha sido aceptado como estandar para describir 

In [18]:
del(langs)

**Thoughts: it's pretty good**

We'd be doing this to remove about 1% of abstracts. Unfortunately I think it will be necessary. Let's try again with a smaller sample of text:

In [20]:
langs = defaultdict(list)

with open("../../Data/semantic_scholar_cleaned/2000.txt", "r") as f:
    for line in tqdm(f,  desc='Lines in file processed...'):
        langs[detect(line[0:200])].append(line[0:200])



Lines in file processed...: 65450it [07:56, 137.24it/s]


In [21]:
for key in langs.keys():
    print(key, len(langs[key]))

en 64695
de 450
id 2
es 62
it 20
pt 33
fr 161
pl 1
nl 3
ro 3
hr 1
so 3
sv 2
sq 2
lt 1
ca 5
no 1
tl 1
af 3
da 1


In [22]:
langs['fr']

['colinearite et instabilite numerique dans le modele lineaire nous donnons dans cet article expression du coefficient de correlation multiple dans un modele lineaire en fonction de coefficient de corre',
 'modelisation par chaines de markov homogenes ergodiques de appels de puissance un parc de chauffe eau electriques lors de evaluation de action de maitrise de la demande electricite on besoin de method',
 'recensement et description de mot composes methodes et application ce memoire decrit le recherches en informatique linguistique menees par auteur dans le domaine de mot composes et plus specialement d',
 'utilisation de methodes formelles pour le developpement de programme paralleles le travail decrit dans cette these pour but etudier comment on peut appliquer le methodes formelles la parallelisassions',
 'planification de emplois du temp et de la formation au sein une grande entreprise nous decrivons une methode issue de la recherche operationnelle permettant de planifier emploi d

**Thoughts: the reduction in time taken isn't sufficient to justify a major cut in the amount of available text**


## Go through all abstracts and remove the ones which are not in english language

In [2]:
for year in range(1980, 2000):
    t0 = time.time()
    to_write = []
    with open("D:/CDT/Data/semantic_scholar_cleaned/"+str(year)+".txt", "r") as f:
        for line in tqdm(f,  desc='Lines in file processed...', mininterval=10):
            if detect(line[0:1000]) == 'en':
                to_write.append(line)


    with open("../../Data/semantic_scholar_cleaned_langdetect/"+str(year)+".txt", "a") as f:
        for line in to_write:
            f.write(line)
    
    del to_write
    print(year, time.time()-t0)

Lines in file processed...: 5245it [00:26, 195.40it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1980 26.913002490997314


Lines in file processed...: 5565it [00:28, 196.03it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1981 28.477207899093628


Lines in file processed...: 6380it [00:33, 192.42it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1982 33.250861167907715


Lines in file processed...: 7138it [00:41, 170.24it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1983 42.04364633560181


Lines in file processed...: 8235it [00:51, 160.95it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1984 51.276766300201416


Lines in file processed...: 9064it [00:54, 164.90it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1985 55.11902856826782


Lines in file processed...: 10193it [01:12, 140.79it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1986 72.57939624786377


Lines in file processed...: 11646it [01:10, 165.90it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1987 70.34139394760132


Lines in file processed...: 14660it [01:43, 141.84it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1988 103.5322802066803


Lines in file processed...: 16706it [01:38, 169.35it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1989 98.8591980934143


Lines in file processed...: 19760it [01:53, 173.77it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1990 113.99781918525696


Lines in file processed...: 22660it [02:19, 162.81it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1991 139.42477655410767


Lines in file processed...: 25386it [02:22, 177.83it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1992 143.03540110588074


Lines in file processed...: 29680it [03:03, 161.67it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1993 183.91923785209656


Lines in file processed...: 33808it [03:35, 157.22it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1994 215.40284776687622


Lines in file processed...: 36210it [03:59, 151.08it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1995 240.05048847198486


Lines in file processed...: 40082it [04:41, 142.20it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1996 282.3543186187744


Lines in file processed...: 44237it [05:18, 139.00it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1997 318.7963058948517


Lines in file processed...: 51735it [05:54, 145.94it/s]
Lines in file processed...: 0it [00:00, ?it/s]

1998 355.0483636856079


Lines in file processed...: 55869it [06:36, 141.04it/s]


1999 396.7487895488739


# Build a vocabulary

- lemmatise stopwords
- for each year, remove stopwords, then build a vocabulary.
- I want the overall vocabulary to be stored in a dictionary such that the count is the total document count of the word. This should be updated each year.


In [8]:
vectorizer = CountVectorizer(strip_accents='unicode',
                             ngram_range=(1,5),
                             stop_words=stop,
                             min_df=6,
                             max_df = 1000000
                            )

t0 = time.time()
with open("../../Data/semantic_scholar_cleaned/1980.txt", "r") as f:
    documents = f.readlines()
    documents = [d.strip() for d in documents] 

vector = vectorizer.fit_transform(documents)
print(time.time()-t0)

6.5104005336761475


In [20]:
# set all elements that are >1 in the vector to 1. This is done so we can calculate document frequency of terms.
print(np.sum(vector))

370383


In [21]:
vector[vector>1] = 1
print(np.sum(vector))

267893


In [36]:
np.asarray(np.sum(vector, axis=0))[0]

array([30,  6,  6, ...,  8, 18, 27], dtype=int64)

In [28]:
len(vectorizer.vocabulary_)

8998

In [40]:
vocabulary = defaultdict(int)

vector[vector>1] = 1
document_frequency = np.asarray(np.sum(vector, axis=0))[0]

for term in vectorizer.vocabulary_:
    vocabulary[term] += document_frequency[vectorizer.vocabulary_[term]]
    
del vector
del vectorizer

## Now build the vocabulary for real...

In [2]:
vocabulary = defaultdict(int)

for year in range(2020, 1999, -1):
    t0 = time.time()
    with open("E:/Data/semantic_scholar_cleaned_langdetect/"+str(year)+".txt", "r") as f:
        documents = f.readlines()
        documents = [d.strip() for d in documents] 

    vectorizer = CountVectorizer(strip_accents='unicode',
                             ngram_range=(1,4),
                             stop_words=stop,
                             min_df=10,
                             max_df = 1000000
                            )
    
    vector = vectorizer.fit_transform(documents)

    del documents
    
    vector[vector>1] = 1
    document_frequency = np.asarray(np.sum(vector, axis=0))[0]

    for term in vectorizer.vocabulary_:
        vocabulary[term] += document_frequency[vectorizer.vocabulary_[term]]

    del vector
    del vectorizer
    del document_frequency
    
    pickle.dump(vocabulary, open("interim_vocabulary.p", "wb"))
    
    print(year, len(vocabulary.keys()), time.time()-t0)

2020 640730 759.7719836235046
2019 781990 787.8609170913696
2018 844614 733.3687374591827
2017 882535 661.7947747707367
2016 910152 718.7338662147522
2015 932299 657.1821525096893
2014 950451 614.8959999084473
2013 965540 525.9848170280457
2012 977135 485.18851375579834
2011 986440 478.5876338481903
2010 992834 397.3440001010895
2009 999337 371.5789999961853
2008 1003902 306.7249596118927
2007 1008582 288.5188488960266
2006 1012205 256.6606636047363
2005 1016732 230.3845226764679
2004 1018897 203.78000020980835
2003 1019513 144.63499999046326
2002 1020035 123.1819999217987
2001 1020687 107.71599984169006
2000 1021362 100.76200032234192


In [7]:
list(vocabulary.keys())
pickle.dump(list(vocabulary.keys()), open("vocabulary.p", "wb"))

In [6]:
len(set(list(vocabulary.keys())))

1021362

The issue with this approach is that it uses a massive amount of space, and my poor computer cannot cope. The smallest file, 2000.txt, contains 60,000 documents. We can't really realistically find terms occuring in less than 0.2% per year. Therefore, set cut off at 30 documents in a single year at their peak. 

We also don't need to hold the entire matrix, since we are only trying to build a vocabulary.

In order to do this in manageable chunks, how about vectorizing 10,000 at a time?

In [55]:
chunk = 10000
min_yearly_df = 20

vocabulary2 = defaultdict(int)

    
t0 = time.time()

interim_vocabulary = defaultdict(int)
with open("../../Data/semantic_scholar_cleaned_langdetect/2000.txt", "r") as f:
    documents = f.readlines()
    documents = [d.strip() for d in documents] 
    random.shuffle(documents)



# Go through the documents in chunks, creating a vocabulary
for i in tqdm(range(int(np.ceil(len(documents)/chunk))-1), desc='Chunks processed'):

    vectorizer = CountVectorizer(strip_accents='unicode',
                         ngram_range=(1,4),
                         stop_words=stop,
                         min_df=2,
                         max_df = chunk
                        )
    
    vector = vectorizer.fit_transform(documents[chunk*i:chunk*(i+1)])
    vector[vector>1] = 1
    document_frequency = np.asarray(np.sum(vector, axis=0))[0]

    del vector 

    for term in vectorizer.vocabulary_:
        interim_vocabulary[term] += document_frequency[vectorizer.vocabulary_[term]]

    del document_frequency
    del vectorizer

# Now do for the last remaining documents
vectorizer = CountVectorizer(strip_accents='unicode',
                     ngram_range=(1,4),
                     stop_words=stop,
                     min_df=1,
                     max_df = chunk
                    )

vector = vectorizer.fit_transform(documents[chunk*(i+1):chunk*(i+2)])
vector[vector>1] = 1
document_frequency = np.asarray(np.sum(vector, axis=0))[0]

del vector 

for term in vectorizer.vocabulary_:
    interim_vocabulary[term] += document_frequency[vectorizer.vocabulary_[term]]

del document_frequency
del vectorizer
    
del documents

for term in interim_vocabulary.keys():
    if interim_vocabulary[term] >= min_yearly_df:
        vocabulary2[term] += interim_vocabulary[term]

print(len(vocabulary2), time.time()-t0)
    



Chunks processed:   0%|                                                                          | 0/6 [00:00<?, ?it/s][A
Chunks processed:  17%|███████████                                                       | 1/6 [00:12<01:03, 12.73s/it][A
Chunks processed:  33%|██████████████████████                                            | 2/6 [00:26<00:51, 12.97s/it][A
Chunks processed:  50%|█████████████████████████████████                                 | 3/6 [00:39<00:39, 13.04s/it][A
Chunks processed:  67%|████████████████████████████████████████████                      | 4/6 [00:52<00:26, 13.16s/it][A
Chunks processed:  83%|███████████████████████████████████████████████████████           | 5/6 [01:06<00:13, 13.23s/it][A
Chunks processed: 100%|██████████████████████████████████████████████████████████████████| 6/6 [01:19<00:00, 13.20s/it][A


21773 87.42224407196045


In [60]:
chunk = 10000
min_yearly_df = 30

vocabulary = defaultdict(int)

    
t0 = time.time()

interim_vocabulary = defaultdict(int)
with open("../../Data/semantic_scholar_cleaned_langdetect/2000.txt", "r") as f:
    documents = f.readlines()
    documents = [d.strip() for d in documents] 

vectorizer = CountVectorizer(strip_accents='unicode',
                         ngram_range=(1,4),
                         stop_words=stop,
                         min_df=30,
                         max_df = 100000
                        )

vector = vectorizer.fit_transform(documents)
vector[vector>1] = 1
document_frequency = np.asarray(np.sum(vector, axis=0))[0]


for term in vectorizer.vocabulary_:
    vocabulary[term] += document_frequency[vectorizer.vocabulary_[term]]
    
print(len(vocabulary), time.time()-t0)

21958 90.74334478378296


In [40]:
del vector
del document_frequency
del vectorizer
del documents
