# Semantic scholar data

We downloaded a subset of data from semantic scholar from here:

http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download/


http://s2-public-api.prod.s2.allenai.org/corpus/
[new link]


Using the command 

```aws s3 cp --no-sign-request --recursive s3://ai2-s2-research-public/open-corpus/2021-03-01/ destinationPath```

Some notes on this:
- The data is very large, with 5000 gz files
- The data is so large that I can only download in batches, unzip in batches, process, resave the processed data, and then delete the old data. At this point I can download another batch.
- "sources":["DBLP"]

aws s3 cp --no-sign-request --exclude "*" --include "*-2*" --exclude "*-20*" --exclude "*-21*" --recursive s3://ai2-s2-research-public/open-corpus/2021-03-01/ C:\Users\[me]\Documents\CDT\Data\semanticscholar 

In [1]:
import json
from collections import defaultdict
import numpy as np
import boto3
import re
import unicodedata
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

import sys
sys.path.append("../../tools")

import tools
import cleaning
import html

def extract_data(document):
    '''
    Extract title, year and abstract from a semantic scholar line
    Replace any newlines or carriage returns with spaces
    '''
    json_document = json.loads(document)
    title = json_document['title'].replace('\n', ' ').replace('\r', ' ')
    abstract = json_document['paperAbstract'].replace('\n', ' ').replace('\r', ' ')
    sources = json_document['sources']
    year = json_document['year']
    if 'DBLP' in sources:
        extracted_data = {
            'title': title,
            'paperAbstract': abstract
        }

    
        return year, json.dumps(extracted_data)
    
    else:
        return None, None

### Anatomy of a document

```
{
    "id":"7bbfdcca4478ba11e511cc46546e5da9fc82fe19",
    "title":"Enhancing the TORA protocol using network localization and selective node participation",
    "paperAbstract":"The Temporally-Ordered Routing Algorithm (TORA) is a distributed routing protocol that is based on a family of link reversal algorithms. TORA is able to provide multiple loop-free routes to any destination using the route creation, maintenance and erasure functions. TORA performs well in networks with a small number of traffic connections but poorly in networks with a large number of traffic connections. This poor performance is due to the traffic congestion caused by excessive route maintenance. This traffic congestion is further aggravated by routing overhead produced by the large number of traffic connections. We propose two modifications to improve TORA using a network localization approach and selective node participation approach. The network localization approach initializes and maintains a localized portion of the entire network while the selective node participation approach selects a subset of nodes to participate as part of the network. Benchmarks against original TORA show that our TORA modifications results in an overall performance improvement in terms of packet delivery, routing overhead and packet latency.",
    "authors":[
        {"name":"Kwan Hui Lim","ids":["4089267"]},
        {"name":"Amitava  Datta","ids":["1716678"]}
        ],
    "inCitations":["07fb7dec071f73d10e0fc6b6e6c065e4297e9006","48d2d86f355f1aca5d71e4a230980c3da8d9c5f6","2b9e549f406c46a33f4c98d9c2b350388e96af74","0e75fee74633d3c776602feadbd2dda9a059b950","d1c86fb6d5ef8c9f97cec9865f9cb58734c9458f","5287d1ab116cb8ce1353c274cba868038dc820fc"],
    "outCitations":["53937dd143269339fc35cba93d397f8fcff62d1b","21e5ce796636e566642224d4737ee3e0eae07470","1c7f78f506d3409f1efa137d51169d5d87fdd24a","673892e326414dfd9e72853048b467d6e75d16d8","9928bfab5ef374e42ab30f8222be9c460afef313","a8c934aa4b9d2736f97421354c51d0f11bfeb63b","cbf3f6628a039f1542324860c2b2363b19ab2619","9548b1a9142a5297c63cd254901eb751af80acd3","5f9d49753857692e342d094a2f417e8de37ad18c","ca95be7fef2ba6e1a98bd1fdbb04dcf140b2ad33","0f28d106e7dc6464166f83583ee496559daa8b7f"],
    "year":2012,
    "s2Url":"https://semanticscholar.org/paper/7bbfdcca4478ba11e511cc46546e5da9fc82fe19",
    "sources":["DBLP"],
    "pdfUrls":["https://doi.org/10.1109/PIMRC.2012.6362586","http://staffhome.ecm.uwa.edu.au/~10449838/2012-PIMRC-toraPaper.pdf"],
    "venue":"2012 IEEE 23rd International Symposium on Personal, Indoor and Mobile Radio Communications - (PIMRC)",
    "journalName":"2012 IEEE 23rd International Symposium on Personal, Indoor and Mobile Radio Communications - (PIMRC)",
    "journalVolume":"",
    "journalPages":"1503-1508",
    "doi":"10.1109/PIMRC.2012.6362586",
    "doiUrl":"https://doi.org/10.1109/PIMRC.2012.6362586",
    "pmid":"",
    "fieldsOfStudy":["Computer Science"],
    "magId":"2132186477",
    "s2PdfUrl":"",
    "entities":[]
}

```

The important fields are
- Title
- paperAbstract
- sources
- year
- fieldsOfStudy

### Load file and translate the first line to json


In [1]:
f = open("../../Data/semanticscholar_sample/sample-S2-records", "r")

In [3]:
f.readline()

'{"id":"989e305765a01478d9a786987fb6f4fc379da91b","title":"FRANCESCO GUERRA â€• NADIA ROBOTTI, Ettore Majorana. Aspects of his Scientific and Academic Activity. Pisa: Edizioni Scuola Normale Superiore, 2008. 243 pp., ISBN 978-88-7642-331-4.","paperAbstract":"","authors":[{"name":"Luisa  Bonolis","ids":["15983375"]}],"inCitations":[],"outCitations":[],"year":2009,"s2Url":"https://semanticscholar.org/paper/989e305765a01478d9a786987fb6f4fc379da91b","sources":[],"pdfUrls":[],"venue":"","journalName":"Nuncius-journal of The History of Science","journalVolume":"24","journalPages":"540-541","doi":"10.1163/182539109X00912","doiUrl":"https://doi.org/10.1163/182539109X00912","pmid":"","fieldsOfStudy":["Philosophy"],"magId":"2090086602","s2PdfUrl":"","entities":[]}\n'

In [6]:
json_record = json.loads(f.readline())

In [7]:
json_record

{'id': '376c7945fa36d821ab9cf40f9eb307528bca2f88',
 'title': 'UMP lost and found mobile application',
 'paperAbstract': 'Since generation, the issue of losing personal belongings is a common thing for all people. Anyone can lose their personal belongings. At Universiti Malaysia Pahang only, almost every week there will be a lot of cases of lost and found personal items such as wallets, matric cards, room keys and so on. Usually, the information regarding the lost and found item was spread by the student himself via the UMP portal in the announcement board and the WhatsApp media. This method is not very efficient because not all students at UMP will receive information about the lost and found items. Hence, UMP Lost and Found Mobile Application is developed to centralized the information regarding all the lost and found items in a mobile application. This app can help students find lost items or track owners for items they have found. Any lost or found items around UMP will be reported 

In [8]:
json_record['sources']

[]

### Findings

- It is easy to read data in the json format and extract information
- Not every record seems to have an associated source.
- Records can have 0+ "fields of study". However it is not clear how these fields of study are assigned and the documentation is opaque.

In [52]:
dblp_documents = defaultdict(list)
f = open("../../Data/semanticscholar_sample/sample-S2-records", "r", encoding="utf8")
for i, line in enumerate(f):
    year, extracted_data = extract_data(line)
    if year is not None:
        dblp_documents[year].append(extracted_data)

In [53]:
dblp_documents

defaultdict(list,
            {2012: ['{"title": "Enhancing the TORA protocol using network localization and selective node participation", "paperAbstract": "The Temporally-Ordered Routing Algorithm (TORA) is a distributed routing protocol that is based on a family of link reversal algorithms. TORA is able to provide multiple loop-free routes to any destination using the route creation, maintenance and erasure functions. TORA performs well in networks with a small number of traffic connections but poorly in networks with a large number of traffic connections. This poor performance is due to the traffic congestion caused by excessive route maintenance. This traffic congestion is further aggravated by routing overhead produced by the large number of traffic connections. We propose two modifications to improve TORA using a network localization approach and selective node participation approach. The network localization approach initializes and maintains a localized portion of the entire n

In [54]:
for year in dblp_documents.keys():
    with open("../../Data/semanticscholar_sample/year_sorted_data/"+str(year)+".txt", "a") as f:
        for document in dblp_documents[year]:
            f.write(document)

### Now try on real data...

In [56]:
dblp_documents = defaultdict(list)
f = open("../../Data/semanticscholar/s2-corpus-000", "r", encoding="utf8")
for i, line in enumerate(f):
    year, extracted_data = extract_data(line)
    if year is not None:
        dblp_documents[year].append(extracted_data)

In [59]:
for key in sorted(dblp_documents.keys()):
    print(key, len(dblp_documents[key]))

1974 2
1975 2
1976 2
1982 1
1983 2
1984 1
1985 3
1986 1
1987 2
1988 4
1989 1
1990 6
1991 3
1992 4
1993 6
1994 7
1995 10
1996 10
1997 12
1998 8
1999 10
2000 18
2001 9
2002 16
2003 14
2004 19
2005 33
2006 26
2007 27
2008 38
2009 37
2010 30
2011 32
2012 40
2013 51
2014 36
2015 49
2016 50
2017 49
2018 60
2019 56
2020 62
2021 4


## Read in a set of files in an interval and save the data

In [2]:
for filenumber in np.arange(600, 1000):
    if len(str(filenumber))==1:
        filenumber = '00'+str(filenumber)
    elif len(str(filenumber))==2:
        filenumber = '0'+str(filenumber)
    else:
        filenumber = str(filenumber)

    dblp_documents = defaultdict(list)
    f = open("../../Data/semanticscholar/s2-corpus-"+filenumber, "r", encoding="utf8")
    for i, line in enumerate(f):
        year, extracted_data = extract_data(line)
        if year is not None:
            dblp_documents[year].append(extracted_data)
            
    for year in dblp_documents.keys():
        with open("../../Data/semantic_scholar_filtered/"+str(year)+".txt", "a") as f:
            for document in dblp_documents[year]:
                f.write(document + '\n')
    
    print(filenumber, sum([len(dblp_documents[key]) for key in dblp_documents]))

600 897
601 899
602 884
603 939
604 873
605 838
606 877
607 882
608 900
609 916
610 878
611 831
612 864
613 871
614 865
615 880
616 837
617 912
618 889
619 927
620 863
621 886
622 861
623 918
624 865
625 907
626 870
627 851
628 871
629 901
630 849
631 911
632 926
633 870
634 928
635 871
636 893
637 936
638 912
639 886
640 928
641 838
642 914
643 913
644 916
645 871
646 905
647 927
648 842
649 914
650 919
651 883
652 845
653 920
654 874
655 889
656 897
657 846
658 914
659 887
660 895
661 950
662 875
663 873
664 858
665 887
666 905
667 897
668 891
669 858
670 863
671 897
672 856
673 880
674 885
675 896
676 936
677 855
678 876
679 866
680 871
681 853
682 912
683 891
684 859
685 899
686 908
687 895
688 881
689 887
690 909
691 864
692 916
693 860
694 844
695 895
696 876
697 894
698 890
699 859
700 889
701 858
702 895
703 886
704 849
705 866
706 892
707 870
708 854
709 902
710 845
711 866
712 880
713 872
714 912
715 895
716 909
717 896
718 848
719 903
720 917
721 896
722 898
723 890
724 945


## Cleaning data

At this point, we try out a cleaning pipeline. What data are we interested in?

- title field must not be empty
- abstract field must be greater than 50 characters in length
- we need to be able to deal with unicode characters
- lemmatise data



In [42]:
f = open("../../Data/semantic_scholar_filtered/1980.txt", "r", encoding="utf8")
lines = []
for i, line in enumerate(f):
    lines.append(json.loads(line))

In [55]:
# remove e.g. \xa0 characters from string
text = lines[892]['paperAbstract']
text = unicodedata.normalize('NFKC', text)
text = text.replace('\n',' ')
text = text.replace('\r',' ')
text = text.replace('\t',' ')
text = re.sub(r'&#?[a-z0-9]+;', ' ', text)
text

'Abstract Let 0⩽ x 1 ⩽ x 2 ⩽ x 3 ⩽· be a sequence of real numbers, lim x i =+∞. We prove that there exists a sequence P ={ z 1 , z 2 , z 3 ,·} in E 2 such that |; z i | = x i and every straight line of E 2 comes arbitrarily near to P , if and only if Σ 1 x i =+∞ . Analogous results are valid for the case of higher dimensions.'

## Cleaning pipeline

- Remove special characters
- Remove punctuation
- Remove stopwords
- Set everything to lowercase
- lemmatize

In [3]:
def reasonable(title, abstract):
    '''
    Check that there is a title and that the abstract is greater than 50 characters in length
    '''
    if type(title) == float:
        return False
    if type(abstract) == float:
        return False
    if title == '':
        return False
    if len(abstract) < 50:
        return False
    else:
        return True

def normalise_acronymns(text):
    '''
    Remove the periods in acronyms. 
    Adapted from the method found at https://stackoverflow.com/a/40197005 
    '''

    # deal with single letters before sentence boundaries
    text = re.sub(r'\s([A-Z, a-z])\.\s', r' \1..  ', text)
    return re.sub(r'(?<!\w)([A-Z, a-z])\.', r'\1', text)

def normalise_decimals(text):
    '''
    Remove the periods in decimal numbers and replace with POINT
    '''
    return re.sub(r'([0-9])\.([0-9])', r'\1POINT\2', text)

def normalise_dashes(text):
    '''
    In cases where there is a dash connecting one or two letters to a longer word, preserve dashes
    '''
    # When it occurs at the start of text...
    text = re.sub(r'(^[0-9a-zA-Z]{1,2})-([0-9a-zA-Z])', r'\1DASH\2', text)
    
    # When it occurs in the middle of text...
    text = re.sub(r'([\s\W][0-9a-zA-Z]{1,2})-([0-9a-zA-Z])', r'\1DASH\2', text)
    text = re.sub(r'([0-9a-zA-Z])-([0-9a-zA-Z]{1,2}[\s\W])', r'\1DASH\2', text)
    
    # When it occurs at the end of text
    text = re.sub(r'([0-9a-zA-Z])-([0-9a-zA-Z]{1,2}$)', r'\1DASH\2', text)
    return text
    
def clean(text, wnl, tokeniser):
    # Remove special characters
    text = unicodedata.normalize('NFKC', text)
    text = re.sub(r'&#?[0-9a-zA-Z]+;', ' ', text)
    
    # Remove line breaks
    text = text.replace('\n',' ')
    text = text.replace('\r',' ')
    text = text.replace('\t',' ')
    
    # Preserve acronyms
    text = normalise_acronymns(text)
    # Preserve decimal points
    text = normalise_decimals(text)
    # Normalise dashes
    text = normalise_dashes(text)
    
    # Lemmatise word by word
    lemmas = []
    for word in tokeniser(text):
        lemmas.append(wnl.lemmatize(word))

    return ' '.join(lemmas)

In [7]:
# Download the lemmatisesr
wnl = WordNetLemmatizer()

# Create a tokeniser
count = CountVectorizer(strip_accents='ascii', min_df=1)
tokeniser = count.build_analyzer()

In [8]:
f = open("../../Data/semantic_scholar_filtered/1980.txt", "r", encoding="utf8")
lines = []
for i, line in enumerate(f):
    lines.append(json.loads(line))

In [9]:
for i in range(250,300):
    title = lines[i]['title']
    abstract = lines[i]['paperAbstract']
    if reasonable(title, abstract):
        text = title + ' ' + abstract
        print(text)
        print('-'*50)
        print(clean(text, wnl, tokeniser))
    else:
        text = title + ' ' + abstract
        print(text)
        print('-'*50)
        print('NOT REASONABLE')
    print('='*50)

Some experiments in discrete utterance recognition This paper is concerned with the following three aspects of the discrete utterance recognition problem: utterance normalization, dynamic programming algorithm implementation, and boundary error effects. Performance sensitivity as a function of each aspect of the problem is comparatively studied utilizing several available alternatives and significant conclusions are drawn regarding each of them. The concept of proportional normalizing is introduced as an effective method of handling the utterance normalization problem. A database consisting of the utterances of the alpha-digit vocabulary produced by several male and female speakers is used to conduct all the experiments.
--------------------------------------------------
some experiment in discrete utterance recognition this paper is concerned with the following three aspect of the discrete utterance recognition problem utterance normalization dynamic programming algorithm implementati

In [18]:
clean(text, wnl, tokeniser)

'yield model for productivity optimization of vlsi memory chip with redundancy and partially good product model with mixed poisson statistic ha been developed for calculating the yield for memory chip with redundant line and for partially good product the mixing process requires two parameter which are readily obtained from product data the product is described in the model by critical area which depend on the circuit sensitivity to defect and they can be determined in systematic way the process is represented in the model by defect density and gross yield loss these are measured with defect monitor independently of product type this paper show how the yield for any product can be calculated given the critical area defect density and mixing parameter future yield are forecast by using expected improvement in defect density example show good agreement between actual and calculated yield'

# Clean real data

- In this section, we go through each year file in semantic_scholar_filtered and clean the data. New year files to be stored in semantic_scholar_clean. One document per line [title and abstract are together]. The data is now preparared for vectorisation and use.

In [6]:
# Download the lemmatisesr
wnl = WordNetLemmatizer()

# Create a tokeniser
count = CountVectorizer(strip_accents='ascii', min_df=1)
tokeniser = count.build_analyzer()


for year in range(1980, 2022):
    cleaned_text = []
    print(year)
    f = open("../../Data/semantic_scholar_filtered/"+str(year)+".txt", "r", encoding="utf8")
    for i, line in enumerate(f):
        line = json.loads(line)
        
        title = line['title']
        abstract = line['paperAbstract']
        if reasonable(title, abstract):
            text = title + ' ' + abstract
            cleaned_text.append(clean(text, wnl, tokeniser))
            
    
    with open("../../Data/semantic_scholar_cleaned/"+str(year)+".txt", "a") as f:
        for line in cleaned_text:
            f.write(line + '\n')

1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021


In [19]:
i

0