This .ipynb interactive code contains the functions for opening a given doi ID in SciHub and extracting the text from the opened article in order to search number of occurences of a specific target word in the article.

*** Please copy the .ipynb file to your own drive before working on it *** 

Author: Busra Tugce Gurbuz

---

In [None]:
#@markdown Install needed libraries
!pip install slate3k

In [42]:
#@markdown Import needed libraries
import re
import io
import requests
import slate3k as slate #Run above cell if it gives error for this library
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

This part goes over following components:

(1) **extractPDFlink(doi)** - Extracting PDF url link from a doi using Sci-Hub webpage

(2) **extractTextFromPDF(pdf_link)** - Extracting text from the obtained PDF url link

(3) **findNumWords(extracted_text, target_word)** - Finding number of occurances of a target word in the extracted text

In [25]:
def extractPDFlink(doi):
  '''
  Extracts PDF url link from a doi ID using Sci-Hub webpage
  ARGS:
    - doi : (str) doi ID of the paper
  RETURNS:
    - pdf_link : (str) link of SciHub article pdf
  '''
  ##Form url: unfortunately, I could only pass scihub security ;(
  url = 'https://www.sci-hub.st/' + doi
  ##Request data from the url
  source_code = requests.get(url) #opens source code of the webpage
  source_text = source_code.text  #turns source code into text
  soup = BeautifulSoup(source_text) #converts source text to bs4 object
  link_line = soup.findAll('embed', {'id': 'pdf'}) #extracts the line in bs4 that has pdflink
  pdf_link = link_line[0].get('src')

  return pdf_link

In [26]:
#An example of PDF-link given doi
pdf_link = extractPDFlink('10.1111/cdev.13080')
pdf_link

'https://twin.sci-hub.st/6833/259fc8ef5277189fcc8e1fd80e22f3b5/yaple2018.pdf#navpanes=0&view=FitH'

In [36]:
def extractTextFromPDF(pdf_link):
  '''
  Extracting text from the obtained article PDF url link
  ARGS:
    - pdf_link : (str) url link of SciHub article [comes from extractPDFlink()]
  RETURNS:
    - text : (str) text in the whole article
  '''
  ##Extract the text from PDF
  remote_file = urlopen(Request(pdf_link)).read() #read the pdf_link
  memory_file = io.BytesIO(remote_file) #form a memory file for the link
  extracted_text = slate.PDF(memory_file) #extract the text from the memory 
  ##Clean the text
  text = str(extracted_text) #convert to string for ease
  text = text.replace('\n', ' ') #replace \n with space
  text = text.replace('\\n', ' ') #replace \n with space

  return text

In [37]:
#An example of etxracted text from PDF link
extracted_text = extractTextFromPDF(pdf_link)
extracted_text



"['Child Development, xxxx 2018, Volume 00, Number 0, Pages 1–13  this Special Section is Meta-analysis and Individual Participant Data  The title for Synthesis in Child Development, edited by Glenn I. Roisman and Marinus H. van IJzendoorn  N-back Working Memory Task: Meta-analysis of Normative fMRI Studies With Children  National Research University Higher School of Economics  Zachary Yaple  Marie Arsalidou  National Research University Higher School of Economics and York University  The n-back task is likely the most popular measure of working memory for functional magnetic resonance imaging (fMRI) studies. Despite accumulating neuroimaging studies with the n-back task and children, its neural representation is still unclear. fMRI studies that used the n-back were compiled, and data from children up to 15 years (n = 260) were analyzed using activation likelihood estimation. Results show concordance in frontoparietal regions recognized for their role in working memory as well as regio

In [44]:
def findNumWords(extracted_text, target_word):
  '''
  Finds number of occurances of a target word in the extracted text
  ARGS:
    - extracted_text : (str) text extracted from the article PDF url link [comes from extractTextFromPDF()]
    - target_word : (str) the word we would like to search in the article
  RETURNS:
    - n : (int) number of occurances of the given target word in article
  '''
  word_list = re.findall(target_word, extracted_text, flags=re.IGNORECASE) #returns a list of the words that match with our target word
  n = len(word_list)
  print(f'Number of occurances of the word {target_word} is {n}')
  return n

In [47]:
#An example of finding number of words in the text
n0 = findNumWords(extracted_text, 'data')
n1 = findNumWords(extracted_text, 'working memory')
n2 = findNumWords(extracted_text, 'data avaliable')

Number of occurances of the word data is 23
Number of occurances of the word working memory is 45
Number of occurances of the word data avaliable is 0
