# Web Scraping Approach Description
---


Authors: 
- Pablo Reyes Martín           (NIA: 100409333)
- Claudio Sotillos Peceros     (NIA: 100409401)
- Bosco De Enrique Romeu       (NIA: 100406718)
- Daniel De Las Cuevas Turel  (NIA: 100406666)

---
What we have done to create our own dataset is the following. 

First we have choosen 10 main disciplines which are:<br>
'Computer Science', 'Philosophy', 'Mathematics', 'Psychology', 'Biomedicine', 'Criminology and Crime Justice', 'Geography', 'Education', 'Physics' and 'Economics'.
From the page we took their respective url (10 Disciplines - 10 Urls).

If you click in any of these disciplines you will find that articles are ordered by pages (20 articles per page more or less) ordered from the newest articles to the oldest and, initially, you only can access to their title 
and the url which will carry you to the article´s page (with all its info: Abstract, Authors, the article itself, etc).

For moving along pages is very simple since the discipline´s urls are almost static, the only thing which changes is the number of the page. See an example:  
https://link.springer.com/search/page/1?facet-discipline=%22Biomedicine%22&facet-language=%22En%22&facet-content-type=Article <br>
Of course, the section '%22Biomedicine%22' changes for each discipline, but since we have gathered the url of each discipline we don't have to worry about changing that piece for moving into another discipline.

That is why we first did an initial 'outer' web scraping obtaining the Article´s title and url. This web scraping consisted on a loop along the Discipline´s Urls 
obtaining 2000 [title,url, discipline name] instances per Discipline, obtaining a total of 20000 instances (then it is reduced since some articles didn´t have abstract).

Finally, the 'inner' web scraping went through the Url´s column,and obtained the remaining features (which were inside of the own article page). 
The first web scraping took like an hour to complete and the second one took 4 hours. 
We guess that there is a more efficient way of doing this but this is what we have been able to do. 

Once we generated a pandas df of all the features, we deleted those rows which didn´t contain an abstract, since these are useles for the topic modelling task.

<h1>Let´s Proceed with the web scraping

In [None]:
%%capture   
!pip install ray

In [None]:
from google.colab import drive
from bs4 import BeautifulSoup
import requests
import pandas as pd
import ray
import pickle

In [None]:
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
# Local Folder for Saving the Final Dataset
local_folder = '/content/drive/MyDrive/ML Applications/Final Project/'   # Claudio´s Local Folder
# local_folder = '/content/drive/MyDrive/'     # Pablo´s Local Folder

In [None]:
# Topics we are going to select
topics = ['Physics', 'Biomedicine','Economics','Computer Science','Mathematics','Philosophy','Psychology','Geography','Criminology and Crime Justice','Education']


# Links of topic of each discipline

linksoftopics=["https://link.springer.com/search?facet-language=%22En%22&facet-content-type=Article&facet-discipline=%22Physics%22",
               "https://link.springer.com/search?facet-language=%22En%22&facet-content-type=Article&facet-discipline=%22Biomedicine%22",
               "https://link.springer.com/search?just-selected-from-overlay=facet-discipline&facet-discipline=%22Economics%22&facet-language=%22En%22&facet-content-type=Article&just-selected-from-overlay-value=%22Economics%22",
               "https://link.springer.com/search?just-selected-from-overlay=facet-discipline&facet-language=%22En%22&just-selected-from-overlay-value=%22Computer+Science%22&facet-content-type=Article&facet-discipline=%22Computer+Science%22",
               "https://link.springer.com/search?just-selected-from-overlay=facet-discipline&facet-language=%22En%22&facet-content-type=Article&facet-discipline=%22Mathematics%22&just-selected-from-overlay-value=%22Mathematics%22",
               "https://link.springer.com/search?just-selected-from-overlay=facet-discipline&facet-language=%22En%22&facet-discipline=%22Philosophy%22&facet-content-type=Article&just-selected-from-overlay-value=%22Philosophy%22",
               "https://link.springer.com/search?just-selected-from-overlay=facet-discipline&facet-discipline=%22Psychology%22&facet-language=%22En%22&facet-content-type=Article&just-selected-from-overlay-value=%22Psychology%22",
               "https://link.springer.com/search?just-selected-from-overlay=facet-discipline&facet-language=%22En%22&just-selected-from-overlay-value=%22Geography%22&facet-content-type=Article&facet-discipline=%22Geography%22",
               "https://link.springer.com/search?just-selected-from-overlay=facet-discipline&facet-language=%22En%22&facet-content-type=Article&facet-discipline=%22Criminology+and+Criminal+Justice%22&just-selected-from-overlay-value=%22Criminology+and+Criminal+Justice%22",
               "https://link.springer.com/search?facet-language=%22En%22&facet-content-type=Article&facet-discipline=%22Education%22&just-selected-from-overlay-value=%22Education%22&just-selected-from-overlay=facet-discipline"]

In [None]:
# EXECUTE THIS EXTRA FUNCTIONS

# Extra functions to use for saving article metadata
def pagelinkofdiscipline(linkoftpc,n):
  lis=linkoftpc.split("?")
  #We insert from what page we want to get titles and urls

  lis.insert(1,"/page/"+str(n)+"?")
  return "".join(lis)

def ifnone(metadata):
  if metadata==None:
    return None
  elif  metadata.text==None:
    return None 
  else:
    return metadata.text
    
def ifkey(request):
  if request==None:
    return None
  else:
    return list(map(lambda x:x.text,request))

def geturls(request):
  if not request:
    return None
  else:
    return list(map(lambda link:"https://link.springer.com"+link.get('href'),request))

# This function returns a vector with the Discipline which you insert (topic) repeated N times (length)    
def generatelistoftopic(topic,length):
  return [topic]*length

# Outer Web Scrapping

In [None]:
totaltitles=[]
totalurls=[]
totaltopics=[]
for topic,topic_url in zip(topics[6:],linksoftopics[6:]):
  print(topic)
  for page in range(1,100):
      # First we need to locate the discipline of the topic and the page that we are going to request all urls
      cleantopic=pagelinkofdiscipline(topic_url,page)
      # Call and request the links of the page 
      soup=BeautifulSoup(requests.get(cleantopic).text,"lxml")
      links=soup.find_all("a",{"class":"title"})
      # Get the titles and the urls that belongs to this titles
      urlspage=geturls(links)
      titlespage=ifkey(links)
      topicgeneration=generatelistoftopic(topic,len(urlspage))

      #### EXTEND all the information to alist where it is saved everything ####
      totalurls.extend(urlspage)
      totaltitles.extend(titlespage)
      totaltopics.extend(topicgeneration)

In [None]:
# # CheckPoint
# with open(local_folder+ 'Non-Clean Datasets/' +"firstphase.pickle","wb") as write:
#   pickle.dump({"titles":totaltitles,"urls":totalurls,"target":totaltopics},write)

In [None]:
# Reload the Data
with open(local_folder+ 'Non-Clean Datasets/'+"firstphase.pickle","rb") as read:
  phase1=pickle.load(read)

urls=phase1["urls"]  # Obtain the urls for making the inner search

# Inner Web Scraping

In [None]:
ray.shutdown()
ray.init()
@ray.remote

def phase(url):
  soup=BeautifulSoup(requests.get(url).text,"lxml")
  abstract_md=ifnone(soup.find(id="Abs1-content"))
  publish_md=ifnone(soup.find(class_="c-bibliographic-information__value"))
  authors_md=ifnone(soup.find(class_="c-article-author-affiliation__authors-list"))
  journal_md=ifnone(soup.find("i",{"data-test":"journal-title"}))
  accesses_md=ifnone(soup.find("p",{"class":"c-article-metrics-bar__count"}))
  keywords_md=ifkey(soup.find_all("span",{"itemprop":"about"}))

  return [abstract_md,publish_md,authors_md,journal_md,accesses_md,keywords_md]

2021-04-21 11:19:45,758	INFO services.py:1174 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


In [None]:
#TAKE CARE ABOUT THIS COMMAND EXECUTE THIS EMPRTY LIST
journals=[];

In [None]:
##### Configuring number of steps and create journals empty list 
steps=200;

In [None]:
for bound in range(0,len(urls),steps):
    
    res=list(map(lambda x: phase.remote(x),urls[bound:(bound+steps)]))
    journalinterval=ray.get(res)
    journals.extend(journalinterval)
   
    print("{} execution...". format((bound+steps)//(steps)))
    

1 execution...
2 execution...
3 execution...
4 execution...
5 execution...
6 execution...
7 execution...
8 execution...
9 execution...
10 execution...
11 execution...
12 execution...
13 execution...
14 execution...
15 execution...
16 execution...
17 execution...
18 execution...
19 execution...
20 execution...
21 execution...
22 execution...
23 execution...
24 execution...
25 execution...
26 execution...
27 execution...
28 execution...
29 execution...
30 execution...
31 execution...
32 execution...
33 execution...
34 execution...
35 execution...
36 execution...
37 execution...
38 execution...
39 execution...
40 execution...
41 execution...
42 execution...
43 execution...
44 execution...
45 execution...
46 execution...
47 execution...
48 execution...
49 execution...
50 execution...
51 execution...
52 execution...
53 execution...
54 execution...
55 execution...
56 execution...
57 execution...
58 execution...
59 execution...
60 execution...
61 execution...
62 execution...
63 execution...
6

In [None]:
# Saving the Extra Features in the dictionary
features = np.array(journals)
phase1['Abstract'] = list(features[:,0])
phase1['Publication Date'] = list(features[:,1])
phase1['Authors'] = list(features[:,2])
phase1['Journal'] = list(features[:,3])
phase1['Accesses'] = list(features[:,4])
phase1['KeyWords'] = list(features[:,5])

  """Entry point for launching an IPython kernel.


In [None]:
df_dict = pd.DataFrame(phase1,columns = list(phase1.keys())) 

In [None]:
df_dict= df_dict[['titles', 'Abstract', 'KeyWords', 'Authors', 'Journal','Publication Date',  'Accesses' ,'urls', 'target']]

In [None]:
# Randomize the ordering of the Rows
result = df_dict.sample(frac=1).reset_index(drop=True)

In [None]:
# # Save the DataSet
# with open(local_folder+ 'Non-Clean Datasets/'+"Dataset.pickle", "wb") as save:
#           pickle.dump(result, save)  

In [None]:
# Load the dataset
with open(local_folder+ 'Non-Clean Datasets/'+"Dataset.pickle", "rb") as metadata:
   md=pickle.load(metadata)

In [None]:
md

Unnamed: 0,titles,Abstract,KeyWords,Authors,Journal,Publication Date,Accesses,urls,target
0,Improved approaches for density-based outlier ...,Density-based algorithms are important data cl...,"[Data analysis, Density clustering, DBSCAN, Ou...",Aymen Abid & Abdennaceur Kachouri,Computing,13 January 2021,44 Accesses,https://link.springer.com/article/10.1007/s006...,Computer Science
1,Can a Bodily Theorist of Pain Speak Mandarin?,"According to a bodily view of pain, pains are ...","[Pain, Bodily theories of pain, Cross-linguist...",Chenwei Nie,Philosophia,15 July 2020,250 Accesses,https://link.springer.com/article/10.1007/s114...,Philosophy
2,Ando–Choi–Effros liftings for regular maps bet...,The Ando–Choi–Effros lifting theorem provides ...,"[Banach lattices, Regular maps, Liftings, Posi...",Javier Alejandro Chávez-Domínguez,Positivity,01 April 2019,44 Accesses,https://link.springer.com/article/10.1007/s111...,Mathematics
3,Towards the Epistemology of the Non-trivial: R...,The present article discusses shared epistemol...,"[Non-trivial research, First-person research, ...",Urban Kordeš & Ema Demšar,Foundations of Science,25 November 2019,122 Accesses,https://link.springer.com/article/10.1007/s106...,Philosophy
4,Kenny’s Whistleblowing and Stanger’s Whistlebl...,,[],Wim Vandekerckhove,Philosophy of Management,11 June 2020,555 Accesses,https://link.springer.com/article/10.1007/s409...,Philosophy
...,...,...,...,...,...,...,...,...,...
19795,Rituals of Vocational Socialisation: Faith-Bui...,This paper addresses the question of how highe...,"[Transitions, Rituals, Vocational socialisatio...",Rebecca Ye,Vocations and Learning,15 April 2020,162 Accesses,https://link.springer.com/article/10.1007/s121...,Education
19796,Modelling inelastic Granular Media Using Dynam...,We construct a new mesoscopic model for granul...,"[Granular media, Dynamical Density Functional ...",B. D. Goddard & T. D. Hurst,Journal of Statistical Physics,18 January 2020,217 Accesses,https://link.springer.com/article/10.1007/s109...,Physics
19797,Availability-aware and energy-aware dynamic SF...,Software-defined networking and network functi...,"[Service function chains, Placement, Network f...",Guto Leoni Santos & Judith Kelner,The Journal of Supercomputing,28 March 2021,28 Accesses,https://link.springer.com/article/10.1007/s112...,Computer Science
19798,Impact of Trade Liberalisation on the Informal...,This paper empirically investigates the impact...,"[Trade, Informal sector, Panel data, BRICS]",Pooja Khanna,The Indian Journal of Labour Economics,03 December 2020,49 Accesses,https://link.springer.com/article/10.1007/s410...,Economics


# Drop the rows with None values

In [None]:
# Substitute the Null values by 0's
md.loc[md['Abstract'].isnull(), 'Abstract'] = 0

In [None]:
# As we can see, we loose just a few instances
len(list(md[md['Abstract'] == 0].index))

1353

In [None]:
# Remove the abstracts that are 0's
md.drop(list(md[md['Abstract'] == 0].index), inplace=True)

In [None]:
md= md.reset_index(drop=True)
md

Unnamed: 0,titles,Abstract,KeyWords,Authors,Journal,Publication Date,Accesses,urls,target
0,Improved approaches for density-based outlier ...,Density-based algorithms are important data cl...,"[Data analysis, Density clustering, DBSCAN, Ou...",Aymen Abid & Abdennaceur Kachouri,Computing,13 January 2021,44 Accesses,https://link.springer.com/article/10.1007/s006...,Computer Science
1,Can a Bodily Theorist of Pain Speak Mandarin?,"According to a bodily view of pain, pains are ...","[Pain, Bodily theories of pain, Cross-linguist...",Chenwei Nie,Philosophia,15 July 2020,250 Accesses,https://link.springer.com/article/10.1007/s114...,Philosophy
2,Ando–Choi–Effros liftings for regular maps bet...,The Ando–Choi–Effros lifting theorem provides ...,"[Banach lattices, Regular maps, Liftings, Posi...",Javier Alejandro Chávez-Domínguez,Positivity,01 April 2019,44 Accesses,https://link.springer.com/article/10.1007/s111...,Mathematics
3,Towards the Epistemology of the Non-trivial: R...,The present article discusses shared epistemol...,"[Non-trivial research, First-person research, ...",Urban Kordeš & Ema Demšar,Foundations of Science,25 November 2019,122 Accesses,https://link.springer.com/article/10.1007/s106...,Philosophy
4,A Spatially Sixth-Order Hybrid L1-CCD Method f...,We consider highly accurate schemes for nonlin...,[nonlinear time fractional Schrödinger equatio...,Chun-Hua Zhang & Hai-Wei Sun,Applications of Mathematics,04 December 2019,30 Accesses,https://link.springer.com/article/10.21136/AM....,Mathematics
...,...,...,...,...,...,...,...,...,...
18442,Rituals of Vocational Socialisation: Faith-Bui...,This paper addresses the question of how highe...,"[Transitions, Rituals, Vocational socialisatio...",Rebecca Ye,Vocations and Learning,15 April 2020,162 Accesses,https://link.springer.com/article/10.1007/s121...,Education
18443,Modelling inelastic Granular Media Using Dynam...,We construct a new mesoscopic model for granul...,"[Granular media, Dynamical Density Functional ...",B. D. Goddard & T. D. Hurst,Journal of Statistical Physics,18 January 2020,217 Accesses,https://link.springer.com/article/10.1007/s109...,Physics
18444,Availability-aware and energy-aware dynamic SF...,Software-defined networking and network functi...,"[Service function chains, Placement, Network f...",Guto Leoni Santos & Judith Kelner,The Journal of Supercomputing,28 March 2021,28 Accesses,https://link.springer.com/article/10.1007/s112...,Computer Science
18445,Impact of Trade Liberalisation on the Informal...,This paper empirically investigates the impact...,"[Trade, Informal sector, Panel data, BRICS]",Pooja Khanna,The Indian Journal of Labour Economics,03 December 2020,49 Accesses,https://link.springer.com/article/10.1007/s410...,Economics


In [None]:
# # Save the CLEAN DataSet
# with open(local_folder+ 'Non-Clean Datasets/'+"Dataset_clean.pickle", "wb") as save:
#           pickle.dump(md, save)  

# Finaly, we have our Dataset!!!!!!!!!!!