# Parsing GEPRIS pages of NFDI consortia for project applicants and participants
The list of funded NFDI projects with their GEPRIS IDs was saved in the notebook 02. Let's read that file into a dataframe.

In [1]:
import pandas as pd
df = pd.read_csv("../../../data/GEPRIS_NFDI_all.csv").fillna('')
df

Unnamed: 0,GEPRIS,Title,Description
0,https://gepris.dfg.de/gepris/projekt/441914366,GHGA,German Human Genome-Phenome Archive
1,https://gepris.dfg.de/gepris/projekt/441926934,NFDI4Cat,NFDI for Catalysis-Related Sciences
2,https://gepris.dfg.de/gepris/projekt/441958017,NFDI4Culture,Consortium for research data on material and i...
3,https://gepris.dfg.de/gepris/projekt/441958208,NFDI4Chem,Chemistry Consortium in the NFDI
4,https://gepris.dfg.de/gepris/projekt/442032008,NFDI4BioDiversity,"Biodiversity, Ecology & Environmental Data"
5,https://gepris.dfg.de/gepris/projekt/442077441,DataPLANT,Data in PLANT research
6,https://gepris.dfg.de/gepris/projekt/442146713,NFDI4Ing,National Research Data Infrastructure for Engi...
7,https://gepris.dfg.de/gepris/projekt/442326535,NFDI4Health,National Research Data Infrastructure for Pers...
8,https://gepris.dfg.de/gepris/projekt/442494171,KonsortSWD,"Consortium for the Social, Behavioural, Educat..."
9,https://gepris.dfg.de/gepris/projekt/460033370,Text+,


For testing let's start with BERD@BW consortium only.

In [2]:
import requests
GEPRIS = "https://gepris.dfg.de/gepris/projekt/460037581"
params = {'language': 'en'}
r = requests.get(GEPRIS, params=params)
text = r.text.encode(r.encoding).decode('utf8')
print(text[0:100].strip())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www


Then the subject area is parsed:

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(text, 'html.parser')
results = soup.find_all("div", class_="firstUnderAntragsbeteiligte")
subject_area = results[0].find_all('span')[-1].text.strip().replace('  ', '').replace('\n', ', ')
print(subject_area)

Social and Behavioural Sciences


The homepage is parsed:

In [4]:
try:
    homepage = soup.find('a', class_="extern").get('href')
except:
    homepage = ''
print(homepage)




The project description is parsed:

In [5]:
description = soup.find('div', id="projekttext").text.strip()
print(description)

The research domains of business, economics, and other social sciences are concerned with the relationships among individuals and organizations within a society. To understand these complex systems, social science disciplines have been using empirical methods for a long time. However, unstructured and non-standard data, i.e., information that either does not have a previously defined data model or is not organized in a predefined manner (e.g., images or videos from social media), are gaining in relevance. The generation of continuous data streams in society and economy (datafication) strengthens this trend: It is estimated that by 2025 80% of the data processed in economic applications will be available in unstructured form. Because of the sheer size and, more importantly, the lack of structure and the heterogeneity of raw digital data, the BERD@NFDI community calls for innovative and reusable methods, mostly from artificial intelligence and machine learning, as well as a suitable stor

Let's parse the rest of useful data, put it into `berd` dictionary and pretty print it:

In [6]:
s = soup.find_all('div', class_="content_frame")
berd = {}
for k in s[1].find_all('div'):
    try:
        t = k.find('span', class_="name")
        if t.get_text().strip()!="DFG Programme" and k.find_all('a', class_="intern"):
            berd[t.get_text().strip()] = [ (g.get_text(), "https://gepris.dfg.de" + g.get('href')) for g in k.find_all('a', class_="intern")]
    except:
        pass
from pprint import pprint
pprint(berd)

{'Applying institution': [('Universität Mannheim',
                           'https://gepris.dfg.de/gepris/institution/10045')],
 'Co-Spokespersons': [('Professor Dr. Bernd  Bischl',
                       'https://gepris.dfg.de/gepris/person/289708818'),
                      ('Professor Dr. Stefan  Dietze',
                       'https://gepris.dfg.de/gepris/person/218646654'),
                      ('Professor Dr. Marc  Fischer',
                       'https://gepris.dfg.de/gepris/person/1804971'),
                      ('Dr. Sabine  Gehrlein',
                       'https://gepris.dfg.de/gepris/person/398007310'),
                      ('Professor Dr. Mark  Heitmann',
                       'https://gepris.dfg.de/gepris/person/216564778'),
                      ('Professor Dr. Hartmut  Höhle',
                       'https://gepris.dfg.de/gepris/person/442455860'),
                      ('Professor Dr. Göran  Kauermann',
                       'https://gepris.dfg.de/gepris/pers

Now we can loop over all consortia and get that data about them automatically:

In [7]:
nfdi = {}
for GEPRIS, title in zip(df['GEPRIS'],df['Title']):
    nfdi[GEPRIS] = {}
    nfdi[GEPRIS]['title'] = title
    params = {'language': 'en'}
    r = requests.get(GEPRIS, params=params)
    text = r.text.encode(r.encoding).decode('utf8')
    soup = BeautifulSoup(text, 'html.parser')
    results = soup.find_all("div", class_="firstUnderAntragsbeteiligte")
    subject_area = results[0].find_all('span')[-1].text.strip().replace('  ', '').replace('\n', ', ')
    nfdi[GEPRIS]['subject_area'] = subject_area
    try:
        homepage = soup.find('a', class_="extern").get('href')
    except:
        homepage = ''
    nfdi[GEPRIS]['homepage'] = homepage
    description = soup.find('div', id="projekttext").text.strip()
    nfdi[GEPRIS]['description'] = description
    s = soup.find_all('div', class_="content_frame")
    for k in s[1].find_all('div'):
        try:
            t = k.find('span', class_="name")
            if t.get_text().strip()!="DFG Programme" and k.find_all('a', class_="intern"):
                nfdi[GEPRIS][t.get_text().strip()] = [ (g.get_text(), "https://gepris.dfg.de" + g.get('href')) for g in k.find_all('a', class_="intern")]
        except:
            pass
import pandas as pd
df_nfdi = pd.DataFrame(nfdi).fillna('')
df_nfdi.T

Unnamed: 0,title,subject_area,homepage,description,Applying institution,Co-applicant institution,Participating Institution,Spokesperson,Participating Persons,Co-Spokespersons,Cooperation partners
https://gepris.dfg.de/gepris/projekt/441914366,GHGA,"Medicine, Biology",https://ghga.dkfz.de/,Human genome sequencing and other omics data m...,"[(Deutsches Krebsforschungszentrum (DKFZ), htt...","[(Charité - Universitätsmedizin Berlin, https:...",[(CISPA - Helmholtz-Zentrum für Informationssi...,"[(Professor Dr. Oliver Stegle, Ph.D., https:/...","[(Viktor Achter, https://gepris.dfg.de/gepris...","[(Privatdozent Dr. Peer Bork, https://gepris....",
https://gepris.dfg.de/gepris/projekt/441926934,NFDI4Cat,"Biology, Chemistry, Mathematics, Thermal Engin...",http://gecats.org/NFDI4Cat.html,The overall strategy for the transformation of...,[(DECHEMA Gesellschaft für Chemische Technik u...,[(Fraunhofer-Institut für Offene Kommunikation...,"[(Technische Universität Darmstadt, https://ge...","[(Dr. Andreas Förster, https://gepris.dfg.de/...","[(Professor Dr.-Ing. Bastian Etzold, https://...","[(Professor Dr. Matthias Beller, https://gepr...",
https://gepris.dfg.de/gepris/projekt/441958017,NFDI4Culture,"Humanities, Construction Engineering and Archi...",https://nfdi4culture.de,Digital data on tangible and intangible cultur...,[(Akademie der Wissenschaften und der Literatu...,[(FIZ KarlsruheLeibniz-Institut für Informatio...,[(Arbeitsgemeinschaft der kunsthistorischen Bi...,"[(Professor Torsten Schrade, https://gepris.d...","[(Professorin Dr. Stefanie Acquavella-Rauch, ...","[(Reinhard Altenhöner, https://gepris.dfg.de/...",
https://gepris.dfg.de/gepris/projekt/441958208,NFDI4Chem,Chemistry,https://nfdi4chem.de,The vision of NFDI4Chem is the digitalisation ...,"[(Friedrich-Schiller-Universität Jena, https:/...",[(FIZ KarlsruheLeibniz-Institut für Informatio...,[(Beilstein-Institut zur Förderung der chemisc...,"[(Professor Dr. Christoph Steinbeck, https://...","[(Privatdozent Dr. Carsten Baldauf, https://g...","[(Dr. Felix Bach, https://gepris.dfg.de/gepri...",
https://gepris.dfg.de/gepris/projekt/442032008,NFDI4BioDiversity,"Biology, Medicine, Agriculture, Forestry and V...",https://www.nfdi4biodiversity.org,Biodiversity is more than just the diversity o...,"[(Universität Bremen, https://gepris.dfg.de/ge...",[(Forschungsinstitut für Nutztierbiologie (FBN...,[(Alfred-Wegener-InstitutHelmholtz-Zentrum für...,"[(Professor Dr. Frank Oliver Glöckner, https:...","[(Professor Dr. Christian Ammer, https://gepr...","[(Professorin Dr. Aletta Bonn, https://gepris...",
https://gepris.dfg.de/gepris/projekt/442077441,DataPLANT,Biology,https://nfdi4plants.de,"In modern hypothesis-driven research, scientis...","[(Albert-Ludwigs-Universität Freiburg, https:/...","[(Eberhard Karls Universität Tübingen, https:/...",[(Helmholtz Zentrum München - Deutsches Forsch...,"[(Dr. Dirk von Suchodoletz, https://gepris.dfg...","[(Professor Dr. Rolf Backofen, https://gepris...","[(Dr. Jens Krüger, https://gepris.dfg.de/gepr...",
https://gepris.dfg.de/gepris/projekt/442146713,NFDI4Ing,"Mechanical and Industrial Engineering, Thermal...",https://nfdi4ing.de,NFDI4Ing brings together the engineering commu...,[(Rheinisch-Westfälische Technische Hochschule...,[(Deutsches Zentrum für Luft- und Raumfahrt e....,[(Bundesanstalt für Materialforschung und -prü...,"[(Professor Dr.-Ing. Robert Schmitt, https://...","[(Professorin Dr. Jasmin Aghassi-Hagmann, htt...","[(Verena Anthofer, https://gepris.dfg.de/gepr...",
https://gepris.dfg.de/gepris/projekt/442326535,NFDI4Health,Medicine,https://www.nfdi4health.de,Germany has accumulated a wealth of health-rel...,[(Deutsche Zentralbibliothek für Medizin (ZB M...,"[(Charité - Universitätsmedizin Berlin, https:...",[(Behörde für Gesundheit und Verbraucherschutz...,"[(Professorin Dr. Juliane Fluck, https://gepr...","[(Professor Dr. Thomas Behrens, https://gepri...","[(Professor Dr. Wolfgang Ahrens, https://gepr...",
https://gepris.dfg.de/gepris/projekt/442494171,KonsortSWD,"Social and Behavioural Sciences, Humanities",https://www.konsortswd.de,"The social, behavioural, educational, and econ...",[(GESIS - Leibniz-Institut für Sozialwissensch...,[(Deutsches Institut für Wirtschaftsforschung ...,"[(Bundesamt für Migration und Flüchtlinge, htt...","[(Professor Dr. Christof Wolf, https://gepris...",,"[(Dr. Maja Adena, https://gepris.dfg.de/gepri...",
https://gepris.dfg.de/gepris/projekt/460033370,Text+,Humanities,,Text+ aims to develop a research data infrastr...,"[(Leibniz-Institut für Deutsche Sprache (IDS),...",[(Berlin-Brandenburgische Akademie der Wissens...,"[(Akademie der Wissenschaften in Hamburg, http...","[(Professor Dr. Erhard Hinrichs, https://gepr...","[(Professor Dr. Andreas Henrich, https://gepr...","[(Privatdozent Dr. Alexander Geyken, https://...",


Let's save the parsed data into a CSV-file:

In [13]:
df_nfdi.T.to_csv("../../../data/GEPRIS_NFDI_project_pages.csv", index=True, index_label='gepris', encoding='utf-8')