In [1]:
# Import usual Python and data handling stuff
import numpy as np
import pandas as pd

In [2]:
# Not yet used - No graph :-(
#import matplotlib.pyplot as plt
#%matplotlib inline

# Analysis of the participations in EU Framework Programmes

In the analysis, we cover the participation in the EU Framework Programmes for Research and Innovation between 2007 and 2017. This period is covered by FP7 (2007-2013) and Horizon 2020 (since 2014).

## FP7 datasets: 2007-2013

### Set up general coding information from Cordis

Cordis has several dictionaries for coded information. The following datasets are created on this basis and will be used to "decode" the information provided in the projects and organisations datasets for FP7.

In [3]:
# Symbols and names of countries in several langues
country_file = "FP7/cordisref-countries.xls"
countries = pd.read_excel(
    country_file,
    sheet_name="cordisref-countries",
    header=0,
)

In [4]:
# Symbols and names of FP7 sub-programmes in several langues
fp7_programmes_file = "FP7/cordisref-FP7programmes.xls"
fp7_programmes = pd.read_excel(
    fp7_programmes_file,
    sheet_name="Hoja1",
    header=0,
)

In [5]:
# Full names of funding schemes
fp7_schemes_file = "FP7/cordisref-projectFundingSchemeCategory.xls"
fp7_schemes = pd.read_excel(
    fp7_schemes_file,
    sheet_name="cordisref-projectFundingSchemeC",
    header=0,
)

### FP7 projects and organisations
The data for projects and participations in FP7 are available on the Open Data Portal of the European Union:

https://data.europa.eu/euodp/en/data/dataset/cordisfp7projects

Two datasets will be created:
* `fp7_proj`: with the descriptors of the FP7 funded projects
* `fp7_part`: with the descriptions of the participating organisations

These two datasets will be merged into one sigle `fp7` dataset where the project informations will be repeated to each participations.

In [6]:
%%time

fp7_proj_file = "FP7/cordis-fp7projects.xlsx"
fp7_part_file = "FP7/cordis-fp7organizations.xlsx"

fp7_proj = pd.read_excel(
    fp7_proj_file, sheet_name="cordis-fp7projects",
    header=0,
)
fp7_part = pd.read_excel(
    fp7_part_file,
    sheet_name="cordis-h2020organizations", # Which is a mistake in the Cordis dataset!
    header=0,
)

CPU times: user 23.3 s, sys: 145 ms, total: 23.5 s
Wall time: 23.5 s


In [7]:
# We  rename some columns which are in both datasets:
# In projects:
fp7_proj = fp7_proj.rename(
    index=str,
    columns={
        "id": "projectID",
        "rcn": "projectRCN",
        "acronym": "projectAcronym"
    }
)
# In participations:
fp7_part = fp7_part.rename(
    index=str,
    columns={
        "id": "organizationID",
        "projectRcn": "projectRCN",
        "name": "organizationName",
    }
)

So, let's merge these two datasets and create the `fp7` dataset.

In [8]:
fp7 = fp7_proj.merge(
    fp7_part,
    on=["projectRCN","projectID","projectAcronym"]
)

In [9]:
fp7_proj.shape, fp7_part.shape, fp7.shape

((25778, 21), (146021, 23), (146021, 41))

## Horizon 2020 datasets: 2014-2020

### Set up general coding information from Cordis

Cordis has several dictionaries for coded information. The following datasets are created on this basis and will be used to "decode" the information provided in the projects and organisations datasets for FP7 and Horizon 2020.

In [10]:
# Symbols and names of Horizon 2020 sub-programmes in several languages
h2020_programmes_file = "Horizon 2020/cordisref-H2020programmes.xls"
h2020_programmes = pd.read_excel(
    h2020_programmes_file,
    sheet_name="Hoja1",
    header=0,
)

In [11]:
# Symbols and names of Horizon 2020 topics in several languages
h2020_topics_file = "Horizon 2020/cordisref-H2020topics.xlsx"
h2020_topics = pd.read_excel(
    h2020_topics_file,
    sheet_name="cordisref-H2020topics",
    header=0,
)

In [12]:
# Symbols and names of Horizon 2020 research topics
h2020_sic_file = "Horizon 2020/cordisref-sicCode.xls"
h2020_sic = pd.read_excel(
    h2020_sic_file,
    sheet_name="cordisref-sicCode",
    header=0,
)

### Horizon 2020 projects and organisations
The data for projects and participations in Horizon 2020 are available on the Open Data Portal of the European Union:

https://data.europa.eu/euodp/en/data/dataset/cordisH2020projects

The cut-off date is: 2017-10-12

Two datasets will be created:
* `h2020_proj`: with the descriptors of the Horizon 2020 funded projects
* `h2020_part`: with the descriptions of the participating organisations

These two datasets will be merged into one sigle `h2020` dataset where the project informations will be repeated to each participations.


In [13]:
%%time

h2020_proj_file = "Horizon 2020/cordis-h2020projects.csv"
h2020_part_file = "Horizon 2020/cordis-h2020organizations.xlsx"

h2020_proj = pd.read_csv(
    h2020_proj_file,
    sep=";",
    header=0,
)
h2020_part = pd.read_excel(
    h2020_part_file,
    sheet_name="organisation",
    header=0,
)

CPU times: user 6.83 s, sys: 24 ms, total: 6.85 s
Wall time: 6.85 s


In [14]:
# We  rename some columns which are in both datasets:
# In projects:
h2020_proj = h2020_proj.rename(
    index=str,
    columns={
        "id": "projectID",
        "rcn": "projectRCN",
        "acronym": "projectAcronym"
    }
)
# In participations:
h2020_part = h2020_part.rename(
    index=str,
    columns={
        "id": "organizationID",
        "projectRcn": "projectRCN",
        "name": "organizationName",
    }
)

So, let's merge these two datasets and create the `h2020` dataset.

In [15]:
h2020 = h2020_proj.merge(
    h2020_part,
    on=["projectRCN", "projectID"]
)

In [16]:
h2020_proj.shape, h2020_part.shape, h2020.shape

((14837, 21), (71312, 23), (62506, 42))

There are projects whith the same RCN and ID, but with difference project acronyms. So let's find who they are:

In [17]:
h2020.loc[
    h2020.projectAcronym_x.ne(h2020.projectAcronym_y),
    ["projectRCN","projectID","topics","projectAcronym_x","projectAcronym_y"]
]

Unnamed: 0,projectRCN,projectID,topics,projectAcronym_x,projectAcronym_y
31485,208306,724846,ERC-2016-COG,321,321
31486,208306,724846,ERC-2016-COG,321,321
56230,194607,649660,EE-10-2014,Save at Work,Save at Work
56231,194607,649660,EE-10-2014,Save at Work,Save at Work
56232,194607,649660,EE-10-2014,Save at Work,Save at Work
56233,194607,649660,EE-10-2014,Save at Work,Save at Work
56234,194607,649660,EE-10-2014,Save at Work,Save at Work
56235,194607,649660,EE-10-2014,Save at Work,Save at Work
56236,194607,649660,EE-10-2014,Save at Work,Save at Work
56237,194607,649660,EE-10-2014,Save at Work,Save at Work


We can see from the fusion that not every project record number has found its corresponding record number in the other dataset. We find indeed 8806 project record numbers (projectRCN) in the `h2020_part` dataset that are _not_ matched in the `h2020_proj` dataset, which is exactly the difference between the size of `h2020_part` and `h2020`.

In [18]:
rcn_in_proj = h2020_proj["projectRCN"].unique()
missing_rcn = h2020_part.loc[~h2020_part["projectRCN"].isin(rcn_in_proj),"projectRCN"]

In [19]:
len(missing_rcn) + h2020.shape[0] == h2020_part.shape[0]

True

## Analysis

This parts is the generala analysis of the FP7 participations dataset extracted from the Cordis project files (`fp7` and `h2020` datasets).

We list below all the column headers of the datasets.

In [20]:
h2020.columns.difference(fp7.columns)

Index(['projectAcronym_x', 'projectAcronym_y'], dtype='object')