# ORCiD example queries

The following examples explore how to use the openly available bigquery dataset available at: ds-open-datasets.orcid.summaries_2024

Further documentation on the orcid schema, along with how to get connected to bigquery can be found at: https://docs.dimensions.ai/bigquery/


In [None]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [None]:
from google.cloud import bigquery


from google.cloud.bigquery import magics
magics.context.project = 'ds-consultancy-gbq'
project_id = "ds-consultancy-gbq" # update as needed eg ds-data-team
bq_params = {}


client = bigquery.Client(project=project_id)
%load_ext google.cloud.bigquery

**Before we go further, a quick warning.  In bigquery don't use "select *" to explore a dataset. It will be expensive. Use only the columns that you need**

# Exploring the ORCiD dataset on Google Bigquery

In this notebook, we breakdown the orcid schema in bigquery, and demonstrate how to query each section

* [counting orcid identifiers](#orcid_identifier)

* [querying pesron data](#person)
  * [biography details](#person.biography)
  * [emails](#person.emails)
  * [addresses](#person.addresses)
  * [external identifiers](#person.external_identifiers)
* [activities](#activites)
  * [education](#activities.educations)
  * [employments](#activities.employments)
  * [funding](#activities.funding)
  * [peer reviews](#activities.peer_reviews)
  * [works](#activities.works)
  * [invited positions](#activities.invited_position)
  * [memberships](#activities.memberships)
  * [qualifications](#activities.qualifications)
  * [services](#activities.services)
  * [research resources](#activities.research_resources)
  

## <a name="orcid_identifier"> Querying orcid identifiers</a>

How many active orcids do we have?

In [None]:
%%bigquery

select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
   where history.deactivation_date is null

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,f0_
0,21046010


Example: How many orcid records have a publicly varified email?

In [None]:

%%bigquery

select count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
   where history.deactivation_date is null
   and history.verified_email is True

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,f0_
0,16623231


How many orcids have been created by year?

In [None]:
%%bigquery

select extract(YEAR FROM timestamp(history.submission_date)) year, count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
   where history.deactivation_date is null
   group by 1
   order by 1


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,year,f0_
0,2012,43695
1,2013,420762
2,2014,584072
3,2015,731309
4,2016,1021238
5,2017,1343595
6,2018,1546783
7,2019,1959241
8,2020,2598695
9,2021,2659905


How many orcids have been modified in 2023 (2024 is only up to the date of the snapshot) ?

In [None]:
%%bigquery

select  count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
   where history.deactivation_date is null
   and extract(YEAR FROM timestamp(history.last_modified_date)) = 2024


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,f0_
0,7471503


## person

How many orcid names are public?

In [None]:
# How many orcid names are public?
%%bigquery

select  person.name.visibility, count(orcid_identifier)
from ds-open-datasets.orcid.summaries_2024
group by 1

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,visibility,f0_
0,,47693
1,public,21023410


How many orcid records have other names?

In [None]:
# How many orcids have other names?

%%bigquery
 select count(orcid_identifier)
 from
 ds-open-datasets.orcid.summaries_2024
 where person.other_names is not null



Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,f0_
0,742479


How many orcids have a populated cedit name?

In [None]:
# How many orcids have a populated cedit name?

%%bigquery
 select count(orcid_identifier)
 from
 ds-open-datasets.orcid.summaries_2024
 where person.name.credit_name is not null

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,f0_
0,811105


### <a name="person.biography">person.biography</a>

Example: What are the dominant research url names?

In [None]:


%%bigquery
 select substr(url.url,1,25) url_beginning, count(orcid_identifier)
 from
 ds-open-datasets.orcid.summaries_2024,
    unnest(person.researcher_urls.urls) url
group by 1
order by 2 desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,url_beginning,f0_
0,https://www.linkedin.com/,238827
1,https://www.researchgate.,150454
2,https://scholar.google.co,125628
3,https://www.facebook.com/,40467
4,http://www.linkedin.com/i,35762
...,...,...
617144,https://solangecoutinho.a,1
617145,https://github.com/kewien,1
617146,http://xochicuicani.blogs,1
617147,http://proyectoacademico.,1


Example: How many profiles have biographies?

In [None]:
# how many profiles have biographies?

%%bigquery

select count(orcid_identifier)
 from
ds-open-datasets.orcid.summaries_2024
    where person.biography.content is not null


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,f0_
0,917857


what are some of the most used words in biographies?

In [None]:
## what are some of the most used words in biographies?

%%bigquery

with bio_tokens as

(select orcid_identifier.path orcid, split(person.biography.content,' ') tokens
 from
 ds-open-datasets.orcid.summaries_2024
    where person.biography.content is not null
    #limit 1000
    ),

tdif as (
SELECT orcid, TF_IDF(tokens, 10000, 20) OVER() AS results
FROM bio_tokens
ORDER BY orcid)

select token.index, count(token.value)
  from tdif, unnest(results) as token
  where token.index is not null
  and token.value > .7
  group by 1
  order by 2 desc
  #limit 200
;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,index,f0_
0,,3210
1,DE,1379
2,Estudiante,1149
3,Student,1087
4,EN,967
...,...,...
7703,Austrian,1
7704,"use,",1
7705,"Elsevier,",1
7706,supervise,1


### <a name="person.emails">person.emails</a>

How many emails does are available in ORCiD?

In [None]:
## emails...

%%bigquery

select orcid_identifier.path, email.email
from ds-open-datasets.orcid.summaries_2024,
   unnest(person.emails.emails) email
limit 10

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,path,email
0,0000-0002-1456-9818,songwy19@mails.tsinghua.edu.cn
1,0000-0003-3404-1719,said.saifullah@northsouth.edu
2,0000-0002-5085-7868,domansarkar@gmail.com
3,0000-0001-8582-1243,fisiolorraine@gmail.com
4,0000-0002-2492-0173,soumendughosh25@gmail.com
5,0000-0002-2492-0173,soumendughosh@nitm.ac.in
6,0000-0001-7879-0485,daniela.bermudez@upch.pe
7,0000-0003-4547-1685,yue.149@osu.edu
8,0000-0003-4547-1685,tommy96@whu.edu.cn
9,0000-0002-4844-6437,Kate.weeks@unimelb.edu.au


These emails are those that are made public

In [None]:
## emails...  how many are made public?

%%bigquery

select email.visibility, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(person.emails.emails) email
group by 1

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,visibility,f0_
0,public,729430


### <a name="person.addresses">person.addresses</a>

How many public addresses are there?

In [None]:
## How many public addresses are there?
%%bigquery

select address.visibility, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(person.addresses.addresses) address
group by 1

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,visibility,f0_
0,public,2855634


How many public addresses are there, from which countries?

In [None]:
## How many public addresses are there, from which countries?
%%bigquery

select address.country, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(person.addresses.addresses) address
group by 1
order by 2 desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,country,f0_
0,BR,273644
1,CN,232258
2,US,211547
3,IN,172261
4,ES,102482
...,...,...
245,NU,6
246,CC,6
247,BV,5
248,TK,4


###<a name="person.keywords">person.keywords</a>

What are the most frequent keywords in the ORCID dataset?

In [None]:
## Most frequent keywords?

%%bigquery

select lower(keyword.content), count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(person.keywords.keywords) keyword
group by 1
order by 2 desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,f0_,f1_
0,machine learning,19257
1,artificial intelligence,10382
2,deep learning,7789
3,bioinformatics,7404
4,education,6359
...,...,...
1189703,b-value,1
1189704,computational structural bioinformatics,1
1189705,"design, product design, tool design, mechanica...",1
1189706,mitochondria-targeted therapeutic strategies,1


### <a name="person.external_identifiers">person.external_identifiers</a>

What are the most common identifier types in ORCiD?

In [None]:

%%bigquery

select lower(identifier.type), count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(person.external_identifiers.identifiers) identifier
group by 1
order by 2 desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,f0_,f1_
0,scopus author id,1791459
1,researcherid,700225
2,sciprofiles,325959
3,loop profile,261417
4,ciência id,67820
5,researcher name resolver id,14228
6,gnd,9692
7,中国科学家在线,5457
8,isni,4227
9,pitt id,3400


## <a name="activities">activities</a>

### <a name="activities.educations">activities.educations</a>


What are the most common academic roles in ORCiD?

In [None]:
%%bigquery

select record.role_title, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(activities.educations.groups) grp,
   unnest(grp.records) record
   where start_date.year = "2024"
group by 1
order by 2 desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,role_title,f0_
0,PhD,2118
1,,1768
2,Mestrado,753
3,Doctor of Philosophy,666
4,Doutorado,583
...,...,...
12989,Doktor Öğretim Üyesi,1
12990,Clinical & Health Psychology,1
12991,MPhil Pharmaceutical Microbiology,1
12992,Bacharelado em Sistemas de Informação,1


### <a name="activities.employments">activities.employments</a>

How many employments have disambiguated addresses in 2024?


In [None]:
#How many employments have disambiguated addresses in 2024?

%%bigquery

select organization.disambiguated_organization.source, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(activities.employments.groups) grp,
   unnest(grp.records) record
   where start_date.year = "2024"
group by 1
order by 2 desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,source,f0_
0,ROR,228862
1,,30104
2,RINGGOLD,2865
3,FUNDREF,1904
4,GRID,576


### <a name="activities.fundings">activities.fundings</a>

What sources of funding have been used in ORCiD?

In [None]:
%%bigquery

select organization.disambiguated_organization.source, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(activities.fundings.groups) grp,
   unnest(grp.records) record
   where start_date.year = "2024"
group by 1
order by 2 desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,source,f0_
0,ROR,7347
1,,6914
2,FUNDREF,6152
3,RINGGOLD,255
4,GRID,15


### <a name="activities.peer_reviews">activities.peer_reviews</a>

charting Peer Reviews in ORCiD. What are the most common sources?

In [None]:
# charting Peer Reviews in ORCiD

%%bigquery

select record.reviewer_role, record.review_type,  record.convening_organization.name, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
    unnest(activities.peer_reviews.groups) grp,
    unnest(grp.groups) grpp,
    unnest(grpp.records) record
 #  where start_date.year = "2024"
group by 1,2,3
order by 4 desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,reviewer_role,review_type,name,f0_
0,reviewer,review,Clarivate PLC,3232053
1,reviewer,review,Publons,3117297
2,reviewer,review,"Elsevier, Inc.",2660331
3,reviewer,review,Springer Nature,2261495
4,reviewer,review,American Chemical Society,1382847
...,...,...,...,...
142,editor,evaluation,Journal of hydrology,1
143,reviewer,review,"Programmes de bioéthique, École de santé publi...",1
144,reviewer,review,Museu de Ciências Naturais/SEMA-RS,1
145,reviewer,review,Coventry University,1


### <a name="activities.works">activities.works</a>

In [None]:
%%bigquery

select identifier.type, identifier.relationship, count(identifier.value), count(distinct identifier.value)
from ds-open-datasets.orcid.summaries_2024,
    unnest(activities.works.groups) grp,
    unnest(grp.external_ids.identifiers) identifier
 #  where start_date.year = "2024"
group by 1,2
order by 3 desc
limit 40

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,type,relationship,f0_,f1_
0,doi,self,77717419,42486133
1,eid,self,40109044,25228992
2,wosuid,self,13081783,10284413
3,source-work-id,self,10843236,8947762
4,pmid,self,8324386,5712657
5,other-id,self,2997650,1336464
6,pmc,self,2545425,1765676
7,arxiv,self,2078254,373735
8,handle,self,1552755,1350702
9,isbn,self,1031978,776139


### <a name="activities.invited_postition">activities.invited_postition</a>

In [None]:
%%bigquery

select record.role_title, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(activities.invited_positions.records) rec,
   unnest(rec.records) record
 #  where start_date.year = "2024"
group by 1
order by 2 desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,role_title,f0_
0,,27068
1,Visiting Professor,8391
2,Visiting Scholar,4841
3,Member,3984
4,Visiting Researcher,3603
...,...,...
160076,Ph Student,1
160077,Participação em Banca de Doutorado de Fernanda...,1
160078,PROMETEO,1
160079,Head of Global Optimization,1


### <a name="activities.memberships">activities.memberships</a>

In [None]:
%%bigquery

select record.role_title, record.department_name, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(activities.memberships.groups) grp,
   unnest(grp.records) record
 #  where start_date.year = "2024"
group by 1,2
order by 3 desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,role_title,department_name,f0_
0,,,329271
1,Member,,60171
2,Fellow,,12417
3,member,,8819
4,Life Member,,8807
...,...,...,...
257379,Annual,ECE Department,1
257380,Human Rights Study Group,Law,1
257381,redaktor naczelna,Dziennikarstwo,1
257382,Regular/Active (3 years),,1


### <a name="activities.qualifications">activities.qualifications</a>

In [None]:
%%bigquery

select record.role_title, record.department_name, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(activities.qualifications.groups) grp,
   unnest(grp.records) record
 #  where start_date.year = "2024"
group by 1,2
order by 3 desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,role_title,department_name,f0_
0,,,59395
1,PhD,,6531
2,MBBS,,2045
3,Investigador RENACYT,Nivel VII,1718
4,Investigador RENACYT,Grupo María Rostworowski - Nivel I,1422
...,...,...,...
825492,Processo Legislativo Constitucional: Teoria e ...,,1
825493,"Estudos Sobre ""O Capital""",,1
825494,Master of Computer Science with a Specializati...,School Of Engineering and Technology,1
825495,PhD,School of Labor and Human Resources,1


### <a name="activities.services">activities.services</a>

In [None]:
%%bigquery

select record.role_title, record.department_name, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(activities.services.groups) grp,
   unnest(grp.records) record
 #  where start_date.year = "2024"
group by 1,2
order by 3 desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,role_title,department_name,f0_
0,,,26932
1,Member,,5509
2,President,,3442
3,Reviewer,,2999
4,Associate Editor,,1625
...,...,...,...
225404,PhD Committee Member Neuroscience Department,Shelby Towers,1
225405,Editorial Board ('Publications'),,1
225406,SECRETARIO,Unidad Judicial Civil Con Sede En La Parroquia...,1
225407,Postgrad teaching (Master & Doctorate): Protot...,Faculdade de Arquitetura e Urbanismo,1


### <a name="activities.research_resources">activities.research_resources</a>

In [None]:
%%bigquery

select record.proposal.title.title, count(orcid_identifier.path)
from ds-open-datasets.orcid.summaries_2024,
   unnest(activities.research_resources.groups) grp,
   unnest(grp.records) record
 #  where start_date.year = "2024"
group by 1
order by 2 desc

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,title,f0_
0,Neutron Beam Award at Spallation Neutron Sourc...,858
1,Neutron Beam Award at High Flux Isotope Reacto...,338
2,Neutron Beam Award at High Flux Isotope Reacto...,199
3,Prediction of soil microbiome phenotypic respo...,16
4,Unraveling Redox Transformation Mechanisms of ...,13
...,...,...
1488,Discovering Dynamics of Subduction through the...,1
1489,Large Eddy Simulation of Hypersonic Shock Wave...,1
1490,Studying the Evolution of Atmospheric Aerosol ...,1
1491,Continued improvement of a Deep Learning Neura...,1
