<a href="https://colab.research.google.com/github/MarciaFG/skill-flow/blob/main/Flows_1980_2000_first_level_for.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Academic Mobility Flows using BigQuery**

Author: Marcia R. Ferreira (Complexity Science Hub Vienna & TU Wien)

Date: September 28, 2022

Input: Dimensions database on BigQuery

Output: 

Other notes: 

## Colab Initialization

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('To enable a high-RAM runtime, select the Runtime → "Change runtime type"')
  print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
  print('re-execute this cell.')
else:
  print('You are using a high-RAM runtime!')

Wed Feb  1 19:27:06 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P0    30W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Install required Drivers

In [3]:
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
#!pip install psutil
#!pip install humanize
#!pip install pynput

# libraries
import psutil
import humanize
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
import requests
import torch
import nltk
import GPUtil as GPU

# plotting
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

from google.cloud import bigquery
from google.colab import files
%load_ext google.colab.data_table
%load_ext google.cloud.bigquery

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gputil
  Downloading GPUtil-1.4.0.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
  Created wheel for gputil: filename=GPUtil-1.4.0-py3-none-any.whl size=7409 sha256=58c08318bf012febb58fc197f4f377f4ac231b6d3d417213b21b7504054dfadb
  Stored in directory: /root/.cache/pip/wheels/ba/03/bb/7a97840eb54479b328672e15a536e49dc60da200fb21564d53
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0
The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery


In [4]:
# only one GPU on Colab and isn’t guaranteed
import psutil
import os
import humanize
import GPUtil as GPU

GPUs = GPU.getGPUs()
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ),\
       " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB"\
       .format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

Gen RAM Free: 25.7 GB  | Proc size: 472.3 MB
GPU RAM Free: 15109MB | Used: 0MB | Util   0% | Total 15360MB


**Loading data from Google Drive (If needed)**

In [None]:
# run this to upload files
# from google.colab import files
# uploaded = files.upload() 

**Mounting the Google Drive folder**

In [4]:
from google.colab import drive
drive.mount('/content/drive')

# let's test it
with open('/content/drive/My Drive/foo.txt', 'w') as f:
  f.write('Hello Google Drive!')
!cat /content/drive/My\ Drive/foo.txt

Mounted at /content/drive
Hello Google Drive!

**Runtime credentials**

In [5]:
# Provide your credentials to the runtime
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


**Declare the Cloud project ID which will be used throughout this notebook**

In [6]:
# declare your project 
project_id = "cshdimensionstest"

# **PART I**

In [None]:
# test to see if it is working correctly
# set up parameters eg for a specific journal
bq_params = {}
bq_params["journal_id"] = "jour.1115214"

In [None]:
# test to see if it is working correctly
%%bigquery --params $bq_params --project $project_id 

select distinct 
  journal.id, journal.title, journal.issn, journal.eissn, publisher.name, date_inserted
from `dimensions-ai.data_analytics.publications` 
where  journal.id = @journal_id
and publisher is not null
order by date_inserted desc
limit 1

**Ok! it works let's start!**

## 1.0 Load Data from GBQ

#### *Basic Table*

In [None]:
# Constructing the mobility flows intermediary table for the FOR categorization
%%bigquery --project $project_id 

#create or replace table cshdimensionstest.test.disambiguated_authors and corresponding publications
CREATE OR REPLACE TABLE cshdimensionstest.test.basic_1980_2000 AS 

SELECT p.id, researcher_ids, research_orgs, category_for.code, p.year
FROM `dimensions-ai.data_analytics.publications` p
    , unnest(category_for.first_level.full) category_for
    , unnest(researcher_ids) researcher_ids
    , unnest(research_orgs) research_orgs
    JOIN `dimensions-ai.data_analytics.researchers` r 
    ON r.id=p.researcher_ids
WHERE researcher_ids IS NOT NULL 
  AND research_orgs IS NOT NULL
  AND category_for IS NOT NULL > its best to allow for null values here
  AND p.year BETWEEN 1980 AND 2000
  AND first_publication >= 1980
ORDER BY p.id, researcher_ids, research_orgs

-- this gives us the publications with disambiguated researchers ids
-- AND  the pubs with authors that have affiliation linkages
-- AND the pubs with author-aff links that have an FOR category associated
-- AND between 1980 and 2000
-- This will be our basic table

Executing query with job ID: a7144245-cfce-42b6-8df6-b22b6334350d
Query executing: 0.41s


ERROR:
 400 Syntax error: Expression to the left of comparison must be parenthesized at [13:32]

Location: US
Job ID: a7144245-cfce-42b6-8df6-b22b6334350d



In [None]:
%%bigquery --project $project_id
-- let's have a look
SELECT  * FROM cshdimensionstest.test.basic_1980_2000 limit 2;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,id,researcher_ids,research_orgs,code,year,is_multi_affiliation
0,pub.1134443158,ur.07412113215.62,grid.267468.9,39,1980,1
1,pub.1006946147,ur.015375262574.30,grid.257410.5,47,1980,1


In [None]:
%%bigquery --project $project_id
-- count the number of FOR categories per publication 
SELECT id, COUNT (DISTINCT code) N_codes FROM cshdimensionstest.test.basic_1980_2000 GROUP BY id order by N_CODES DESC LIMIT 5;
-- A publication can have up to 5 codes
SELECT COUNT(*) FROM cshdimensionstest.test.basic_1980_2000;
-- 69534837 total rows


In [None]:
%%bigquery --project $project_id
SELECT COUNT(distinct researcher_ids) FROM cshdimensionstest.test.basic_1980_2000;
-- 4510186

### *Multiple Affiliations*

* Identify multiple affiliations in the table and update the basic table



In [None]:
%%bigquery --project $project_id 
# indicate whether an author-affiliation is shared  or not
CREATE OR REPLACE TABLE cshdimensionstest.test.multi_affiliations AS

SELECT DISTINCT p.id, p.researcher_ids, p.research_orgs, p.year, s.is_multi_affiliation, s.aff_w
FROM cshdimensionstest.test.basic_1980_2000 p
 JOIN 
    (
    SELECT id, researcher_ids, COUNT(DISTINCT research_orgs) as aff_w, CASE WHEN COUNT(DISTINCT research_orgs)  > 1 Then 1 Else 0 END is_multi_affiliation
    FROM cshdimensionstest.test.basic_1980_2000
    GROUP BY id, researcher_ids
    ) s
  ON p.id=s.id and p.researcher_ids=s.researcher_ids;

CREATE OR REPLACE TABLE cshdimensionstest.test.basic_1980_2000 AS SELECT * FROM cshdimensionstest.test.multi_affiliations;

SELECT * FROM cshdimensionstest.test.basic_1980_2000 order by id, researcher_ids, research_orgs LIMIT 20;

DROP TABLE IF EXISTS cshdimensionstest.test.multi_affiliations;

In [None]:
%%bigquery --project $project_id 

SELECT * FROM cshdimensionstest.test.basic_1980_2000 order by id, researcher_ids, research_orgs LIMIT 20

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,id,researcher_ids,research_orgs,year,is_multi_affiliation,aff_w
0,pub.1000000002,ur.01042225174.03,grid.129553.9,1999,1,2
1,pub.1000000002,ur.01042225174.03,grid.46078.3d,1999,1,2
2,pub.1000000002,ur.01241053402.31,grid.129553.9,1999,1,2
3,pub.1000000002,ur.01241053402.31,grid.46078.3d,1999,1,2
4,pub.1000000009,ur.010314133322.78,grid.14335.30,1995,1,4
5,pub.1000000009,ur.010314133322.78,grid.410335.0,1995,1,4
6,pub.1000000009,ur.010314133322.78,grid.7362.0,1995,1,4
7,pub.1000000009,ur.010314133322.78,grid.9026.d,1995,1,4
8,pub.1000000009,ur.01311727131.02,grid.14335.30,1995,1,4
9,pub.1000000009,ur.01311727131.02,grid.410335.0,1995,1,4


### *Repeatable To Dos*

*   To create the basic tables for all years copy the code above and rerun it for the next few decades
*   Note that the tables need to have overlapping years otherwise it will not be possible to capture the transitions at year ceilings



# **PART II**

## 2.0 Researcher Trajectories

#### *Time Sequences*

In [None]:
%%bigquery --project $project_id 

# step (1): give a row number to the years in the order
 create or replace table cshdimensionstest.test.sequence_1980_2000 as 
  select distinct researcher_ids, 
    year, 
    dense_rank() over (
      partition by researcher_ids 
      order by 
        year asc
    ) as t 
  from `cshdimensionstest.test.basic_1980_2000`
  order by 
    researcher_ids, 
    year, 
    t
#DROP TABLE IF EXISTS cshdimensionstest.test.sequence_00_02;

In [None]:
%%bigquery --project $project_id 
SELECT * FROM cshdimensionstest.test.sequence_1980_2000 where researcher_ids = 'ur.011460612366.60' order by t;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,year,t
0,ur.011460612366.60,1980,1
1,ur.011460612366.60,1981,2
2,ur.011460612366.60,1982,3
3,ur.011460612366.60,1983,4
4,ur.011460612366.60,1985,5
5,ur.011460612366.60,1986,6
6,ur.011460612366.60,1987,7
7,ur.011460612366.60,1988,8
8,ur.011460612366.60,1989,9
9,ur.011460612366.60,1990,10


#### *Affiliation Weights*

In [None]:
%%bigquery --project $project_id 
# step (2)
# generating affiliation weights if the author has had more than one affiliation simultaneously
 create or replace table cshdimensionstest.test.affweight_1980_2000 as 
  select 
    distinct researcher_ids, 
    id, 
    1 * 1.0 / count(distinct research_orgs) as aff_weight 
  from 
    `cshdimensionstest.test.basic_1980_2000`
  group by 
    researcher_ids, 
    id
  order by researcher_ids, id;

drop table if exists cshdimensionstest.test.affweight_00_02;

In [None]:
%%bigquery --project $project_id 
SELECT * FROM cshdimensionstest.test.affweight_1980_2000 order by researcher_ids, id limit 10;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,id,aff_weight
0,ur.010000001271.33,pub.1039310092,1.0
1,ur.010000001341.07,pub.1064721680,0.333333
2,ur.01000000143.58,pub.1007920533,0.5
3,ur.01000000143.58,pub.1055163401,1.0
4,ur.01000000162.06,pub.1017303151,1.0
5,ur.01000000162.06,pub.1040922518,1.0
6,ur.01000000162.06,pub.1082688278,1.0
7,ur.010000001625.53,pub.1000729400,1.0
8,ur.010000001625.53,pub.1020269617,1.0
9,ur.01000000255.40,pub.1010721240,0.5


In [None]:
%%bigquery --project $project_id 
# step (3)
# merging results from steps 1-2
create or replace table cshdimensionstest.test.psequence_weight_1980_2000 as 
  select 
      a.researcher_ids,
      a.id,
    --  a.code,
      a.year,
      a.research_orgs,
      a.is_multi_affiliation,
      b.t,
      c.aff_weight, 
      a.aff_w as n_au_orgs
  from
      `cshdimensionstest.test.basic_1980_2000` as a 
      inner join
         `cshdimensionstest.test.sequence_1980_2000` as b 
         on a.researcher_ids = b.researcher_ids 
         and a.year = b.year 
      inner join
         `cshdimensionstest.test.affweight_1980_2000` as c 
         on c.researcher_ids = a.researcher_ids 
         and c.id = a.id 
  order by
        b.researcher_ids,
        b.year,
        b.t;

  # drop table if exists cshdimensionstest.test.psequence_weight_00_02;

  select * from cshdimensionstest.test.psequence_weight_1980_2000  order by researcher_ids, id, year, t limit 10;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,id,year,research_orgs,is_multi_affiliation,t,aff_weight,n_au_orgs
0,ur.010000001271.33,pub.1039310092,1985,grid.417643.3,0,1,1.0,1
1,ur.010000001341.07,pub.1064721680,1998,grid.224260.0,1,1,0.333333,3
2,ur.010000001341.07,pub.1064721680,1998,grid.265457.7,1,1,0.333333,3
3,ur.010000001341.07,pub.1064721680,1998,grid.448385.6,1,1,0.333333,3
4,ur.01000000143.58,pub.1007920533,2000,grid.17091.3e,1,1,0.5,2
5,ur.01000000143.58,pub.1007920533,2000,grid.417570.0,1,1,0.5,2
6,ur.01000000143.58,pub.1055163401,2000,grid.17091.3e,0,1,1.0,1
7,ur.01000000162.06,pub.1017303151,1993,grid.412587.d,0,1,1.0,1
8,ur.01000000162.06,pub.1040922518,1993,grid.412587.d,0,1,1.0,1
9,ur.01000000162.06,pub.1082688278,1993,grid.412597.c,0,1,1.0,1


#### *First Affiliation*


*   The first affiliation of an author is sometimes in the data table. This has to do with (1) missing author-affiliation linkages and/or (2) lack of field classification codes associated to a publication
*   The column `is_orign` marks whether that institution is the authors' first affiliation in the whole database and not just in the dataset for the overall period 1980-2022



In [None]:
%%bigquery --project $project_id 
# filter the dataset by researchers that started in 1980 or after
# all researchers can still be found in this table cshdimensionstest.test.psequence_weight_00_02
create or replace table cshdimensionstest.test.researchers_after_1980 as
  select p.*
    , first_publication_year
    , case 
        when first_publication_year = year then 1 else 0 
      end is_origin
  from cshdimensionstest.test.psequence_weight_1980_2000 p
  join (
        select distinct researcher_ids, first_publication_year
        from cshdimensionstest.test.psequence_weight_1980_2000 au
        left join dimensions-ai.data_analytics.researchers r on au.researcher_ids=r.id
        where first_publication_year >= 1980
        ) s
    on p.researcher_ids=s.researcher_ids;
    
select * from cshdimensionstest.test.researchers_after_1980 
order by researcher_ids,  year, t limit 10;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,id,year,research_orgs,is_multi_affiliation,t,aff_weight,n_au_orgs,first_publication_year,is_origin
0,ur.010000001271.33,pub.1039310092,1985,grid.417643.3,0,1,1.0,1,1981,0
1,ur.010000001341.07,pub.1064721680,1998,grid.224260.0,1,1,0.333333,3,1997,0
2,ur.010000001341.07,pub.1064721680,1998,grid.265457.7,1,1,0.333333,3,1997,0
3,ur.010000001341.07,pub.1064721680,1998,grid.448385.6,1,1,0.333333,3,1997,0
4,ur.01000000143.58,pub.1055163401,2000,grid.17091.3e,0,1,1.0,1,2000,1
5,ur.01000000143.58,pub.1007920533,2000,grid.417570.0,1,1,0.5,2,2000,1
6,ur.01000000143.58,pub.1007920533,2000,grid.17091.3e,1,1,0.5,2,2000,1
7,ur.01000000162.06,pub.1040922518,1993,grid.412587.d,0,1,1.0,1,1993,1
8,ur.01000000162.06,pub.1082688278,1993,grid.412597.c,0,1,1.0,1,1993,1
9,ur.01000000162.06,pub.1017303151,1993,grid.412587.d,0,1,1.0,1,1993,1


In [None]:
%%bigquery --project $project_id 
# make a list of all origins and researcher_ids combinations in the dataset
# match the origins to the whole trajectory and mark it as 1
create or replace table cshdimensionstest.test.origins as
select distinct researcher_ids, research_orgs, is_origin
from cshdimensionstest.test.researchers_after_1980
where is_origin = 1;

In [None]:
%%bigquery --project $project_id 
# join all the origins to the trajectories after 1980 table
create or replace table cshdimensionstest.test.researchers_after_1980_with_origins as
select a.*, b.research_orgs as first_affiliation, ifnull(b.is_origin, 0) is_origin_all
from cshdimensionstest.test.researchers_after_1980 a
left join cshdimensionstest.test.origins b 
  on a.researcher_ids=b.researcher_ids
  and a.research_orgs=b.research_orgs;

drop table cshdimensionstest.test.researchers_after_1980;
create or replace table cshdimensionstest.test.researchers_after_1980 as
select * from cshdimensionstest.test.researchers_after_1980_with_origins;
drop table cshdimensionstest.test.researchers_after_1980_with_origins;
ALTER TABLE cshdimensionstest.test.researchers_after_1980 DROP COLUMN is_multi_affiliation;
ALTER TABLE cshdimensionstest.test.researchers_after_1980 DROP COLUMN first_affiliation;
ALTER TABLE cshdimensionstest.test.researchers_after_1980 DROP COLUMN first_publication_year;
ALTER TABLE cshdimensionstest.test.researchers_after_1980 DROP COLUMN is_origin;

**Check table**

In [12]:
%%bigquery --project $project_id 
select * from cshdimensionstest.test.researchers_after_1980 
order by researcher_ids, t
limit 10

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,id,year,research_orgs,t,aff_weight,n_au_orgs,is_origin_all
0,ur.010000001271.33,pub.1039310092,1985,grid.417643.3,1,1.0,1,0
1,ur.010000001341.07,pub.1064721680,1998,grid.448385.6,1,0.333333,3,0
2,ur.010000001341.07,pub.1064721680,1998,grid.265457.7,1,0.333333,3,0
3,ur.010000001341.07,pub.1064721680,1998,grid.224260.0,1,0.333333,3,0
4,ur.01000000143.58,pub.1055163401,2000,grid.17091.3e,1,1.0,1,1
5,ur.01000000143.58,pub.1007920533,2000,grid.417570.0,1,0.5,2,1
6,ur.01000000143.58,pub.1007920533,2000,grid.17091.3e,1,0.5,2,1
7,ur.01000000162.06,pub.1017303151,1993,grid.412587.d,1,1.0,1,1
8,ur.01000000162.06,pub.1082688278,1993,grid.412597.c,1,1.0,1,1
9,ur.01000000162.06,pub.1040922518,1993,grid.412587.d,1,1.0,1,1


**Ok now we can contruct the mobility network.**


---

- we can use this table to calculate the number of publications of the author per `year`
- we count the fractional number of papers using the `aff_weight`

# **PART III**

## 3.0 Network Flows

We will split the calculation of the network flows:
1. All affiliation transitions network
2. Multiple affiliation transitions network
3. Single affiliation transitions network



In [None]:
%%bigquery --project $project_id 
# now we have everything we need to construct the flows at the institutional level
create or replace table cshdimensionstest.test.flows_1980_2000 as 
  select distinct
    a.researcher_ids,
    a.id as pub1,
    b.id as pub2,
    a.research_orgs as unit1,
    b.research_orgs as unit2,
    a.t as t1,
    b.t as t2,
    a.year as p1,
    b.year as p2,
    a.aff_weight as aff_w1,
    b.aff_weight as aff_w2,
    a.is_multi_affiliation as is_multi1,
    b.is_multi_affiliation as is_multi2,
   -- a.is_origin as is_origin1,
   -- b.is_origin as is_origin2,
    a.is_origin_all as is_origin_all1,
    b.is_origin_all as is_origin_all2
  from
        cshdimensionstest.test.researchers_after_1980 a 
    inner join
        cshdimensionstest.test.researchers_after_1980 b 
        on a.researcher_ids = b.researcher_ids 
  where
        a.t < b.t and a.t = b.t - 1 
  order by
    a.researcher_ids,
    a.t,
    b.t;
  #select * from cshdimensionstest.test.flows_1980_2000 limit 10;

Query is running:   0%|          |

In [None]:
%%bigquery --project $project_id 
ALTER TABLE cshdimensionstest.test.flows_1980_2000 DROP COLUMN is_origin1;
ALTER TABLE cshdimensionstest.test.flows_1980_2000 DROP COLUMN is_origin2;
--ALTER TABLE cshdimensionstest.test.flows_1980_2000 DROP COLUMN aff_weight1;
--ALTER TABLE cshdimensionstest.test.flows_1980_2000 DROP COLUMN aff_weight2;

In [None]:
%%bigquery --project $project_id 
# now we make a table with all the double affiliations by each t for all researcher_ids
create or replace table cshdimensionstest.test.multi_affiliations as
select distinct researcher_ids, 
                research_orgs,
                t, 
                is_origin
from cshdimensionstest.test.researchers_after_1980 
where is_multi_affiliation = 1;

# CONTINUE HERE

In [None]:
%%bigquery --project $project_id 
select * from cshdimensionstest.test.multi_affiliations where  researcher_ids = 'ur.01000000255.40' order by  t, research_orgs

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,research_orgs,t,is_origin
0,ur.01000000255.40,grid.136593.b,1,1
1,ur.01000000255.40,grid.416963.f,1,1
2,ur.01000000255.40,grid.136593.b,2,0
3,ur.01000000255.40,grid.258799.8,2,0


In [None]:
%%bigquery --project $project_id 
select * from cshdimensionstest.test.flows_1980_2000 where  researcher_ids = 'ur.01000000255.40' order by t1, unit1

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,pub1,pub2,unit1,unit2,t1,t2,p1,p2,aff_weight1,aff_weight2,is_multi_affiliation1,is_multi_affiliation2,is_origin_all1,is_origin_all2
0,ur.01000000255.40,pub.1010721240,pub.1033444575,grid.136593.b,grid.136593.b,1,2,1980,1981,0.5,0.5,1,1,1,1
1,ur.01000000255.40,pub.1010721240,pub.1033444575,grid.136593.b,grid.258799.8,1,2,1980,1981,0.5,0.5,1,1,1,0
2,ur.01000000255.40,pub.1010721240,pub.1033444575,grid.416963.f,grid.258799.8,1,2,1980,1981,0.5,0.5,1,1,1,0
3,ur.01000000255.40,pub.1010721240,pub.1033444575,grid.416963.f,grid.136593.b,1,2,1980,1981,0.5,0.5,1,1,1,1


**Rules**

1. if unit1 = unit2 and is_origin_all2 = 1 then 'stayed in first affiliation';

2. if unit1 = unit2 and is_origin_all2 = 0 then 'stayed in';

3. if unit1 != unit2 and is_origin_all1 = 1 and is_origin_all2 = 1 and t1 = t2-1 then 'redundant edge';

4. if unit1 != unit2 and (is_origin_all2 = 0 or is_origin_all2 = 0) then 'moved to';

In [None]:
%%bigquery --project $project_id 
# lets test it with researcher_ids = 'ur.01000000255.40'
create or replace table cshdimensionstest.test.flows_1980_2000_types_of_mobility as
select * , 
  case when unit1 = unit2 and is_origin_all2 = 1 then 'stayed in'
    else case when unit1 = unit2 and is_origin_all2 = 0 then 'stayed in unit2' 
     else case when unit1 != unit2 and is_origin_all1 = 1 and is_origin_all2 = 1 and t1 = t2-1 then 'redundant edge'
      else case when unit1 != unit2 and (is_origin_all2 = 0 or is_origin_all2 = 0) then 'moved to'
        else 'stayed in unit1' 
          end end end end as mobility_type
from cshdimensionstest.test.flows_1980_2000;

drop table cshdimensionstest.test.flows_1980_2000;

create or replace table cshdimensionstest.test.flows_1980_2000 as
select * from cshdimensionstest.test.flows_1980_2000_types_of_mobility;

drop table cshdimensionstest.test.flows_1980_2000_types_of_mobility;

select * from cshdimensionstest.test.flows_1980_2000 
order by researcher_ids, t1, t2, pub1, pub2, unit1, unit2
limit 200;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,pub1,pub2,unit1,unit2,t1,t2,p1,p2,is_multi_affiliation1,is_multi_affiliation2,is_origin_all1,is_origin_all2,mobility_type
0,ur.01000000255.40,pub.1010721240,pub.1033444575,grid.136593.b,grid.136593.b,1,2,1980,1981,1,1,1,1,stayed in
1,ur.01000000255.40,pub.1010721240,pub.1033444575,grid.136593.b,grid.258799.8,1,2,1980,1981,1,1,1,0,moved to
2,ur.01000000255.40,pub.1010721240,pub.1033444575,grid.416963.f,grid.136593.b,1,2,1980,1981,1,1,1,1,redundant edge
3,ur.01000000255.40,pub.1010721240,pub.1033444575,grid.416963.f,grid.258799.8,1,2,1980,1981,1,1,1,0,moved to
4,ur.01000000352.51,pub.1026856550,pub.1004086972,grid.10253.35,grid.10253.35,1,2,1993,1994,0,1,1,1,stayed in
5,ur.01000000352.51,pub.1026856550,pub.1004086972,grid.10253.35,grid.6553.5,1,2,1993,1994,0,1,1,0,moved to
6,ur.01000000352.51,pub.1026856550,pub.1046881649,grid.10253.35,grid.10253.35,1,2,1993,1994,0,0,1,1,stayed in
7,ur.01000000352.51,pub.1004086972,pub.1025636724,grid.10253.35,grid.10253.35,2,3,1994,1997,1,0,1,1,stayed in
8,ur.01000000352.51,pub.1004086972,pub.1025636724,grid.6553.5,grid.10253.35,2,3,1994,1997,1,0,0,1,other
9,ur.01000000352.51,pub.1046881649,pub.1025636724,grid.10253.35,grid.10253.35,2,3,1994,1997,0,0,1,1,stayed in


In [None]:
%%bigquery --project $project_id 
select * from cshdimensionstest.test.flows_1980_2000 
order by researcher_ids, t1, t2, pub1, pub2, unit1, unit2
limit 200;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,pub1,pub2,unit1,unit2,t1,t2,p1,p2,is_multi_affiliation1,is_multi_affiliation2,is_origin_all1,is_origin_all2,mobility_type
0,ur.01000000255.40,pub.1010721240,pub.1033444575,grid.136593.b,grid.136593.b,1,2,1980,1981,1,1,1,1,stayed in
1,ur.01000000255.40,pub.1010721240,pub.1033444575,grid.136593.b,grid.258799.8,1,2,1980,1981,1,1,1,0,moved to
2,ur.01000000255.40,pub.1010721240,pub.1033444575,grid.416963.f,grid.136593.b,1,2,1980,1981,1,1,1,1,redundant edge
3,ur.01000000255.40,pub.1010721240,pub.1033444575,grid.416963.f,grid.258799.8,1,2,1980,1981,1,1,1,0,moved to
4,ur.01000000352.51,pub.1026856550,pub.1004086972,grid.10253.35,grid.10253.35,1,2,1993,1994,0,1,1,1,stayed in
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,ur.010000007767.96,pub.1042812711,pub.1058370169,grid.417597.9,grid.13992.30,6,7,1997,1998,1,1,0,1,other
196,ur.010000007767.96,pub.1042812711,pub.1058370169,grid.417597.9,grid.4991.5,6,7,1997,1998,1,1,0,0,moved to
197,ur.010000007767.96,pub.1042812711,pub.1060588289,grid.13992.30,grid.13992.30,6,7,1997,1998,1,1,1,1,stayed in
198,ur.010000007767.96,pub.1042812711,pub.1060588289,grid.13992.30,grid.9619.7,6,7,1997,1998,1,1,1,0,moved to


# TBC HERE

In [None]:
%%bigquery --project $project_id 
# we use this table to remove all the edges between multiple affiliations
# lets test with this id first: ur.01000000255.40

select b.* 
from cshdimensionstest.test.multi_affiliations a
join cshdimensionstest.test.flows_1980_2000 b 
  on a.researcher_ids=b.researcher_ids
  and 
where researcher_ids = 'ur.01000000255.40'
order by t




Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,research_orgs,t,is_origin
0,ur.01000000255.40,grid.416963.f,1,1
1,ur.01000000255.40,grid.136593.b,1,1
2,ur.01000000255.40,grid.136593.b,2,0
3,ur.01000000255.40,grid.258799.8,2,0


In [None]:
%%bigquery --project $project_id 

# step (7)
# bring the flows and the start publication datasets together and save it in a table
create or replace table cshdimensionstest.test.flows_with_start_00_02 as 
   select
      researcher_ids,
      pub1,
      pub2,
      field1,
      field2,
      unit1,
      unit2,
     cast(t1 as int) t1,
     cast(p1 as int) p1,
      t2,
      p2,
      aff_weight,
      case
         when
            unit1 = unit2 
         then
            'stayed in' 
         else
            case
               when
                  unit1 != unit2 
               then
                  'moved to' 
               else
                  'error' 
            end
      end
      as mobility_type 
   from
      `cshdimensionstest.test.flows_00_02` 
   union all
   select
      researcher_ids,
     'void' as pub1,
      pub2,
      field1,
      field2,
     'void' as unit1,
      unit2,
     cast(t1 as int) t1,
     cast(p1 as int) p1,
      t2,
      p2,
      aff_weight,
      mobility_type 
   from
      `cshdimensionstest.test.first_pub_00_02` 
   order by
      researcher_ids,
      t2;

In [None]:
#@title Hidden cell
%%bigquery --project $project_id

select * from `cshdimensionstest.test.flows_with_start_00_02` 
order by researcher_ids, p1, p2
limit 10 

Unnamed: 0,researcher_ids,pub1,pub2,field1,field2,unit1,unit2,t1,p1,t2,p2,aff_weight,mobility_type
0,ur.01000000143.58,void,pub.1055163401,2581,2581,void,grid.17091.3e,0,2000,1,2000,1.0,started in
1,ur.01000000143.58,void,pub.1007920533,2581,2581,void,grid.17091.3e,0,2000,1,2000,0.5,started in
2,ur.01000000143.58,void,pub.1007920533,2581,2581,void,grid.417570.0,0,2000,1,2000,0.5,started in
3,ur.01000000143.58,pub.1007920533,pub.1053189419,2581,2581,grid.417570.0,grid.17089.37,1,2000,2,2001,0.333333,moved to
4,ur.01000000143.58,pub.1007920533,pub.1053189419,2581,2581,grid.17091.3e,grid.17089.37,1,2000,2,2001,0.333333,moved to
5,ur.01000000143.58,pub.1007920533,pub.1053189419,2581,2581,grid.17091.3e,grid.17091.3e,1,2000,2,2001,0.333333,stayed in
6,ur.01000000143.58,pub.1007920533,pub.1053189419,2581,2581,grid.417570.0,grid.31501.36,1,2000,2,2001,0.333333,moved to
7,ur.01000000143.58,pub.1055163401,pub.1053189419,2581,2581,grid.17091.3e,grid.17089.37,1,2000,2,2001,0.333333,moved to
8,ur.01000000143.58,pub.1055163401,pub.1053189419,2581,2581,grid.17091.3e,grid.31501.36,1,2000,2,2001,0.333333,moved to
9,ur.01000000143.58,pub.1007920533,pub.1053189419,2581,2581,grid.17091.3e,grid.31501.36,1,2000,2,2001,0.333333,moved to


## **2.0 Mobility Flows**


# *Test table*
- this was the table initially given to LIU, but its not correct

In [None]:
%%bigquery --project $project_id
create table cshdimensionstest.test.aggregated_moved_to_00_02 as
select  unit1 as geoid_o
      , unit2 as geoid_d
      , field1 as catid_o
      , field2 as catid_d
      , p1 as date_o
      , p2 as date_d
      , '2000-2002' as date_range
      , sum(aff_weight) as weighted_flows
      , count(researcher_ids) as flows
from `cshdimensionstest.test.flows_with_start_00_02` 
where mobility_type = 'moved to'
group by unit1,unit2, field1,field2,  p1,p2

In [None]:
%%bigquery --project $project_id
select * from `cshdimensionstest.test.aggregated_moved_to_00_02`
order by geoid_o, date_o, date_d, catid_o
limit 10

Unnamed: 0,geoid_o,geoid_d,catid_o,catid_d,date_o,date_d,date_range,weighted_flows,flows
0,grid.1001.0,grid.28312.3a,2330,2921,2000,2001,2000-2002,0.333333,1
1,grid.1001.0,grid.11355.33,2330,2409,2000,2001,2000-2002,3.166667,8
2,grid.1001.0,grid.6612.3,2330,2389,2000,2001,2000-2002,0.5,1
3,grid.1001.0,grid.4991.5,2330,2330,2000,2001,2000-2002,1.666667,3
4,grid.1001.0,grid.1003.2,2330,2344,2000,2001,2000-2002,0.5,1
5,grid.1001.0,grid.1005.4,2330,2353,2000,2001,2000-2002,0.5,1
6,grid.1001.0,grid.483427.e,2330,2409,2000,2001,2000-2002,0.666667,2
7,grid.1001.0,grid.425004.7,2330,2409,2000,2001,2000-2002,0.5,1
8,grid.1001.0,grid.1022.1,2330,2953,2000,2001,2000-2002,2.0,4
9,grid.1001.0,grid.37172.30,2330,2921,2000,2001,2000-2002,0.333333,1


In [None]:
from google.cloud import bigquery
client = bigquery.Client(project=project_id)

sql = """
  SELECT *
  FROM `cshdimensionstest.test.aggregated_moved_to_00_02` 
  order by geoid_o, date_o, date_d, catid_o
"""
movedto_edges = client.query(sql).to_dataframe()
movedto_edges.head(10)

# save the dataset
movedto_edges.to_csv('movedto_edges.csv')
files.download('movedto_edges.csv')

Unnamed: 0,geoid_o,geoid_d,catid_o,catid_d,date_o,date_d,date_range,weighted_flows,flows
0,grid.1001.0,grid.1003.2,2330,2366,2000,2001,2000-2002,1.0,1
1,grid.1001.0,grid.32197.3e,2330,2933,2000,2001,2000-2002,1.0,1
2,grid.1001.0,grid.508487.6,2330,2330,2000,2001,2000-2002,0.666667,2
3,grid.1001.0,grid.12136.37,2330,2409,2000,2001,2000-2002,0.666667,2
4,grid.1001.0,grid.264756.4,2330,2330,2000,2001,2000-2002,0.5,1
5,grid.1001.0,grid.8127.c,2330,2447,2000,2001,2000-2002,0.833333,2
6,grid.1001.0,grid.117476.2,2330,2921,2000,2001,2000-2002,1.0,2
7,grid.1001.0,grid.8484.0,2330,2746,2000,2001,2000-2002,0.333333,1
8,grid.1001.0,grid.5596.f,2330,2933,2000,2001,2000-2002,0.5,1
9,grid.1001.0,grid.5333.6,2330,2921,2000,2001,2000-2002,0.333333,1


In [None]:
%%bigquery --project $project_id

# calculating outflows for each institution
create table cshdimensionstest.test.total_outflows_00_02 as
  select  geoid_o   # sending instution
        , catid_o   # sending field
        , date_d    # sending year: we consider the destination date and the sending date
        , date_range
        , sum(flows) as t_outflows   
        , sum(weighted_flows)  as t_weightedOutflows 
  from `cshdimensionstest.test.aggregated_moved_to_00_02`
  group by geoid_o, catid_o, date_d, date_range

In [None]:
%%bigquery --project $project_id

# calculating inflows for each institution
create table cshdimensionstest.test.total_inflows_00_02 as
  select  geoid_d    # receiving instution
        , catid_d    # receiving field
        , date_d     # receiving year: we consider the destination date and the sending date
        , date_range
        , sum(flows) as inflows   
        , sum(weighted_flows)  as weightedInflows
  from `cshdimensionstest.test.aggregated_moved_to_00_02`
  group by geoid_d, catid_d, date_d, date_range


In [None]:
from google.cloud import bigquery
client = bigquery.Client(project=project_id)

sql = """
  SELECT *
  FROM `cshdimensionstest.test.total_inflows_00_02` 
"""

inflows = client.query(sql).to_dataframe()

#inflows.to_csv('inflows.csv')
#!cp inflows.csv "gdrive/My Drive/CSH-DIMENSIONS Flows Test/BigQuery-results"

sql = """
  SELECT *
  FROM `cshdimensionstest.test.total_outflows_00_02` 
"""
outflows = client.query(sql).to_dataframe()

#from google.colab import files
#files.download('inflows.csv')
#outflows.to_csv('outflows.csv')
#!cp inflows.csv "gdrive/My Drive/CSH-DIMENSIONS Flows Test/BigQuery-results"
#from google.colab import files
#files.download('outflows.csv')

In [None]:
#@title Hidden Cell
inflows.sort_values(["geoid_d", "catid_d", "date_d"]).head(10)

Unnamed: 0,geoid_d,catid_d,date_d,date_range,inflows,weightedInflows
4666,grid.1001.0,2330,2001,2000-2002,208,107.833333
4759,grid.1001.0,2330,2002,2000-2002,234,162.666667
7252,grid.1001.0,2344,2001,2000-2002,537,272.583333
9142,grid.1001.0,2344,2002,2000-2002,419,241.666667
12230,grid.1001.0,2353,2001,2000-2002,96,63.5
11395,grid.1001.0,2353,2002,2000-2002,29,10.0
16908,grid.1001.0,2358,2001,2000-2002,212,150.5
16441,grid.1001.0,2358,2002,2000-2002,380,159.916667
18109,grid.1001.0,2366,2002,2000-2002,43,37.0
20890,grid.1001.0,2377,2001,2000-2002,19262,2502.594061


In [None]:
#@title Hidden Cell
outflows.sort_values(["geoid_o", "catid_o", "date_d"]).head(10)

Unnamed: 0,geoid_o,catid_o,date_d,date_range,t_outflows,t_weightedOutflows
4761,grid.1001.0,2330,2001,2000-2002,501,243.75
3223,grid.1001.0,2330,2002,2000-2002,311,154.666667
9245,grid.1001.0,2344,2001,2000-2002,498,243.5
9032,grid.1001.0,2344,2002,2000-2002,684,354.423077
11374,grid.1001.0,2353,2001,2000-2002,37,18.5
12302,grid.1001.0,2353,2002,2000-2002,187,75.50641
15729,grid.1001.0,2358,2001,2000-2002,346,181.166667
14254,grid.1001.0,2358,2002,2000-2002,275,113.044872
17733,grid.1001.0,2366,2001,2000-2002,48,20.333333
17388,grid.1001.0,2366,2002,2000-2002,23,9.5


In [None]:
# merge the inflows and outflows dataframe
result = pd.merge(inflows
                  , outflows
                  , how="outer"
                  , left_on=["geoid_d", "catid_d", "date_d"]
                  , right_on=["geoid_o", "catid_o", "date_d"]
                  ).reset_index(drop = True)
def diff(a, b):
    return b - a

result["net_mobility"] = result['inflows'] - result['t_outflows']
result["weighted_net_mobility"] = result['weightedInflows'] - result['t_weightedOutflows']
#result.sort_values(["geoid_o", "catid_o", "date_d"]).head(10)
flow_ind = result.rename(columns = {'date_d': 'MoveYear'
                         , ' t_ouflows': 'outflows' 
                         , 't_weightedOutflows': 'weightedOutflows'
                         , 'date_range_x':'Range'
                         , 'net_mobility':'NetFlows'
                         , 'weighted_net_mobility': 'WeightedNetFlows'}) \
                         [[  'geoid_d', 'catid_d', 'inflows', 'weightedInflows'\
                           , 'geoid_o', 'catid_o', 't_outflows', 'weightedOutflows'\
                           , 'NetFlows', 'WeightedNetFlows', 'MoveYear', 'Range']]

# save the indicators to a csv file
#flow_ind.to_csv('Flows_indicators.csv')
#files.download('Flows_indicators.csv')
flow_ind.sort_values(["geoid_o", "catid_o", "MoveYear"]).head(10)

# store dataset directly into GBQ and DRIVE
# store in drive
flow_ind.to_csv('2023_01_08_flows_output.csv', encoding = 'utf-8-sig') 

# store in GBQ
# import pandas_gbq
# table_id = 'test.2023_01_08_flows_output'
# pandas_gbq.to_gbq(flow_ind, table_id, project_id=project_id)

flow_ind.head(1)

# *End of test table*

## **3.0 Flows**

**For each institution id and year we comput the following basic indicators:**

1.   Institution id
2.   Year
3. pcp
4. workforce (# of researchers per year)
5. net mobility (ok)
6. avg academic age
7. total author inflow (ok)
8. total author outflow (ok)

### **3.1 Filtering the edges**
**We need to do some  cleaning because we have edges that should not be in the table**

1. Filter out all edges between the origin institutions at t1=1 from table `cshdimensionstest.test.flows_with_start_00_02` 
and create a filtered table `cshdimensionstest.test.flows_with_start_00_02_filtered`

*Note: Let's do it like this for now but we need to incorporate this in the creation of the network at some point*

In [None]:
%%bigquery --project $project_id

CREATE OR REPLACE TABLE cshdimensionstest.test.flows_with_start_00_02_filtered as

WITH filtered AS (
SELECT *
FROM `cshdimensionstest.test.flows_with_start_00_02`
WHERE t1 = 0 AND mobility_type = 'started in'
), grouped AS (
SELECT researcher_ids, ARRAY_AGG(unit2) unit2_values
FROM filtered
GROUP BY researcher_ids
)
SELECT t1.*
FROM `cshdimensionstest.test.flows_with_start_00_02` t1
JOIN grouped t2
ON t1.researcher_ids = t2.researcher_ids
WHERE t1.t1 <> 1
OR NOT (t1.t1 = 1 AND t1.unit2 IN UNNEST(t2.unit2_values) AND t1.mobility_type <> 'stayed in')

In [None]:
from google.cloud import bigquery
client = bigquery.Client()

table_id = 'cshdimensionstest.test.flows_with_start_00_02_filtered'
table = client.get_table(table_id)

# print the names of all columns in the table
print([field.name for field in table.schema])

['researcher_ids', 'pub1', 'pub2', 'field1', 'field2', 'unit1', 'unit2', 't1', 'p1', 't2', 'p2', 'aff_weight', 'mobility_type']


In [None]:
%%bigquery --project $project_id

# create a new table with the rows where mobility_type is 'stayed in'
CREATE OR REPLACE TABLE cshdimensionstest.test.df_stayed_in AS (
  SELECT researcher_ids, pub1, pub2, field1, field2, unit1, unit2, t1, p1, t2, p2, aff_weight,mobility_type
  FROM cshdimensionstest.test.flows_with_start_00_02_filtered
  WHERE mobility_type = 'stayed in' and  researcher_ids = 'ur.0676562433.28' 
  --GROUP BY t1, t2, unit2, researcher_ids
);

# create a new table with the rows where mobility_type is 'moved to'
CREATE OR REPLACE TABLE cshdimensionstest.test.df_moved_to AS (
  SELECT researcher_ids, pub1, pub2, field1, field2, unit1, unit2, t1, p1, t2, p2, aff_weight,mobility_type
  FROM cshdimensionstest.test.flows_with_start_00_02_filtered
  WHERE mobility_type = 'moved to'  and  researcher_ids = 'ur.0676562433.28' 
  --GROUP BY t1, t2, unit2, researcher_ids
);

# merge the two tables on t1, unit2, and id, and keep only the rows where the id is not in both tables
CREATE OR REPLACE TABLE cshdimensionstest.test.df_filtered AS (
  SELECT s.*
  FROM cshdimensionstest.test.df_stayed_in s
  LEFT JOIN cshdimensionstest.test.df_moved_to m
  USING (t1, t2, unit2, researcher_ids)
  WHERE m.researcher_ids IS NULL  and  researcher_ids = 'ur.0676562433.28' 
);

Query is running:   0%|          |

In [None]:
%%bigquery --project $project_id
select * from `cshdimensionstest.test.df_filtered` 
--order by t1, researcher_ids, p1, t2, p2, unit1, unit2
--limit  20

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,pub1,pub2,field1,field2,unit1,unit2,t1,p1,t2,p2,aff_weight,mobility_type
0,ur.0676562433.28,pub.1002325839,pub.1001389009,2401,2389,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
1,ur.0676562433.28,pub.1002325839,pub.1001389009,2401,2421,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
2,ur.0676562433.28,pub.1002325839,pub.1001389009,2415,2389,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
3,ur.0676562433.28,pub.1002325839,pub.1001389009,2415,2421,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
4,ur.0676562433.28,pub.1019988787,pub.1001389009,2480,2389,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
...,...,...,...,...,...,...,...,...,...,...,...,...,...
86,ur.0676562433.28,pub.1042389624,pub.1060822875,2913,2421,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
87,ur.0676562433.28,pub.1042389624,pub.1060822875,2913,2421,grid.16750.35,grid.16750.35,1,2000,2,2001,0.333333,stayed in
88,ur.0676562433.28,pub.1060597145,pub.1060822875,2921,2421,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
89,ur.0676562433.28,pub.1060597145,pub.1060822875,2921,2421,grid.16750.35,grid.16750.35,1,2000,2,2001,0.333333,stayed in


In [None]:
%%bigquery --project $project_id
select * from `cshdimensionstest.test.flows_with_start_00_02_filtered` 
where   researcher_ids = 'ur.0676562433.28' 
order by t1, p1, t2, p2, pub1, pub2, unit1, unit2
limit  200

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,pub1,pub2,field1,field2,unit1,unit2,t1,p1,t2,p2,aff_weight,mobility_type
0,ur.0676562433.28,void,pub.1002325839,3021,3021,void,grid.419859.8,0,2000,1,2000,0.500000,started in
1,ur.0676562433.28,void,pub.1002325839,2401,2401,void,grid.419859.8,0,2000,1,2000,0.500000,started in
2,ur.0676562433.28,void,pub.1002325839,2415,2415,void,grid.419859.8,0,2000,1,2000,0.500000,started in
3,ur.0676562433.28,void,pub.1002325839,2401,2401,void,grid.425806.d,0,2000,1,2000,0.500000,started in
4,ur.0676562433.28,void,pub.1002325839,3021,3021,void,grid.425806.d,0,2000,1,2000,0.500000,started in
...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,ur.0676562433.28,pub.1042389624,pub.1001389009,2913,2389,grid.419330.c,grid.1005.4,1,2000,2,2001,0.333333,moved to
196,ur.0676562433.28,pub.1042389624,pub.1001389009,2913,2389,grid.419859.8,grid.1005.4,1,2000,2,2001,0.333333,moved to
197,ur.0676562433.28,pub.1042389624,pub.1001389009,2913,2421,grid.419859.8,grid.1005.4,1,2000,2,2001,0.333333,moved to
198,ur.0676562433.28,pub.1042389624,pub.1001389009,2913,2389,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in


In [None]:
%%bigquery --project $project_id

select * from `cshdimensionstest.test.flows_with_start_00_02_filtered` 
where researcher_ids = 'ur.0676562433.28' and unit2 = 'grid.36425.36'
order by t1, researcher_ids, p1, t2, p2, unit1, unit2
limit 100

In [None]:
%%bigquery --project $project_id

select researcher_ids, unit2, count(distinct mobility_type) as t_mob_type,  t1, p1, t2, p2
from `cshdimensionstest.test.flows_with_start_00_02_filtered` 
group by researcher_ids, unit2,  t1, p1, t2, p2
having count(distinct mobility_type) > 1
order by t1
limit 1000

In [None]:
"""
# Filter the dataframe to include only rows with t1 = 0 and mobility_type = 'started in'
df_filtered = df_test[(df_test['t1'] == 0) & (df_test['mobility_type'] == 'started in')]

# Group the dataframe by id
df_grouped = df_filtered.groupby(['researcher_ids'])

# Iterate through the groups and store the values of unit2 for each group
unit2_values = {}
for name, group in df_grouped:
    unit2_values[name] = set(group['unit2'].tolist())

# Iterate through the original dataframe and remove rows where unit2 and t1 = 1 appear in the same row
# and unit2 also appears in t1 = 0 for the same id
to_remove = []
for index, row in df_test.iterrows():
    if (row['t1'] == 1) and (row['unit2'] in unit2_values[row['researcher_ids']]):
        to_remove.append(index)

# Drop the rows from the dataframe
df_test.drop(to_remove, inplace=True)
"""

In [None]:
query = """
SELECT 
  outer_query.unit2 as id
  ,outer_query.p2 as pub_year
  ,'2000-2002' as date_range
  ,count(distinct outer_query.researcher_ids) as t_workforce
  ,subquery.t_initial_affiliations
  ,subquery2.t_stayers
  ,subquery3.t_incomers
  ,subquery4.t_outgoers
FROM `cshdimensionstest.test.flows_with_start_00_02_filtered` AS outer_query
LEFT JOIN (
  SELECT unit2, p2, COUNT(distinct researcher_ids) AS t_initial_affiliations
  FROM `cshdimensionstest.test.flows_with_start_00_02_filtered`
  WHERE mobility_type = 'started in'
  GROUP BY unit2, p2
) AS subquery
ON outer_query.unit2 = subquery.unit2 AND outer_query.p2 = subquery.p2
LEFT JOIN (
  SELECT unit2, p2, COUNT(distinct researcher_ids) AS t_stayers
  FROM `cshdimensionstest.test.flows_with_start_00_02_filtered`
  WHERE mobility_type = 'stayed in' 
  GROUP BY unit2, p2
) AS subquery2
ON outer_query.unit2 = subquery2.unit2 AND outer_query.p2 = subquery2.p2
LEFT JOIN (
  SELECT unit2, p2, COUNT(distinct researcher_ids) AS t_incomers
  FROM `cshdimensionstest.test.flows_with_start_00_02_filtered`
  WHERE mobility_type = 'moved to'
  GROUP BY unit2, p2
) AS subquery3
ON outer_query.unit2 = subquery3.unit2 AND outer_query.p2 = subquery3.p2
LEFT JOIN (
  SELECT unit1, p2, COUNT(distinct researcher_ids) AS t_outgoers
  FROM `cshdimensionstest.test.flows_with_start_00_02_filtered`
  WHERE mobility_type = 'moved to'
  GROUP BY unit1, p2
) AS subquery4
ON outer_query.unit2 = subquery4.unit1 AND outer_query.p2 = subquery4.p2
GROUP BY outer_query.unit2, outer_query.p2, subquery.t_initial_affiliations, subquery2.t_stayers, subquery3.t_incomers, subquery4.t_outgoers
ORDER BY outer_query.unit2, outer_query.p2 
LIMIT 10
"""
query_job = client.query(query)
df = query_job.to_dataframe()
df

# **NEXT STEPS**

1. **Make a hello world program**
1. **Connect resources to each other:**
 e.g., can I print the GBQ data in a website (print=show any table) for instance?
1. **Other considerations**
* how to run queries fast enough (users should not have delays)
* how does the interface look like
* how to put all calculations in one query?
* how to connect the web interface to google bigquery?
* what if multiple users use it? performance?


# Coverage Indicators

In [None]:
# calculate the number of publications without disambiguated researchers

%%bigquery --project $project_id 

SELECT id, researcher_ids FROM `dimensions-ai.data_analytics.publications` 
--left join unnest(researcher_ids) researcher_ids
where researcher_ids is not null
order by id
LIMIT 1000