<a href="https://colab.research.google.com/github/MarciaFG/skill-flow/blob/main/Flows_1980_2000_first_level_for.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Academic Mobility Flows using BigQuery**

Author: Marcia R. Ferreira (Complexity Science Hub Vienna & TU Wien)

Date: September 28, 2022

Input: Dimensions database on BigQuery

Output: 

Other notes: 
*   To create the basic tables for all years copy this code above and rerun it for the next few decades;
*   Note that the tables need to have overlapping years otherwise it will not be possible to capture the transitions at year ceilings;


## Colab Initialization

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('To enable a high-RAM runtime, select the Runtime → "Change runtime type"')
  print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
  print('re-execute this cell.')
else:
  print('You are using a high-RAM runtime!')

Fri Feb  3 09:18:02 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    25W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Install required Drivers

In [3]:
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
#!pip install psutil
#!pip install humanize
#!pip install pynput

# libraries
import psutil
import humanize
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
import requests
import torch
import nltk
import GPUtil as GPU

# plotting
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

from google.cloud import bigquery
from google.colab import files
%load_ext google.colab.data_table
%load_ext google.cloud.bigquery

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gputil
  Downloading GPUtil-1.4.0.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
  Created wheel for gputil: filename=GPUtil-1.4.0-py3-none-any.whl size=7409 sha256=ec418d690fa0eb477ad225b2ef71d06377013dd4e2a267e25085e5ed60f76ae7
  Stored in directory: /root/.cache/pip/wheels/ba/03/bb/7a97840eb54479b328672e15a536e49dc60da200fb21564d53
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0
The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery


In [4]:
# only one GPU on Colab and isn’t guaranteed
import psutil
import os
import humanize
import GPUtil as GPU

GPUs = GPU.getGPUs()
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ),\
       " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB"\
       .format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

Gen RAM Free: 26.1 GB  | Proc size: 459.9 MB
GPU RAM Free: 15109MB | Used: 0MB | Util   0% | Total 15360MB


**Loading data from Google Drive (If needed)**

In [None]:
# run this to upload files
# from google.colab import files
# uploaded = files.upload() 

**Mounting the Google Drive folder**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# let's test it
with open('/content/drive/My Drive/foo.txt', 'w') as f:
  f.write('Hello Google Drive!')
!cat /content/drive/My\ Drive/foo.txt

Mounted at /content/drive
Hello Google Drive!

**Runtime credentials**

In [5]:
# Provide your credentials to the runtime
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


**Declare the Cloud project ID which will be used throughout this notebook**

In [6]:
# declare your project 
project_id = "cshdimensionstest"

# **PART I - Data Wrangling**

In [None]:
# test to see if it is working correctly
# set up parameters eg for a specific journal
bq_params = {}
bq_params["journal_id"] = "jour.1115214"

In [None]:
# test to see if it is working correctly
%%bigquery --params $bq_params --project $project_id 

select distinct 
  journal.id, journal.title, journal.issn, journal.eissn, publisher.name, date_inserted
from `dimensions-ai.data_analytics.publications` 
where  journal.id = @journal_id
and publisher is not null
order by date_inserted desc
limit 1

**Ok! it works let's start!**

## 1.0 Load Data from GBQ

### *Basic Table*

1.   The intermediary table is restricted to:
  - publication id
  - researcher id
  - research orgs
  - first level for code
  - and year as our unit time
2.   Filters:
  - only disambiguated researchers for which author-affiliation linkages exist
  - research organizations for which a grid id exists
  - only publications that have the two things above + a first level code assigned to it
  - and researchers whose first publication has been after or in 1980
3.   Time period:
  - 1980-2000
  - we need to run different scripts for shorter time frames for easing the computational resources
  - we will consider paralelizing the full code at the end



In [None]:
# Constructing the mobility flows intermediary table for the FOR categorization
%%bigquery --project $project_id 

#create or replace table cshdimensionstest.test.disambiguated_authors and corresponding publications
CREATE OR REPLACE TABLE cshdimensionstest.test.basic_1980_2000 AS 

SELECT p.id, researcher_ids, research_orgs, category_for.code, p.year
FROM `dimensions-ai.data_analytics.publications` p
    , unnest(category_for.first_level.full) category_for
    , unnest(researcher_ids) researcher_ids
    , unnest(research_orgs) research_orgs
    JOIN `dimensions-ai.data_analytics.researchers` r 
    ON r.id=p.researcher_ids
WHERE researcher_ids IS NOT NULL 
  AND research_orgs IS NOT NULL
  AND category_for IS NOT NULL -- its best to allow for null values here
  AND p.year BETWEEN 1980 AND 2000
  AND first_publication >= 1980
ORDER BY p.id, researcher_ids, research_orgs

-- this gives us the publications with disambiguated researchers ids
-- AND  the pubs with authors that have affiliation linkages
-- AND the pubs with author-aff links that have an FOR category associated
-- AND between 1980 and 2000
-- This will be our basic table

In [None]:
%%bigquery --project $project_id
-- let's have a look
SELECT  * FROM cshdimensionstest.test.basic_1980_2000 limit 2;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,id,researcher_ids,research_orgs,code,year,is_multi_affiliation
0,pub.1134443158,ur.07412113215.62,grid.267468.9,39,1980,1
1,pub.1006946147,ur.015375262574.30,grid.257410.5,47,1980,1


In [None]:
%%bigquery --project $project_id
-- count the number of FOR categories per publication 
-- SELECT id, COUNT (DISTINCT code) N_codes FROM cshdimensionstest.test.basic_1980_2000 GROUP BY id order by N_CODES DESC LIMIT 5;
-- A publication can have up to 5 codes
SELECT COUNT(*) FROM cshdimensionstest.test.basic_1980_2000;
-- 55873771 total rows

In [None]:
%%bigquery --project $project_id
SELECT COUNT(distinct researcher_ids) FROM cshdimensionstest.test.basic_1980_2000; -- 4,510,186 distinct researchers
SELECT COUNT(distinct research_orgs) FROM cshdimensionstest.test.basic_1980_2000; -- 35,136 unique organizations
SELECT COUNT(distinct id) FROM cshdimensionstest.test.basic_1980_2000; -- 10,025,338 publications of any document type

### *Multiple Affiliations*

* Identify multiple affiliations in the table and update the basic table



In [None]:
%%bigquery --project $project_id 
# indicate whether an author-affiliation is shared  or not
CREATE OR REPLACE TABLE cshdimensionstest.test.multi_affiliations AS

SELECT DISTINCT p.id, p.researcher_ids, p.research_orgs, p.year, s.is_multi_affiliation, s.aff_w
FROM cshdimensionstest.test.basic_1980_2000 p
 JOIN 
    (
    SELECT id, researcher_ids, COUNT(DISTINCT research_orgs) as aff_w, CASE WHEN COUNT(DISTINCT research_orgs)  > 1 Then 1 Else 0 END is_multi_affiliation
    FROM cshdimensionstest.test.basic_1980_2000
    GROUP BY id, researcher_ids
    ) s
  ON p.id=s.id and p.researcher_ids=s.researcher_ids;

CREATE OR REPLACE TABLE cshdimensionstest.test.basic_1980_2000 AS SELECT * FROM cshdimensionstest.test.multi_affiliations;

SELECT * FROM cshdimensionstest.test.basic_1980_2000 order by id, researcher_ids, research_orgs LIMIT 20;

DROP TABLE IF EXISTS cshdimensionstest.test.multi_affiliations;

## 2.0 Researcher Trajectories

#### *Time Sequences*

In [None]:
%%bigquery --project $project_id 

# step (1): give a row number to the years in the order
 create or replace table cshdimensionstest.test.sequence_1980_2000 as 
  select distinct researcher_ids, 
    year, 
    dense_rank() over (
      partition by researcher_ids 
      order by 
        year asc
    ) as t 
  from `cshdimensionstest.test.basic_1980_2000`
  order by 
    researcher_ids, 
    year, 
    t;

In [None]:
%%bigquery --project $project_id 
SELECT * FROM cshdimensionstest.test.sequence_1980_2000 where researcher_ids = 'ur.011460612366.60' order by t;

#### *Affiliation Weights*

In [None]:
%%bigquery --project $project_id 
# step (2)
# generating affiliation weights if the author has had more than one affiliation simultaneously
 create or replace table cshdimensionstest.test.affweight_1980_2000 as 
  select 
    distinct researcher_ids, 
    id, 
    1 * 1.0 / count(distinct research_orgs) as aff_weight 
  from 
    `cshdimensionstest.test.basic_1980_2000`
  group by 
    researcher_ids, 
    id
  order by researcher_ids, id;

drop table if exists cshdimensionstest.test.affweight_00_02;

In [None]:
%%bigquery --project $project_id 
SELECT * FROM cshdimensionstest.test.affweight_1980_2000 order by researcher_ids, id limit 10;

In [None]:
%%bigquery --project $project_id 
# step (3)
# merging results from steps 1-2
create or replace table cshdimensionstest.test.psequence_weight_1980_2000 as 
  select 
      a.researcher_ids,
      a.id,
    --  a.code,
      a.year,
      a.research_orgs,
      a.is_multi_affiliation,
      b.t,
      c.aff_weight, 
      a.aff_w as n_au_orgs
  from
      `cshdimensionstest.test.basic_1980_2000` as a 
      inner join
         `cshdimensionstest.test.sequence_1980_2000` as b 
         on a.researcher_ids = b.researcher_ids 
         and a.year = b.year 
      inner join
         `cshdimensionstest.test.affweight_1980_2000` as c 
         on c.researcher_ids = a.researcher_ids 
         and c.id = a.id 
  order by
        b.researcher_ids,
        b.year,
        b.t;

  # drop table if exists cshdimensionstest.test.psequence_weight_00_02;

  select * from cshdimensionstest.test.psequence_weight_1980_2000  order by researcher_ids, id, year, t limit 10;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,id,year,research_orgs,is_multi_affiliation,t,aff_weight,n_au_orgs
0,ur.010000001271.33,pub.1039310092,1985,grid.417643.3,0,1,1.0,1
1,ur.010000001341.07,pub.1064721680,1998,grid.224260.0,1,1,0.333333,3
2,ur.010000001341.07,pub.1064721680,1998,grid.265457.7,1,1,0.333333,3
3,ur.010000001341.07,pub.1064721680,1998,grid.448385.6,1,1,0.333333,3
4,ur.01000000143.58,pub.1007920533,2000,grid.17091.3e,1,1,0.5,2
5,ur.01000000143.58,pub.1007920533,2000,grid.417570.0,1,1,0.5,2
6,ur.01000000143.58,pub.1055163401,2000,grid.17091.3e,0,1,1.0,1
7,ur.01000000162.06,pub.1017303151,1993,grid.412587.d,0,1,1.0,1
8,ur.01000000162.06,pub.1040922518,1993,grid.412587.d,0,1,1.0,1
9,ur.01000000162.06,pub.1082688278,1993,grid.412597.c,0,1,1.0,1


#### *First Affiliation*


*   The first affiliation of an author is sometimes in the data table. This has to do with (1) missing author-affiliation linkages and/or (2) lack of field classification codes associated to a publication
*   The column `is_orign` marks whether that institution is the authors' first affiliation in the whole database and not just in the dataset for the overall period 1980-2022



In [None]:
%%bigquery --project $project_id 
# filter the dataset by researchers that started in 1980 or after
# all researchers can still be found in this table cshdimensionstest.test.psequence_weight_00_02
create or replace table cshdimensionstest.test.researchers_after_1980 as
  select p.*
    , first_publication_year
    , case 
        when first_publication_year = year then 1 else 0 
      end is_origin
  from cshdimensionstest.test.psequence_weight_1980_2000 p
  join (
        select distinct researcher_ids, first_publication_year
        from cshdimensionstest.test.psequence_weight_1980_2000 au
        left join dimensions-ai.data_analytics.researchers r on au.researcher_ids=r.id
        where first_publication_year >= 1980
        ) s
    on p.researcher_ids=s.researcher_ids;
    
select * from cshdimensionstest.test.researchers_after_1980 
order by researcher_ids,  year, t limit 10;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,id,year,research_orgs,is_multi_affiliation,t,aff_weight,n_au_orgs,first_publication_year,is_origin
0,ur.010000001271.33,pub.1039310092,1985,grid.417643.3,0,1,1.0,1,1981,0
1,ur.010000001341.07,pub.1064721680,1998,grid.224260.0,1,1,0.333333,3,1997,0
2,ur.010000001341.07,pub.1064721680,1998,grid.265457.7,1,1,0.333333,3,1997,0
3,ur.010000001341.07,pub.1064721680,1998,grid.448385.6,1,1,0.333333,3,1997,0
4,ur.01000000143.58,pub.1055163401,2000,grid.17091.3e,0,1,1.0,1,2000,1
5,ur.01000000143.58,pub.1007920533,2000,grid.417570.0,1,1,0.5,2,2000,1
6,ur.01000000143.58,pub.1007920533,2000,grid.17091.3e,1,1,0.5,2,2000,1
7,ur.01000000162.06,pub.1040922518,1993,grid.412587.d,0,1,1.0,1,1993,1
8,ur.01000000162.06,pub.1082688278,1993,grid.412597.c,0,1,1.0,1,1993,1
9,ur.01000000162.06,pub.1017303151,1993,grid.412587.d,0,1,1.0,1,1993,1


In [None]:
%%bigquery --project $project_id 
# make a list of all origins and researcher_ids combinations in the dataset
# match the origins to the whole trajectory and mark it as 1
create or replace table cshdimensionstest.test.origins as
select distinct researcher_ids, research_orgs, is_origin
from cshdimensionstest.test.researchers_after_1980
where is_origin = 1;

In [None]:
%%bigquery --project $project_id 
# join all the origins to the trajectories after 1980 table
create or replace table cshdimensionstest.test.researchers_after_1980_with_origins as
select a.*, b.research_orgs as first_affiliation, ifnull(b.is_origin, 0) is_origin_all
from cshdimensionstest.test.researchers_after_1980 a
left join cshdimensionstest.test.origins b 
  on a.researcher_ids=b.researcher_ids
  and a.research_orgs=b.research_orgs;

drop table cshdimensionstest.test.researchers_after_1980;
create or replace table cshdimensionstest.test.researchers_after_1980 as
select * from cshdimensionstest.test.researchers_after_1980_with_origins;
drop table cshdimensionstest.test.researchers_after_1980_with_origins;
ALTER TABLE cshdimensionstest.test.researchers_after_1980 DROP COLUMN is_multi_affiliation;
ALTER TABLE cshdimensionstest.test.researchers_after_1980 DROP COLUMN first_affiliation;
ALTER TABLE cshdimensionstest.test.researchers_after_1980 DROP COLUMN first_publication_year;
ALTER TABLE cshdimensionstest.test.researchers_after_1980 DROP COLUMN is_origin;

In [None]:
%%bigquery --project $project_id 
select * from cshdimensionstest.test.researchers_after_1980 
order by researcher_ids, t
limit 10

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,id,year,research_orgs,t,aff_weight,n_au_orgs,is_origin_all
0,ur.010000001271.33,pub.1039310092,1985,grid.417643.3,1,1.0,1,0
1,ur.010000001341.07,pub.1064721680,1998,grid.448385.6,1,0.333333,3,0
2,ur.010000001341.07,pub.1064721680,1998,grid.265457.7,1,0.333333,3,0
3,ur.010000001341.07,pub.1064721680,1998,grid.224260.0,1,0.333333,3,0
4,ur.01000000143.58,pub.1055163401,2000,grid.17091.3e,1,1.0,1,1
5,ur.01000000143.58,pub.1007920533,2000,grid.417570.0,1,0.5,2,1
6,ur.01000000143.58,pub.1007920533,2000,grid.17091.3e,1,0.5,2,1
7,ur.01000000162.06,pub.1017303151,1993,grid.412587.d,1,1.0,1,1
8,ur.01000000162.06,pub.1082688278,1993,grid.412597.c,1,1.0,1,1
9,ur.01000000162.06,pub.1040922518,1993,grid.412587.d,1,1.0,1,1


In [None]:
%%bigquery --project $project_id 

create or replace table cshdimensionstest.test.researchers_after_1980_simplified as
select distinct researcher_ids, research_orgs, year, t, is_origin_all
from cshdimensionstest.test.researchers_after_1980;

In [None]:
%%bigquery --project $project_id 
select * from cshdimensionstest.test.researchers_after_1980_simplified 
order by researcher_ids, t
limit 10

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,research_orgs,year,t,is_origin_all
0,ur.010000001271.33,grid.417643.3,1985,1,0
1,ur.010000001341.07,grid.448385.6,1998,1,0
2,ur.010000001341.07,grid.265457.7,1998,1,0
3,ur.010000001341.07,grid.224260.0,1998,1,0
4,ur.01000000143.58,grid.417570.0,2000,1,1
5,ur.01000000143.58,grid.17091.3e,2000,1,1
6,ur.01000000162.06,grid.412597.c,1993,1,1
7,ur.01000000162.06,grid.412587.d,1993,1,1
8,ur.010000001625.53,grid.5596.f,1998,1,1
9,ur.010000001625.53,grid.498578.f,1998,1,1


**Ok now we can contruct the mobility network.**


---

- we can also use this table to calculate the number of publications of the author per `year`
- we use the publications in this table to calculate the indicators for authors for the period 1980-2000
- we count the fractional number of papers using the `aff_weight`
- note that the table contains repeated rows for author-pub-org combinations

# **PART II - Mobility Networks**

## 3.0 Network Flows

We will split the calculation of the network flows:
1. Institutions
2. Cities
3. Countries


---



We can later on think of costumisable layers such as NUTS2 etc

### 3.1 Cross-Institutional Flows

In [None]:
%%bigquery --project $project_id 
# now we have everything we need to construct the flows at the institutional level
create or replace table cshdimensionstest.test.flows_1980_2000 as 
  select distinct
    a.researcher_ids,
  --  a.id as pub1,
  --  b.id as pub2,
    a.research_orgs as unit1,
    b.research_orgs as unit2,
    a.t as t1,
    b.t as t2,
    a.year as p1,
    b.year as p2,
  --  a.aff_weight as aff_w1,
  --  b.aff_weight as aff_w2,
   -- a.is_multi_affiliation as is_multi1,
   -- b.is_multi_affiliation as is_multi2,
   -- a.is_origin as is_origin1,
   -- b.is_origin as is_origin2,
    a.is_origin_all as is_origin1,
    b.is_origin_all as is_origin2
  from
        cshdimensionstest.test.researchers_after_1980_simplified a 
    inner join
        cshdimensionstest.test.researchers_after_1980_simplified b 
        on a.researcher_ids = b.researcher_ids 
  where
        a.t < b.t and a.t = b.t - 1 
        and a.research_orgs != b.research_orgs;
  --      and a.researcher_ids = 'ur.01000000367.29';
  --order by
  --  a.researcher_ids,
  --  a.t,
  --  b.t;

create or replace table cshdimensionstest.test.flows_1980_2000_without_origins as  
select *
from cshdimensionstest.test.flows_1980_2000
where is_origin1 = 0 or is_origin2 = 0;

drop table if exists cshdimensionstest.test.flows_1980_2000;

create or replace table cshdimensionstest.test.flows_1980_2000 as  
select * 
from cshdimensionstest.test.flows_1980_2000_without_origins;

drop table if exists cshdimensionstest.test.flows_1980_2000_without_origins;

# check the table
select * from cshdimensionstest.test.flows_1980_2000 
order by researcher_ids, t1,t2, unit1, unit2 
limit 50;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,unit1,unit2,t1,t2,p1,p2,is_origin1,is_origin2
0,ur.01000000255.40,grid.136593.b,grid.258799.8,1,2,1980,1981,1,0
1,ur.01000000255.40,grid.416963.f,grid.258799.8,1,2,1980,1981,1,0
2,ur.01000000352.51,grid.10253.35,grid.6553.5,1,2,1993,1994,1,0
3,ur.01000000352.51,grid.6553.5,grid.10253.35,2,3,1994,1997,0,1
4,ur.01000000367.29,grid.47840.3f,grid.1214.6,4,5,1995,1996,1,0
5,ur.01000000367.29,grid.47840.3f,grid.34477.33,4,5,1995,1996,1,0
6,ur.01000000367.29,grid.47840.3f,grid.423213.5,4,5,1995,1996,1,0
7,ur.01000000367.29,grid.47840.3f,grid.453560.1,4,5,1995,1996,1,0
8,ur.01000000532.21,grid.425618.c,grid.483432.a,1,2,1992,1997,0,0
9,ur.01000000532.21,grid.433823.d,grid.483432.a,1,2,1992,1997,0,0


*   ATT: This is the most computationally expensive table, becareful with running it too many times
* now we are ready to aggregate the flows:

### 3.2 Indicators


*   **Total Flows**: Are the flows exchanged between two instituions at a given year
  - note that we cannot compute this indicator for individual institutions, it needs to be in relation to the source-destination institutions
*   **Outflows**: Total number of researchers leaving an institution
*   **Inflows**: Total number of researchers entering an institution
*   We take the destination date `date_d` as the 'moving date'
*   **Net Flows**: the difference between Inflows and Outflows
  - if *negative*, then the institution is loosing researchers
  - if *positive*, then the institution is gaining researchers

#### Pairwise Flows
 **Total Flows**: Are the flows exchanged between two instituions at a given calendar year

In [29]:
%%bigquery --project $project_id

-- Calculate the total flows between institutional pairs
CREATE OR REPLACE TABLE cshdimensionstest.test.flows_1980_2000_institutional_flows AS
SELECT 
  unit1 AS geoid_o,
  unit2 AS geoid_d,
  p2 AS date_d,
  COUNT(DISTINCT researcher_ids) AS total_flows
FROM 
  cshdimensionstest.test.flows_1980_2000
GROUP BY 
  geoid_o, 
  geoid_d, 
  date_d;

-- Check the table 
SELECT 
  * 
FROM 
  cshdimensionstest.test.flows_1980_2000_institutional_flows 
ORDER BY 
  geoid_o, 
  geoid_d, 
  date_d 
LIMIT 50;


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,geoid_o,geoid_d,date_d,total_flows
0,grid.1001.0,grid.1002.3,1982,2
1,grid.1001.0,grid.1002.3,1983,3
2,grid.1001.0,grid.1002.3,1984,8
3,grid.1001.0,grid.1002.3,1985,4
4,grid.1001.0,grid.1002.3,1986,12
5,grid.1001.0,grid.1002.3,1987,10
6,grid.1001.0,grid.1002.3,1988,14
7,grid.1001.0,grid.1002.3,1989,8
8,grid.1001.0,grid.1002.3,1990,26
9,grid.1001.0,grid.1002.3,1991,21


#### *Outflows*
**Outflows**: Total number of researchers leaving an institution at a given calendar year

In [55]:
%%bigquery --project $project_id

# calculate the total outflows by calendar year
CREATE OR REPLACE TABLE cshdimensionstest.test.flows_1980_2000_institutional_outflows AS
SELECT unit1 AS geoid_o,
   --    unit2 AS geoid_d,
   --    p1 AS date_o,
       p2 AS date_d,
       COUNT(DISTINCT researcher_ids) outflows
FROM cshdimensionstest.test.flows_1980_2000
GROUP BY geoid_o, date_d;

# check table 
select * 
from cshdimensionstest.test.flows_1980_2000_institutional_outflows 
order by geoid_o, date_d 
limit 5;  # limit the result to 5 rows for viewing purposes

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,geoid_o,date_d,outflows
0,grid.1001.0,1981,25
1,grid.1001.0,1982,40
2,grid.1001.0,1983,76
3,grid.1001.0,1984,138
4,grid.1001.0,1985,160


#### *Inflows*
 **Inflows**: Total number of researchers entering an institution in a given calendar year

In [56]:
# Define the BigQuery project
%%bigquery --project $project_id

# Calculate the total inflows
CREATE OR REPLACE TABLE cshdimensionstest.test.flows_1980_2000_institutional_inflows AS
SELECT 
  -- unit1 AS geoid_o,   # commented out as it is not being used in the query
      unit2 AS geoid_d,  # rename column unit2 to geoid_d
  -- p1 AS date_o,        # commented out as it is not being used in the query
       p2 AS date_d,    # rename column p2 to date_d
       COUNT(DISTINCT researcher_ids) inflows  # count the number of unique researcher_ids
FROM cshdimensionstest.test.flows_1980_2000
GROUP BY geoid_d, date_d;  # group by destination geoid and date

# Check the created table
SELECT * 
FROM cshdimensionstest.test.flows_1980_2000_institutional_inflows 
ORDER BY geoid_d, date_d 
LIMIT 5;  # limit the result to 5 rows for viewing purposes

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,geoid_d,date_d,inflows
0,grid.1001.0,1981,12
1,grid.1001.0,1982,46
2,grid.1001.0,1983,75
3,grid.1001.0,1984,106
4,grid.1001.0,1985,146


 #### *Net Flows*
 **Net Flows, number**: the difference between Inflows and Outflows
  - if *`negative`*, then the institution is loosing researchers
  - if *`positive`*, then the institution is gaining researchers

  Note: there are more appropriate metrics of net mobility that can be found here: https://unstats.un.org/unsd/statcom/48th-session/documents/BG-4a-Migration-Handbook-E.pdf

**Net migration rate**: the difference between the number of people moving into a place and the number of people moving out of that place, expressed as a proportion of the population.

<p>Net migration rate: the difference between the number of people moving into a place and the number of people moving out of that place, expressed as a proportion of the population.</p>
<p>Mathematically: N = (I - O) / P * 100</p>
<p>Where:</p>
<ul>
  <li>N = Net migration rate</li>
  <li>I = Number of people moving into a place (inflows)</li>
  <li>O = Number of people moving out of a place (outflows)</li>
  <li>P = Population</li>
</ul>


In [None]:
# to calculate the net migration rate we need to know the population first
# we count all publishing authors in the dataset
%%bigquery --project $project_id
create or replace table cshdimensionstest.test.population_1980_2000 as
select research_orgs, year, count(distinct researcher_ids) as population_year
from cshdimensionstest.test.researchers_after_1980_simplified 
group by research_orgs, year;

In [12]:
%%bigquery --project $project_id
# Calculate the the inflows and outflows table
CREATE OR REPLACE TABLE cshdimensionstest.test.flows_1980_2000_institution_flows_indicators AS
SELECT ifnull(geoid_o, geoid_d) as org
, ifnull(a.date_d, b.date_d) as date_d
, ifnull(inflows, 0) as inflows
, ifnull(outflows, 0) as outflows
FROM cshdimensionstest.test.flows_1980_2000_institutional_inflows a 
LEFT JOIN cshdimensionstest.test.flows_1980_2000_institutional_outflows b
ON a.geoid_d=b.geoid_o and a.date_d=b.date_d;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,org,date_d,inflows,outflows
0,grid.1001.0,1981,12,25
1,grid.1001.0,1982,46,40
2,grid.1001.0,1983,75,76
3,grid.1001.0,1984,106,138
4,grid.1001.0,1985,146,160
5,grid.1001.0,1986,166,184
6,grid.1001.0,1987,253,212
7,grid.1001.0,1988,278,314
8,grid.1001.0,1989,350,335
9,grid.1001.0,1990,400,439


In [23]:
%%bigquery --project $project_id
# merge the population counts per year with the inflows and outflows table
# calculate the net migration rate
create or replace table cshdimensionstest.test.flows_1980_2000_institutional_in_out_net as
select a.*
  , b.inflows
  , b.outflows
  , round((inflows - outflows) / population_year * 100, 5) AS net_migration_rate
from cshdimensionstest.test.population_1980_2000 a
left join cshdimensionstest.test.flows_1980_2000_institution_flows_indicators b 
  on b.org=a.research_orgs 
  and b.date_d=a.year;

# Check the created table
SELECT * 
FROM cshdimensionstest.test.flows_1980_2000_institutional_in_out_net 
ORDER BY research_orgs, year 
LIMIT 50;  # limit the result to 50 rows for viewing purposes

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,research_orgs,year,population_year,inflows,outflows,net_migration_rate
0,grid.1001.0,1980,202,,,
1,grid.1001.0,1981,252,12.0,25.0,-5.15873
2,grid.1001.0,1982,339,46.0,40.0,1.76991
3,grid.1001.0,1983,392,75.0,76.0,-0.2551
4,grid.1001.0,1984,512,106.0,138.0,-6.25
5,grid.1001.0,1985,548,146.0,160.0,-2.55474
6,grid.1001.0,1986,597,166.0,184.0,-3.01508
7,grid.1001.0,1987,760,253.0,212.0,5.39474
8,grid.1001.0,1988,841,278.0,314.0,-4.28062
9,grid.1001.0,1989,880,350.0,335.0,1.70455


The net migration rate is an indicator that provides insight into the balance between the number of people moving into and out of a place. It is expressed as a percentage of the population, making it a useful measure for comparing the migration trends of different areas. A positive net migration rate indicates that there are more people moving into the place than moving out, while a negative rate indicates that there are more people leaving than arriving.

Here are some ways to interpret the net migration rate:

  *  A positive net migration rate indicates growth in the population due to migration, which can be seen as a positive sign for the economy, job market, and overall development of the area.

    *   A high positive net migration rate may indicate a strong pull factor, such as a growing economy, high quality of life, or attractive amenities.

    *   A negative net migration rate indicates a decrease in the population due to migration, which may indicate that people are leaving the area due to factors such as a declining economy, high crime rates, or a lack of job opportunities.

   *    A low negative net migration rate may indicate a minor trend of people leaving the area, while a high negative rate may indicate a more serious and persistent trend of population loss.

It is important to note that the net migration rate is only one of many indicators that can be used to understand migration trends, and it is important to consider other factors such as age, education, and employment status of the population in order to gain a full understanding of the migration patterns in a given area.

In [None]:
# Delete any redundant tables from GBQ
%%bigquery --project $project_id
drop table if exists cshdimensionstest.test.flows_1980_2000_institutional_inflows;
drop table if exists cshdimensionstest.test.flows_1980_2000_institutional_outflows;
drop table if exists cshdimensionstest.test.flows_1980_2000_institution_flows_indicators;

**Now we compute other basic indicators such as:**
 - Absence of mobility, number;


### 3.3 Python

In [None]:
from google.cloud import bigquery
client = bigquery.Client(project=project_id)

sql = """
  SELECT *
  FROM `cshdimensionstest.test.aggregated_moved_to_00_02` 
  order by geoid_o, date_o, date_d, catid_o
"""
movedto_edges = client.query(sql).to_dataframe()
movedto_edges.head(10)

# save the dataset
movedto_edges.to_csv('movedto_edges.csv')
files.download('movedto_edges.csv')

Unnamed: 0,geoid_o,geoid_d,catid_o,catid_d,date_o,date_d,date_range,weighted_flows,flows
0,grid.1001.0,grid.1003.2,2330,2366,2000,2001,2000-2002,1.0,1
1,grid.1001.0,grid.32197.3e,2330,2933,2000,2001,2000-2002,1.0,1
2,grid.1001.0,grid.508487.6,2330,2330,2000,2001,2000-2002,0.666667,2
3,grid.1001.0,grid.12136.37,2330,2409,2000,2001,2000-2002,0.666667,2
4,grid.1001.0,grid.264756.4,2330,2330,2000,2001,2000-2002,0.5,1
5,grid.1001.0,grid.8127.c,2330,2447,2000,2001,2000-2002,0.833333,2
6,grid.1001.0,grid.117476.2,2330,2921,2000,2001,2000-2002,1.0,2
7,grid.1001.0,grid.8484.0,2330,2746,2000,2001,2000-2002,0.333333,1
8,grid.1001.0,grid.5596.f,2330,2933,2000,2001,2000-2002,0.5,1
9,grid.1001.0,grid.5333.6,2330,2921,2000,2001,2000-2002,0.333333,1


In [None]:
from google.cloud import bigquery
client = bigquery.Client(project=project_id)

sql = """
  SELECT *
  FROM `cshdimensionstest.test.total_inflows_00_02` 
"""

inflows = client.query(sql).to_dataframe()

#inflows.to_csv('inflows.csv')
#!cp inflows.csv "gdrive/My Drive/CSH-DIMENSIONS Flows Test/BigQuery-results"

sql = """
  SELECT *
  FROM `cshdimensionstest.test.total_outflows_00_02` 
"""
outflows = client.query(sql).to_dataframe()

#from google.colab import files
#files.download('inflows.csv')
#outflows.to_csv('outflows.csv')
#!cp inflows.csv "gdrive/My Drive/CSH-DIMENSIONS Flows Test/BigQuery-results"
#from google.colab import files
#files.download('outflows.csv')

In [None]:
#@title Hidden Cell
inflows.sort_values(["geoid_d", "catid_d", "date_d"]).head(10)

Unnamed: 0,geoid_d,catid_d,date_d,date_range,inflows,weightedInflows
4666,grid.1001.0,2330,2001,2000-2002,208,107.833333
4759,grid.1001.0,2330,2002,2000-2002,234,162.666667
7252,grid.1001.0,2344,2001,2000-2002,537,272.583333
9142,grid.1001.0,2344,2002,2000-2002,419,241.666667
12230,grid.1001.0,2353,2001,2000-2002,96,63.5
11395,grid.1001.0,2353,2002,2000-2002,29,10.0
16908,grid.1001.0,2358,2001,2000-2002,212,150.5
16441,grid.1001.0,2358,2002,2000-2002,380,159.916667
18109,grid.1001.0,2366,2002,2000-2002,43,37.0
20890,grid.1001.0,2377,2001,2000-2002,19262,2502.594061


In [None]:
#@title Hidden Cell
outflows.sort_values(["geoid_o", "catid_o", "date_d"]).head(10)

Unnamed: 0,geoid_o,catid_o,date_d,date_range,t_outflows,t_weightedOutflows
4761,grid.1001.0,2330,2001,2000-2002,501,243.75
3223,grid.1001.0,2330,2002,2000-2002,311,154.666667
9245,grid.1001.0,2344,2001,2000-2002,498,243.5
9032,grid.1001.0,2344,2002,2000-2002,684,354.423077
11374,grid.1001.0,2353,2001,2000-2002,37,18.5
12302,grid.1001.0,2353,2002,2000-2002,187,75.50641
15729,grid.1001.0,2358,2001,2000-2002,346,181.166667
14254,grid.1001.0,2358,2002,2000-2002,275,113.044872
17733,grid.1001.0,2366,2001,2000-2002,48,20.333333
17388,grid.1001.0,2366,2002,2000-2002,23,9.5


In [None]:
# merge the inflows and outflows dataframe
result = pd.merge(inflows
                  , outflows
                  , how="outer"
                  , left_on=["geoid_d", "catid_d", "date_d"]
                  , right_on=["geoid_o", "catid_o", "date_d"]
                  ).reset_index(drop = True)
def diff(a, b):
    return b - a

result["net_mobility"] = result['inflows'] - result['t_outflows']
result["weighted_net_mobility"] = result['weightedInflows'] - result['t_weightedOutflows']
#result.sort_values(["geoid_o", "catid_o", "date_d"]).head(10)
flow_ind = result.rename(columns = {'date_d': 'MoveYear'
                         , ' t_ouflows': 'outflows' 
                         , 't_weightedOutflows': 'weightedOutflows'
                         , 'date_range_x':'Range'
                         , 'net_mobility':'NetFlows'
                         , 'weighted_net_mobility': 'WeightedNetFlows'}) \
                         [[  'geoid_d', 'catid_d', 'inflows', 'weightedInflows'\
                           , 'geoid_o', 'catid_o', 't_outflows', 'weightedOutflows'\
                           , 'NetFlows', 'WeightedNetFlows', 'MoveYear', 'Range']]

# save the indicators to a csv file
#flow_ind.to_csv('Flows_indicators.csv')
#files.download('Flows_indicators.csv')
flow_ind.sort_values(["geoid_o", "catid_o", "MoveYear"]).head(10)

# store dataset directly into GBQ and DRIVE
# store in drive
flow_ind.to_csv('2023_01_08_flows_output.csv', encoding = 'utf-8-sig') 

# store in GBQ
# import pandas_gbq
# table_id = 'test.2023_01_08_flows_output'
# pandas_gbq.to_gbq(flow_ind, table_id, project_id=project_id)

flow_ind.head(1)

## **3.0 Flows**

**For each institution id and year we comput the following basic indicators:**

1.   Institution id
2.   Year
3. pcp
4. workforce (# of researchers per year)
5. net mobility (ok)
6. avg academic age
7. total author inflow (ok)
8. total author outflow (ok)

### **3.1 Filtering the edges**
**We need to do some  cleaning because we have edges that should not be in the table**

1. Filter out all edges between the origin institutions at t1=1 from table `cshdimensionstest.test.flows_with_start_00_02` 
and create a filtered table `cshdimensionstest.test.flows_with_start_00_02_filtered`

*Note: Let's do it like this for now but we need to incorporate this in the creation of the network at some point*

In [None]:
%%bigquery --project $project_id

CREATE OR REPLACE TABLE cshdimensionstest.test.flows_with_start_00_02_filtered as

WITH filtered AS (
SELECT *
FROM `cshdimensionstest.test.flows_with_start_00_02`
WHERE t1 = 0 AND mobility_type = 'started in'
), grouped AS (
SELECT researcher_ids, ARRAY_AGG(unit2) unit2_values
FROM filtered
GROUP BY researcher_ids
)
SELECT t1.*
FROM `cshdimensionstest.test.flows_with_start_00_02` t1
JOIN grouped t2
ON t1.researcher_ids = t2.researcher_ids
WHERE t1.t1 <> 1
OR NOT (t1.t1 = 1 AND t1.unit2 IN UNNEST(t2.unit2_values) AND t1.mobility_type <> 'stayed in')

In [None]:
from google.cloud import bigquery
client = bigquery.Client()

table_id = 'cshdimensionstest.test.flows_with_start_00_02_filtered'
table = client.get_table(table_id)

# print the names of all columns in the table
print([field.name for field in table.schema])

['researcher_ids', 'pub1', 'pub2', 'field1', 'field2', 'unit1', 'unit2', 't1', 'p1', 't2', 'p2', 'aff_weight', 'mobility_type']


In [None]:
%%bigquery --project $project_id

# create a new table with the rows where mobility_type is 'stayed in'
CREATE OR REPLACE TABLE cshdimensionstest.test.df_stayed_in AS (
  SELECT researcher_ids, pub1, pub2, field1, field2, unit1, unit2, t1, p1, t2, p2, aff_weight,mobility_type
  FROM cshdimensionstest.test.flows_with_start_00_02_filtered
  WHERE mobility_type = 'stayed in' and  researcher_ids = 'ur.0676562433.28' 
  --GROUP BY t1, t2, unit2, researcher_ids
);

# create a new table with the rows where mobility_type is 'moved to'
CREATE OR REPLACE TABLE cshdimensionstest.test.df_moved_to AS (
  SELECT researcher_ids, pub1, pub2, field1, field2, unit1, unit2, t1, p1, t2, p2, aff_weight,mobility_type
  FROM cshdimensionstest.test.flows_with_start_00_02_filtered
  WHERE mobility_type = 'moved to'  and  researcher_ids = 'ur.0676562433.28' 
  --GROUP BY t1, t2, unit2, researcher_ids
);

# merge the two tables on t1, unit2, and id, and keep only the rows where the id is not in both tables
CREATE OR REPLACE TABLE cshdimensionstest.test.df_filtered AS (
  SELECT s.*
  FROM cshdimensionstest.test.df_stayed_in s
  LEFT JOIN cshdimensionstest.test.df_moved_to m
  USING (t1, t2, unit2, researcher_ids)
  WHERE m.researcher_ids IS NULL  and  researcher_ids = 'ur.0676562433.28' 
);

Query is running:   0%|          |

In [None]:
%%bigquery --project $project_id
select * from `cshdimensionstest.test.df_filtered` 
--order by t1, researcher_ids, p1, t2, p2, unit1, unit2
--limit  20

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,pub1,pub2,field1,field2,unit1,unit2,t1,p1,t2,p2,aff_weight,mobility_type
0,ur.0676562433.28,pub.1002325839,pub.1001389009,2401,2389,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
1,ur.0676562433.28,pub.1002325839,pub.1001389009,2401,2421,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
2,ur.0676562433.28,pub.1002325839,pub.1001389009,2415,2389,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
3,ur.0676562433.28,pub.1002325839,pub.1001389009,2415,2421,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
4,ur.0676562433.28,pub.1019988787,pub.1001389009,2480,2389,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
...,...,...,...,...,...,...,...,...,...,...,...,...,...
86,ur.0676562433.28,pub.1042389624,pub.1060822875,2913,2421,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
87,ur.0676562433.28,pub.1042389624,pub.1060822875,2913,2421,grid.16750.35,grid.16750.35,1,2000,2,2001,0.333333,stayed in
88,ur.0676562433.28,pub.1060597145,pub.1060822875,2921,2421,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in
89,ur.0676562433.28,pub.1060597145,pub.1060822875,2921,2421,grid.16750.35,grid.16750.35,1,2000,2,2001,0.333333,stayed in


In [None]:
%%bigquery --project $project_id
select * from `cshdimensionstest.test.flows_with_start_00_02_filtered` 
where   researcher_ids = 'ur.0676562433.28' 
order by t1, p1, t2, p2, pub1, pub2, unit1, unit2
limit  200

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,researcher_ids,pub1,pub2,field1,field2,unit1,unit2,t1,p1,t2,p2,aff_weight,mobility_type
0,ur.0676562433.28,void,pub.1002325839,3021,3021,void,grid.419859.8,0,2000,1,2000,0.500000,started in
1,ur.0676562433.28,void,pub.1002325839,2401,2401,void,grid.419859.8,0,2000,1,2000,0.500000,started in
2,ur.0676562433.28,void,pub.1002325839,2415,2415,void,grid.419859.8,0,2000,1,2000,0.500000,started in
3,ur.0676562433.28,void,pub.1002325839,2401,2401,void,grid.425806.d,0,2000,1,2000,0.500000,started in
4,ur.0676562433.28,void,pub.1002325839,3021,3021,void,grid.425806.d,0,2000,1,2000,0.500000,started in
...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,ur.0676562433.28,pub.1042389624,pub.1001389009,2913,2389,grid.419330.c,grid.1005.4,1,2000,2,2001,0.333333,moved to
196,ur.0676562433.28,pub.1042389624,pub.1001389009,2913,2389,grid.419859.8,grid.1005.4,1,2000,2,2001,0.333333,moved to
197,ur.0676562433.28,pub.1042389624,pub.1001389009,2913,2421,grid.419859.8,grid.1005.4,1,2000,2,2001,0.333333,moved to
198,ur.0676562433.28,pub.1042389624,pub.1001389009,2913,2389,grid.419859.8,grid.419859.8,1,2000,2,2001,0.333333,stayed in


In [None]:
%%bigquery --project $project_id

select * from `cshdimensionstest.test.flows_with_start_00_02_filtered` 
where researcher_ids = 'ur.0676562433.28' and unit2 = 'grid.36425.36'
order by t1, researcher_ids, p1, t2, p2, unit1, unit2
limit 100

In [None]:
%%bigquery --project $project_id

select researcher_ids, unit2, count(distinct mobility_type) as t_mob_type,  t1, p1, t2, p2
from `cshdimensionstest.test.flows_with_start_00_02_filtered` 
group by researcher_ids, unit2,  t1, p1, t2, p2
having count(distinct mobility_type) > 1
order by t1
limit 1000

In [None]:
"""
# Filter the dataframe to include only rows with t1 = 0 and mobility_type = 'started in'
df_filtered = df_test[(df_test['t1'] == 0) & (df_test['mobility_type'] == 'started in')]

# Group the dataframe by id
df_grouped = df_filtered.groupby(['researcher_ids'])

# Iterate through the groups and store the values of unit2 for each group
unit2_values = {}
for name, group in df_grouped:
    unit2_values[name] = set(group['unit2'].tolist())

# Iterate through the original dataframe and remove rows where unit2 and t1 = 1 appear in the same row
# and unit2 also appears in t1 = 0 for the same id
to_remove = []
for index, row in df_test.iterrows():
    if (row['t1'] == 1) and (row['unit2'] in unit2_values[row['researcher_ids']]):
        to_remove.append(index)

# Drop the rows from the dataframe
df_test.drop(to_remove, inplace=True)
"""

In [None]:
query = """
SELECT 
  outer_query.unit2 as id
  ,outer_query.p2 as pub_year
  ,'2000-2002' as date_range
  ,count(distinct outer_query.researcher_ids) as t_workforce
  ,subquery.t_initial_affiliations
  ,subquery2.t_stayers
  ,subquery3.t_incomers
  ,subquery4.t_outgoers
FROM `cshdimensionstest.test.flows_with_start_00_02_filtered` AS outer_query
LEFT JOIN (
  SELECT unit2, p2, COUNT(distinct researcher_ids) AS t_initial_affiliations
  FROM `cshdimensionstest.test.flows_with_start_00_02_filtered`
  WHERE mobility_type = 'started in'
  GROUP BY unit2, p2
) AS subquery
ON outer_query.unit2 = subquery.unit2 AND outer_query.p2 = subquery.p2
LEFT JOIN (
  SELECT unit2, p2, COUNT(distinct researcher_ids) AS t_stayers
  FROM `cshdimensionstest.test.flows_with_start_00_02_filtered`
  WHERE mobility_type = 'stayed in' 
  GROUP BY unit2, p2
) AS subquery2
ON outer_query.unit2 = subquery2.unit2 AND outer_query.p2 = subquery2.p2
LEFT JOIN (
  SELECT unit2, p2, COUNT(distinct researcher_ids) AS t_incomers
  FROM `cshdimensionstest.test.flows_with_start_00_02_filtered`
  WHERE mobility_type = 'moved to'
  GROUP BY unit2, p2
) AS subquery3
ON outer_query.unit2 = subquery3.unit2 AND outer_query.p2 = subquery3.p2
LEFT JOIN (
  SELECT unit1, p2, COUNT(distinct researcher_ids) AS t_outgoers
  FROM `cshdimensionstest.test.flows_with_start_00_02_filtered`
  WHERE mobility_type = 'moved to'
  GROUP BY unit1, p2
) AS subquery4
ON outer_query.unit2 = subquery4.unit1 AND outer_query.p2 = subquery4.p2
GROUP BY outer_query.unit2, outer_query.p2, subquery.t_initial_affiliations, subquery2.t_stayers, subquery3.t_incomers, subquery4.t_outgoers
ORDER BY outer_query.unit2, outer_query.p2 
LIMIT 10
"""
query_job = client.query(query)
df = query_job.to_dataframe()
df

# PART III - Coverage Indicators

1. **Make a hello world program**
1. **Connect resources to each other:**
 e.g., can I print the GBQ data in a website (print=show any table) for instance?
1. **Other considerations**
* how to run queries fast enough (users should not have delays)
* how does the interface look like
* how to put all calculations in one query?
* how to connect the web interface to google bigquery?
* what if multiple users use it? performance?
