<a href="https://colab.research.google.com/github/MarciaFG/skill-flow/blob/main/Flows_test_2000_2002_FOR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Academic Mobility Flows using BigQuery and Firebase**

Author: Marcia R. Ferreira (Complexity Science Hub Vienna & TU Wien)

Date: September 28, 2022

Input: Affiliation Trajectories

Model: 

Output: 

Other notes: 

# Colab Initialization

In [3]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('To enable a high-RAM runtime, select the Runtime → "Change runtime type"')
  print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
  print('re-execute this cell.')
else:
  print('You are using a high-RAM runtime!')

Fri Nov  4 09:40:01 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    42W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Install required Drivers

In [4]:
import numpy as np
import requests
import pandas as pd
from tqdm import tqdm
import torch
import nltk
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
from google.cloud import bigquery
import humanize

In [5]:
%load_ext google.colab.data_table

In [6]:
# Provide your credentials to the runtime
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


# Download and Load Data from  Google Big Query

### Declare the Cloud project ID which will be used throughout this notebook

In [7]:
# Provide your credentials to the runtime
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

project_id = "cshdimensionstest"

%load_ext google.cloud.bigquery

# set up parameters eg for a specific journal
bq_params = {}
bq_params["journal_id"] = "jour.1115214"

Authenticated


In [8]:
# test to see if it is working correctly
%%bigquery --params $bq_params --project $project_id 

select distinct 
  journal.id, journal.title, journal.issn, journal.eissn, publisher.name, date_inserted
from `dimensions-ai.data_analytics.publications` 
where  journal.id = @journal_id
and publisher is not null
order by date_inserted desc
limit 1

Unnamed: 0,id,title,issn,eissn,name,date_inserted
0,jour.1115214,Nature Biotechnology,1087-0156,1546-1696,Springer Nature,2022-11-01 21:22:55+00:00


### Ok! it works let's start!

# **1. Extract Dimensions Data from Google BigQuery**
This script extracts test data for Liu.

In [28]:
%%bigquery --project $project_id 

drop table if exists `cshdimensionstest.test.basic_2000_2002`

In [None]:
# Constructing the mobility flows for the FOR categorization

%%bigquery --project $project_id 

create table cshdimensionstest.test.basic_2000_2002 as 
select   p.id,
         p.year,
         p.date,
         researcher_ids,
         research_orgs,
         category_for.name,
         category_for.id as cat_id
from     `dimensions-ai.data_analytics.publications` p
        left join unnest(p.researcher_ids) researcher_ids
        left join unnest(p.research_orgs) research_orgs
        left join unnest(p.category_for.second_level.FULL) category_for
where    researcher_ids is not null
and      research_orgs is not null
and      category_for.name is not null
and      category_for.id is not null
and      year between 2000 and 2002
order by p.id;

In [37]:
%%bigquery --project $project_id 
select * from `cshdimensionstest.test.basic_2000_2002`
limit 10


Unnamed: 0,id,year,date,researcher_ids,research_orgs,name,cat_id
0,pub.1000000033,2002,2002-01,ur.013211012560.00,grid.4643.5,Biomedical Engineering,2837
1,pub.1000000033,2002,2002-01,ur.01300514341.04,grid.4643.5,Materials Engineering,2921
2,pub.1000000033,2002,2002-01,ur.0743476073.71,grid.8982.b,Biomedical Engineering,2837
3,pub.1000000033,2002,2002-01,ur.0743476073.71,grid.4643.5,Materials Engineering,2921
4,pub.1000000033,2002,2002-01,ur.01073301374.46,grid.4643.5,Materials Engineering,2921
5,pub.1000000033,2002,2002-01,ur.0671533174.24,grid.4643.5,Materials Engineering,2921
6,pub.1000000033,2002,2002-01,ur.0671533174.24,grid.4643.5,Biomedical Engineering,2837
7,pub.1000000033,2002,2002-01,ur.01164265741.55,grid.8982.b,Materials Engineering,2921
8,pub.1000000033,2002,2002-01,ur.01300514341.04,grid.8982.b,Biomedical Engineering,2837
9,pub.1000000033,2002,2002-01,ur.0671533174.24,grid.8982.b,Materials Engineering,2921


In [None]:
%%bigquery --project $project_id 
drop table if exists `cshdimensionstest.test.sequence_00_02`

In [33]:
%%bigquery --project $project_id 
# step (1)
# now we need to contruct the trajectories of researchers

 create table cshdimensionstest.test.sequence_00_02 as 
  select 
    distinct researcher_ids, 
    year, 
    dense_rank() over (
      partition by researcher_ids 
      order by 
        year asc
    ) as t 
  from 
    `cshdimensionstest.test.basic_2000_2002` 
  order by 
    researcher_ids, 
    year, 
    t;

In [None]:
%%bigquery --project $project_id 
drop table if exists `cshdimensionstest.test.affweight_00_02`

In [35]:
%%bigquery --project $project_id 
# step (2)
# generating affiliation weights if the author has had more than one affiliation simultaneously
 create table cshdimensionstest.test.affweight_00_02 as 
  select 
    distinct researcher_ids, 
    id, 
    1 * 1.0 / count(distinct research_orgs) as aff_weight 
  from 
    `cshdimensionstest.test.basic_2000_2002`
  group by 
    researcher_ids, 
    id
  order by researcher_ids, id;

In [None]:
%%bigquery --project $project_id 
drop table if exists `cshdimensionstest.test.psequence_weight_00_02`

In [40]:
%%bigquery --project $project_id 

# step (3)
# merging results from steps 1-2
# consider using a subquery to combine these 3 steps

create table cshdimensionstest.test.psequence_weight_00_02 as 
  select
      a.researcher_ids,
      a.id,
      a.name,
      a.cat_id,
      a.year,
      a.research_orgs,
      b.t,
      c.aff_weight 
  from
      `cshdimensionstest.test.basic_2000_2002` as a 
      inner join
         `cshdimensionstest.test.sequence_00_02` as b 
         on a.researcher_ids = b.researcher_ids 
         and a.year = b.year 
      inner join
         `cshdimensionstest.test.affweight_00_02` as c 
         on c.researcher_ids = a.researcher_ids 
         and c.id = a.id 
  order by
        b.researcher_ids,
        b.year,
        b.t;

In [None]:
%%bigquery --project $project_id 
select * from `cshdimensionstest.test.psequence_weight_00_02` limit 10

In [42]:
%%bigquery --project $project_id 
drop table if exists `cshdimensionstest.test.origin_institution_00_02`

In [43]:
%%bigquery --project $project_id 

# step (4)
# generate the origins and destinations for each researcher
  create  table cshdimensionstest.test.origin_institution_00_02 as 
  select  
  *, 
  case when t = 1 
    then 'origin' 
        else 'destination' 
            end od 
from `cshdimensionstest.test.psequence_weight_00_02`
  order by researcher_ids, year, t;

In [None]:
%%bigquery --project $project_id 
drop table if exists `cshdimensionstest.test.first_pub_00_02`

In [45]:
%%bigquery --project $project_id 

# step (5)
# getting the first publication of each researcher
   create table cshdimensionstest.test.first_pub_00_02 as 
   select distinct
      a.researcher_ids,
      a.id as pub2,
      a.cat_id as field1,
      a.cat_id as field2,
      a.research_orgs as unit2,
      "0" as t1,
      a.year as p1,
      a.t as t2,
      a.year as p2,
      a.aff_weight,
      'started in' as mobility_type 
   from `cshdimensionstest.test.origin_institution_00_02` a 
   where
      t = 1;

In [46]:
%%bigquery --project $project_id 
drop table if exists `cshdimensionstest.test.flows_00_02`

In [47]:
%%bigquery --project $project_id 

# step (6)
# now we have everything we need to construct the flows at the institutional level
create table cshdimensionstest.test.flows_00_02 as 
  select
    a.researcher_ids,
    a.id as pub1,
    a.cat_id as field1,
    b.id as pub2,
    b.cat_id as field2,
    a.research_orgs as unit1,
    b.research_orgs as unit2,
    a.t as t1,
    a.year as p1,
    b.t as t2,
    b.year as p2,
    b.aff_weight  
  from
    `cshdimensionstest.test.psequence_weight_00_02` as a 
    inner join
        `cshdimensionstest.test.psequence_weight_00_02` as b 
        on a.researcher_ids = b.researcher_ids 
  where
        a.t < b.t 
    and a.t = b.t - 1 
  order by
    a.researcher_ids,
    a.t,
    b.t;

In [50]:
%%bigquery --project $project_id 
drop table if exists `cshdimensionstest.test.flows_with_start_00_02`

In [51]:
%%bigquery --project $project_id

# step (7)
# bring the flows and the start publication datasets together and save it in a table

create table cshdimensionstest.test.flows_with_start_00_02 as 
   select
      researcher_ids,
      pub1,
      pub2,
      field1,
      field2,
      unit1,
      unit2,
     cast(t1 as int) t1,
     cast(p1 as int) p1,
      t2,
      p2,
      aff_weight,
      case
         when
            unit1 = unit2 
         then
            'stayed in' 
         else
            case
               when
                  unit1 != unit2 
               then
                  'moved to' 
               else
                  'error' 
            end
      end
      as mobility_type 
   from
      `cshdimensionstest.test.flows_00_02` 
   union all
   select
      researcher_ids,
     'void' as pub1,
      pub2,
      field1,
      field2,
     'void' as unit1,
      unit2,
     cast(t1 as int) t1,
     cast(p1 as int) p1,
      t2,
      p2,
      aff_weight,
      mobility_type 
   from
      `cshdimensionstest.test.first_pub_00_02` 
   order by
      researcher_ids,
      t2;

In [55]:
%%bigquery --project $project_id

select * from `cshdimensionstest.test.flows_with_start_00_02` 
order by researcher_ids, p1, p2
limit 10 

Unnamed: 0,researcher_ids,pub1,pub2,field1,field2,unit1,unit2,t1,p1,t2,p2,aff_weight,mobility_type
0,ur.01000000143.58,void,pub.1055163401,2581,2581,void,grid.17091.3e,0,2000,1,2000,1.0,started in
1,ur.01000000143.58,void,pub.1007920533,2581,2581,void,grid.17091.3e,0,2000,1,2000,0.5,started in
2,ur.01000000143.58,void,pub.1007920533,2581,2581,void,grid.417570.0,0,2000,1,2000,0.5,started in
3,ur.01000000143.58,pub.1007920533,pub.1053189419,2581,2581,grid.417570.0,grid.17089.37,1,2000,2,2001,0.333333,moved to
4,ur.01000000143.58,pub.1007920533,pub.1053189419,2581,2581,grid.17091.3e,grid.17089.37,1,2000,2,2001,0.333333,moved to
5,ur.01000000143.58,pub.1007920533,pub.1053189419,2581,2581,grid.17091.3e,grid.17091.3e,1,2000,2,2001,0.333333,stayed in
6,ur.01000000143.58,pub.1007920533,pub.1053189419,2581,2581,grid.417570.0,grid.31501.36,1,2000,2,2001,0.333333,moved to
7,ur.01000000143.58,pub.1055163401,pub.1053189419,2581,2581,grid.17091.3e,grid.17089.37,1,2000,2,2001,0.333333,moved to
8,ur.01000000143.58,pub.1055163401,pub.1053189419,2581,2581,grid.17091.3e,grid.31501.36,1,2000,2,2001,0.333333,moved to
9,ur.01000000143.58,pub.1007920533,pub.1053189419,2581,2581,grid.17091.3e,grid.31501.36,1,2000,2,2001,0.333333,moved to


In [58]:
%%bigquery --project $project_id

create table cshdimensionstest.test.aggregated_moved_to_00_02 as
select  unit1 as geoid_o
      , unit2 as geoid_d
      , field1 as catid_o
      , field2 as catid_d
      , p1 as date_o
      , p2 as date_d
      , '2000-2002' as date_range
      , sum(aff_weight) as weighted_flows
      , count(researcher_ids) as flows
from `cshdimensionstest.test.flows_with_start_00_02` 
where mobility_type = 'moved to'
group by unit1,unit2, field1,field2,  p1,p2

In [6]:
%%bigquery --project $project_id
select * from `cshdimensionstest.test.aggregated_moved_to_00_02`
order by geoid_o, date_o, date_d, catid_o
limit 10

Unnamed: 0,geoid_o,geoid_d,catid_o,catid_d,date_o,date_d,date_range,weighted_flows,flows
0,grid.1001.0,grid.28312.3a,2330,2921,2000,2001,2000-2002,0.333333,1
1,grid.1001.0,grid.11355.33,2330,2409,2000,2001,2000-2002,3.166667,8
2,grid.1001.0,grid.6612.3,2330,2389,2000,2001,2000-2002,0.5,1
3,grid.1001.0,grid.4991.5,2330,2330,2000,2001,2000-2002,1.666667,3
4,grid.1001.0,grid.1003.2,2330,2344,2000,2001,2000-2002,0.5,1
5,grid.1001.0,grid.1005.4,2330,2353,2000,2001,2000-2002,0.5,1
6,grid.1001.0,grid.483427.e,2330,2409,2000,2001,2000-2002,0.666667,2
7,grid.1001.0,grid.425004.7,2330,2409,2000,2001,2000-2002,0.5,1
8,grid.1001.0,grid.1022.1,2330,2953,2000,2001,2000-2002,2.0,4
9,grid.1001.0,grid.37172.30,2330,2921,2000,2001,2000-2002,0.333333,1


In [6]:
from google.cloud import bigquery

client = bigquery.Client(project=project_id)

sql = """
  SELECT *
  FROM `cshdimensionstest.test.aggregated_moved_to_00_02` 
  order by geoid_o, date_o, date_d, catid_o
"""

movedto_edges = client.query(sql).to_dataframe()
movedto_edges.head(10)

Unnamed: 0,geoid_o,geoid_d,catid_o,catid_d,date_o,date_d,date_range,weighted_flows,flows
0,grid.1001.0,grid.5292.c,2330,2344,2000,2001,2000-2002,0.333333,1
1,grid.1001.0,grid.97008.36,2330,2953,2000,2001,2000-2002,0.833333,2
2,grid.1001.0,grid.34421.30,2330,2401,2000,2001,2000-2002,1.833333,3
3,grid.1001.0,grid.49100.3c,2330,2867,2000,2001,2000-2002,0.5,2
4,grid.1001.0,grid.418228.5,2330,2389,2000,2001,2000-2002,0.25,1
5,grid.1001.0,grid.1019.9,2330,2867,2000,2001,2000-2002,0.333333,1
6,grid.1001.0,grid.11762.33,2330,2330,2000,2001,2000-2002,0.333333,1
7,grid.1001.0,grid.251924.9,2330,2933,2000,2001,2000-2002,0.5,1
8,grid.1001.0,grid.5596.f,2330,2867,2000,2001,2000-2002,0.5,1
9,grid.1001.0,grid.9619.7,2330,2330,2000,2001,2000-2002,2.5,5


In [2]:
# save dataframe locally so we can use it next time without GBQ connection
from google.colab import files
movedto_edges.to_csv('movedto_edges.csv')
files.download('movedto_edges.csv')

NameError: ignored

# **Compute Indicators**

**For each institution id and year we comput the following basic indicators:**


1.   Institution id
2.   Year
3. pcp
4. workforce
5. net mobility
6. avg academic age
7. total author inflow
8. total author outflow

# **Improvements to the code:**


*   Load data direclty into google cloud storage bucket in CSV
*   Optimize and simplify queries



In [1]:
# save dataframe to a CSV file
movedto_edges.head(1)


NameError: ignored

In [1]:
#%%bigquery --project $project_id  > this command is just to do things inside GBQ
# Save output in a variable `df` you just df in front of the magic command like so
# %%bigquery --project $project_id df
# Now we aggregate the results to create the final flows table
# counting the flows

client = bigquery.Client(project=project_id)

sql = """
  SELECT researcher_ids, pub1, pub2, name1, name2, unit1, unit2, p1, p2, aff_weight, mobility_type
  FROM `cshdimensionstest.test.flows_with_start_00_02` 
  order by researcher_ids, p1, p2
  Limit 10
"""

all_edges = client.query(sql).to_dataframe()
all_edges.head(10)


NameError: ignored

In [None]:
"""
%%bigquery --project $project_id

# Now we aggregate the results to create the final flows table
# counting the flows

create table cshdimensionstest.test.for_flows_2000_2002 as 
select
   unit1 as institution_s,
   org1.latitude as lat_s,
   org1.longitude as lng_s,
   unit2 as institution_t,     
   org2.latitude as lat_t,
   org2.longitude as lng_t,                           
   name1 as for_s,
   name2 as for_t,                                        
   "Field of Research (Second Level)" as category_system,
   p2 as moving_year,                                       
   count(distinct researcher_ids) as researcher_flows,
   sum(aff_weight) as affiliation_weights 
from
   `cshdimensionstest.test.flows_with_start_00_02` a 
   inner join
      `cshdimensionstest.test.organisations` org1 
      on a.unit1 = org1.id 
   inner join
      `cshdimensionstest.test.organisations` org2 
      on a.unit2 = org2.id 
where
   mobility_type = "moved to"  
group by
   unit1,
   org1.latitude,
   org1.longitude,
   unit2,
   org2.latitude,
   org2.longitude,
   name1,
   name2,
   p2; 
"""

1.1 make hello world 
1.2 TRY TO CONNECT RESOURCES TO EACH OTHER
 can you print the GBQ data in a website (print = show any table)?




how to run queries fast enough (users should not have delays)
how does the interface look like
how to put all calculations in one query?
how to connect the web interface to google bigquery

if multiple users use it? performance?


# **2. Set up a Firebase project**

Before you can continue, you need to set up a Firebase project:

1.  If you don't already have a Firebase project, create a new project in the [Firebase console](https://console.firebase.google.com/). Then, open your project and do the following:

    1.  On the [Settings](https://console.firebase.google.com/project/_/settings/serviceaccounts/adminsdk) page, create a service account and download the service account key file. Keep this file safe, since it grants administrator access to your project.

    1.  On the [Storage](https://console.firebase.google.com/project/_/storage) page, enable Cloud Storage. Take note of your bucket name.

        You need a Storage bucket to temporarily store model files while adding them to your Firebase project. If you are on the Blaze plan, you can create and use a bucket other than the default for this purpose.

    1.  On the [ML Kit](https://console.firebase.google.com/project/_/ml) page, click **Get started** if you haven't yet enabled ML Kit.

1.  In the [Google APIs console](https://console.developers.google.com/apis/library/firebaseml.googleapis.com?project=_), open your Firebase project and enable the Firebase ML API.

# **3. Set up Mobility Webtool and Connection to CSH Server**