In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from google.cloud import bigquery
import pandas as pd
from open_patstat.plots import stacked_bar_chart, bar_chart
import plotly.plotly as py
import plotly.io as pio
import numpy as np

In [3]:
data_path = 'data/'
plots_path = 'plots/'
views_path = 'views/'

In [4]:
client = bigquery.Client()

The purpose of this notebook is to provide insights on the geographical flows of applications. The geographical destination of an application is defined as the appln_auth and the application origin is defined as the inventor's country (from the address).

<font color='red'>Warning: the number of known inventors varies from 0 to more than 80.</font>

The aforementioned warning means that we have to set an (arbitrary) assignment rule. 

A solution would be to use the `inv_rank` that is assigned to each inventor of a given application by *some* authorities. However, since this is not systematic, this is not our favourite option to perform international investigations.

Another solution would be to use the `nb_inv` variable to assign $\frac{1}{nb\_inv}$ of a given application to the country of each inventor. However, the output is far from perfect. Depending on the authority and/or the period, the overall time series of applications is far from expected (see tls201 exploratory analysis for our benchmark). This is partly due to the pitfalls of the `nb_inv` var. For example, when the inventors are unknown, it is set to 0. Due to our attribution method, we have to exclude these samples although they can be significant at some point. There might be some other strange things happening. When we sum up the number of applications obtained with this method, although the applications with 0 inventors are excluded, we obtain more applications than the total number of applications actually in the database. A complete discussion can be found in the old version of the `tls20167.ipynb`.

In this context we created a `nb_occ` which is the number of occurences on an application_id in the `tls20167` table where the number of occurence is equivalent to the number of inventors. We use the `nb_inv` variable to assign  $\frac{1}{nb\_occ}$ of a given application to the country of each inventor. This solution natively embodies desirable properties such as:

- the ability to split an application between different origin countries - the no 0 case, any application has at least 1 occurence, even if the origin country is `Null`
- the total amount of application (overall and by `appln_auth`, `year`, etc) remains the same <font color=#1F618D>Nb: taking into account the applications wit 0 known inventor and thus no country of origin</font>

# Data

```python
query = """
SELECT
  SUM(1/nb_occ) AS nb,
  year,
  appln_auth,
  person_ctry_code
FROM
  `raw.tls20167_cp_v2`
GROUP BY
  year,
  appln_auth,
  person_ctry_code;
  """
client.query(query).to_dataframe().to_csv(views_path + '20167_ApplnAuth_byOriginYear_vf.csv')
```

In [5]:
df = pd.read_csv(views_path + '20167_ApplnAuth_byOriginYear_vf.csv', index_col=0)

# Sanity check

In [6]:
print("There are {} applictaions in the table".format(int(df['nb'].sum())))

There are 85790188 applictaions in the table


In [7]:
# ts of applicatuions in Fr (including na)
fig = bar_chart(df.query('appln_auth=="FR"').groupby('year').sum())
py.iplot(fig, filename='test')

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~cverluise/0 or inside your plot.ly account where it is named 'test'



Consider using IPython.display.IFrame instead



In [8]:
# ts of applicatuions in Fr (excluding na)
fig = bar_chart(df.query('appln_auth=="FR"').dropna(subset=['person_ctry_code'], axis=0).groupby('year').sum())
py.iplot(fig, filename='test')

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~cverluise/0 or inside your plot.ly account where it is named 'test'


# "Input output" matrix

We build an input/output matrix where:

- Values in rows are grouped by country of origin (`person_ctry_code`). They represent application outflows (exports/outputs)

- Values in columns are grouped by destination (`appln_auth`). They represent application inflows (imports/intputs)

- Special case: values on the diagonal are applications with same `appln_auth` and `person_ctry_code`

In [9]:
# we focus on the 2000-2005 period - should be refined based on the 
tmp = df.query('2000<=year<=2005')
tmp_io = pd.crosstab(index=tmp['person_ctry_code'], columns=tmp['appln_auth'], 
                     values=tmp['nb'], aggfunc='sum',
                     margins=True, margins_name='Total', dropna=False)
tmp_io = tmp_io.replace(np.nan, 0)

We restrict to a subset of countries which have the largest record of patent flows (inflows here) for the sake of simplicity

In [10]:
# we filter on countries which import a sufficiently large amount of patents
exp_list = list(tmp_io.columns[tmp_io.sum(0)>1e5])
subset_io = tmp_io.loc[exp_list, exp_list]

Nb: Overall, the diagonal is where we find the most extreme values.

## Margin analysis: net balance of applications

The positive net baalnce of some European countries could be linked to the EP strongly negative net balance

In [11]:
pd.DataFrame(subset_io['Total'] - subset_io.loc['Total']).\
rename(columns={'person_ctry_code':'ctry_code', 'Total':'net_balance'}).sort_values('net_balance')# X-M

Unnamed: 0_level_0,net_balance
person_ctry_code,Unnamed: 1_level_1
JP,-1813005.0
EP,-848448.9
US,-638826.6
AU,-396452.9
CN,-374291.4
KR,-306959.1
AT,-160611.5
CA,-158409.5
RU,-131958.7
BR,-103714.4


## Export analysis

From the total (col), we see that inventors living in the subset mainly export patents in the same subset of authorities. Only exception seems to be the EPO. Q: what does `person_ctry_code` EP means ?

In [12]:
subset_io.T.apply(lambda s: s/s[:-1].sum()).T

appln_auth,AT,AU,BR,CA,CN,DE,DK,EP,ES,FR,GB,IT,JP,KR,MX,RU,TW,US,Total
person_ctry_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
AT,0.341471,0.001008,0.003243,0.030569,0.023972,0.218917,0.0137,0.199194,0.023916,0.001217,0.002878,0.0013,0.00029,0.005652,0.006224,0.009264,0.007974,0.109212,1.079615
AU,0.034612,0.232531,0.00798,0.094929,0.056441,0.049499,0.00751,0.142392,0.009517,0.000414,0.016793,9.3e-05,0.000426,0.010983,0.015603,0.006418,0.010741,0.303117,1.057562
BR,0.009304,0.000318,0.808129,0.014262,0.011228,0.02433,0.002264,0.037059,0.007123,0.000866,0.002721,0.00043,7.3e-05,0.003053,0.012892,0.002284,0.002649,0.061015,1.038334
CA,0.030559,0.000579,0.003593,0.374022,0.023583,0.059599,0.005473,0.106829,0.008221,0.001158,0.009214,6.2e-05,0.000235,0.008327,0.012056,0.003802,0.009931,0.342758,1.038516
CN,0.001239,4.2e-05,0.00017,0.001433,0.969946,0.002296,0.000166,0.004528,0.00034,0.000101,0.000469,1.2e-05,4.5e-05,0.001453,0.000217,0.000528,0.004877,0.012139,1.002218
DE,0.056723,0.000602,0.003609,0.021525,0.029899,0.480095,0.010066,0.218231,0.021566,0.002654,0.003494,0.000635,0.000251,0.006645,0.007495,0.0056,0.007603,0.123307,1.05115
DK,0.074151,0.003698,0.005536,0.049659,0.048767,0.123981,0.268065,0.188798,0.022498,0.000689,0.007939,0.00044,0.000238,0.008582,0.012825,0.012519,0.004751,0.166865,1.11485
EP,0.067481,0.0,0.575835,0.0,0.0,0.067481,0.0,0.067481,0.0,0.0,0.0,0.0,0.0,0.0,0.134961,0.0,0.01928,0.067481,1.902528
ES,0.040094,0.000544,0.006021,0.021614,0.016829,0.073951,0.009341,0.119471,0.597593,0.005371,0.005301,0.001405,0.000144,0.004171,0.013598,0.00598,0.004025,0.074546,1.071944
FR,0.072969,0.000764,0.007723,0.046182,0.04453,0.148893,0.012505,0.207061,0.034261,0.217879,0.004698,0.000632,0.000527,0.00963,0.015378,0.009452,0.006997,0.159917,1.079143


## Import analysis 

From the total (row), we see that appln_auth in the subset only partly import patents from inventors living in the same subset of authorities - or is it an articfact of `nan` ? This is notably the case of Japan and Australia.

In [13]:
subset_io.apply(lambda s: s/s[:-1].sum())

appln_auth,AT,AU,BR,CA,CN,DE,DK,EP,ES,FR,GB,IT,JP,KR,MX,RU,TW,US,Total
person_ctry_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
AT,0.09277,0.003544,0.003762,0.005893,0.001092,0.01362487,0.016467,0.01317159,0.012262,0.00094,0.00136,0.002535,0.000149,0.000535,0.004795,0.004506,0.001338714,0.002894668,0.008062
AU,0.008697,0.75597,0.008562,0.016926,0.002377,0.002849254,0.008349,0.008708182,0.004513,0.000296,0.007338,0.000168,0.000202,0.000961,0.011116,0.002887,0.001667851,0.007430483,0.007304
BR,0.00131,0.000579,0.485904,0.001425,0.000265,0.0007847763,0.00141,0.00126999,0.001893,0.000347,0.000666,0.000434,1.9e-05,0.00015,0.005147,0.000576,0.0002304886,0.0008381297,0.004018
CA,0.019172,0.004699,0.009628,0.166518,0.00248,0.008565838,0.015192,0.01631295,0.009734,0.002066,0.010054,0.000277,0.000278,0.001819,0.021447,0.004271,0.003850547,0.02097955,0.017908
CN,0.0056,0.002474,0.00328,0.004599,0.735116,0.002378236,0.003314,0.004983065,0.002901,0.001297,0.003692,0.00038,0.000389,0.002288,0.002781,0.004275,0.01362808,0.005354969,0.124562
DE,0.289809,0.039809,0.078737,0.078044,0.025605,0.5619322,0.227541,0.2713838,0.207943,0.038575,0.031043,0.023279,0.002423,0.011824,0.108576,0.051234,0.02400604,0.06146406,0.147616
DK,0.015085,0.009734,0.004809,0.007169,0.001663,0.005778077,0.241268,0.00934836,0.008637,0.000399,0.002809,0.000643,9.2e-05,0.000608,0.007398,0.00456,0.0005972429,0.003311832,0.006234
EP,3e-06,0.0,0.000107,0.0,0.0,6.703512e-07,0.0,7.12217e-07,0.0,0.0,0.0,0.0,0.0,0.0,1.7e-05,0.0,5.16674e-07,2.854813e-07,2e-06
ES,0.012056,0.002117,0.007732,0.004612,0.000848,0.005094122,0.012426,0.008743785,0.339119,0.004595,0.002772,0.003035,8.2e-05,0.000437,0.011594,0.00322,0.0007479346,0.002186878,0.00886
FR,0.103425,0.014019,0.046745,0.046451,0.010579,0.04834615,0.078418,0.07143274,0.091646,0.878558,0.011581,0.006427,0.001413,0.004754,0.061804,0.023989,0.006129198,0.02211353,0.042042


<font color='indianred'>FR: 
    
- clustering ?
- determinants of flows intensity? </font>

# Country time series

## Application inflows

In [22]:
for cnt in df['appln_auth'].unique():
    tmp = df.query('appln_auth==@cnt').dropna(subset=['year']).copy()
    if tmp['nb'].sum()>1e5:
        tmp.loc[tmp['person_ctry_code'].isna(), 'person_ctry_code'] = 'NA'
        tmp['person_ctry_code'] = tmp['person_ctry_code'].replace('  ', 'NA')
        top8 = tmp.groupby(['person_ctry_code']).sum()['nb'].sort_values(ascending=False).index[:8]
        tmp.loc[~tmp['person_ctry_code'].isin(top8), "person_ctry_code"] = 'others'
        tmp = tmp.groupby(['person_ctry_code', 'year']).sum()['nb'].reset_index('person_ctry_code')
        fig = stacked_bar_chart(tmp, 'person_ctry_code', 'nb', ('div', 'Spectral'), 
                                title='Application inflows by year and inventors origin in {}'.format(cnt))
        pio.write_image(fig, plots_path + 
                        '/20167_patentInflows_byYearOrigin/20167_{}_patentInflows.png'.format(cnt), width=1200, height=900) 
        #py.iplot(fig, filename='test')
    else:
        pass

## Application outflows

In [21]:
for cnt in df['person_ctry_code'].unique():
    tmp = df.query('person_ctry_code==@cnt').dropna(subset=['year']).copy()
    if tmp['nb'].sum()>1e4:
        tmp.loc[tmp['appln_auth'].isna(), 'appln_auth'] = 'NA'
        tmp['appln_auth'] = tmp['appln_auth'].replace('  ', 'NA')
        top8 = tmp.groupby(['appln_auth']).sum()['nb'].sort_values(ascending=False).index[:8]
        tmp.loc[~tmp['appln_auth'].isin(top8), "appln_auth"] = 'others'
        tmp = tmp.groupby(['appln_auth', 'year']).sum()['nb'].reset_index('appln_auth')
        fig = stacked_bar_chart(tmp, 'appln_auth', 'nb', ('div', 'Spectral'), 
                               title='Application outflows by year and destination authority from {}'.format(cnt))
#        py.iplot(fig, filename='test')
        pio.write_image(fig, plots_path + 
                        '/20167_patentOutflows_byYearOrigin/20167_{}_patentOutflows.png'.format(cnt), 
                        width=1200, height=900) 
    else:
        pass