In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from google.cloud import bigquery
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from pathora.plots import stacked_bar_chart, bar_chart
import plotly.plotly as py

In [3]:
data_path = 'data/'
plots_path = 'plots/'
views_path = 'views/'

In [4]:
client = bigquery.Client()

**Date**: 11.10.2018

**State**: Work in progress

The purpose of this notebook is to provide insights on the geographical origins of applications. The geographical origin of an application is defined as the inventor's country. 

There can be more than one inventor per application. Thus, we have too choose how to assign an application to one or several countries. The straightest way to do it is to assign $\frac{1}{nb\_inventors}$ of a given application to the country of each inventor. However, the output is far from perfect. Depending on the authority and/or the period, the overall time series of applications is far from expected (see tls201 exploratory analysis). This is partly due to the `nb_inventors`  variable. For example, when the inventors are unknown, it is set to 0. Due to our attribution method, we have to exclude these samples although they can be significant at some point. There might be some other strange things happening. When we sum up the number of applications obtained with this method, although the applications with 0 inventors are excluded, we obtain more applications than the total number of applications actually in the database. 

A simple approach is to keep only the first inventor's (given by `invt_seq_nr`) country. That's what we do in `v3` *Note that the final analysis cannot rely on such a simplistic approach*. That being said, this results can be complemented by our analysis of the case when `nb_inventors` is 0. Put together, these two approaches seem to yield consistent time series. 

Research directions: 

1. replace `nb_inventors` by a `nb_occurences` variable. Such a variable will range in $\mathbb{N}^*$ which solves our `Error: 0 denominator` issue. Moreover, applications whithout known inventors will be categorized in `na` without any additional step (thus minimizing the risk of making a mistake). 
2. Authorities with many missing inventors' locations should be complemented with external datasets (ex: scrapping).

# Queries

```python
# v1
query="""
SELECT
  SUM(1/nb_inventors) AS nb_appln,
  year,
  appln_auth,
  person_ctry_code
FROM
  raw.tls20167_cp
WHERE
  nb_inventors>0
GROUP BY
  year,
  appln_auth,
  person_ctry_code
ORDER BY
  year,
  appln_auth,
  person_ctry_code;
"""
client.query(query).to_dataframe().to_csv(views_path + '20167_ApplnAuth_byOriginYear.csv')
```

```python
# v2
query="""
SELECT
  SUM(1/nb_applicants) AS nb_appln,
  year,
  appln_auth,
  person_ctry_code
FROM
  raw.tls20167_cp
WHERE
  nb_applicants>0 # 
GROUP BY
  year,
  appln_auth,
  person_ctry_code
ORDER BY
  year,
  appln_auth,
  person_ctry_code;
"""
client.query(query).to_dataframe().to_csv(views_path + '20167_ApplnAuth_byOriginYear_v2.csv')
```

```python
# v3
query="""
SELECT
  COUNT(*) AS nb_appln,
  year,
  appln_auth,
  person_ctry_code
FROM
  raw.tls20167_cp
WHERE
  invt_seq_nr=1
GROUP BY
  year,
  appln_auth,
  person_ctry_code
ORDER BY
  year,
  appln_auth,
  person_ctry_code;
"""
client.query(query).to_dataframe().to_csv(views_path + '20167_ApplnAuth_byOriginYear_v3.csv')
```

# Checks and exploratory analysis

In [5]:
df_v1 = pd.read_csv(views_path + '20167_ApplnAuth_byOriginYear_v1.csv', index_col=0)
df_v2 = pd.read_csv(views_path + '20167_ApplnAuth_byOriginYear_v2.csv', index_col=0)
df_v3 = pd.read_csv(views_path + '20167_ApplnAuth_byOriginYear_v3.csv', index_col=0)

## Checks

In [8]:
# Check
i=0
for tmp in [df_v1, df_v2, df_v3]:
    i+=1
    print('df_v{}: {}'.format(i, tmp.nb_appln.sum()))
# Should be equal to the number of applications (ie approx 85.e6)

df_v1: 92496205.35466744
df_v2: 172274724.7333502
df_v3: 59110698


## Exploratory analysis

In [18]:
df = df_v3

In [19]:
cnt_appln = "DE"

In [14]:
tmp = df.dropna().query('appln_auth==@cnt_appln').copy()
tmp['person_ctry_code'] = tmp['person_ctry_code'].replace('  ', 'na')
top8 = tmp.groupby(['person_ctry_code']).sum()['nb_appln'].sort_values(ascending=False).index[:8]
tmp.loc[~tmp['person_ctry_code'].isin(top8), "person_ctry_code"] = 'others'
tmp = tmp.groupby(['person_ctry_code', 'year']).sum()['nb_appln'].reset_index('person_ctry_code')
fig = stacked_bar_chart(tmp, 'person_ctry_code', 'nb_appln', ('div', 'Spectral'))
py.iplot(fig, filename='test')

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~cverluise/0 or inside your plot.ly account where it is named 'test'


Overall quite consistent with aggregate results BUT some pathological cases/periods. Ex: France, US, GB 
hypothese: l'information sur les inventeurs/ le nombre ou le premier disparait périodiquement

# `nb_inventors` = 0

when `nb_inventors`=0, there is no first inventor. Meaning that we miss the related applications in the previous query (v3). This application should however enter the `na` category. It seems that adding these applictaions gets us back to the proper time distributions of applications. 

```python
query="""SELECT
  COUNT(*) AS count,
  year,
  appln_auth,
  nb_inventors
FROM
  raw.tls201_cp
GROUP BY
  year,
  appln_auth,
  nb_inventors
ORDER BY
  year,
  appln_auth,
  nb_inventors;"""
client.query(query).to_dataframe().to_csv(views_path + '201_nbInventors_byApplnAuthYear.csv')  
```  

In [16]:
df_nbi = pd.read_csv(views_path + '201_nbInventors_byApplnAuthYear.csv', index_col=0)

In [21]:
fig = bar_chart(df_nbi.dropna().query('appln_auth==@cnt_appln & nb_inventors==0').set_index('year')['freq'])
py.iplot(fig, filename='test')

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~cverluise/0 or inside your plot.ly account where it is named 'test'
