In [1]:
%load_ext autoreload
%autoreload 2

**Import and meta**

In [2]:
from google.cloud import bigquery
from open_patstat.plots import bar_chart, stacked_bar_chart
from open_patstat.documentation.shortdoc import ShortDoc, print_markdown, print_signature

import plotly.plotly as py
import plotly.graph_objs as go
import plotly.io as pio
import plotly.figure_factory as ff
import pandas as pd
import os
import pycountry
import colorlover as cl

In [3]:
data_path = '../data/'
plots_path = '../plots/'
views_path = '../views/'

[Credentials]:https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries#client-libraries-install-python

To instantiate your Client, you need to set the environment variable `GOOGLE_APPLICATION_CREDENTIALS` to the file path of the JSON file that contains your service account key. Follow the steps described [here][credentials]. 

Readers only interested in existing output can comment the `client` instantiation and execute the rest of the notebook using the aggregate data provided in `/views`.

In [4]:
documentation = ShortDoc()
client = bigquery.Client()

----

**What is this notebook for?**

The Application table (`TLS201_APPLN`) is PATSTAT's central table. It contains the key bibliographical data elements relevant to identify the patent application. Most of the elements in this table can be found on the first page of a printed patent document. It also links to many other database tables.

This notebook documents the 101 of the Patstat "Application" table (aka `TLS201_APPLN`, `PATSTAT2016a`).

More precisely:

- we report variable descriptions and encoding details from the `Data Catalog for Patstat`(5.07)
- we provide summary statistics and visualizations (when relevant)

**Data**

In [5]:
print_markdown(documentation.version())

**Author**: European patent office

**Distribution**: Epo patstat customers

**Patstat edition**: 2016 spring edition

**Version**: 5.07

**Date**: 01.04.2016

**Output**

[db-logo]: https://aem.dropbox.com/cms/content/dam/dropbox/www/en-us/branding/app-dropbox-windows@2x.png.transform/half-res/img.png
[db]:https://www.dropbox.com/sh/b1gs90gtoduu02v/AABICBNYH2kysjX-4JTqee0Wa?dl=0

`plots/` and `views/` are not available on our GitHub repository. However, interested readers can access those files on our dedicated dropbox. 

[![alt text][db-logo]][db]

----

**TLS201_APPLN Schema**

In [6]:
print_markdown(documentation.desc_table('TLS201_APPLN'))

**Description**: Tls201_appln: application 5.1contains the key bibliographical data elements relevant to identify the patent application. most of the elements in this table can be found on the first page of a printed patent document. e. g.: application authority, application number and application filing date.
from a database structure point of view, this table is very important because it links to many other database tables via the application id attribute.
note: the following attributes have been renamed in the 2015 spring edition:
prior_earliest_date
 earliest_filing_date
prior_earliest_year
 earliest_filing_year
publn_earliest_date
 earliest_publn_date
publn_earliest_year

**Page**: 37

In [7]:
table_ref = client.dataset('raw').table('tls201_cp')
table = client.get_table(table_ref)
schema = table.schema
schema = [[schema[i].name, schema[i].field_type, schema[i].mode] for i in range(len(schema))]

In [8]:
pd.DataFrame(schema, columns=['Name', 'Type', 'Mode']).set_index('Name')

Unnamed: 0_level_0,Type,Mode
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
appln_id,INTEGER,NULLABLE
appln_auth,STRING,NULLABLE
appln_nr,STRING,NULLABLE
appln_kind,STRING,NULLABLE
appln_filing_date,DATE,NULLABLE
appln_filing_year,INTEGER,NULLABLE
appln_nr_epodoc,STRING,NULLABLE
appln_nr_original,STRING,NULLABLE
ipr_type,STRING,NULLABLE
internat_appln_id,INTEGER,NULLABLE


----

# Technical identifier

## APPLN_ID

In [9]:
print_markdown(documentation.desc_variable('APPLN_ID'))

**Name**:  application identification 

**Also known as**:  n/a 

**Description**:  surrogate key technical unique identifier without any business meaning 

**Domain**:  9 999 999 

**Default value**:  n/a 

**Source database**:  docdb (range 1), patstat (ranges 2, 3, 4) 

**Source field name**: 
 for range 1 (see below for definition of ranges)
<application-reference is-representative="yes" doc-id="11607218" data-format="docdb">
<document-id>

<country>de</country>

<doc-number>8909720</doc-number>

<kind>u</kind>

<date>19890812</date>
</document-id>
for ranges 2, 3 and 4 appln_id is set as described in section 4.4 "application replenishment".
source sub-field identifier n/a source codes
for range 1 <application-reference is-representative="yes" doc-id="11607218" data-format="docdb">


**Page**: 88

Note: 

1. Range 1: 1 to 900 000 000. Filed applications which have a related publication in DOCDB. UNique but not sequential.
2. Range 2: from 900 000 001 to 930 000 000. Artificial applications which are created in PATSTAT for prior applications, claimed as *priorities*, which do not have an application-reference in DOCDB.
3. Range 3: 930 000 001 to 960 000 000. Artificial filing applications with kind code `D2` which are created in PATSTAT for those artificial publications which are also created in PATSTAT because these *publications* are cited, but do not have a publication-reference in DOCDB.
3. Range 4: 960 000 001 to 999 999 999. Artificial filing applications with kind code `D3` which are created in PATSTAT because these *applications* are cited. 

```python
query = """
SELECT
  year,
  appln_auth,
  COUNT(*) AS nb_range,
  CASE
    WHEN appln_id BETWEEN 1 AND 900000000 THEN "range_1"
    WHEN appln_id BETWEEN 900000001 AND 930000000 THEN "range_2"
    WHEN appln_id BETWEEN 930000001 AND 960000000 THEN "range_3"
    WHEN appln_id BETWEEN 960000001 AND 999000000 THEN "range_4"
  END AS range
FROM
  raw.tls201_cp
GROUP BY
  year,
  appln_auth,
  nb_range
ORDER BY
  year,
  appln_auth;"""

client.query(query=query).to_dataframe().to_csv(views_path + '201_ApplnId_byRange.csv')
```

In [10]:
df = pd.read_csv(views_path + '201_ApplnId_byRange.csv', index_col=0)

In [12]:
# Overall range distribution
tmp = df.groupby(['range']).sum()['nb_range']

fig = bar_chart(tmp, title='ApplicationId by Range')
pio.write_image(fig, plots_path + '/201_ApplnId_byRange.png')
#py.iplot(fig, filename='201_ApplnId_byRange')

In [13]:
# Range distribution by year
tmp = df.groupby(['year', 'range']).sum()['nb_range'].to_frame().reset_index('range')
fig = stacked_bar_chart(tmp, 'range', 'nb_range', title='ApplnId by range and year')
pio.write_image(fig, plots_path + '/201_ApplnId_byRangeYear.png', width=1200, height=800)
#py.iplot(fig, filename='201_ApplnId_byRange')

# Business identifiers

## APPLN_AUTH

In [14]:
print_markdown(documentation.desc_variable('APPLN_AUTH'))

**Name**:  application authority 

**Also known as**:  country, state. receiving office in case of pct application 

**Description**:  patent authority where the national, international or regional application was filed 

**Domain**:  up to 2 ascii characters (a-z), according to wipo st.3 and including rh (south rhodesia) 

**Default value**:  n/a 

**Source database**:  docdb 

**Page**: 83

```python

query = """
SELECT 
    appln_auth,
    year,
    count(appln_auth) as nb_patents
FROM
    raw.tls201_cp
GROUP BY 
    appln_auth,
    year
ORDER BY
    appln_auth,
    year;"""

client.query(query=query).to_dataframe().to_csv(views_path + '201_ApplnAuth_byYear.csv')
```

In [15]:
tmp_path = '/201_ApplnAuth_byYearAuth'

In [16]:
df_org = pd.read_csv(data_path + 'tls801_part01.txt')
df = pd.read_csv(views_path + '/201_ApplnAuth_byYear.csv', index_col=0)

In [17]:
# Overall applications by auth since the 1970s
threshold = 1970
tmp = df.query('year>=@threshold').groupby('appln_auth').sum().merge(df_org, how='left',
                                                                     left_index=True, right_on=['ctry_code']).set_index(
    'st3_name')
fig = bar_chart(tmp['nb_patents'].sort_values(ascending=False).iloc[:15],
                title='Number of Applications by ApplnAuth since {}'.format(threshold))
pio.write_image(fig, plots_path + '/201_ApplnAuth_byAuth70s.png', width=1200, height=800)
#  py.iplot(fig)


In [18]:
# Application by auth over time
for cnt_code in df.appln_auth.unique():
    try:  # some organizations don't have names ...
        cnt_name = df_org.query('ctry_code == @cnt_code')["st3_name"].values[0]
    except:
        pass
    tmp = df.dropna().set_index('year').query('appln_auth==@cnt_code')['nb_patents']
    # be careful; there is a large number of year=nan for some countries (ex: DE, CA)
    if sum(tmp) > 1e6:
        fig = bar_chart(tmp, title='Applications by year in {}'.format(cnt_name))
        pio.write_image(fig,
                        plots_path + tmp_path + '/201_ApplnAuth_{}_byYear.png'.format(cnt_code),
                        width=1200, height=800)


## APPLN_NR

In [19]:
print_markdown(documentation.desc_variable('APPLN_NR'))

**Name**:  application number 

**Also known as**:  "dossier number" in case of ep applications 

**Description**:  number issued by the patent authority where the national, international or regional application was filed 

**Domain**:  up to 15 ascii characters this attribute must be unique in combination with appln_auth & appln_kind. the last character is either numeric or a, d, k, t or x. the docdb administrators make the application numbers end with a d, t or x to create "dummy" application numbers that are present because the number is mandatory but the actual number is not known. a - data errors d - dummy application; the publication number is put in front of the d k
special type of older brazilian application (number format 11nnnnnk ) t - dummy technical priority x - dummy pre-1970 derived priority 

**Default value**:  empty string 

**Source database**:  docdb


**Page**: 92

Note: Ends with A, D, K, T or X to create "dummy" application numbers that are present because the number is mandatory but the actual number is not known.

- A: data errors
- D: dummy application; the publication number is put in front of the D
- K: special type of older Brazilian application (number format 11nnnnnK )
- T: dummy technical priority
- X: dummy pre-1970 derived priority



```python
query = """
SELECT
  #  year,
  COUNT(*) AS nb_dummy,
  CASE
    WHEN SUBSTR(appln_nr, -1)="A" THEN "data errors"
    WHEN SUBSTR(appln_nr, -1)="D" THEN "dummy application"
    WHEN SUBSTR(appln_nr, -1)="T" THEN "technical priority"
    WHEN SUBSTR(appln_nr, -1)="X" THEN "pre-1970 derived priority"
    ELSE "no dummy"
  END AS dummy
FROM
  raw.tls201_cp
GROUP BY
  #  year,
  #  appln_auth,
  dummy;
  #ORDER BY
  #  year;
  #  appln_auth;"""
# feel-free to uncomment to get year and appln_auth details
client.query(query=query).to_dataframe().to_csv(views_path + '201_ApplnNR_byDummy.csv')
```

In [20]:
df = pd.read_csv(views_path + '201_ApplnNR_byDummy.csv', index_col=0)

In [21]:
tmp = df.set_index('dummy')['nb_dummy'].sort_values(ascending=False)
fig = bar_chart(tmp, title='ApplnNr by Dummy')
pio.write_image(fig, plots_path + '201_ApplnNR_byDummy.png')

## APPLN_KIND

In [22]:
print_markdown(documentation.desc_variable('APPLN_KIND'))

**Name**:  kind of application 

**Also known as**:  n/a 

**Description**:  specification of the kind of application 

**Domain**:  up to 2 ascii characters a,b,c,d,d2, d3, e,f,h,i,k,l,m,n,o,p,q,t,u,w and others
a patent
 u utility model
f design patent
p provisional application
w pct application (in the international phase)
t used by some offices (e. g. at, de, dk, es, gr, hr, pl, pt, si, sm, tr)
 for applications which are "translations" of granted pct or ep applications
d2, d3 artificial applications
d,k,l,m,n dummy for de-duplicating
other values are used temporarily to resolve minor problems that would otherwise have prevented the application to be recorded in docdb. see also section 4.4 "application replenishment".


**Default value**:  n/a 

**Source database**:  docdb 

**Page**: 90

Note:

- A: patent
- U: utility model
- F: design patent
- P: provisional application
- W: PCT application (in the international phase)
- T: used by some offices (e. g. AT, DE, DK, ES, GR, HR, PL, PT, SI, SM, TR) for applications which are "translations" of granted PCT or EP applications
- D2, D3: artificial applications
- D,K,L,M,N: dummy for de-duplicating
- Other values are used temporarily to resolve minor problems that would otherwise have prevented the application to be recorded in DOCDB

```python
query = """SELECT
  year,
  appln_auth,
  count(*) AS nb_kind,
  CASE
    WHEN appln_kind="A " THEN "Patent"
    WHEN appln_kind="U " THEN "Utility model"
    WHEN appln_kind="F " THEN "Design patent"
    WHEN appln_kind="P " THEN "Provisional application"
    WHEN appln_kind="W " THEN "PCT application"
    WHEN appln_kind="T " THEN "Translation"
    WHEN appln_kind="D2" OR appln_kind="D3" THEN "Translation"
    WHEN appln_kind="D "
  OR appln_kind="K "
  OR appln_kind="L "
  OR appln_kind="M "
  OR appln_kind="N " THEN "Dummy"
    ELSE "Temp"
  END AS kind
FROM
  raw.tls201_cp
GROUP BY
  year,
  appln_auth,
  kind
ORDER BY
  year,
  appln_auth,
  kind;"""
client.query(query).to_dataframe().to_csv(views_path+'201_ApplnKind_byYearAuth.csv')
```

In [23]:
df = pd.read_csv(views_path + '201_ApplnKind_byYearAuth.csv',
                 index_col=0)


In [25]:
# Overall ApplnKind
tmp = df.groupby('kind').sum()['nb_kind'].sort_values(ascending=False)
fig = bar_chart(tmp, title='Appln kind')
pio.write_image(fig, plots_path + '201_ApplnKind.png')


In [26]:
# ApplnKind by Year
tmp = df.dropna().set_index('year')
fig = stacked_bar_chart(tmp, 'kind', 'nb_kind', title='Appln kind by year')
pio.write_image(fig, plots_path + '201_ApplnKind_byYear.png',
                width=1200, height=800)


<font color=#1F618D>PC: Could look by cnt as well.</font>

## <font color=grey>APPLN_FILING_DATE</font>

In [27]:
print_markdown(documentation.desc_variable('APPLN_FILING_DATE'))

**Name**:  application filing date 

**Also known as**:  date of receipt 

**Description**:  date on which the application was physically received at the patent authority 

**Domain**:  date (up to 9999-12-31) 

**Default value**:  9999-12-31 

**Source database**:  docdb


**Page**: 85

##  APPLN_FILING_YEAR

In [28]:
print_markdown(documentation.desc_variable('APPLN_FILING_YEAR'))

**Name**:  year of the application filing date


**Also known as**:  n/a 

**Description**: 


**Domain**:  4 digits in the form yyyy (e. g. 2015) 

**Default value**:  n/a 

**Source database**:  patstat 

**Source field name**:  derived from attribute appln_filing_date of table tls201_appln computed as
 format(appln_filing_date, 'yyyy')
source sub-field identifier n/a 

**Comments**:  n/a


**Page**: 87

<font color=#1F618D>PC: Partitioning is derived from the `APPLN_FILING_YEAR`. Many prior queries can help us retrieve the distribution of patents per year at 0 cost.</font>

In [29]:
df = pd.read_csv(views_path + '/201_ApplnAuth_byYear.csv', index_col=0)

In [30]:
tmp = df.groupby('year').sum()['nb_patents']
fig = bar_chart(tmp, title='Appln filing by Year')
pio.write_image(fig, plots_path + '201_ApplnFilingYear_byYear.png',
               width=1200, height=800)

## <font color=grey>APPLN_NR_EPODOC</font>

In [31]:
print_markdown(documentation.desc_variable('APPLN_NR_EPODOC'))

**Name**:  application number in epodoc format 

**Also known as**:  epodoc application number 

**Description**:  number in epodoc format (containing letters and digits) which, if present - will uniquely identify an application. the number is created by the epo based on the docdb application number, application authority and application kind. 

**Domain**:  up to 20 ascii characters (typically, 13 - 14 characters)
explanation of the format, according to annex xi of the "exchange format" document of docdb, version 2.4.3 from 01.01.2013 basic structure of application and priority-numbers in data-format="epodoc" is
country
number
ccyy - century/year derived from application- or priority-date
nnnnnnn - serial number, leading zeroes when required
kind-code, when kind-code not = 'a'
extended structure for a number of countries
country [ "wo" when kind-code in data-format="docdb" is "w" ]
number
ccyy  century/year derived from application- or priority-date
xx  "other data"
nnnnn  serial number, leading zeroes when required
kind-code, when kind-code not = 'a'
"other data" may be
regional office, e.g. 'mi' when country = 'it' and regional office = milan
filing country, e.g. 'us' when country = 'wo' and filing country = us
...
length of the concatenated string is generally fixed at 13 characters or 14 when the kind-code is appended. strings exceeding a total of 13 or 14 may occur, when the number of significant digits exceeds the number of digits reserved for the serial number, e.g. de.
a special format applies to numbers that in data-format="docdb" have been suffixed with letters 'd' or 't' or
 country
'd' or 't' or 'x'
number
kind-code, when kind-code not = 'a' 

**Default value**:  empty (if not provided by docdb due to formatting issues) 

**Source database**:  docdb


**Page**: 94

<font color=#1F618D>PC: can be used to retrieve other variables (see p 96)</font>

## <font color=grey>APPLN_NR_ORIGINAL<font>

In [32]:
print_markdown(documentation.desc_variable('APPLN_NR_ORIGINAL'))

**Name**:  application number in original format 

**Also known as**:  original application number 

**Description**:  application number in original format as provided by the supplier. it is assumed that the number is as printed on the respective publications.
typically these numbers do not contain the country code. in about 10% of the applications no original application number is known.


**Domain**:  up to 100 characters


**Default value**:  empty 

**Source database**:  docdb


**Source field name**: 
1) source for the standard (= non-artificial) applications <exchapplication-reference data-format="original">
<document-id>
 <doc-number>11137814</doc-number>
</document-id> </exchapplication-reference>
if docdb does not provide an original application number in any of the publications of an application, then appln_nr_original will contain an empty string.
if docdb provides multiple conflicting original application numbers for the same application, then only one (= any of the conflicting) original application numbers should be stored. (note this is supposed to not happen, but may still occur due to data errors)
ep publications published after 2013-03-13, the application number is published in docdb with a check digit, i.e. 04801606.7. for sake of consistency with previous original application numbers, the check digit is removed in patstat.
 2) for all artificial applications the attribute appln_nr_original will contain an empty string.
source sub-field identifier data-format="original" source codes n/a 

**Comments**:  this attribute is useful to combine application data of patstat with other databases which also contains the original application number. the original application number is not necessarily unique within the same appln_auth and the same appln_kind (e.g. for patents and utility models). for example, the offices of us, jp, fr, ch, cs, it, su seem to have re-used their application numbers at least in some periods of time.


**Page**: 96

Note: Typically these numbers do not contain the country code. In about 10% of the applications no original application number is known.

Comment: This attribute is useful to combine application data of PATSTAT with other databases which also contains the original application number.

## IPR_TYPE

In [33]:
print_markdown(documentation.desc_variable('IPR_TYPE'))

**Name**:  type of intellectual property right 

**Also known as**:  n/a 

**Description**:  type of intellectual property right 

**Domain**:  2 ascii characters pi, um, dp;
pi - patent of invention um - utility model dp - design patent 

**Default value**:  n/a 

**Source database**:  patstat 

**Source field name**:  appln_auth, appln_kind, publn_kind source sub-field identifier n/a source codes if , or (appln_auth = 'fr' and appl_kind = 'a' and at least one related publications has a publn_kind = 'a3' or 'a4' or 'a7' or a8') then ipr_type = 'um' for utility model else if appln_kind = 'f ' and appln_auth is not 'fr' then ipr_type = 'dp' for design patent. for all other values of appln_kind, set ipr_type to 'pi' for patent of invention. note that in america, a patent of invention is known as a utility patent. this rule applies to all instances of appln_kind, whether it is derived from application-reference or a priority-reference.


**Comments**:  the rule to compute utility models and design patents does cover all major, but not necessarily all cases. the rule may be improved in the future. 

**Page**: 169

Note: Derived from `APPLN_KIND` (see p 169)
    
- PI: Patent of innovation 
- UM: Utility Model
- DP: Design Patent

<font color='orange'>*TODO*</font>: 

- Query by year, auth and ipr_type only.
- Clean code

In [35]:
df_ipr = pd.read_csv(views_path + '/ipr_yr_auth.csv', index_col=0)

In [36]:
df_ipr.query('appln_auth =="FR"').head()

Unnamed: 0,ipr_type,date,appln_auth,nb_ipr
3524,UM,2015.0,FR,87
3525,PI,2015.0,FR,2292
3526,UM,2014.0,FR,297
3527,PI,2014.0,FR,12236
3528,UM,2013.0,FR,486


In [37]:
def plot_ipr_type(df: pd.DataFrame, cnt: str, path: str = None):
    tmp = df_ipr.query('appln_auth == @cnt').dropna()
    try:
        cnt_name = pycountry.countries.get(alpha_2=cnt).name
    except:
        cnt_name = cnt_a2_name[cnt]
    data = []
    i = 0
    for ipr in tmp.groupby('ipr_type').sum()["nb_ipr"].sort_values(
            ascending=False).index.unique():
        data += [
            go.Bar(
                x=tmp.query('ipr_type == @ipr')["date"].values,
                y=tmp.query('ipr_type == @ipr')["nb_ipr"].values,
                name=ipr,
                marker=dict(color=cl.flipper()['seq']['3']['Reds'][i], ))
        ]
        i += 1
    layout = go.Layout(
        barmode='stack',
        title='Composition of patent applications in {}'.format(cnt_name))
    fig = go.Figure(data=data, layout=layout)
    pio.write_image(fig, path + '/{}_ts_patents.png'.format(cnt))
    #py.iplot(fig, filename='stacked-bar', )

## <font color=grey>INTERNAT_APPLN_ID</font>

In [39]:
print_markdown(documentation.desc_variable('INTERNAT_APPLN_ID'))

**Name**:  application identification of the earlier pct international application for an application. 

**Also known as**:  n/a 

**Description**:  technical unique identifier without any business meaning


**Domain**:  9 999 999 

**Default value**:  0 

**Source database**:  docdb, patstat 

**Page**: 156

# Route of the application

<font color=orange>*TODO*</font>: 

- Summarize : http://www.wipo.int/pct/en/guide/ip03.html#_chapt3 and p 155 (5.07)
- Funnel chart ? From national to international even if the route goes the other way round. Still don't know what to do with that info.


```python
query="""SELECT
  year,
  appln_auth,
  int_phase,
  reg_phase,
  nat_phase,
  COUNT(*) AS nb_phase
FROM
  raw.tls201_cp
GROUP BY
  year,
  appln_auth,
  int_phase,
  reg_phase,
  nat_phase
ORDER BY
  year,
  appln_auth;"""

client.query(query).to_dataframe().to_csv(views_path + '201_Phases.csv')
```

In [40]:
df = pd.read_csv(views_path + '201_Phases.csv', index_col=0)

In [41]:
df.groupby(['int_phase', 'reg_phase', 'nat_phase']).sum()['nb_phase'].to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,nb_phase
int_phase,reg_phase,nat_phase,Unnamed: 3_level_1
False,False,False,5626
False,False,True,72110968
False,True,False,1765314
False,True,True,1853271
True,False,False,2798628
True,False,True,4971398
True,True,False,1399161
True,True,True,877567


In [47]:
df.groupby(['int_phase']).sum()['nb_phase'].to_frame()

Unnamed: 0_level_0,nb_phase
int_phase,Unnamed: 1_level_1
False,75740125
True,10050063


In [49]:
df.groupby(['reg_phase']).sum()[['int_phase', 'nb_phase']]

Unnamed: 0_level_0,int_phase,nb_phase
reg_phase,Unnamed: 1_level_1,Unnamed: 2_level_1
False,3963.0,79888810
True,320.0,5895313


In [50]:
df.groupby(['nat_phase']).sum()[['int_phase', 'nb_phase']]

Unnamed: 0_level_0,int_phase,nb_phase
nat_phase,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2076.0,5968729
True,2254.0,79819269


### INT_PHASE

In [51]:
print_markdown(documentation.desc_variable('INT_PHASE'))

**Name**:  indicator whether the application is or has been in the international phase 

**Also known as**:  n/a 

**Description**:  indicates that an application is or has been in the international phase.
this covers all international filings at the receiving office as well as all applications based on these filings. 

**Domain**:  1 ascii character y
yes n
no space not known
 (in case of uncertain interpretations; used very little or not at all) 

**Default value**:  n 

**Source database**:  patstat 

**Source field name**:  derived from table tls201_appln
y
if the application has appln_kind = w
 (i.e. international filing)
 or internat_appln_id > 0; (i.e. based on internat. application) n
otherwise
source sub-field identifier n/a 

**Page**: 154

Note: These indicators provide a somewhat *simplistic* approach to identify the route an application has taken. This is the result of interpretations and assumptions for which no responsibility whatsoever can be accepted.

WARNING: These indicators only help to understand applications which actually exist in PATSTAT. It does not help to answer questions like “How many EP applications are valid in country x”, because not every office publishes patents which are validated / granted in their country. Consequently, there is no publication or application in PATSTAT for every granted patent. The same will apply for the Unitary Patents, if there is no publication for that (see p 154).

### REG_PHASE

In [52]:
print_markdown(documentation.desc_variable('REG_PHASE'))

**Name**:  indicator whether the application is or has been in the regional phase 

**Also known as**:  n/a 

**Description**:  indicates that an application is or has been in the regional phase.


**Domain**:  1 ascii character
y
yes
n
no
space
not known
 (in case of uncertain interpretations;

 used very little or not at all)


**Default value**:  n


**Source database**:  patstat; 

**Source field name**:  derived from tables tls201_appln , tls211_pat_publn and the http//www.epo.org/searching-for-patents/helpful-resources/raw-data/data/tables/regular.html ) y
if the appln_kind <> w
and (appln_auth is a regional office

or (appln_auth is a member of an regional office

 and

the publn_kind code indicates that the patent


publication is the result of a regional phase)
 ); n
otherwise
source sub-field identifier n/a


**Comments**:  for explanation and disclaimer see attribute int_phase in section 6.57.


**Page**: 259

### NAT_PHASE

In [53]:
print_markdown(documentation.desc_variable('NAT_PHASE'))

**Name**:  indicator whether the application is in the national phase 

**Also known as**:  n/a 

**Description**:  indicates that an application is in the national phase.


**Domain**:  1 ascii character
y
yes
n
no
space
not known
 (in case of uncertain interpretations;

 used very little or not at all)


**Default value**:  n


**Source database**:  patstat; 

**Source field name**:  derived from table tls201_appln
y
if the application has appln_kind <> w
and appln_auth is a national office; n
otherwise
source sub-field identifier n/a


**Comments**:  for explanation and disclaimer see attribute int_phase in section 6.57.


**Page**: 213

# Data from priorities

## <font color=grey>EARLIEST_FILING_DATE</font>

In [54]:
print_markdown(documentation.desc_variable('EARLIEST_FILING_DATE'))

**Name**:  date of the earliest filing 

**Also known as**:  n/a 

**Description**:  the earliest date of the filing dates of the application itself, its paris convention priority applications, the applications with which it is related via technical relations and its application continuations. only directly related applications are considered; this is unlike the inpadoc family, where applications might also be indirectly related. 

**Domain**:  date (up to 9999-12-31) 

**Default value**:  9999-12-31 

**Source database**:  patstat 

**Source field name**:  derived from the tables
 - tls201_appln

self-priority
- tls204_appln_prior
 paris convention priority
- tls205_tech_rel
technical relations
- tls216_appln_cont
application continuations
source sub-field identifier n/a 

**Comments**:  n/a


**Page**: 138

<font color=grey>Note: Only directly related applications are considered; this is unlike the INPADOC family, where applications might also be indirectly related.</font>

## <font color='grey'>EARLIEST_FILING_YEAR</font>

In [55]:
print_markdown(documentation.desc_variable('EARLIEST_FILING_YEAR'))

**Name**:  year of the earliest filing date 

**Also known as**:  n/a 

**Description**:  year of the earliest filing date 

**Domain**:  4 digits in the form yyyy (e. g. 2015) 

**Default value**:  n/a 

**Source database**:  patstat


**Source field name**:  derived from attribute earliest_filing_date of table tls201_appln computed as
 format(earliest_filing_date, 'yyyy') source sub-field identifier n/a 

**Comments**:  n/a


**Page**: 140

## <font color=grey>EARLIEST_FILING_ID</font>

In [56]:
print_markdown(documentation.desc_variable('EARLIEST_FILING_ID'))

**Name**:  application id of the earliest filing 

**Also known as**:  first filing 

**Description**:  the id of the earliest application, considering the application itself, its paris convention priority applications, the applications with which it is related via technical relations and its application continuations. only directly related applications are considered; this is unlike the inpadoc family, where applications might also be indirectly related. 

**Domain**: 
surrogate key technical unique identifier without any business meaning 

**Default value**:  n/a 

**Source database**:  patstat 

**Source field name**:  derived from the tables
 - tls201_appln

self-priority
- tls201_appln
 pct application
- tls204_appln_prior
 paris convention priority
- tls205_tech_rel
technical relations
- tls216_appln_contn application continuations
source sub-field identifier n/a 

**Comments**: 
if multiple applications have been filed on the earliest filing date, then conceptually any of these applications can be regarded as the earliest application. nevertheless, the logic to determine the application which has been filed first is like this 1. if there is a pct application which was filed on the earliest application date, then the appln_id of this pct application is taken as the earliest_filing_id. 2. else if there are 1 or more paris convention priorities which were filed on the earliest application date, then the paris convention priority with the smallest appln_id is taken as the earliest_filing_id. 3. else the application which was filed on the earliest application date with the smallest appln_id will be taken.


**Page**: 139

<font color=grey>Note: If multiple applications have been filed on the earliest filing date, then conceptually any of these applications can be regarded as the earliest application. Nevertheless, the logic to determine the application which has been filed first is like this:

1. If there is a PCT application which was filed on the earliest application date, then the APPLN_ID of this PCT application is taken as the EARLIEST_FILING_ID.
2. Else: If there are 1 or more Paris convention priorities which were filed on the earliest application date, then the Paris convention priority with the smallest APPLN_ID is taken as the EARLIEST_FILING_ID.
3. Else: the application which was filed on the earliest application date with the smallest APPLN_ID will be taken.
</font>

# Data from publications

<font color=orange>TODO</font>: 

- count by YEAR and AUTH
- avg(year) between EARLIEST_FILING_DATE and EARLIEST_PUBN_DATE

## <font color='grey'>EARLIEST_PUBLN_DATE</font>

In [57]:
print_markdown(documentation.desc_variable('EARLIEST_PUBLN_DATE'))

**Name**:  date of earliest publication 

**Also known as**:  n/a 

**Description**: 


**Domain**:  date (up to 9999-12-31) 

**Default value**:  9999-12-31 

**Source database**:  patstat 

**Source field name**:  derived from table tls211_pat_publn

source sub-field identifier n/a 

**Comments**:  n/a


**Page**: 142

## <font color='grey'>EARLIEST_PUBLN_YEAR</font>

In [58]:
print_markdown(documentation.desc_variable('EARLIEST_PUBLN_YEAR'))

**Name**:  year of the earliest publication date 

**Also known as**:  n/a 

**Description**: 


**Domain**:  4 digits in the form yyyy (e. g. 2015) 

**Default value**:  n/a 

**Source database**:  patstat


**Source field name**:  derived from attribute earliest_publn_date of table tls201_appln
computed as
 format(earliest_publn_date, 'yyyy') source sub-field identifier n/a 

**Comments**:  n/a


**Page**: 143

## <font color='grey'>EARLIEST_PAT_PUBLN_ID</font>

In [59]:
print_markdown(documentation.desc_variable('EARLIEST_PAT_PUBLN_ID'))

**Name**:  id of the earliest publication 

**Also known as**:  n/a 

**Description**:  the id of a publication published on the earliest publication date of an application 

**Domain**:  9 999 999 

**Default value**:  0 

**Source database**:  patstat


**Source field name**:  the earliest application date is indicated by attribute earliest_publn_date of table tls201_appln. table tls211_pat_publn contains the publications with their id (attribute pat_publn_id).

source sub-field identifier n/a 

**Comments**:  if more than one publication is published on the same (earliest) publication date, then any one is selected.


**Page**: 141

# Data derived from publications

<font color=orange>TODO</font>: 

- clean code

## GRANTED

In [60]:
print_markdown(documentation.desc_variable('GRANTED'))

**Name**:  "granted" indicator 

**Also known as**:  n/a 

**Description**:  "1" if there exists a publication of the grant; "0" otherwise 

**Domain**:  0 or 1 

**Default value**:  n/a 

**Source database**:  patstat 

**Source field name**:  derived from attribute publn_first_grant of table tls211_pat_publn

source sub-field identifier n/a 

**Comments**: 
the same disclaimer as for attribute publn_first_grant applies
although the epo has taken great care in analysing the grant information, this process is the result of interpretations and assumptions for which no responsibility whatsoever can be accepted.


**Page**: 147

Note: This variable is the result of interpretations and assumptions for which no responsibility whatsoever can be accepted.

```python
query = """
SELECT
  granted,
  year,
  appln_auth,
  COUNT(*) AS nb_grant
FROM
  raw.tls201_cp
#WHERE
#  date>'1900-01-01'
GROUP BY
  granted,
  appln_auth,
  year
ORDER BY
  granted,
  year,
  appln_auth
  ;"""
df_granted = client.query(query).to_dataframe()
df_granted.to_csv(views_path + '/grant_yr_auth.csv')
```

In [62]:
df_granted = pd.read_csv(views_path + '/grant_yr_auth.csv', index_col=0)

In [63]:
def plot_grant(df: pd.DataFrame, cnt: str, path: str = None):
    tmp = df.query('appln_auth == @cnt').dropna()
    try:
        cnt_name = pycountry.countries.get(alpha_2=cnt).name
    except:
        cnt_name = cnt_a2_name[cnt]
    data = []

    for boo in [True, False]:
        i = 0 if boo == False else -1
        data += [
            go.Bar(
                x=tmp.query('granted == @boo')["year"].values,
                y=tmp.query('granted == @boo')["nb_grant"].values,
                name=boo,
                marker=dict(color=cl.scales['3']['div']['RdBu'][i], ))
        ]
    layout = go.Layout(
        barmode='stack', title='Patent grants in {}'.format(cnt_name))
    fig = go.Figure(data=data, layout=layout)
    pio.write_image(fig, path + '/{}_ts_grants.png'.format(cnt))


# Family data

## <font color='grey'>DOCDB_FAMILY_ID</font>

In [64]:
print_markdown(documentation.desc_variable('DOCDB_FAMILY_ID'))

**Name**:  identifier of a docdb simple family 

**Also known as**:  docdb family id; simple family id 

**Description**: 
means that most probably the applications share exactly the same priorities (paris convention or technical relation or others) as in table tls201_appln, tls204_prior_appln, tls205_tech_rel and tls216_appln_contn. 

**Domain**:  number 0 9 999 999;
a value 0 indicates that the application does not belong to any docdb family. this is only the case for the dummy application (appln_id = 0) and for artificial applications (appln_id
900 000 000) 

**Default value**:  0 

**Source database**:  docdb 

**Source field name**:  <exchange-document country="de" doc-number="10331291" kind="a1" family-id=" 33441709" date="20050217" is-representative="y" date-of-last-exchange="2006120611" date-of-previous-exchange="20050217" date-added-docdb="20050201" status="a">
source sub-field identifier family-id 

**Comments**: 
generally speaking, if two applications claim exactly the same prior applications as priorities (these can be e. g. paris convention priorities or technical relation priorities
for details see section 4.4.1 application replenishment for prioritiesre defined by the epo as belonging to the same docdb simple family. the epo reserves the right to classify an application into a particular simple family irrespective of this general rule -
the epo does this by creating artificial priorities for an application or by ignoring certain
 the simplified definition of the docdb family is that all their priorities must be the same. docdb family members generally refer to the same invention.
 the simple family is also at times used to attribute automatically the same cpc classification symbols and other attributes to their family members.
as a general rule, the value of the docdb_family_id will not change. it will be the same across editions of docdb and patstat. however, corrections to priority numbers or changes in the priority pictures (priority numbers changing from active to inactive or vice-versa) might lead to a change in the family-id of a given publication. see also section 4.3.2 stable ids


**Page**: 132

Note: 0 indicates that the application does not belong to any DOCDB family. This is only the case for the dummy application (APPLN_ID = 0) and for artificial applications (APPLN_ID ≥ 900 000 000). See more p 132.

## <font color='grey'>INPADOC_FAMILY_ID</font>

In [65]:
print_markdown(documentation.desc_variable('INPADOC_FAMILY_ID'))

**Name**:  identifier of an inpadoc extended priority family 

**Also known as**:  inpadoc family id; extended family id 

**Description**:  means that the applications share a priority directly or indirectly via a third application. a 'priority' in this case means a link shown between applications as in tables tls201_appln (regional/national phase of a pct application), tls204_ appln_prior (paris convention priorities), tls205_tech_rel (patents which have been technically linked by patent examiners on the basis of similar content) and table tls216_ appln_contn (continuations, divisions etc.).


**Domain**:  number 0 9 999 999 a value 0 indicates that the application does not belong to any inpadoc family. this is only the case for the dummy application (appln_id = 0) and for artificial applications replenished because of citations (i.e. appln_id 930 000 000) 

**Default value**:  0 

**Source database**:  this attribute is calculated during the preparation of patstat data. 

**Source field name**:  n/a source sub-field identifier n/a


**Page**: 152

Note: Much patent research is affected by the “family” concepts. There are various definitions of how to link different patents into “families”. This INPADOC extended priority family was developed by the INPADOC organisation before it was integrated into the EPO.

## <font color='grey'>DOCDB_FAMILY_SIZE</font>

In [66]:
print_markdown(documentation.desc_variable('DOCDB_FAMILY_SIZE'))

**Name**:  size of docdb simple family 

**Also known as**:  n/a 

**Description**:  size of docdb simple family of a given application 

**Domain**: 


**Default value**:  n/a 

**Source database**:  patstat


**Source field name**:  derived from table tls201_appln
 source sub-field identifier n/a 

**Comments**:  n/a


**Page**: 134

<font color=orange>TODO</font>: 

- distribution (requires external info to trace the earliest filing appn such as YEAR and AUTH)

## <font color='grey'>NB_CITING_DOCDB_FAM</font>

<font color=orange>TODO</font>: 

- distribution (requires external info to trace de earliest filing appn such as YEAR and AUTH)
- interactions between family size and nb_citing (dist plot) https://plot.ly/ipython-notebooks/2d-kernel-density-distributions/

In [67]:
print_markdown(documentation.desc_variable('NB_CITING_DOCDB_FAM'))

**Name**:  number of forward citations on family level 

**Also known as**:  n/a 

**Description**:  number of distinct docdb simple families citing at least one of the publications or applications of the docdb simple family of the current application (search report citations from tls212_citation) 

**Domain**:  number 0 .. about 3.000 

**Default value**:  n/a 

**Source database**:  patstat 

**Source field name**:  derived from table tls228_docdb_fam_citn
source sub-field identifier n/a 

**Comments**:  n/a


**Page**: 215

# Aggregated data

## <font color='grey'>NB_APPLICANTS</font>

<font color=orange>TODO</font>: distribution + Think about longitudinal dimension (YEAR, AUTH)


<font color=red>WARNING</font>: How to avoid duplicates ? -> Group by earliest_pat_publn_date, DOCDB family.

In [68]:
print_markdown(documentation.desc_variable('NB_APPLICANTS'))

**Name**:  number of applicants of an application according to the most recent publication 

**Also known as**:  n/a 

**Description**:  number of applicants of an application according to the most recent publication 

**Domain**:  number
about 250 

**Default value**:  n/a 

**Source database**:  patstat 

**Source field name**:  derived from table tls207_pers_appln
 source sub-field identifier n/a 

**Comments**:  only the latest known set of applicants is considered (e. g. from the latest publication)


**Page**: 214

## <font color='grey'>NB_INVENTORS</font>

<font color=orange>TODO</font>: distribution + Think about longitudinal dimension (YEAR, AUTH)


<font color=red>WARNING</font>: How to avoid duplicates ? -> Group by earliest_pat_publn_date, DOCD family.

In [69]:
print_markdown(documentation.desc_variable('NB_INVENTORS'))

**Name**:  number of inventors of an application according to the most recent publication 

**Also known as**:  n/a 

**Description**:  number of inventors of an application according to the most recent publication 

**Domain**:  nu 

**Default value**:  n/a 

**Source database**:  patstat


**Source field name**:  derived from table tls207_pers_appln
 source sub-field identifier n/a 

**Comments**:  only the latest known set of inventors is considered (e. g. from the latest publication)


**Page**: 216