In [1]:
%load_ext autoreload
%autoreload 2

**Import and meta**

In [2]:
from google.cloud import bigquery
from open_patstat.plots import bar_chart, stacked_bar_chart, pie_chart
from open_patstat.documentation.shortdoc import ShortDoc, print_markdown, print_signature

import plotly.plotly as py
import plotly.graph_objs as go
import plotly.io as pio
import plotly.figure_factory as ff
import pandas as pd
import os
import pycountry
import colorlover as cl

In [3]:
data_path = '../data/'
plots_path = '../plots/'
views_path = '../views/'

[Credentials]:https://cloud.google.com/bigquery/docs/quickstarts/quickstart-client-libraries#client-libraries-install-python

To instantiate your Client, you need to set the environment variable `GOOGLE_APPLICATION_CREDENTIALS` to the file path of the JSON file that contains your service account key. Follow the steps described [here][credentials] 

In [4]:
client = bigquery.Client()
documentation = ShortDoc()
techfield_df = pd.read_csv(data_path + 'tls901_part01_short.txt')
ipcnace_df = pd.read_csv(data_path + 'tls902_part01_short.txt')

---

**What is this notebook for?**

The classification tables (mainly `TLS_200_IPC` and `TLS_224_CPC`) classify patent applications in broad technological fields. This notebook documents the 101 of the PatStat classification tales.

More precisely:

- we report descriptions and encoding details from the `Data Catalog for Patstat` (5.07)
- we provide summary statistics and visualizations (when relevant)

**Data**

In [5]:
print_markdown(documentation.version())

**Author**: European patent office

**Distribution**: Epo patstat customers

**Patstat edition**: 2016 spring edition

**Version**: 5.07

**Date**: 01.04.2016

**Output**

[db-logo]: https://aem.dropbox.com/cms/content/dam/dropbox/www/en-us/branding/app-dropbox-windows@2x.png.transform/half-res/img.png
[db]:https://www.dropbox.com/sh/b1gs90gtoduu02v/AABICBNYH2kysjX-4JTqee0Wa?dl=0

`plots/` and `views/` are not available on our GitHub repository. However, interested readers can access those files on our dedicated dropbox. 

[![alt text][db-logo]][db]

---

**REMARKS**:

- There are only 70e6 application with at least one IP classification. Nb: the difference might be partly explained by replenished appplications which are unlikely to be classified
- Each application can be classified in several IP classifications
- When a given application has more than one application we spread For the sake of simplicity, we allocate 1/nb_occurences to each of its IP classifications

<font color=#1F618D>WARNING: for the sake of simplicity, we only consider the first 4 characters of the `ipc_class_symbol`. This leads to some (presumably minor) shortcuts. Future and more rigorous studies should not do so.</font> 

# International Patent Classification

In [6]:
print_markdown(documentation.desc_table('TLS209_APPLN_IPC'))

**Description**: Tls209_appln_ipc: international patent classification 5.8the table contains all international patent classifications linked to the applications. the set of classifications linked to a single application is a de-duplicated merge of all classifications of the various publication instances linked to the specific application. additionally only the latest version of the ipc classifications is used. this means that the user does not have to worry about reclassifications because older applications will always be classified according to the latest ipc version.

**Page**: 47

In [7]:
table_ref = client.dataset('raw').table('tls209')
table = client.get_table(table_ref)
schema = table.schema
schema = [[schema[i].name, schema[i].field_type, schema[i].mode] for i in range(len(schema))]
pd.DataFrame(schema, columns=['Name', 'Type', 'Mode']).set_index('Name')

Unnamed: 0_level_0,Type,Mode
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
appln_id,INTEGER,REQUIRED
ipc_class_symbol,STRING,NULLABLE
ipc_class_level,STRING,NULLABLE
ipc_version,DATE,NULLABLE
ipc_value,STRING,NULLABLE
ipc_position,STRING,NULLABLE
ipc_gener_auth,STRING,NULLABLE


## IPC_CLASS_SYMBOL

In [8]:
print_markdown(documentation.desc_variable('IPC_CLASS_SYMBOL'))

**Name**:  ipc classification symbol (ipc 8th edition) 

**Also known as**:  (ipc) class, (ipc) classification 

**Description**:  classification symbol according to the international patent classification, eights edition (entered into force january 1, 2006)


**Domain**:  up to 15 characters (a-z, 0-9, /, space) as allowed by ipc; examples
a61k
 h04q
 7/32
 c07k
14/00
 c07d 405/06
 h01m2220/20 note that spaces may be required on position 5-7, because the slash "/" is always on the 9th position. for more details see the table below. 

**Default value**:  n/a 

**Source database**:  docdb 

**Page**: 162

<font color='orange'>*TODO*:

- Extend analysis to NACE2 (use `ipcnace_df`)
- Extend analysis at the country level (country specific patterns are likely) </font>

```python
query = """
SELECT
  SUM(1/nb_occurences) AS nb,
  SAFE.SUBSTR(ipc_class_symbol, 0, 4) as short_ipc_symbol,
  appln_auth,
  year
FROM
  `raw.tls2019_cp`
GROUP BY
  short_ipc_symbol,
  appln_auth,
  year;
"""
client.query(query).to_dataframe().to_csv(views_path + '2019_ipcClassSymbol_byAuthYr.csv')
```

In [9]:
df = pd.read_csv(views_path + '2019_ipcClassSymbol_byAuthYr.csv', index_col=0)
df = df.merge(techfield_df, how='left', left_on='short_ipc_symbol', right_on='ipc_maingroup_symbol')


elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison



In [10]:
# dist techsect
fig = pie_chart(df.groupby('techn_sector').sum().max(1), title='Applications by technical sector')
pio.write_image(fig, plots_path + '2019_technSector.png')

In [11]:
# ts techsect
for techsect in df['techn_sector'].unique():
    fig = bar_chart(df.groupby(['techn_sector', 'year']).sum().\
                    dropna().query('techn_sector==@techsect').\
                    reset_index('techn_sector')['nb'],
                   title='Applications in {} (sector)'.format(techsect))
    pio.write_image(fig, plots_path +\
                    '2019_technSector_byYr/2019_{}_technSector_byYr.png'.format(str(techsect).replace(' ','')))
#py.iplot(fig)

In [12]:
# ts techfield
for techfield in df['techn_field'].unique():
    fig = bar_chart(df.groupby(['techn_field', 'year']).sum().\
                    dropna().query('techn_field==@techfield').\
                    reset_index('techn_field')['nb'],
                   title='Applications in {} (field)'.format(techfield))
    pio.write_image(fig, plots_path +\
                    '2019_technField_byYr/2019_{}_technField_byYr.png'.format(str(techfield).replace(' ','')))
#py.iplot(fig)

In [13]:
# ts techsect by techfield
for techsect in df['techn_sector'].unique():
    tmp = df.query('techn_sector==@techsect').set_index('year')[['techn_field','nb']]
    tmp_top = tmp.groupby('techn_field').sum().sort_values('nb', ascending=False)
    if len(tmp_top)>11:
        top10 = tmp_top.index[:10]
        tmp.loc[~tmp['techn_field'].isin(top10), "techn_field"] = 'others'
    tmp = tmp.groupby(['techn_field', 'year']).sum().reset_index('techn_field')
    tmp = tmp.merge(tmp.groupby('year').sum(), how='left',
              left_on='year', right_on='year', suffixes=('','_yr'))
    tmp['share'] = tmp['nb']/tmp['nb_yr']
    fig = stacked_bar_chart(tmp, clusters='techn_field', values='share', colors=('div', 'Spectral'), 
                            title='Applications in {} by field'.format(techsect),
                            legend=dict(orientation="h"), width=900, height=900)
    #py.iplot(fig)  
    pio.write_image(fig, plots_path + '2019_technField_byYr/2019_{}_technSectorField_byYr.png'.format(str(techsect).replace(' ','')))
    

## IPC_CLASS_LEVEL

In [14]:
print_markdown(documentation.desc_variable('IPC_CLASS_LEVEL'))

**Name**:  ipc classification level indicator 

**Also known as**:  n/a 

**Description**:  denotes whether an authority classified either in the full ipc, in main groups or in sub classes only. 

**Domain**:  1 character
 a = classification in the full ipc
e.g. 'h04q
 7/32', 'c07k
14/00'
c = classification in main groups only e.g. 'h04h
 1/00', 'a61k
31/00'
s = classification in subclasses only e. g. 'h04h', 'a61k'


**Default value**:  n/a 

**Source database**:  docdb 

**Source field name**:  <classifications-ipcr>
<classification-ipcr sequence="1">
 <text>a43c
11/00



20060101cfi20070118bhus



</text>
</classification-ipcr>
<classification-ipcr sequence="2">
 <text>a43c
11/00



20060101afi20070118bhus



</text>
</classification-ipcr> </classifications-ipcr> source sub-field identifier positions 28 of the source-field
......12345678901234567890123456789012345678901234567890 <text>a43c
11/00



20060101cfi20070118bhus



</text>
these text strings are all 50 bytes long. see wipo st.8. take byte 28 as the value of ipc_class_level.
 source sub-field identifier position 28 

**Comments**:  see the description of table tls209_appln_ipc on how the ipc symbols, which are allocated in docdb to publications, are de-duplicated and assigned to applications in patstat.


**Page**: 161

```python
query = """
SELECT
  SUM(1/nb_occurences) AS nb,
  ipc_class_level,
  appln_auth,
  year
FROM
  `raw.tls2019_cp`
GROUP BY
  ipc_class_level,
  appln_auth,
  year;
"""
client.query(query).to_dataframe().to_csv(views_path + '2019_ipcClassLevel_byAuthYr.csv')
```

In [15]:
df = pd.read_csv(views_path + '2019_ipcClassLevel_byAuthYr.csv', index_col=0)

In [16]:
for cnt in df['appln_auth'].unique():
    if df.query('appln_auth==@cnt').dropna()['nb'].sum()>1e6:
        fig = stacked_bar_chart(df.query('appln_auth==@cnt').dropna().set_index('year'),
                      clusters='ipc_class_level', values='nb', 
                      title='IPC class Levels in {}'.format(cnt))
        pio.write_image(fig, plots_path + '2019_ipcLevel_byYrAuth/2019_{}_ipcClassLevel_byAuthYr.png'.format(cnt))
    else:
        pass

## IPC_VERSION

In [17]:
print_markdown(documentation.desc_variable('IPC_VERSION'))

**Name**:  ipc version 

**Also known as**:  n/a 

**Description**:  version of the ipc 

**Domain**:  date between '2006-01-01' and current date 

**Default value**:  n/a 

**Source database**:  docdb 

**Source field name**: 
<classifications-ipcr>
<classification-ipcr sequence="1">
 <text>a43c
11/00



20060101cfi20070118bhus



</text>
</classification-ipcr>
<classification-ipcr sequence="2">
 <text>a43c
11/00



20060101afi20070118bhus



</text>
</classification-ipcr> </classifications-ipcr> source sub-field identifier 20 to 27 version indicator yyyymmdd date format


**Comments**:  see wipo st.8 for an explanation.
see the description of table tls209_appln_ipc on how the ipc symbols, which are allocated in docdb to publications, are de-duplicated and assigned to applications in patstat.


**Page**: 168

## IPC_VALUE

In [18]:
print_markdown(documentation.desc_variable('IPC_VALUE'))

**Name**:  classification value 

**Also known as**:  invention / additional; inventive/non-inventive


**Description**:  indication of the value of the classification i.e. is the class symbol relating to the invention or to aspects not related to the invention (but in the application). 

**Domain**:  1 character i=invention, n=additional (non-invention) 

**Default value**:  n/a 

**Source database**:  docdb 

**Source field name**: 
<classifications-ipcr>
<classification-ipcr sequence="1">
 <text>a43c
11/00



20060101cfi20070118bhus



</text>
</classification-ipcr>
<classification-ipcr sequence="2">
 <text>a43c
11/00



20060101afi20070118bhus



</text>
</classification-ipcr> </classifications-ipcr> source sub-field identifier 30 classification value (inventive or non-inventive) i, n 

**Comments**:  see wipo st.8 for an explanation.
see the description of table tls209_appln_ipc on how the ipc symbols, which are allocated in docdb to publications, are de-duplicated and assigned to applications in patstat.
invention related ipc symbols are printed bold on the front page of patent documents, according to wipo standard st.10/c.


**Page**: 167

```python
query = """
SELECT
  SUM(1/nb_occurences) AS nb,
  ipc_value,
  appln_auth,
  year
FROM
  `raw.tls2019_cp`
GROUP BY
  ipc_value,
  appln_auth,
  year;
"""
client.query(query).to_dataframe().to_csv(views_path + '2019_ipcValue_byAuthYr.csv')
```

In [19]:
df = pd.read_csv(views_path + '2019_ipcValue_byAuthYr.csv', index_col=0)

In [20]:
for cnt in df['appln_auth'].unique():
    if df.query('appln_auth==@cnt').dropna()['nb'].sum()>1e6:
        fig = stacked_bar_chart(df.query('appln_auth==@cnt').dropna().set_index('year'),
                      clusters='ipc_value', values='nb', 
                      title='IPC value in {}'.format(cnt))
        pio.write_image(fig, plots_path + '2019_ipcValue_byYrAuth/2019_{}_ipcValue_byYr.png'.format(cnt))
    else:
        pass

## <font color='grey'>IPC_POSITION</font>

In [21]:
print_markdown(documentation.desc_variable('IPC_POSITION'))

**Name**:  first or later position of symbol 

**Also known as**:  n/a 

**Description**:  indicates the position of the class symbol in the sequence of classes that form the classification 

**Domain**:  1 character f=first, l=later. space =unidentified 

**Default value**:  space 

**Source database**:  docdb 

**Source field name**: 
<classifications-ipcr>
<classification-ipcr sequence="1">
 <text>a43c
11/00



20060101cfi20070118bhus



</text>
</classification-ipcr>
<classification-ipcr sequence="2">
 <text>a43c
11/00



20060101afi20070118bhus



</text>
</classification-ipcr> </classifications-ipcr>
if there is a space in <classification-ipcr> in position 29, then record a space in patstat in ipc_position.
source sub-field identifier 29 first or later position of symbol f, l 

**Comments**:  see wipo st.8 for an explanation.
for patent authorities (e. g. uspto) where the law entails the concept of "first" class, the first class symbol in a list of class symbols is the main class. for other authorities, like the epo, there is no meaning in the position - classes may be quoted in alphabetical order for instance. some researchers use a weighting technique to analyse by ipc.


**Page**: 166

## <font color='grey'>IPC_GENER_AUTH</font>

In [22]:
print_markdown(documentation.desc_variable('IPC_GENER_AUTH'))

**Name**:  ipc generating authority 

**Also known as**:  n/a 

**Description**:  patent office that generated the ipc classification of the application concerned 

**Domain**:  2 ascii characters (a-z), according to wipo st.3


**Default value**:  n/a 

**Source database**:  docdb 

**Source field name**:  <classifications-ipcr>
<classification-ipcr sequence="1">
 <text>a43c
11/00



20060101cfi20070118bhus



</text>
</classification-ipcr>
<classification-ipcr sequence="2">
 <text>a43c
11/00



20060101afi20070118bhus



</text>
</classification-ipcr> </classifications-ipcr> source sub-field identifier 41-42 generating office aa, zz (st.3) 

**Comments**:  see wipo st.8. see the description of table tls209_appln_ipc on how the ipc symbols, which are allocated in docdb to publications, are de-duplicated and assigned to applications in patstat.


**Page**: 164

# Cooperative Patent Classification

In [23]:
print_markdown(documentation.desc_table('TLS224_APPLN_CPC'))

**Description**: Tls224_appln_cpc: cooperative patent classification 5.20the table contains all cooperative patent classifications linked to the applications. the set of classifications linked to a single application is a de-duplicated merge of all classifications of the various publication instances linked to the specific application.
from a statistical point of view it is important to remember that cpc codes are propagated to all members of the same docdb family (simple family).

**Page**: 64

In [24]:
table_ref = client.dataset('raw').table('tls224')
table = client.get_table(table_ref)
schema = table.schema
schema = [[schema[i].name, schema[i].field_type, schema[i].mode] for i in range(len(schema))]
pd.DataFrame(schema, columns=['Name', 'Type', 'Mode']).set_index('Name')

Unnamed: 0_level_0,Type,Mode
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
appln_id,INTEGER,REQUIRED
cpc_class_symbol,STRING,NULLABLE
cpc_scheme,STRING,NULLABLE
cpc_version,DATE,NULLABLE
cpc_value,STRING,NULLABLE
cpc_position,STRING,NULLABLE
cpc_gener_auth,STRING,NULLABLE


## <font color='grey'>CPC_CLASS_SYMBOL<font>

In [25]:
print_markdown(documentation.desc_variable('CPC_CLASS_SYMBOL'))

**Name**:  cpc classification symbol 

**Also known as**:  cpc class, cpc classification, cpc symbol 

**Description**:  classification symbol according to the cooperative patent classification 

**Domain**:  up to 19 characters (a-z, 0-9, /, space);
all values which are allowed by the cpc;
corresponds to position 1 - 19
(i.e. section, class, subclass, main group, subgroup) of the 50 character long text string as defined by wipo st.8
with trailing spaces removed.
examples
a61k
 h04q
 7/32
 c07k
14/00
 c07d 405/06
 h01m2220/20 note that spaces may be required on position 5-7, because the slash "/" is always on the 9th position. for more details see the table below. 

**Default value**:  n/a 

**Source database**:  docdb 

**Page**: 122

## <font color='grey'>CPC_SCHEME<font>

In [26]:
print_markdown(documentation.desc_variable('CPC_SCHEME'))

**Name**:  classification scheme 

**Also known as**:  n/a 

**Description**: 
the two schemes are cpc
 - cpc symbol allocated by the epo or the uspto cpcno
- cpc symbol allocated by the national office 

**Domain**:  up to 5 ascii characters; cpc or cpcno 

**Default value**:  n/a 

**Source database**:  docdb 

**Source field name**:  <patent-classification sequence="1">
<classification-scheme office="ep" scheme="cpc">
 <date>20130101</date>
</classification-scheme>
<classification-symbol>g06f 17/30233 </classification-symbol>
<symbol-position>f</symbol-position>
<classification-value>i</classification-value>
<classification-status>b</classification-status>
<classification-data-source>h</classification-data-source>
<action-date>
 <date>20130101</date>
</action-date> </patent-classification> <patent-classification sequence="2">
<classification-scheme office="ep" scheme="cpcno">
 <date>20130101</date>
</classification-scheme>
<classification-symbol>g06f 9/06 </classification-symbol>
<classification-value>i</classification-value>
<classification-status>b</classification-status>
<classification-data-source>h</classification-data-source>
<generating-office>gb</generating-office>
<action-date>
 <date>20130101</date>
</action-date> </patent-classification>
source sub-field identifier n/a 

**Comments**:  

**Page**: 127

## <font color='grey'>CPC_VERSION<font>

In [27]:
print_markdown(documentation.desc_variable('CPC_VERSION'))

**Name**:  cpc version 

**Also known as**:  n/a 

**Description**:  version of the cpc 

**Domain**:  date between '2013-01-01' and current date 

**Default value**:  n/a 

**Source database**:  docdb 

**Source field name**: 
<patent-classifications>
<patent-classification sequence="1">
 <classification-scheme office="ep" scheme="cpc">

<date>20130101</date>
 </classification-scheme>
 <classification-symbol>b60v
 1/16



</classification-symbol>
 <classification-value>i</classification-value>
 <classification-status>b</classification-status>
 <classification-data-source>h</classification-data-source>
 <action-date>

<date>20130101</date>
 </action-date>
</patent-classification> source sub-field identifier n/a 

**Comments**: 
see the description of table tls224_appln_cpc on how the cpc symbols, which are allocated in docdb to publications, are de-duplicated and assigned to applications in patstat.


**Page**: 129

## <font color='grey'>CPC_VALUE<font>

In [28]:
print_markdown(documentation.desc_variable('CPC_VALUE'))

**Name**:  classification value 

**Also known as**:  invention / additional 

**Description**:  indication of the value of the classification i.e. is the class symbol relating to the invention or to aspects not related to the invention (but in the application). 

**Domain**:  1 character; i=invention a=additional (non-invention)


**Default value**:  n/a 

**Source database**:  docdb 

**Source field name**: 
<patent-classifications>
<patent-classification sequence="1">
 <classification-scheme office="ep" scheme="cpc">

<date>20130101</date>
 </classification-scheme>
 <classification-symbol>b60v
 1/16



</classification-symbol>
 <classification-value>i</classification-value>
 <classification-status>b</classification-status>
 <classification-data-source>h</classification-data-source>
 <action-date>

<date>20130101</date>
 </action-date>
</patent-classification> source sub-field identifier n/a 

**Comments**:  see the description of table tls224_appln_cpc on how the cpc symbols, which are allocated in docdb to publications, are de-duplicated and assigned to applications in patstat.
 

**Page**: 128

## <font color='grey'>CPC_POSITION</font>

In [29]:
print_markdown(documentation.desc_variable('CPC_POSITION'))

**Name**:  first or later position of cpc symbol 

**Also known as**:  n/a 

**Description**:  indicates the position of the class symbol in the sequence of classes that form the classification. first / later indications are only available for cpc symbols allocated by the epo or uspto. 

**Domain**:  1 character; f = first, l = later, space = unidentified


**Default value**:  space 

**Source database**:  docdb <patent-classifications>
<patent-classification sequence="1">
 <classification-scheme office="ep" scheme="cpc">

<date>20130101</date>
 </classification-scheme>
 <classification-symbol>b60v
 1/16



</classification-symbol>
 <classification-value>i</classification-value>
 <symbol_position>l</symbol_position>
 <classification-status>b</classification-status>
 <classification-data-source>h</classification-data-source>
 <action-date>

<date>20130101</date>
 </action-date>
</patent-classification>


**Source field name**: 
<symbol_position>l</symbol_position> this field is only available for scheme "cpc". this field is not used with scheme "cpcno".
source sub-field identifier n/a


**Page**: 125

## <font color='grey'>CPC_GENER_AUTH</font>

In [30]:
print_markdown(documentation.desc_variable('CPC_GENER_AUTH'))

**Name**:  cpc generating authority 

**Also known as**:  n/a 

**Description**:  patent office that classified the application with a cpc symbol


**Domain**:  up to 2 characters (a-z) or spaces;
- empty/spaces (when scheme is cpc, i.e. ep / us are assigning the cpc symbols)
- values according to wipo st.3
(when scheme is cpcno) default value n/a 

**Source database**:  docdb
<patent-classification sequence="2">
 <classification-scheme office="ep" scheme="cpcno">

<date>20130101</date>
 </classification-scheme>
 <classification-symbol>b60v
 1/16



</classification-symbol>
 <classification-value>i</classification-value>
 <classification-status>b</classification-status>
 <classification-data-source>h</classification-data-source>
 <generating-office>gb</generating-office>
 <action-date>

<date>20130101</date>
 </action-date>
</patent-classification>


**Source field name**: 
 <generating-office>gb</generating-office>
this field is only used for scheme "cpcno". this field is not used with scheme "cpc".
source sub-field identifier n/a 

**Comments**:  see the description of table tls224_appln_cpc on how the cpc symbols, which are allocated in docdb to publications, are de-duplicated and assigned to applications in patstat
application. this should be the authority of this application.


**Page**: 124