# Data Story Outline

## Part 1. The Big Picture

In this first part, we answer the following questions:
- What does the typical patent-holder look like today, and how has that evolved between today and the 1990's?
- Is a migration of innovators through time visible in the data, e.g. a convergence towards certain innovation centers?
- How has the number of assignees and inventors evolved in this period?

### <mark>Approach (TO ADJUST) </mark>

- Question 1 can be a natural way to progress after e.g. a short introduction about patents itself, providing the reader with insight into the main sources of patent creation as well as the potentially historic overview of the same metric.  
- Q1.1 focuses on how data evolves through time from a geographic standpoint, and using  maps to visualize the migration could be a good approach. Below (Q2) interactive graphs are discussed and this could be one alternative here as well. 
    - One approach is to use several maps from ascending chronological periods of time where the reader could see if patterns of innovation centers emerge
    - Another approach could be an interactive chart that enables toggling between different time periods by the reader, enabling comparison by using a single map.
- Q1.2. can potentially provide a good segway into what we brought up in the Project's abstract: _"While innovation is often portrayed as the product of either one charismatic leader - or a ragtag team of geniuses - in reality we suspect that innovations, however important, happen in small steps supported by large networks of people."_
    - This will also be addressed in more detail in research question 2.
    - Various barplots or line plots could provide helpful tools for answering the first part of the question, while the same tools might also be an initial way to approach the second part, after segmenting the data into geographies. Maps could also be explored here.
    

## Part 2. Peeling back the layers

In this second part, we look at the following examples of innovation networks for specific patents held by the following specific assignees:
- companies : 
  - patents :
- academic institutions :
  - patents :
- governments :
  - patents :

By looking at these networks, we answer the following questions:
- If we take a few different types of companies / government bodies / academic institutions, and look at the network supporting some of their patents, what do these networks look like, in light of the networks seen in Part 1?
- Within the same companies, how have their networks evolved between today and the 1990's, if we look at patents similar to those above?
- Across companies of the same type, to what extent, if any, will their networks be similar?

While the **data gathering** and **data preprocessing** were done mostly in parallel and in an interative fashion, for simplicity we nonetheless start by presenting the data gathering part and we then present the data preprocessing, both as in their final iteration. 

### <mark>Approach (TO ADJUST)</mark>

- For the geographically focused research question, a natural approach can be to explore the data using maps to visualize e.g. the innovation networks.  
- In HW2 we worked with **Folium**, which can be one interesting tool here, as we have access to clean longitude- and latitude data (see preprocessing.ipynb). Folium has e.g. cluster-functionality, which could be a way to quantify the magnitude of the networks in different geographic zones. 
    - Folium also gives us the option to add interactivity.
- To represent networks, an alternative could be to use a graph-based approach, where relationships between entities within a network could be represented by nodes and edges. The Python library **networkx**, which enables the creation of Network Graphs, is one option. 
    - Network representations could also be relevant to answer question 2.2, when exploring the citation of patents from similar companies. 
- Throughout the data story, the usage of **interactive graphs** could be particularly efficient in making the plots self-sufficient and to invite exploration by the reader. Adding the element of time can enable comparing the development of innovation networks, and by hovering over parts of the map the reader can obtain more detailed information. 
    - The Python library **HoloViews** can, alongside networkx, enable the creation of interactive network graphs. 
    - Another interactive visualization library to explore is **Bokeh**.
- Questions Q2.1-2. could provide tangible examples for how prevalent (or not) patent networks are for well-known companies.
- By following these results with those of questions Q2.3-4 we believe it could provide a good overview into the similarities and/or differences between the three major categories of organizations (academia, non-governmental and governmental organizations), hence being of interest to readers within various field, and also (hopefully) providing input to our peers (students) that might be interested in research and innovation, and that might contemplate various possible tracks for their careers.  

In summary, the visualization techniques, as well as the amount of interactivity to add, if any, we decide upon will have a big impact on the quality of the data story. This latter is intended to be eventually recounted on a website spitted out by Jekyll, a static site generator.

# Data Gathering

## PatentsView Database

The database offers a wide range of features for all patents since 1976, which can be extracted through their API. 

The most relevant documentation can be found here:
- [Query Language Documentation](http://www.patentsview.org/api/query-language.html)
- [Field List](http://www.patentsview.org/api/patent.html)

For each research question, we came up with a list of required output **fields** which the API calls needed to return, as well as a list of input **filters** that limit the amount of extra pre-processing we had to do, while at the same time providing us with all the information necessary for us to cover the topics listed in the first section.

The complete list of **fields** is as follows:
- **Part 1**. For each patent in a given timeframe, we need:
  - `cited_patent_number`: Patent number of the cited patents.
  - `inventor_latitude`: Latitude of all the inventors as listed on the selected patent.
  - `inventor_longitude`: Longitude of all the inventors as listed on the selected patent.
  - `patent_type`: Category of patent (see below).
  - `assignee_organization`: Organization name, if assignee is organization.
  - `assignee_type`: Classification of assignee (see below).

- For **Part 2**, the list is the same, with the exception of the last two items, which are not needed for the second part. 

The `assignee_type` field allows us to have a good picture of the typical patent rights holder. The `assignee_type` categories are as follows:
- '2': US company or corporation
- '3': foreign company or corporation
- '4': US individual
- '5': foreign individual
- '6': US government
- '7': foreign government
- '8': country government
- '9': US state governement
- '1x': part interest

As well, the reason for including the `patent_type` field is that we want to be able to distinguish between major patent categories:
- 'Defensive Publication': "... an intellectual property strategy used to prevent another party from obtaining a patent on a product, apparatus or method for instance." ( [wikipedia](https://en.wikipedia.org/wiki/Defensive_publication) )
- 'Design': "... legal protection granted to the ornamental design of a functional item" ( [wikipedia](https://en.wikipedia.org/wiki/Design_patent) )
- 'Plant': covering any "new variety of plant" ( [wikipedia](https://en.wikipedia.org/wiki/Plant_breeders'_rights) )
- 'Reissue': correction of "a significant error in an already issued patent" ( [uslegal](https://definitions.uslegal.com/r/reissue-patent/) )
- 'Statutory Invention Registration': "for publishing patent applications on which they no longer felt they could get patents" ( [wikipedia](https://en.wikipedia.org/wiki/United_States_Statutory_Invention_Registration) )
- 'Utility': patent for a "useful" patent ( [wikipedia](https://en.wikipedia.org/wiki/Utility_(patent)) )

**Utility patents** are the most relevant for us. They are patents protecting "any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof" ( [U.S. Code § 101](https://www.law.cornell.edu/uscode/text/35/101) ) We do chose to exclude the **reissue** patent type from our results, as they are not truly innovations, but only corrections on already issued patents. This category is thus omitted directly when calling the API.

We chose to query for the **latitudes** and **longitudes** instead of the location in the format city-state-country, as the former data are more useful for visualization purposes. Furthermore, during our preliminary data analysis, the data in the city-state-country format was not uniform (i.e., some cities were named in full, while others were abbreviated in different ways). On the other hand, the latitudes and longitudes data is uniform, and the missing data is also easier to clean than with names format.

We also needed to be able to query for patents based on the following **filters**:
- `patent_number`: US Patent number, as assigned by USPTO.
- `app_date`: Date the patent application was filed (filing date)

For the purposes of situating in time the patent data we have, we chose to consider the date at which the patent application was filed, instead of the date at which it was granted. The reasoning behind this choice is that at the time of the patent application, the innovation supporting it already exists. So while using the application date will most probably reduce the amount of most-recent data we can work with, in our opinion it will paint a more vivid picture of innovation. Furthermore, as our analysis is performed on data spanning a few decades, we're confident this choice will not be detremental to our story. 

## Working with the PatentsView API

While the **PatentsView API** made it easy to extract data, the usage limits meant we had to implement the following functions to automate the incremental extraction and saving of all the data we needed.

**As the maximum number of results per query is 10,000 per page, capped at 10 pages, in order to avoid hitting the limit, as a rule of thumb, if we are querying for all patent applications in a given date range, we need to keep the date range to about quarter of a year.**

The following implemented functions can be found in the **[pipeline.py](https://github.com/cmdavid-epfl/Project/blob/master/pipeline.py)** module.

- **patentsviewAPI**: puts together the query string, the output fields string and the options string, and then extracts and saves the data returned by the PatentsView API in json format. The following functions are called by **patentsviewAPI**
  - query: 
  - get_data :
  
The saved json data is of the following format:

- page number in format '1', '2', ...
  - 'patents': list of patents in page. For each patent:
    - 'patent_type'
    - 'inventors' : list of inventors listed for the patent. For each inventor:
      - 'inventor_key_id'
      - 'inventor_latitude'
      - 'inventor_longitude'
    - 'assignees': list of assignees listed for the patent. For each assignee:
      - 'assignee_key_id'
      - 'assignee_organization'
      - 'assignee_type'
    - 'cited_patents' : list of other patents cited by the current patent. For each cited patent:
      - 'cited_patent_number'
  - 'count': number of results in the page
  - 'total_patent_count': total number of patents referenced in the results  


# Data Preprocessing

## Part 1.

The following functions in the **pipeline** module are specifically for the first part:
- **load_data**: converts the saved jsondata from the PatentsView API to the format that is most useful in answering the research topics (see below), by calling on the following functions:
  - get_full_year_data :
  - (load_)preprocess_data :

The preprocessed data has the following structure:




- get_ts :

In [None]:
from pipeline import load_data, get_ts

In [1]:
MY_PATH = '/media/dcm/HDD/ADA_DATA'
MIN_YEAR = 1990
MAX_YEAR = 2016
year_range = range(MIN_YEAR,MAX_YEAR + 1)

**Running this next line for the full time range (1990-2016), if no data is yet saved to disk, will take a few hours**

In [None]:
full_year_data = load_data(year_range, MY_PATH)

## Part 2.

The functions to get the data for part 2 are:
- load_layers(_data) :
  - get_layers_data :
    - get_cited_patents_data :
    - preprocess_layer_data :

In [2]:
from pipeline import load_layers_data

In [3]:
apple_example_data = load_layers_data(filename = 'Apple', patent_number = ['9430098'], layers = 4, data_dir = MY_PATH)

/media/dcm/HDD/ADA_DATA Apple_layer0.json
already on file
/media/dcm/HDD/ADA_DATA Apple_layer1.json
already on file
/media/dcm/HDD/ADA_DATA Apple_layer2.json
already on file
/media/dcm/HDD/ADA_DATA Apple_layer3.json
already on file
saving data


In [4]:
apple_example_data['0']

{'cited_patents': ['7345677', '8477463'],
 'inventors':          latitude  longitude
 1221243   37.9476   -122.525
 3146580   33.6846   -117.826
 3317742   37.3852   -122.114}

In [5]:
apple_example_data['1']

{'cited_patents': ['4317227',
  '5059959',
  '5194852',
  '5404458',
  '5412189',
  '5628031',
  '5638093',
  '5691959',
  '5717432',
  '5856820',
  '5986224',
  '6161434',
  '6167165',
  '6404353',
  '6549193',
  '6555235',
  '6724373',
  '6738051',
  '6891527',
  '5152401'],
 'inventors':          latitude  longitude
 1878402   48.8638    2.44845
 2212351   48.8566    2.35222
 2212352   48.8566    2.35222
 2824276   13.0423   77.61360
 503175    48.8130    2.23847}

# Visualizations

## Part 1.

In [None]:
from visualizations import get_png, get_timeseries_fig

In [None]:
# Whole World
for year in year_range:
    get_png(full_year_data,'All',year)

In [None]:
# Asia
for year in year_range:
    get_png(full_year_data,'All',year, zoom_on = (25,120,3.5))

In [None]:
# Europe
for year in year_range:
    get_png(full_year_data,'All',year, zoom_on = (53,18,3.5))

In [None]:
# US
for year in year_range:
    get_png(full_year_data,'All',year, zoom_on = (40,-90,3.5))

In [None]:
# Top 10 Assignees US, Non-US
num_inventors_top10_us = []
num_inventors_top10_nonus = []

for year in year_range:
    num_inventors_top10_us.append(get_png(full_year_data,'US Assignees',year, k = 10))
    num_inventors_top10_nonus.append(get_png(full_year_data,'Non-US Assignees'),year, k = 10)

In [None]:
# time series figure
get_timeseries_fig(full_year_data, year_range, np.array(num_inventors_top10_us) / 1000, 
                   np.array(num_inventors_top10_nonus) / 1000)

## Part 2.

In [6]:
from visualizations import save_layers

In [9]:
save_layers(apple_example_data, 'apple', zoom_on = None, layered = True)