In [1]:
import requests
import json
import collections
import pandas as pd
import matplotlib.pyplot as plt

# Figuring out the relevant data for each research question

## Q1/ How can we best identify and visualize different geographical innovation networks? Can we estimate the number of people in such networks?

For this first part, we need the following output **fields**:
- `patent_number`: US Patent number, as assigned by USPTO.
- `assignee_latitude`: Latitude for assignee's location as listed on the patent.
- `assignee_longitude`: Longitude for assignee's location a listed on the patent.
- `cited_patent_number`: Patent number of cited patent.
- `inventor_latitude`: Latitude of inventor's as listed on the selected patent.
- `inventor_longitude`: Longitude of inventor's city as listed on the selected patent.
- `inventor_lastknown_latitude`: Latitude of inventor's city as of their most recent patent grant date.
- `inventor_lastknown_longitude`: Longitude of inventor's city as of their most recent patent grant date.
- `patent_type`: Category of patent (see below).

We chose to query for the latitudes and longitudes instead of for the location in the format city-state-country, as the former data are more useful for visualization purposes. Furthermore, during our preliminary data analysis, the data in the city-state-country format was not uniform (i.e., some cities were named in full, while others were abbreviated in different ways). On the other hand, the latitudes and longitudes data is uniform, and the missing data is also easier to clean than with the other format.

The reason for including the `patent_type` field is that we want to be able to distinguish between major patent categories, i.e. (this will be detailed further)
- 'Defensive Publication'
- 'Design'
- 'Plant'
- 'Reissue'
- 'Statutory Invention Registration'
- 'Utility'

We will also need to be able to query for patents based on the following **filters**:
- `patent_number`: US Patent number, as assigned by USPTO.
- `app_date`: Date the patent application was filed (filing date)

We the purposes of situating in time the patent data we have, we chose to consider the date at which the patent application was filed, instead of the date at which it was granted. The reasoning behind this choice is that at the time of the patent application, the innovation supporting the it already exists. So while using the application date will most probably reduced the amount of most-recent data we can work with, in our opinion it will paint a more vivid picture of innovation. Furthermore, as our analysis is performed on data spanning a few decades, we're confident this choice will not be detremental to our story. 

### Q1.1/ If we then take a few examples of different types of companies and look at the network of patents supporting their own patents, will these networks match up with the former innovation networks, or will they be more self-contained? In the latter case, can we estimate the number of people that make up these networks? Are these innovation networks concentrated around specific areas, or are they spread out ?

This part does not require any extra fields than in the first case. On the other hand, it will be necessary for us to be able to query for the patents owned by specific companies. To this end, we need the following **filter**:
- `assignee_organization`: Organization name, if assignee is organization

The process we will follow in order to answer this question is as follows:
- Settle on a few examples of companies operating in the same industry (e.g., Facebook, Twitter, Snapchat or Apple, Samsung, Huawei).
- For each of the chosen companies, identify the patents they held a given point in time
- Identify the network of innovators behind these patents
- Visualize the networks for each of the companies
- Analyze the results - see Q.1.2

For this question, as well as the following ones, we will also try to design and implement an interactive way to visualize the results. 

### Q1.2/ Do similar companies use the same knowledge bases to innovate? For example, if we look at different social networking companies, will the networks supporting their patents be distinct? Will a given companies patents mostly cite their own previous patents, or will they tap outside innovation networks? On what scale? 

The features necessary to answer these questions have been covered above. 

### Q1.3/ What about if we look at university/academic knowledge bases and compare them with those of the companies analyzed above?

### Q1.4/ What about governmental or non-governmental organizations, or international agencies?

### Q1.5/ How have the innovation networks identified above evolved through time?

The process to be followed to answer these questions is similar to the process for **Q1.1**, with the exception that we need to be able to identify which patents belong to companies, organizations and governmental entities. To this end, we add the following output **fields** to our search queries:
- `assignee_type`: Classification of assignee.

The assignee classes are as follows:
- '2': US company or corporation
- '3': foreign company or corporation
- '4': US individual
- '5': foreign individual
- '6': US government
- '7': foreign government
- '8': country government
- '9': US state governement
- '1x': part interest

We will also need the following extra output **field**:
- `app_date`: Date the patent application was filed (filing date)

## Q2/ What does the typical patent-holder look like today (Corporation, Universities, Governments, Individuals), and how has that evolved throughout time / geographies?

### Q2.1/  Is a migration of innovators through time visible in the data, e.g. a convergence towards certain innovation centers?

### Q2.2 / How has the number of assignees and inventors evolved through for different patent types? Are there significant differences in these numbers between different geographies?

The features necessary to answer these questions have been covered above. 

# Gathering and pre-processing the data

The **pipeline.py** file contains the functions used to prepare the data for our analysis. The functions that are used at this level are:
- The **patentsviewAPI()** function puts together the query string, the output fields string and the options string, and then extracts and saves the data returned by the PatentsView API in json format. The saved json data is of the following format (if there are no extra output fields added to the query):


    - page number in format '1', '2', ...
      - 'patents': list of patents in page. For each patent:
        - 'patent_number'
        - 'patent_type'
        - 'inventors' : list of inventors listed for the patent. For each inventor:
          - 'inventor_latitude'
          - 'inventor_longitude'
          - 'inventor_lastknown_latitude'
          - 'inventor_lastknown_longitude'
          - 'inventor_key_id'
        - 'assignees': list of assignees listed for the patent. For each assignee:
          - 'assignee_latitude'
          - 'assignee_longitude'
          - 'assignee_organization'
          - 'assignee_type'
          - 'assignee_key_id'
        - 'applications'
          - 'app_date'
          - 'app_id'
        - 'cited_patents' : list of other patents cited by the current patent. For each cited patent:
          - 'cited_patent_number'
      - 'count': number of results in the page
      - 'total_patent_count': total number of patents referenced in the results


- The **json_to_pandas()** function converts the saved json data from the PatentsView API to the format that is most convient for us - a set containing the following Pandas DataFrames:
  - `date`: associates a patent_number their application date.
  - `patent_type`: the number of patents in each patent_type category.
  - `assignee_type`: the number of patents in each assignee_type category.
  - `assignee_organization`: the number of patents belonging to each organization in the dataset.
  - `cited_patent_number`: the number of citations for each patent cited by a patent in the dataset.
  - `inventor_location`: inventor locations as stated in the patent application, indexed by patent_number.
  - `inventor_lastknown_location`: inventor locations as stated in the their most recent patent, indexed by patent number.
  - `assignee_location`: assignee locations as stated in the patent application, indexed by patent number.
  
  This function will make analyzing the 'raw' data more straightforward. Using this function to analyze the query results, we will then be able to specify the data cleaning process that is needed.
  
**As the maximum number of results per query is 10,000 per page, capped at 10 pages, in order to avoid hitting the limit, as a rule of thumb, if we are querying for all patent applications in a given date range, we need to keep the date range to about quarter of a year.**

# Example of analysis on a sample of data

## get and pre-process data

In [3]:
from pipeline import patentsviewAPI, json_to_pandas

In [None]:
datafile = patentsviewAPI(filename = '2015_q1', filepath = 'data',
                          app_date_from = '"2015-01-01"', app_date_to = '"2015-03-31"')

In [4]:
data = json_to_pandas('data/2015_q1.json')

## Analysis

In [5]:
data.keys()

dict_keys(['date', 'patent_type', 'assignee_type', 'assignee_organization', 'cited_patent_number', 'inventor_location', 'inventor_lastknown_location', 'assignee_location'])

Check the `date` dataframe. There should not be any inconsistencies here, as our query was based on the dates, so inconsistent dates would not have been returned.

In [None]:
data['date'].min().values[0], data['date'].max().values[0]

In [None]:
sum(data['date'].values == None), sum(data['date'].values == float('nan'))

Check the `patent_type` dataframe.

In [6]:
data['patent_type']

Unnamed: 0,count
utility,47695
design,7218
plant,251
reissue,94


**Utility patents** are the most relevant for us. They are patents 

Check the `assignee_type` dataframe.

In [None]:
data['assignee_type']

While the NaNs count in the `assignee_type` dataframe is negligeable as compared to the total count of assignees, there is an issue: assignee_types 4 and 5 refer to, respectively, US and foreign individuals. The total number of individuals assignees is thus much smaller than the NaN assignees count. Before removing the NaN values, we will need to investigate further what they may represent. 

Check the `assignee_organization` dataframe.

In [None]:
print(data['assignee_organization'].head(10)), print(data['assignee_organization'].tail(10))

In [None]:
for assignee in data['assignee_organization'].index.values[1:]:
    if 'sony' in assignee.lower():
        print(assignee)

The NaN count is refers should normally correspond to the total number of indiviuals assignee_type. In this case, though, we notice that it corresponds almost perfectly to the sum of the individual assignees count and NaN counts in the `assignee_type` dataframe.

While the organization names are definitely not in a standardized format, this will not be a problem, as we can simply use the list we obtain from querying for dates to get a list of the names used in the database.

In [None]:
sum(data['assignee_type']['count'].iloc[2:5]), sum(data['assignee_organization'].iloc[0])

Check `cited_patent_number` dataframe.

In [None]:
print(data['cited_patent_number'].head(10)), print(data['cited_patent_number'].tail(10))

In [None]:
len(data['cited_patent_number']) - 1

In [None]:
data['cited_patent_number'].describe()

The NaN count here refers to the number of patent citations by the patents in our dataset which do not have a patent number. This is to say that the NaN count is an upper bound on the number of cited patents with no corresponding patent number. As the number of total cited patents is almost 100 times larger than the NaN count, we judge that we can safely disregard them and have them be skipped in the data cleaning process. Otherwise, we see that all entries are numeric, which suggests that the format is entirely uniform.

Check `inventor_location` dataframe.

In [None]:
data['inventor_location'].dtypes

The first thing to do will be to convert the latitudes and longitudes from objects to floats, while performing the data pre-processing.

In [None]:
data['inventor_location'].describe()

In [None]:
len(data['assignee_location'])

In [None]:
sum(data['assignee_location'].values == None), sum(data['assignee_location'].values == float('nan'))

In [None]:
sum(data['assignee_location'].values == '0.1')

The major issue here is that almost 10% of the latitude and longitude data is either ('None','None') or ('0.1','0.1'). As our goal is to come up with a way to 'visualize' the dynamics of innovation, we need to discard this data. 

Check `inventor_location` and `inventor_lastknown_location` dataframes.

In [None]:
data['inventor_location'].dtypes, data['inventor_lastknown_location'].dtypes

In [None]:
sum(data['inventor_location'].values == None), sum(data['inventor_location'].values == float('nan')), sum(data['inventor_location'].values == '0.1')

In [None]:
sum(data['inventor_lastknown_location'].values == None), sum(data['inventor_lastknown_location'].values == float('nan')), sum(data['inventor_lastknown_location'].values == '0.1')

To be able to produce a more in-depth analysis of the  