In [1]:
import requests
import json
import collections
import pandas as pd
import matplotlib.pyplot as plt

# Figuring out the relevant data for each research question

## Q1/ How can we best identify and visualize different geographical innovation networks? Can we estimate the number of people in such networks?

For this first part, we need the following output **fields**:
- `patent_number`: 
- `assignee_latitude`
- `assignee_longitude`
- `cited_patent_number`: Patent number of cited patent
- `inventor_latitude`
- `inventor_longitude`
- `inventor_lastknown_latitude`:
- `inventor_lastknown_longitude`:
- `patent_type`:

We chose to query for the latitudes and longitudes instead of for the location in the format city-state-country, as the former data are more useful for visualization purposes. Furthermore, during our preliminary data analysis, the data in the city-state-country format was not uniform (i.e., some cities were named in full, while others were abbreviated in different ways). On the other hand, the latitudes and longitudes data is uniform, and the missing data is also easier to clean than with the other format.

The reason for including the `patent_type` field is that we want to be able to distinguish between major patent categories, i.e. (this will be detailed further)
- 'Defensive Publication'
- 'Design'
- 'Plant'
- 'Reissue'
- 'Statutory Invention Registration'
- 'Utility'

We will also need to be able to query for patents based on the following **filters**:
- `patent_number`
- `app_date`

We the purposes of situating in time the patent data we have, we chose to consider the date at which the patent application was filed, instead of the date at which it was granted. The reasoning behind this choice is that at the time of the patent application, the innovation supporting the it already exists. So while using the application date will most probably reduced the amount of most-recent data we can work with, in our opinion it will paint a more vivid picture of innovation. Furthermore, as our analysis is performed on data spanning a few decades, we're confident this choice will not be detremental to our story. 

### Q1.1/ If we then take a few examples of different types of companies and look at the network of patents supporting their own patents, will these networks match up with the former innovation networks, or will they be more self-contained? In the latter case, can we estimate the number of people that make up these networks? Are these innovation networks concentrated around specific areas, or are they spread out ?

This part does not require any extra fields than in the first case. On the other hand, it will be necessary for us to be able to query for the patents owned by specific companies. To this end, we need the following **filter**:
- `assignee_organization`

The process we will follow in order to answer this question is as follows:
- Settle on a few examples of companies operating in the same industry (e.g., Facebook, Twitter, Snapchat or Apple, Samsung, Huawei).
- For each of the chosen companies, identify the patents they held a given point in time
- Identify the network of innovators behind these patents
- Visualize the networks for each of the companies
- Analyze the results - see Q.1.2

For this question, as well as the following ones, we will also try to design and implement an interactive way to visualize the results. 

### Q1.2/ Do similar companies use the same knowledge bases to innovate? For example, if we look at different social networking companies, will the networks supporting their patents be distinct? Will a given companies patents mostly cite their own previous patents, or will they tap outside innovation networks? On what scale? 

The features necessary to answer these questions have been covered above. 

### Q1.3/ What about if we look at university/academic knowledge bases and compare them with those of the companies analyzed above?

### Q1.4/ What about governmental or non-governmental organizations, or international agencies?

### Q1.5/ How have the innovation networks identified above evolved through time?

The process to be followed to answer these questions is similar to the process for **Q1.1**, with the exception that we need to be able to identify which patents belong to companies, organizations and governmental entities. To this end, we add the following output **fields** to our search queries:
- `assignee_type`:

We will also need the following extra output **field**:
- `app_date`: Date the patent application was filed (filing date)

## Q2/ What does the typical patent-holder look like today (Corporation, Universities, Governments, Individuals), and how has that evolved throughout time / geographies?

### Q2.1/  Is a migration of innovators through time visible in the data, e.g. a convergence towards certain innovation centers?

### Q2.2 / How has the number of assignees and inventors evolved through for different patent types? Are there significant differences in these numbers between different geographies?

The features necessary to answer these questions have been covered above. 

# Gathering and processing the data

The **pipeline.py** file contains the functions used to prepare the data for our analysis. The functions that are used at this level are:
- The **patentsviewAPI()** function puts together the query string, the output fields string and the options string, and then extracts and saves the data returned by the PatentsView API in json format. The saved json data is of the following format (if there are no extra output fields added to the query):

# THIS LIST IS TO CHECK

    - page number in format '1', '2', ...
      - 'patents': list of patents in page. For each patent:
        - 'patent_number'
        - 'patent_date'
        - 'inventors' : list of inventors listed for the patent. For each inventor:
          - 'inventor_latitude'
          - 'inventor_longitude'
          - 'inventor_lastknown_latitude'
          - 'inventor_lastknown_longitude'
          - 'inventor_key_id'
        - 'assignees': list of assignees listed for the patent. For each assignee:
          - 'assignee_latitude'
          - 'assignee_longitude'
          - 'assignee_organization'
          - 'assignee_type'
            - '2': US company or corporation
            - '3': foreign company or corporation
            - '4': US individual
            - '5': foreign individual
            - '6': US government
            - '7': foreign government
            - '8': country government
            - '9': US state governement
            - '1x': part interest
          - 'assignee_key_id'
        - 'applications'
          - 'app_country'
          - 'app_date'
          - 'app_id'
        - 'cited_patents' : list of other patents cited by the current patent. For each cited patent:
          - 'cited_patent_number'
        - 'foreign_priority'
          - 'forprior_country'
      - 'count': number of results in the page
      - 'total_patent_count': total number of patents referenced in the results



- The **json_to_pandas()** function converts the saved json data from the PatentsView API to the format that is most convient for us - a set containing the following Pandas DataFrames:
  - `patent_info`:
  - 
  
As the maximum number of results per query is 10,000 per page, capped at 10 pages, in order to avoid hitting the limit, as a rule of thumb, if we are querying for all patent applications in a given date range, we need to keep the date range to about half a year. 

In [2]:
from pipeline import patentsviewAPI, json_to_pandas

## get and process data

In [3]:
datafile = patentsviewAPI(filename = '2015_h1', filepath = 'data',
                          app_date_from = '"2015-01-01"', app_date_to = '"2015-05-31"')
data = json_to_pandas(datafile)

fetching first page
fetching page 2
fetching page 3
fetching page 4
fetching page 5
fetching page 6
fetching page 7
fetching page 8
fetching page 9
saving data
error: empty page
error: empty page
error: empty page
error: empty page
error: empty page
error: empty page
error: empty page
error: empty page
error: empty page


## Analysis

In [9]:
data.keys()

dict_keys(['date', 'assignee_type', 'assignee_organization', 'cited_patent_number', 'inventor_location', 'inventor_lastknown_location', 'assignee_location'])

In [10]:
data['date'].describe()

Unnamed: 0,app_date
count,89853
unique,151
top,2015-03-27
freq,1195


In [15]:
data['assignee_type']

Unnamed: 0,count
3.0,42571
2.0,41685
,7468
4.0,400
6.0,277
5.0,265
7.0,68
15.0,1


In [16]:
print(data['assignee_organization'].head(10)), print(data['assignee_organization'].tail(10))

                                                  count
NaN                                                8134
Samsung Electronics Co., Ltd.                      2395
International Business Machines Corporation        1877
Canon Kabushiki Kaisha                             1182
LG Electronics Inc.                                 783
Samsung Display Co., Ltd.                           767
QUALCOMM Incorporated                               694
Taiwan Semiconductor Manufacturing Company, Ltd.    664
Kabushiki Kaisha Toshiba                            635
Seiko Epson Corporation                             614
                                        count
HOGUE TOOL & MACHINE, INC.                  1
Magna Car Top Systems of America, Inc.      1
FreeWire Technologies, Inc.                 1
Delta Kogyo Co., Ltd.                       1
Winkler Canvas Ltd.                         1
TRW Vehicle Safety Systems Inc.             1
Befra Electronic, S.R.O                     1
Gustav Magenwirt

(None, None)

In [17]:
len(data['cited_patent_number'])

885678

In [18]:
sum(data['cited_patent_number']['count'].values)

1727376

In [19]:
print(data['cited_patent_number'].head(10)), print(data['cited_patent_number'].tail(10))

         count
NaN      10771
7674650    212
6294274    208
7061014    207
7732819    207
6563174    206
7064346    206
6727522    206
7297977    206
7282782    206
         count
3502917      1
4137558      1
4233534      1
4289923      1
4306760      1
4341921      1
4467137      1
4565417      1
5098319      1
7788956      1


(None, None)

In [20]:
data['inventor_location'].describe()

Unnamed: 0,lat,lon
count,242887.0,242887.0
unique,14987.0,15221.0
top,37.5665,126.978
freq,7061.0,7061.0


In [21]:
data['inventor_location'].head(40)

Unnamed: 0,lat,lon
8997419,41.8009,-87.937
8997419,42.1878,-88.183
8997430,51.0951,5.7913
8997430,50.7707,3.8752
9002934,40.8768,-73.3246
9004524,-41.2838,174.741
9004524,-41.2838,174.741
9005554,29.7604,-95.3698
9005725,44.9958,-92.8794
9005725,1.37209,103.947


In [22]:
data['assignee_location']['lat'].values[2] == None

True

In [23]:
sum(data['assignee_location']['lat'].values == None)

7594

In [24]:
data['assignee_location'].head(40)

Unnamed: 0,lat,lon
8997419,41.6986,-88.0684
8997430,50.9099,3.3737
9002934,,
9004524,-36.8575,174.81
9005554,,
9005725,43.1566,-77.6088
9009087,33.7484,-117.875
9011499,33.6846,-117.826
9014148,35.7796,-78.6382
9016606,51.8156,-0.8084


In [None]:
locations.replace('0.1',float('nan'),inplace=True)
locations.dropna(inplace=True)