# Example of analysis on a sample of data

In [1]:
import pandas as pd
from pipeline import patentsviewAPI, json_to_pandas

## get data from API

In [8]:
data['assignee_type']

Unnamed: 0,count
3.0,26723
2.0,25245
,4444
4.0,224
5.0,177
6.0,160
7.0,38


In [9]:
data['assignee_type'].index.values == None

array([False, False,  True, False, False, False, False])

While the **NaNs** count in the `assignee_type` dataframe is negligeable as compared to the total count of assignees, there is an issue: assignee_types 4 and 5 refer to, respectively, US and foreign individuals. The total number of individuals assignees is thus much smaller than the NaN assignees count. Before removing the NaN values, we will need to investigate further what they may represent. 

Check the `assignee_organization` dataframe.

In [10]:
print(data['assignee_organization'].head(10)), print(data['assignee_organization'].tail(10))

                                                  count
NaN                                                4841
Samsung Electronics Co., Ltd.                      1550
International Business Machines Corporation        1161
Canon Kabushiki Kaisha                              696
LG Electronics Inc.                                 507
QUALCOMM Incorporated                               474
Kabushiki Kaisha Toshiba                            470
Seiko Epson Corporation                             470
Samsung Display Co., Ltd.                           419
Taiwan Semiconductor Manufacturing Company, Ltd.    385
                                              count
R.J. Reynolds Tobacco Products                    1
XI'AN AOLAN SCIENCE AND TECHNOLOGY CO., LTD.      1
Wowwee Group Limited                              1
Epoch Company, Ltd.                               1
VeraWall, LLC                                     1
Onepin, Inc.                                      1
Thirdwayv, Inc.     

(None, None)

In [11]:
data['assignee_organization'].index.values[0] == None

True

In [12]:
for assignee in data['assignee_organization'].index.values[1:]:
    if 'sony' in assignee.lower():
        print(assignee)

Sony Côrporation
Sony Mobile Communications Inc.
Sony Interactive Entertainment America LLC
Sony Semiconductor Solutions Corporation
SONY NETWORK ENTERTAINMENT INTERNATIONAL LLC
Sony Computer Entertainment Europe Limited
Sony Interactive Entertainment Europe Limited
Sony Computer Entertainment Inc.
SONY OLYMPUS MEDICAL SOLUTIONS INC.
Sony Corporation of America
Sony Europe Limited


The NaN count is refers should normally correspond to the total number of indiviuals assignee_type. In this case, though, we notice that it corresponds almost perfectly to the sum of the individual assignees count and NaN counts in the `assignee_type` dataframe.

While the organization names are definitely not in a standardized format, this will not be a problem, as we can simply use the list we obtain from querying for dates to get a list of the names used in the database.

In [13]:
sum(data['assignee_type']['count'].iloc[2:5]), sum(data['assignee_organization'].iloc[0])

(4845, 4841)

Check `cited_patent_number` dataframe.

In [14]:
print(data['cited_patent_number'].head(10)), print(data['cited_patent_number'].tail(10))

         count
NaN       6668
7674650    127
7732819    124
7297977    123
6294274    123
7282782    123
7385224    122
7064346    122
7061014    122
7323356    122
         count
6911243      1
6670521      1
6363530      1
440051       1
436738       1
419780       1
5034078      1
4640859      1
3894352      1
6309066      1


(None, None)

In [15]:
len(data['cited_patent_number']) - 1

611599

In [16]:
data['cited_patent_number'].describe()

Unnamed: 0,count
count,611600.0
mean,1.71286
std,8.728264
min,1.0
25%,1.0
50%,1.0
75%,2.0
max,6668.0


The NaN count here refers to the number of patent citations by the patents in our dataset which do not have a patent number. This is to say that the NaN count is an upper bound on the number of cited patents with no corresponding patent number. As the number of total cited patents is almost 100 times larger than the NaN count, we judge that we can safely disregard them and have them be skipped in the data cleaning process. Otherwise, we see that all entries are numeric, which suggests that the format is entirely uniform.

Check `inventor_location` dataframe.

In [17]:
data['inventor_location'].dtypes

lat    object
lon    object
dtype: object

The first thing to do will be to convert the latitudes and longitudes from objects to floats, while performing the data pre-processing.

In [18]:
data['inventor_location'].describe()

Unnamed: 0,lat,lon
count,148610.0,148610.0
unique,11977.0,12129.0
top,37.5665,126.978
freq,4425.0,4425.0


In [19]:
len(data['assignee_location'])

57011

In [20]:
sum(data['assignee_location'].values == None), sum(data['assignee_location'].values == float('nan'))

(array([4522, 4522]), array([0, 0]))

In [21]:
sum(data['assignee_location'].values == '0.1')

array([885, 885])

The major issue here is that almost 10% of the latitude and longitude data is either ('None','None') or ('0.1','0.1'). As our goal is to come up with a way to 'visualize' the dynamics of innovation, we need to discard this data. 

Check `inventor_location` and `inventor_lastknown_location` dataframes.

In [22]:
data['inventor_location'].dtypes, data['inventor_lastknown_location'].dtypes

(lat    object
 lon    object
 dtype: object, lat    object
 lon    object
 dtype: object)

In [23]:
len(data['inventor_location']), len(data['inventor_lastknown_location'])

(148633, 148633)

In [24]:
sum(data['inventor_location'].values == None), sum(data['inventor_location'].values == float('nan')), sum(data['inventor_location'].values == '0.1')

(array([23, 23]), array([0, 0]), array([1154, 1154]))

In [25]:
sum(data['inventor_lastknown_location'].values == None), sum(data['inventor_lastknown_location'].values == float('nan')), sum(data['inventor_lastknown_location'].values == '0.1')

(array([4, 4]), array([0, 0]), array([1110, 1110]))

Assuming that the missing values are non-overlapping between the `inventor_location` and `inventor_lastknown_location` dataframes, the upper bound on the proportion of missing data is here:

In [26]:
no_data = sum(data['inventor_location'].values == None) + sum(data['inventor_location'].values == '0.1')\
        + sum(data['inventor_lastknown_location'].values == None) + sum(data['inventor_lastknown_location'].values == '0.1')

round(no_data[0]/len(data['inventor_location']), 4)

0.0154

Coming back to the **NaN** count in the `assignee_type` dataframe, we look at the raw json data for these entries.

In [27]:
import json
import numpy as np

In [28]:
json_data = json.load(open(datafile))

In [29]:
missing_location_data = []
missing_assignee_type = {}

for page in json_data:
    for patent in json_data[page]['patents']:
        for inventor in patent['inventors']:
            if (inventor['inventor_latitude'] == None) or (inventor['inventor_latitude'] == '0.1'):
                missing_location_data.append(patent['patent_number'])
                
            if (inventor['inventor_lastknown_latitude'] == None) or (inventor['inventor_lastknown_latitude'] == '0.1'):
                missing_location_data.append(patent['patent_number'])

        for assignee in patent['assignees']:
            if (assignee['assignee_latitude'] == None) or (assignee['assignee_latitude'] == '0.1'):
                missing_location_data.append(patent['patent_number'])
                
            if (assignee['assignee_type'] == None):
                missing_assignee_type[patent['patent_number']] = assignee['assignee_organization']

We look at how many patent_numbers are missing at least one location datapoint.

In [30]:
len(pd.DataFrame(index = missing_location_data).index.unique())

6249

It is only a minority of the data points in this dataset which have only one location data missing. The majority of the data which have location data missing have both inventors and assignees location missing. 

In [31]:
len(missing_assignee_type)

4444

We now **merge** the `patent_number` data for the `missing_location_data` points and the `missing_assignee_type` points, and we count the number of unique `patent_number`. 

In [32]:
len(pd.DataFrame(index = np.vstack([np.array(list(missing_assignee_type.keys()))[:,None], np.array(missing_location_data)[:,None]])).index.unique())

6249

From the above code, we observe that it is only **a single** datapoint that has no `assignee_type` but **does** have `location` data. For us, this is good news. It means that by omitting the data which has no `assignee_type`, we do not risk losing more information than necessary - because in any case, we have to omit the data for which there is no location, as that is the most crucial information for our purposes. 

## Checking cleaned data