# Analysis of the USPTO AI dataset

The dataset genration and description is described in [Giczy, Pairolero, and Toole](https://link.springer.com/article/10.1007/s10961-021-09900-2)

**Abstract**  

Artificial intelligence (AI) is an area of increasing scholarly and policy interest. To help researchers, policymakers, and the public, this paper describes a novel dataset identifying AI in over 13.2 million patents and pre-grant publications (PGPubs). The dataset, called the Artificial Intelligence Patent Dataset (AIPD), was constructed using machine learning models for each of eight AI component technologies covering areas such as natural language processing, AI hardware, and machine learning. The AIPD contains two data files, one identifying the patents and PGPubs predicted to contain AI and a second file containing the patent documents used to train the machine learning classification models. We also present several evaluation metrics based on manual review by patent examiners with focused expertise in AI, and show that our machine learning approach achieves state-of-the-art performance across existing alternatives in the literature. We believe releasing this dataset will strengthen policy formulation, encourage additional empirical work, and provide researchers with a common base for building empirical knowledge on the determinants and impacts of AI invention.

## Import Packages and connect to Google Drive

In [1]:
%pip install sentence-transformers scann

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting scann
  Downloading scann-1.2.9-cp310-cp310-manylinux_2_27_x86_64.whl (10.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hCollecting tqdm (from sentence-transformers)
  Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.1/77.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecti

In [1]:
# import needed packages
import pandas as pd
import time
from utils import *

ModuleNotFoundError: No module named 'sentence_transformers'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## AI Patent data description

### Variables Included in the AI Prediction Data
*(The following descriptions are verbatim from the publication)*  

#### Document number
Each document has a unique identification number, captured by the variable **doc_id**, which has a string data type. For utility patents, the number is seven or eight digits long. If the patent is a reissue patent, then the number begins with “RE” followed by several digits. The patent numbers do not include any leading zeros, either at the beginning of utility patent numbers or after the “RE” for reissue patents. For PGPubs, the number is 11 digits long: a four-digit year (corresponding to the publication year of the PGPub) followed by seven numbers. In the data, the PGPub number does not include a forward slash between the year and numbers.

#### Application number
Each patent application has an associated patent application number. An application may be published multiple times, e.g., as a PGPub and as a patent. An application may also include a corrected PGPub. Hence, the application number is not unique and if necessary may be used to combine all publications of a given application. The application number for utility patents is a 2-digit series code that includes a leading zero for series below 10, followed by a 6-digit serial number. In the data, the application number does not include a slash between the series code and serial number. The application number is contained in the variable **appl_id** of data type string.

#### Publication date
We capture the publication date of the document by the variable **pub_dt** (for granted patents, this date is also known as the grant date or issue date). The date format is “YYYY-MM-DD,” where YYYY is the calendar year, MM is the numeric month (1–12), and DD is the numeric day of the month. This variable is a string data type.

#### Patent flag
To easily distinguish between patents and PGPubs, we include a flag variable, **flag_patent**, of data type integer that is equal to 1 for patents and 0 for PGPubs.

#### Consolidated “any AI” variables
There are two variables that look across all AI technology component models: **flag_tng_any** and **predict50_any_ai**. Both of these are binary variables of data type integer. The first variable, **flag_tng_any** is equal to 1 if the document was used in training any of the eight technology component models, i.e., if it was in the seed or anti-seed sets of any model, and it is equal to 0 if not used for training in any model. The second variable, **predict50_any_ai**, takes a value of 1 if the document was predicted to be AI in any of the eight technology component models based on a 50% threshold and 0 if not predicted as AI in any of the models.

#### Training flags
Since we employed multiple models and did not want the documents used in training to automatically default as AI or not AI, we generated predictions for all the training documents. We include a flag of data type integer whether the document was part of the seed or anti-seed set for each component technology model. The variables are **flag_train_ml** for machine learning, **flag_train_evo** for evolutionary computation, **flag_train_nlp** for natural language processing, **flag_train_speech** for speech, **flag_train_vision** for vision, **flag_train_kr** for knowledge processing, **flag_train_planning** for planning and control, and **flag_train_hardware** for AI hardware. The variable takes on a value of 1 if the document was in the seed or anti-seed sets for that model, and 0 if not.

#### AI prediction score
Each AI technology component model generates a probability score between 0.0 and 1.0 as a prediction of whether the document belongs in the AI technology component. There are eight variables, one for each component technology, of data type float. The variables are **ai_score_ml** for machine learning, **ai_score_evo** for evolutionary computation, **ai_score_nlp** for natural language processing, **ai_score_speech** for speech, **ai_score_vision** for vision, **ai_score_kr** for knowledge processing, **ai_score_planning** for planning and control, and **ai_score_hardware** for AI hardware. In our code, the scores are data type float64. We do not round the data to any number of significant digits.

#### AI prediction using a 50% threshold
For convenience, we also include a variable for each AI technology component that translates the probability score into a binary prediction based on a 50% threshold. If the score is greater than or equal to 0.50, then the document is predicted to be AI in that component technology. The variables are **predict50_ml** for machine learning, **predict50_evo** for evolutionary computation, **predict50_nlp** for natural language processing, **predict50_speech** for speech, **predict50_vision** for vision, **predict50_kr** for knowledge processing, **predict50_planning** for planning and control, and **predict50_hardware** for AI hardware. The variables take a value of 1 if the technology component model score predicts AI in that component based on a 50% threshold, and 0 if the model score does not.

#### Analysis phase
The final variable identifies whether we created the data under the Phase 1 analysis or the updated Phase 2 analysis. As described above, Phase 1 includes patent documents to train the models and predictions for patent documents through early 2019. Phase 2 includes predictions for additional patent documents through 2020 (primarily 2019 and 2020). The **analysis_phase** variable is of data type integer and takes a value of 1 if the predictions were generated under Phase 1 and a value of 2 if the predictions were generated under the updated Phase 2 analysis.

## Load the Data from multiple sources

In [None]:
# get O-net data
%curl -o onet_statements.xlsx https://www.onetcenter.org/dl_files/database/db_26_0_excel/Task%20Statements.xlsx

In [None]:
#onet = pd.read_excel('onet_statements.xlsx')
onet = pd.read_csv('/content/drive/MyDrive/Big Data/patent/task_ratings.csv')
onet.head()

In [None]:
# Load the USPTO ai dataset
#AIpatent = pd.read_csv('/content/drive/MyDrive/Big Data/patent/ai_model_predictions.tsv',
#                       delimiter="\t", quoting = csv.QUOTE_NONNUMERIC).drop_duplicates()

AIpatent = pd.read_csv('/content/drive/MyDrive/bq-results-20230303-173253-1677864879860/ai_patents_with_title_abstracts.csv').drop_duplicates()

In [None]:
allThese = ['flag_train_ml', 'ai_score_ml', 
            'flag_train_evo', 'ai_score_evo', 
            'flag_train_nlp', 'ai_score_nlp', 
            'flag_train_speech', 'ai_score_speech',
            'flag_train_vision', 'ai_score_vision',
            'flag_train_kr', 'ai_score_kr', 
            'flag_train_planning', 'ai_score_planning', 
            'flag_train_hardware', 'ai_score_hardware', 
            'analysis_phase']

AIpatent = AIpatent.drop(allThese,axis=1)
AIpatent.head()

In [None]:
#AIpats = AIpatent[AIpatent.predict50_any_ai==1]
#AIpats = AIpats[AIpats.flag_patent==1]
AIpatent['pub_dt'] = pd.to_datetime(AIpatent['pub_dt'])
AIpatent['year'] = AIpatent.pub_dt.dt.year
AIpatent['doc_id'] = AIpatent['doc_id'].astype('string')
AIpatent['appl_id'] = AIpatent['appl_id'].astype('string')

## Start Analysis

1. Create a tranformer Model
2. Create embeddings for the tasks (and normalize them)
3. For each year of patent data
    - generate embeddings using the same model used for tasks
    - normalize the embeddings
    - create a SCANN searcher, which creates a tree of patents close to each other using the dot_product metric (note that because we normalized the embeddings for both the tasks and patents, the dot product and cosine are one and the same.
    - performe the search, comparison between tasks and the patents
    - store the results contain the best patent matches for each task and  their distance (e.g., cosine similarity) into the O-Net dataframe

In [None]:
analysis = Patent_ONET_tasks_matching(onet_tasks=onet.Task.values.tolist())

In [None]:
start = time.time()

for year in np.sort(AIpatent['year'].unique()): 
    st_time = time.time()
    print(year)
    # generate input data
    pat_year = AIpatent[AIpatent['year']==year]

    onet['neighbors_'+str(year)], onet['distances_'+str(year)] = analysis.compareByYear(patent_title=pat_year.title.values.tolist())
    print('\nCompleted {} matching in {} minutes.'.format(year,(time.time()-st_time)))

end = time.time()
print("\n\nTotal enlapsed time was {} hours.".format((end - start)/3600))

## Calculate the number of patents above a certain cosine similarity

To consider that we have made a match we must choose a cosine similarity value, above witch we will ?deam? a patent capable of performing a task.

In [None]:
# This funtion will count the number of patents that match a task above
# a specific cosine similarity threshold.
def count_above_threshold(self, entry, threshold):
    res = list(entry.replace('[','').replace(']','').split(', '))
    my_list = [float(i) for i in res]
    return sum(1 for x in my_list if x > threshold)

In [None]:
New_onet = onet[['O*NET-SOC Code','Title','Task']]

cosineSim = 0.75

for year in range(1976,2021):
    print(year)
    New_onet.loc[:,str(year)] = onet['distances_'+str(year)].apply(count_above_threshold,threshold=cosineSim)