# Patent Breakthrough walkthrough

This notebook illustrates the complete analysis process of breakthrough patents, from preparing input files
to calculating impact and novelty scores.

## 1. Preparing input files

There are three input files: a file with patent texts, a patent/year-index, and list of patent/CPC-codes.

### Patents
In its raw format, the input file contains the text of one patent file per line.
Each line starts with a path pointing to that patent's original text 
file (`/Volumes/External/txt/0000000-0100000/US1009.txt`), followed by the patent text. Example file: `./data/raw_input.txt`. 


### Patent/Year-index
Contains the year of publication of each patent. Example file: `./data/year.csv`. 


### CPC-file
The CPC-file (Cooperative Patent Classification) contains the patent classification code for each patent. These codes are used to calculate benchmark similarities. Example file: `./data/GPCPCs.txt`

Note: the included data files only contain a small subset of the original data, for example purposes.

#### Other files
The three other files in the data folder - `greek.txt`  `stopwords.txt`, and `symbols.txt` - are required by the `OldPreprocessor`-class.

In [None]:
from pathlib import Path

data_path = "./data"
input_file = Path(f"{data_path}/raw_input.txt")
year_file = Path(f"{data_path}/year.csv")
cpc_fp = Path(f"{data_path}/GPCPCs.txt")
patent_dir = Path("./patents")
output_folder = Path("./output")
output_fp = Path("./output", "patents.h5")
results_fp = Path("./results")

output_folder.mkdir(exist_ok=True)
patent_dir.mkdir(exist_ok=True)

### 1.1. Compressing

The compressor function transforms the patents to a more manageable format, sorts and saves them by year of publication, and compresses the resulting files.

In [None]:
from docembedder.preprocessor.parser import compress_raw


if len([path for path in patent_dir.iterdir() if path.suffix==".xz"])==0:
    print("Compressing raw files")
    compress_raw(input_file, year_file, patent_dir)
else:
    print(f"xz-files already present in '{patent_dir}'")    

You now have XZ-compressed files containing patents per year. Each file contains a list of JSON-objects, each JSON-object has the following key/values:

- `patent`: patent's ID
- `file`: path of original text file (not actually used)
- `contents`: patent text
- `year`: year of publication

## 2. Calculating embeddings

We calculate embeddings and scores with four different models: Countvec, Tf-Idf, Doc2Vec, and BERT ([PatentSBERTa](https://github.com/AI-Growth-Lab/PatentSBERTa)).


### 2.1. Preprocessors & parameters
Each model has its own preprocessor with various parameters. Most models also have configurable hyperparameters. The values for these parameters have been optimised using the original dataset, resulting in the values used in the `compute_embeddings()`-function below.

To recalibrate preprocessor and model parameters, run each model's hyperopt-script. See the [readme](https://github.com/UtrechtUniversity/patent-breakthrough/blob/main/docs/hyperparameter.md) and [hyperopt-notebooks](hyperopt/) for more details.


### 2.2. Calculating embeddings
Next, we calculate the embeddings.

In [None]:
from docembedder.models import TfidfEmbedder
from docembedder.preprocessor.preprocessor import Preprocessor
from docembedder.preprocessor.oldprep import OldPreprocessor
from docembedder.models.doc2vec import D2VEmbedder
from docembedder.models import CountVecEmbedder
from docembedder.models import BERTEmbedder

from docembedder.utils import run_models
from docembedder.pretrained_run import pretrained_run_models
import datetime

def check_files(sim_spec):
    for year in range(sim_spec.year_start, sim_spec.year_end):
        if not (patent_dir / f"{year}.xz").is_file():
            raise ValueError(f"Please download patent file {year}.xz and put it in"
                             f"the right directory ({patent_dir})")

def compute_embeddings_cv(patent_dir, output_fp, cpc_fp, sim_spec, n_jobs):

    model_cv = {
        "countvec": CountVecEmbedder(method='sigmoid')
    }
    prep_cv = {
        "prep-countvec": OldPreprocessor(list_path=data_path)
    }

    check_files(sim_spec)
    run_models(prep_cv, model_cv, sim_spec, patent_dir, output_fp, cpc_fp, n_jobs=n_jobs)
    print('Calculated countvec emdeddings')

    
def compute_embeddings_tfidf(patent_dir, output_fp, cpc_fp, sim_spec, n_jobs):
    
    model_tfidf = {
        "tfidf": TfidfEmbedder(
            ngram_max=1,stop_words='english',stem=False, norm='l1', sublinear_tf=True, min_df=6, max_df=0.665461)
    }
    prep_tfidf = {
        "prep-tfidf": Preprocessor(keep_caps=True, keep_start_section=True, remove_non_alpha=True),
    }

    check_files(sim_spec)
    run_models(prep_tfidf, model_tfidf, sim_spec, patent_dir, output_fp, cpc_fp, n_jobs=n_jobs)
    print('Calculated tfidf emdeddings')

    
def compute_embeddings_doc2vec(patent_dir, output_fp, cpc_fp, sim_spec, n_jobs):

    model_doc2vec = {
        "doc2vec": D2VEmbedder(epoch=8, min_count=13, vector_size=100)
    }
    prep_doc2vec = {
        "prep-doc2vec": Preprocessor(keep_caps=False, keep_start_section=True, remove_non_alpha=False)
    }

    check_files(sim_spec)
    run_models(prep_doc2vec, model_doc2vec, sim_spec, patent_dir, output_fp, cpc_fp, n_jobs=n_jobs)
    print('Calculated doc2vec emdeddings')

def compute_embeddings_bert(patent_dir, output_fp, cpc_fp, sim_spec, n_jobs):

    model_bert = {
        "bert": BERTEmbedder(pretrained_model='AI-Growth-Lab/PatentSBERTa')
    }
    prep_bert = {
         "prep-bert": Preprocessor(keep_caps=True, keep_start_section=True, remove_non_alpha=True)
    }

    check_files(sim_spec)
    pretrained_run_models(prep_bert, model_bert, sim_spec, patent_dir, output_fp, cpc_fp)
    print('Calculated BERT emdeddings')

#### Defining the calculation window

Embeddings are calculated within a time window, which shifts over the dataset and then recalculated.
This procedure is configured with the `SimulationSpecification()`, which has the following attributes:
    
- `year_start`: start year of the entire (sub)set of data to calculate embeddings for.
- `year_end`: id. end year (the end year itself is not included).
- `window_size`: width of the window (in years) to compute embeddings for.
- `window_shift`: number of years between subsequent windows.
- `debug_max_patents`: restrict the number of patents per year (optional; for testing purposes).
    
With the `n_jobs`-parameter you can set the number of concurrent jobs to run. A higher number means faster processing, but be aware that each job takes utilises one CPU-core.

In [None]:
from docembedder.simspec import SimulationSpecification

sim_spec = SimulationSpecification(
    year_start=1877,
    year_end=1897,
    window_size=11,
    window_shift=1
)

n_jobs=2

#### Computing embeddings

Now that we've defined the window, we can calculate embeddings, using each of the four models.
    
Be aware, depending on the amlount of patents and window size, this will take quite some time, 
and can require a (_very_) large amount of memory. Warnings from the Countvec calculations can be ignored.

All output is stored in a HDF5 file, which contains embeddings for all patents in all windows.

In [None]:
args={'patent_dir': patent_dir, 'output_fp': output_fp, 'cpc_fp': cpc_fp, 'sim_spec': sim_spec, 'n_jobs': n_jobs}

# Countvec
compute_embeddings_cv(**args)

# Tf-Idf
compute_embeddings_tfidf(**args)

# Doc2Vec
compute_embeddings_doc2vec(**args)

# BERT
compute_embeddings_bert(**args)

## 3. Impact and novelty scores

### 3.1. Calculating the scores

After we've computed and stored the embeddings, we compute novelty and impact scores. The result is a dictionary per model, each containing the novelties and impacts for each patent.


_Note on exponents_

The exponents (`[1.0, 2.0, 3.0]`) are used in the calculations to reward patents that are more similar to the patent under consideration. The backward and forward similarities for each patent is calculated based on the mean of all cosine similarities with the preceding and following patents in the window, using the formula `(x1**a + x2**a + ...)**(1/a)`, with `a` being the exponent. An `a` larger than 1 increases the weight of similarities closer to 1, i.e. of embeddings that are more similar to the one under consideration. The output includes the result for each exponent.

In [None]:
from docembedder.analysis import DocAnalysis
from docembedder.datamodel import DataModel
from collections import defaultdict

import pandas as pd

def compute_impacts(embedding_fp, output_dir):
    exponents = [1.0, 2.0, 3.0]

    impact_novel = defaultdict(lambda: defaultdict(list))

    with DataModel(embedding_fp, read_only=False) as data:
        analysis = DocAnalysis(data)
       
        for window, model in data.iterate_window_models():
            results = analysis.impact_novelty_results(window, model, exponents, cache=False, n_jobs=8)

            for expon, res in results.items():
                if expon == exponents[0]:
                    impact_novel[model]["patent_ids"].extend(res["patent_ids"])
                impact_novel[model][f"impact-{expon}"].extend(res["impact"])
                impact_novel[model][f"novelty-{expon}"].extend(res["novelty"])

    output_dir.mkdir(exist_ok=True, parents=True)

    for model, data in impact_novel.items():
        classifier_name = model.split("-")[-1]
        impact_fp = Path(output_dir, f"impact-{classifier_name}.csv")
        pd.DataFrame(impact_novel[model]).sort_values("patent_ids").to_csv(impact_fp, index=False)


compute_impacts(embedding_fp=output_fp, output_dir=results_fp)

### 3.2. Output

After the computations are done, novelty and impact scores are written to CSV-files in the results folder. One file per model, with novelty and impact scores for each exponent. The key column refers back to the patent ID's from the original data.

Below is a list of the resulting files.

In [None]:
[str(path.absolute()) for path in results_fp.iterdir()]