# Importing Necessary Libraries and Modules

In this section, we import the required libraries and modules that will be used throughout the notebook. These include:
- `EPABClient` from the `epo.tipdata.epab` module, which is likely used for retrieving patent data.
- `os` for interacting with the operating system.
- `numpy` (imported as `np`) for numerical operations and data manipulation.
- `pandas` (imported as `pd`) for data manipulation and analysis.
- `re` for regular expression operations, which can be useful for text processing.

The imports are essential for setting up the environment for further data analysis and operations.


In [15]:
from epo.tipdata.epab import EPABClient
import os
import numpy as np
import pandas as pd
import re

## Helper Functions for Text Processing

This section defines two helper functions:
1. `extract_texts(claims)`: This function takes a list of claim dictionaries and returns the `text` field from each dictionary, but only if the `text` key exists in that dictionary.
   
2. `remove_brackets(text)`: This function removes any text enclosed in angle brackets (`< >`). It can process both lists of strings and individual strings. It uses regular expressions (`re.sub`) to replace the bracketed text with an empty string.

These functions are useful for cleaning and extracting relevant information from patent claims.


In [5]:
def extract_texts(claims):
    return [claim['text'] for claim in claims if 'text' in claim]
def remove_brackets(text):
    if isinstance(text, list):
        return [re.sub(r'<.*?>', '', item) for item in text]
    elif isinstance(text, str):
        return re.sub(r'<.*?>', '', text)
    return text

## Initializing the EPABClient

In this section, we initialize the `EPABClient` instance, which connects to the European Patent Office (EPO) data services.

- The `env` parameter determines the environment the client connects to:
  - `'TEST'`: This is a testing environment where limited data may be available for testing purposes.
  - `'PROD'`: This would be the production environment where full data is accessible.

For now, the client is set to connect to the `'TEST'` environment, but to retrieve full data, you can switch it to `'PROD'`.


In [17]:
epab = EPABClient(env = 'TEST')

## Function to Query Patent Data and Save Abstracts and Claims as Text Files

This function retrieves patent abstracts in English, filters them by a specified year or range of years, and saves the abstract and claims for each publication into separate text files. 

### Parameters:
- `start_year`: The starting year for the data retrieval.
- `end_year`: The ending year for the data retrieval.

For each year, a directory will be created to store the corresponding abstract and claims files.


In [30]:
def process_and_save_patent_abstracts_and_claims(start_year, end_year):
    # Query abstracts in English
    q = epab.query_abstract_language("en")
    
    # Loop through the range of years
    for year in range(start_year, end_year + 1):
        # Create a directory for the current year
        new_folder_path = str(year)
        os.makedirs(new_folder_path, exist_ok=True)
        output_folder = new_folder_path
        
        # Query publications for the current year
        p = epab.query_publication_date(str(year) + "%")
        s = q & p
        # s represents the data with English abstracts for the current year
        
        # Fetch results in a DataFrame
        df = s.get_results("title.en,publication,ipc,abstract.text,claims", output_type="dataframe")
        df['abstract.text'] = df['abstract.text'].apply(remove_brackets)
        
        # Iterate over each row in the DataFrame
        for _, row in df.iterrows():
            pub_number = row['publication.number']
            
            # Check that 'claims' and 'abstract.text' are non-empty
            if row['claims'] and row['abstract.text']:
                # Define filenames for abstract and claims
                abstract_filename = os.path.join(output_folder, f'{pub_number}_abstract.txt')
                claims_filename = os.path.join(output_folder, f'{pub_number}_claims.txt')
                
                # Extract and clean abstract and claims text
                abstract = row['abstract.text']
                claims = row['claims'][0]['text']
                claims_ = remove_brackets(claims)
                
                # Save the claims content to a text file
                with open(claims_filename, 'w') as claims_file:
                    claims_file.write(claims_)
                
                # Save the abstract content to a text file
                with open(abstract_filename, 'w') as abstract_file:
                    abstract_file.write(abstract)
        
        print(f"Data for the year {year} has been processed and saved.")


In [31]:
process_and_save_patent_abstracts_and_claims(1982,1982)

Data for the year 1982 has been processed and saved.


## Function to Query and Process Patent Data Over a Specified Year Range

This function retrieves patent data with abstracts in English and a publication date within a given year range. It performs the following steps:
1. **Queries the EPABClient** to retrieve patent abstracts in English and filters them by the desired publication date range.
2. **Cleans and processes the data**, removing unwanted HTML tags and extracting claim texts.
3. **Saves the result** in a compressed CSV file for further analysis.

### Parameters:
- `start_year`: The first year of the desired range.
- `end_year`: The last year of the desired range.
- The processed data is saved in a zip file where the filename includes the year range.

In [21]:
def process_patent_data(start_year, end_year):
    # Query abstracts in English
    q = epab.query_abstract_language("en")
    # Query publications in the given range of years
    p = epab.query_publication_date(f"{start_year}0101-{end_year}1231")
    # Combine the queries
    s = q & p
    
    # Get results and convert them to a DataFrame
    df = s.get_results("title.en,publication,ipc,abstract.text,claims", output_type="dataframe")
    
    # Clean the abstract texts and claims
    df['abstract.text'] = df['abstract.text'].apply(remove_brackets)
    df['claims'] = df['claims'].apply(extract_texts)
    df['claims'] = df['claims'].apply(remove_brackets)
    
    # Filter out rows without claims
    df = df[df['claims'].apply(lambda x: len(x) > 0)].reset_index(drop=True)
    
    # Take the claims from the list
    df['claims'] = df['claims'].apply(lambda x: x[0])
    
    # Drop unnecessary publication columns
    df = df.drop(columns=['publication.country', 'publication.kind', 'publication.language'])
    
    # Extract only the 'symbol' field from the 'ipc' column
    df['ipc'] = df['ipc'].apply(lambda x: [item['symbol'] for item in x])
    
    # Define the name for the CSV file inside the ZIP archive
    csv_file_in_zip = f'{start_year}_{end_year}_all_data.csv'
    
    # Save the DataFrame as a zipped CSV file
    zip_filename = f'{start_year}_{end_year}_all_data.zip'
    df.to_csv(zip_filename, index=False, compression=dict(method='zip', archive_name=csv_file_in_zip))
    
    print(f"Data from {start_year} to {end_year} has been saved to {zip_filename}.")

In [22]:
process_patent_data(1982,1982)

Data from 1982 to 1982 has been saved to 1982_1982_all_data.zip.
