<pre>
<img align="center" width="900" src="https://raw.githubusercontent.com/andrelmfsantos/INPADOC-Data-Normalization/main/Images/FAPESP_Header_Google_Colab_english.png">
</pre>

> * __INPADOC: International Patent Documentation__
* [FAPESP Process Number: 23/12389-1](https://bv.fapesp.br/pt/auxilios/113767/solucoes-diagnosticas-e-terapeuticas-da-covid-19-protegidas-por-patentes-sistematizacao-das-principa/)

# Info

## Authors

|   |   |
|--:|:--|
|**Authors:**|[Priscila Rezende da Costa](https://bv.fapesp.br/pt/pesquisador/67192/priscila-rezende-da-costa/) $-$ [Camila Naves Arantes](http://lattes.cnpq.br/3897204543440920) $-$ [Alex Fabianne de Paulo](http://lattes.cnpq.br/9690861410844635) $-$ </br> [Geciane Silveira Porto](https://bv.fapesp.br/pt/pesquisador/89388/geciane-silveira-porto/) $-$ [André Luis Marques Ferreira dos Santos](http://lattes.cnpq.br/9690861410844635) $-$ [Celise Marson](http://lattes.cnpq.br/2618279063609476)|
|**Host Institution:**|[Universidade Nove de Julho (UNINOVE). Campus Vergueiro. São Paulo , SP, Brasil](https://bv.fapesp.br/pt/instituicao/1496/campus-vergueiro/)|
|**Date:**|July 28, 2024|

## Query & Files

**24,986 results from Derwent Innovations Index for:**

* Query:

    * TS=("covid" OR "coronavirus" OR "pandemic" OR "sars-cov-2")
* [Derwent $-$ Web of Science](https://www.webofscience.com/wos/diidw/summary/e80a62ba-ad88-42d0-a179-f69a58eb4087-fe8523ab/diidw-relevance/1)
* Export: "Tab delimited file"
* Record Content: Full Record

**Github files**:

* Exports_Web_of_Science_FY2019 :: savedrecs[0]
* Exports_Web_of_Science_FY2020 :: savedrecs[1 - 3]
* Exports_Web_of_Science_FY2021 :: savedrecs[4 - 10]
* Exports_Web_of_Science_FY2022 :: savedrecs[11 - 18]
* Exports_Web_of_Science_FY2023 :: savedrecs[19 - 24]
* Exports_Web_of_Science_FY2024 :: savedrecs[25 - 26]

## About this Computer Program

**About this Notebook**:

1. **Import Necessary Modules:**
   - `pandas` for data manipulation.
   - `StringIO` for handling string data as file-like objects.
   - `files` from `google.colab` for file download functionality.

2. **Configure Pandas:**
   - Set the display option to show the full content of each column without truncating.

3. **Read Multiple CSV Files from GitHub:**
   - Define the base URL of the CSV files.
   - Read CSV files numbered from 0 to 18 and store them in a list of dataframes.

4. **Combine DataFrames:**
   - Concatenate all dataframes into a single dataframe.
   - Print the length of the combined dataframe and display the first few rows.

5. **Split and Reshape Data:**
   - Split the 'AE' column by the semicolon delimiter and create a new dataframe with split values in separate columns.
   - Join this new dataframe with the original dataframe to retain 'PN' and 'AE' columns.
   - Reshape the dataframe to get unique values from each column, dropping NaN values and the 'Coluna' column.
   - Split the 'assignee_split' column in 3 new columns:
     (a) assignee_name - registeed name;
     (b) assignee_abbreviation - assignee_name abbreviation
     (c) assignee_individual_legal - natural or legal person

6. **Print Information and Sample Data:**
   - Print the length and count of unique values in the 'assignee_split' column.
   - Sort the dataframe by 'PN' and display a random sample of 10 rows.

7. **Save and Download the Final DataFrame:**
   - Define the folder name and file path for the output CSV.
   - Save the final dataframe to a CSV file.
   - Download the CSV file using Google Colab's file download functionality.

This script processes CSV files from a GitHub repository, performs data manipulation and reshaping, and finally saves and downloads the processed data as a CSV file.

## Dicionary

### <center>Variables Dicionary</center>

|n   |Column|Non-Null |Count    |Dtype   |Description
|:---|:-----|--------:|:--------|:-------|:---------------
|1   |PN    |18298    |non-null |object  |Patent Number(s)
|2   |TI    |18298    |non-null |object  |Title
|3   |AU    |18175    |non-null |object  |Author
|4   |AE    |18298    |non-null |object  |Assignee(s)
|5   |GA    |18298    |non-null |object  |General Annotation
|6   |AB    |18297    |non-null |object  |Abstract
|7   |TF    |0        |non-null |float64 |Text Field
|8   |EA    |52       |non-null |object  |Equivalent Abstracts
|9   |DC    |18298    |non-null |object  |Derwent Class Code(s)
|10  |MC    |18298    |non-null |object  |Derwent Manual Code(s)
|11  |IP    |18298    |non-null |object  |International Patent Classification (IPC)
|12  |PD    |18298    |non-null |object  |Patent Details
|13  |AD    |18298    |non-null |object  |Addresses
|14  |FD    |7761     |non-null |object  |Funding Details
|15  |PI    |18298    |non-null |object  |Priority Application Information
|16  |DS    |8091     |non-null |object  |Designated States
|17  |FS    |99       |non-null |object  |Field of Search
|18  |CP    |10691    |non-null |object  |Cited Patents
|19  |CR    |8335     |non-null |object  |Cited References
|20  |DN    |11282    |non-null |object  |DCR Numbers (Derwent Citation Records)
|21  |MN    |2099     |non-null |object  |Markush Number
|22  |RI    |3169     |non-null |object  |Reference Identification Number
|23  |CI    |11120    |non-null |object  |Cited Inventors (or Context Information)
|24  |RG    |4597     |non-null |object  |Derwent Registry Numbers

# Code

In [18]:
# @title Package Requires

# Import necessary modules
import pandas as pd
# Set the display option to show the full content of each column without truncating
pd.set_option('display.max_colwidth', None)
#pd.options.display.float_format = '{:,.2f}'.format
#import glob
import os
from io import StringIO
import requests
from google.colab import files

In [4]:
# @title Get multiple csv files from github from differents folders
#import os
#import pandas as pd
#import requests
#from io import StringIO

# Ensure the output directory exists
output_folder_path = 'TXT_Web_of_Science_to_CSV'
os.makedirs(output_folder_path, exist_ok=True)

# Base URL for the files on GitHub
base_url = 'https://raw.githubusercontent.com/andrelmfsantos/INPADOC-Data-Normalization/main/Exports_Web_of_Science_Full_Years/'

# Define column names
columns = ['PN', 'TI', 'AU', 'AE', 'GA', 'AB', 'TF', 'EA', 'DC', 'MC', 'IP', 'PD', 'AD', 'FD', 'PI', 'DS', 'FS', 'CP', 'CR', 'DN', 'MN', 'RI', 'CI', 'RG']

# Dictionary with folder names and respective file ranges
folders_files = {
    "Exports_Web_of_Science_FY2019": ["savedrecs_0.txt"],
    "Exports_Web_of_Science_FY2020": [f"savedrecs_{i}.txt" for i in range(1, 4)],
    "Exports_Web_of_Science_FY2021": [f"savedrecs_{i}.txt" for i in range(4, 11)],
    "Exports_Web_of_Science_FY2022": [f"savedrecs_{i}.txt" for i in range(11, 19)],
    "Exports_Web_of_Science_FY2023": [f"savedrecs_{i}.txt" for i in range(19, 25)],
    "Exports_Web_of_Science_FY2024": [f"savedrecs_{i}.txt" for i in range(25, 27)]
}

# Function to read a file from a URL and return a DataFrame
def read_file_to_dataframe(file_url):
    response = requests.get(file_url)
    response.raise_for_status()  # Ensure we notice bad responses
    lines = response.text.splitlines()
    if lines:
        lines[0] = lines[0].replace('\ufeff', '')  # Strip BOM character if present

        # Split lines by tab and create a list of dictionaries
        data = []
        for line in lines[1:]:  # Skip header line
            split_line = line.split('\t')
            entry = {col: split_line[i] if i < len(split_line) else None for i, col in enumerate(columns)}
            data.append(entry)

        # Create DataFrame
        df = pd.DataFrame(data, columns=columns)
        return df
    else:
        return pd.DataFrame(columns=columns)

# List to hold all DataFrames
all_dataframes = []

# Iterate over each folder and file
for folder, files in folders_files.items():
    for file_name in files:
        file_url = f"{base_url}{folder}/{file_name}"
        try:
            df = read_file_to_dataframe(file_url)
            print(f'{file_name}: {len(df)} rows')

            # Add Folder_Name column
            df['Folder_Name'] = folder

            # Add DataFrame to list
            all_dataframes.append(df)

            # Save DataFrame to CSV
            output_file_path = os.path.join(output_folder_path, f'{file_name.replace(".txt", ".csv")}')
            df.to_csv(output_file_path, index=False)
            print(f'Saved {output_file_path}')
        except requests.exceptions.RequestException as e:
            print(f'Failed to download {file_url}: {e}')

# Combine all DataFrames into a single DataFrame
combined_data = pd.concat(all_dataframes, ignore_index=True)

# Save the combined DataFrame to a CSV file
combined_output_file_path = os.path.join(output_folder_path, 'combined_data.csv')
combined_data.to_csv(combined_output_file_path, index=False)
print(f'Saved combined data to {combined_output_file_path}')
print(combined_data.shape)
combined_data.sample(5, random_state = 43)

savedrecs_0.txt: 194 rows
Saved TXT_Web_of_Science_to_CSV/savedrecs_0.csv
savedrecs_1.txt: 1000 rows
Saved TXT_Web_of_Science_to_CSV/savedrecs_1.csv
savedrecs_2.txt: 1000 rows
Saved TXT_Web_of_Science_to_CSV/savedrecs_2.csv
savedrecs_3.txt: 427 rows
Saved TXT_Web_of_Science_to_CSV/savedrecs_3.csv
savedrecs_4.txt: 1000 rows
Saved TXT_Web_of_Science_to_CSV/savedrecs_4.csv
savedrecs_5.txt: 1000 rows
Saved TXT_Web_of_Science_to_CSV/savedrecs_5.csv
savedrecs_6.txt: 1000 rows
Saved TXT_Web_of_Science_to_CSV/savedrecs_6.csv
savedrecs_7.txt: 1000 rows
Saved TXT_Web_of_Science_to_CSV/savedrecs_7.csv
savedrecs_8.txt: 968 rows
Saved TXT_Web_of_Science_to_CSV/savedrecs_8.csv
savedrecs_9.txt: 1000 rows
Saved TXT_Web_of_Science_to_CSV/savedrecs_9.csv
savedrecs_10.txt: 1000 rows
Saved TXT_Web_of_Science_to_CSV/savedrecs_10.csv
savedrecs_11.txt: 1000 rows
Saved TXT_Web_of_Science_to_CSV/savedrecs_11.csv
savedrecs_12.txt: 1000 rows
Saved TXT_Web_of_Science_to_CSV/savedrecs_12.csv
savedrecs_13.txt: 1000

Unnamed: 0,PN,TI,AU,AE,GA,AB,TF,EA,DC,MC,...,DS,FS,CP,CR,DN,MN,RI,CI,RG,Folder_Name
21441,CN116694637-A,"New small interfering RNA for inhibiting spike glycoprotein gene of coronavirus, and used in preparation of nucleic acid drugs for treating diseases caused by coronavirus, comprises sense strand comprising specific base pair sequence and antisense strand comprising specific base pair sequence",HSU W; MENG Y; LI P,GUIYI TECHNOLOGY SHANGHAI CO LTD (GUIY-Non-standard),202396457R,"NOVELTY - A small interfering RNA (siRNA) for inhibiting spike glycoprotein gene of coronavirus, comprising a sense strand comprising a base pair sequence and an antisense strand comprising a base pair sequence, is new. USE - The siRNA is useful for inhibiting spike glycoprotein gene of coronavirus, and used in preparation of nucleic acid drugs for treating or preventing diseases caused by the coronavirus. ADVANTAGE - The siRNA specifically interfere with the coronavirus, and can effectively knock down the spike sugar of 2019-ncoronavirus-omicro-BA.1/BA.2 protein gene. DETAILED DESCRIPTION - A small interfering RNA (siRNA) for inhibiting spike glycoprotein gene of coronavirus, comprising a sense strand comprising a base pair sequence (5'-CCUCAAUGAGGUUGCCAAGAAUU-3'), and an antisense strand comprising a base pair sequence (5'-UUCUUGGCAACCUCAUUGAGGCG-3'), is new. An INDEPENDENT CLAIM is included for use of siRNA in the preparation of nucleic acid drugs for treating or preventing diseases caused by the coronavirus.",,,"D16 (Fermentation industry - including fermentation equipment, brewing, yeast production, production of pharmaceuticals and other chemicals by fermentation, microbiology, production of vaccines and antibodies, cell and tissue culture and genetic engineering.); B04 (Natural products and polymers. Including testing of body fluids (other than blood typing or cell counting), pharmaceuticals or veterinary compounds of unknown structure, testing of microorganisms for pathogenicity, testing of chemicals for mutagenicity or human toxicity and fermentative production of DNA or RNA. General compositions.)",D05-H12D8A; D05-H99; B14-A02B5; B14-S03C; B04-E07C; B04-E99,...,,,,,105730-0-0-0 N,,,RA012P N,,Exports_Web_of_Science_FY2023
7458,US11035817-B1,"Diagnosing viral infection by modifying electrode with electrode material, functionalizing modified electrode by grafting carboxy phenyl group, immobilizing viral antigen on electrode, capping viral antigen and immersing sample in antibody",ZOUROB M; EISSA S,UNIV ALFAISAL (UYAL-Non-standard),2021649415,"NOVELTY - Process for diagnosing a viral infection involves modifying an electrode using an electrode material to make a modified electrode, functionalizing the modified electrode by grafting a carboxy phenyl group to make a functional electrode, immobilizing a viral antigen on the functional electrode to make a viral antigen coated functional electrode, capping the viral antigen coated functional electrode with a layer of tip material to make a tipped electrochemical sensor, in which the layer of tip material is one of cotton, nylon, rayon, polyurethane foam or polyester, contacting the tipped electrochemical sensor with a mammalian body part which releases a mucous membrane secretion on the tipped electrochemical sensor, immersing the sample collected by the tipped electrochemical sensor into a tube containing antibody solution and a redox solution, and applying a voltage difference on the sample collected on the tipped electrochemical sensor. USE - The process is useful for diagnosing a viral infection, where the viral infection is due to a Coronavirus. ADVANTAGE - The process is economical, enables infection control, and has capability of miniaturization and high sensitivity, selectivity and accuracy. DETAILED DESCRIPTION - Process for diagnosing a viral infection involves modifying an electrode using an electrode material to make a modified electrode, functionalizing the modified electrode by grafting a carboxy phenyl group to make a functional electrode, immobilizing a viral antigen on the functional electrode to make a viral antigen coated functional electrode, capping the viral antigen coated functional electrode with a layer of tip material to make a tipped electrochemical sensor, in which the layer of tip material is one of cotton, nylon, rayon, polyurethane foam or polyester, contacting the tipped electrochemical sensor with a mammalian body part which releases a mucous membrane secretion on the tipped electrochemical sensor, immersing the sample collected by the tipped electrochemical sensor into a tube containing antibody solution and a redox solution, and applying a voltage difference on the sample collected on the tipped electrochemical sensor in the tube containing redox to read a difference in a reduction peak current using a square wave voltammetry or a charge transfer resistance using electrochemical impedance spectroscopy using Smartphone. An INDEPENDENT CLAIM is included for a method of diagnosing a viral infection involving modifying an electrode using carbon nanofiber to make a modified carbon electrode, grafting a carboxy phenyl group to make a functional carbon electrode by functionalizing the modified carbon electrode, coating a viral antigen on the functional carbon electrode to make a coated functional electrode, layering a cotton layer to make a cotton tipped electrochemical sensor on the viral antigen coated functional electrode with, collecting a sample from a mammalian body part infected by a virus which releases a mucous membrane secretion on the cotton tipped electrochemical sensor, immersing the sample collected cotton tipped electrochemical sensor into a tube containing antibody and redox solution, applying a voltage difference on the sample collected on the cotton tipped electrochemical sensor in the tube containing redox to read a difference in a reduction peak current, and identifying the presence or absence of viral infection based on current difference.",,,"A89 (Photographic, laboratory equipment, optical - including electrophotographic, thermographic uses.); B04 (Natural products and polymers. Including testing of body fluids (other than blood typing or cell counting), pharmaceuticals or veterinary compounds of unknown structure, testing of microorganisms for pathogenicity, testing of chemicals for mutagenicity or human toxicity and fermentative production of DNA or RNA. General compositions.); D16 (Fermentation industry - including fermentation equipment, brewing, yeast production, production of pharmaceuticals and other chemicals by fermentation, microbiology, production of vaccines and antibodies, cell and tissue culture and genetic engineering.); S03 (Scientific Instrumentation);",A12-E14; A12-V03C2; A03-A05; B11-C08B; B12-K04G1B; B04-B04C1; B11-C11; B11-C08E8; B04-C03D; B04-C02A; B04-C03C; B04-G08; B11-C07A; B11-C12; D05-H10; D05-H09; D05-H06A; S03-E09F; S03-E03C; W01-C01G8S; W01-C01D3C,...,,,,"US11035817-B1 Yanez-Sedeno, Integrated Affinity Biosensing Platforms on Screen-Printed Electrodes Electrografted with Diazonium Salts Sensors 2018, 18, 675, pp. 1-21 (Year: 2018).; Layqah et al, An electrochemical immunosensor for the corona virus associated with the Middle East respiratory syndrome using an array of gold nanoparticle-modified carbon electrodes Microchimica Acta (2019) 186: 224; pp. 1-10 (Year: 2019).; Alamer Rapid colorimetric lactoferrin-based sandwich immunoassay on cotton swabs for the detection of foodborne pathogenic bacteria Taianta 185 (2018) 275-280 (Year: 2020).",199339-0-0-0 M K; 102573-0-0-0 M K; 90356-0-0-0 ; 135416-0-0-0 ; 192544-0-0-0,,,RA2I0R M K; R02035 M K; R01852 ; R24076 ; R24077,2035-S,Exports_Web_of_Science_FY2021
17219,US2022162686-A1; WO2022115187-A1; CA3194659-A1; IN202317027250-A,"System for detecting target microorganism in sample suspected of containing target microorganism, has primary filters, filter membrane, loop mediated isothermal amplification (LAMP) reagents, and hydrogel components for forming hydrogel",CID C A; DOBELLE L; GU A Y; WU X; ZHU Y; LI J; HOFFMANN M R,CALIFORNIA INST OF TECHNOLOGY (CALY-C); CALIFORNIA INST OF TECHNOLOGY (CALY-C),202272512W,"NOVELTY - System (10) for detecting a target microorganism in a sample suspected of containing the target microorganism, comprises primary filters (18) configured to remove particles larger than the target microorganism from the sample, thus producing a primary filtered sample; a filter membrane (20) receiving the primary filtered sample and configured to trap the target microorganism on a membrane while passing through the membrane particles present in the primary filtered sample that are smaller than the target microorganism; loop mediated isothermal amplification (LAMP) reagents (11); hydrogel components for forming a hydrogel; a substrate configured to receive the filter membrane, the LAMP reagents and the hydrogel components, where the LAMP reagents and the hydrogel components are placed on the membrane to form a loaded substrate; an incubator (14) configured to heat the loaded substrate; and a fluorescence illuminator configured to illuminate the loaded substrate. USE - The system is useful for detecting a target microorganism in a sample suspected of containing the target microorganism (claimed). ADVANTAGE - The method may be employed for rapid and inexpensive point-of-use (POU) absolute quantification of SARS-CoV-2 in environmental water or wastewater samples with high sensitivity. DETAILED DESCRIPTION - An INDEPENDENT CLAIM is included for detecting a target microorganism in a sample suspected of containing the target microorganism, which involves filtering the sample to remove particles larger than the target microorganism from the sample, where producing a primary filtered sample; filtering the primary filtered sample with a filter membrane configured to trap the target microorganism, if present, on a membrane while passing through the membrane particles present in the primary filtered sample that are smaller than the target microorganism, thus producing a loaded membrane; combining LAMP reagents and hydrogel components for forming a hydrogel into a mixture; applying the loaded membrane to a slide; applying the mixture to the loaded membrane after the membrane is placed on the slide to form a loaded slide; incubating the loaded slide; illuminating the loaded, incubated slide with a fluorescence illuminator; and visually detecting the presence or absence of one or more fluorescent amplicons on the loaded, incubated slide that are produced as a result of a LAMP reaction amplifying the DNA/RNA of the target microorganism if the target microorganism is present on the membrane, where the presence of the amplicons is indicative of the presence of the target microorganism in the sample and the absence of the amplicons is indicative of the absence of the target microorganism in the sample. DESCRIPTION OF DRAWING(S) - The drawing shows a schematic view of membrane-based in-gel LAMP (mgLAMP) assay system. 10System 11Reagents 14Incubator 18Primary filters 20Filter membrane",,,"D16 (Fermentation industry - including fermentation equipment, brewing, yeast production, production of pharmaceuticals and other chemicals by fermentation, microbiology, production of vaccines and antibodies, cell and tissue culture and genetic engineering.); D15 (Chemical or biological treatment of water, industrial waste and sewage - including purification, sterilising or testing water, scale prevention, treatment of sewage sludge, regeneration of active carbon which has been used for water treatment and impregnating water with gas e.g. CO2, but excluding plant and anti-pollution devices (C02).); B04 (Natural products and polymers. Including testing of body fluids (other than blood typing or cell counting), pharmaceuticals or veterinary compounds of unknown structure, testing of microorganisms for pathogenicity, testing of chemicals for mutagenicity or human toxicity and fermentative production of DNA or RNA. General compositions.); A89 (Photographic, laboratory equipment, optical - including electrophotographic, thermographic uses.)",D05-H99; D05-H09; D05-H18B2; D04-A01; B04-E99; B04-C03D; B04-C03C; B11-C08J; B12-K04F; B11-C08F8; B11-C07B3; B11-C08N2; B11-C08K; B11-C08E; A05-E06B; A10-E01; A12-L04B; A12-W11A,...,WO2022115187-A1: (National): AE; AG; AL; AM; AO; AT; AU; AZ; BA; BB; BG; BH; BN; BR; BW; BY; BZ; CA; CH; CL; CN; CO; CR; CU; CZ; DE; DJ; DK; DM; DO; DZ; EC; EE; EG; ES; FI; GB; GD; GE; GH; GM; GT; HN; HR; HU; ID; IL; IN; IR; IS; IT; JO; JP; KE; KG; KH; KN; KP; KR; KW; KZ; LA; LC; LK; LR; LS; LU; LY; MA; MD; ME; MG; MK; MN; MW; MX; MY; MZ; NA; NG; NI; NO; NZ; OM; PA; PE; PG; PH; PL; PT; QA; RO; RS; RU; RW; SA; SC; SD; SE; SG; SK; SL; ST; SV; SY; TH; TJ; TM; TN; TR; TT; TZ; UA; UG; US; UZ; VC; VN; WS; ZA; ZM; ZW (Regional): BW; GH; GM; KE; LR; LS; MW; MZ; NA; RW; SD; SL; ST; SZ; TZ; UG; ZM; ZW; EA; AL; AT; BE; BG; CH; CY; CZ; DE; DK; EE; ES; FI; FR; GB; GR; HR; HU; IE; IS; IT; LT; LU; LV; MC; MK; MT; NL; NO; PL; PT; RO; RS; SE; SI; SK; SM; TR; OA,,"; WO2022115187-A1 -- US20160362718-A1 ; US20170275696-A1 ; WO2020047077-A1 CALIFORNIA TECHNOLOGY INST (CALY) HOFFMANN M R, LIN X, HUANG X",,131343-0-0-0 M K; 444-0-0-0,,,R08416 M K; R00351,,Exports_Web_of_Science_FY2022
12754,WO2022072401-A1; US2023381159-A1,"Preventing or treating viral infection caused by coronavirus e.g. severe acute respiratory syndrome coronavirus, human coronavirus-NL63 and middle east respiratory syndrome-coronavirus CoV, by administering substituted aromatic compounds",BUCHWALD P,UNIV MIAMI (UYMI-C); UNIV MIAMI (UYMI-C),202248330Q,"NOVELTY - Preventing or treating a viral infection in a subject, comprises administering substituted aromatic compounds (I) and their salts. USE - The method is useful for: preventing or treating a viral infection in a subject, where the viral infection is caused by a coronavirus e.g. severe acute respiratory syndrome coronavirus (SARS-CoV), SARS-CoV-2, human coronavirus-NL63, middle east respiratory syndrome-CoV, HCoV-229E, HCoV-OC43 and/or HCoV-HKU1; and inhibiting an interaction between a coronavirus spike protein and its receptor, thus decreasing viral attachment and entry into a host cell, where the receptor is angiotensin converting enzyme 2 (ACE2), dipeptidyl peptidase 4 (DPP4), or CD13 (all claimed). Test details are described but no results given. ADVANTAGE - The method inhibits an interaction between a coronavirus spike protein and its receptor, thus decreasing viral attachment and entry into a host cell. DETAILED DESCRIPTION - Preventing or treating a viral infection in a subject, comprises administering substituted aromatic compounds of formula (I) and their salts. Ring A1, ring D1 = substituted phenyl moiety of formula (a), pyridine-2,5-diyl, furan-2,4-diyl, furan-2,5-diyl, thiophen-2,5-diyl or thiophen-2,4-diyl; R1 = H, halo, CF3, SO3H, CO2R1b, NO2, NH2 or substituted phenyl moiety of formula (b); L1, L2 = -C(=O)-NH- or -NH-C(=O)-; n, m = 0-4; R2 = halo, CF3, SO3H, CO2R1b, NO2 or NH2 and when two R2 are adjacent, they can together form -(N=N-NH)- or, with the carbon atoms to which they are attached, form a 6C aryl optionally substituted by 1-4 R3; R3 = halo, OH, CF3, SO3H, CO2R1b, NO2 or NH2 and when two R3 are adjacent, they can together form -(N=N-NH)-; R4 = halo, CF3, SO3H, CO2R1b, NO2 or NH2 and when two R4 are adjacent, they can together form -(N=N-NH)- or, with the carbon atoms to which they are attached, form a 6C aryl optionally substituted by 1-4 R3; R1a = H, 1-5C alkyl or 1-5C alkoxy; and R1b = H or 1-5C alkyl. INDEPENDENT CLAIMS are also included for: (1) substituted aromatic compounds (I); and (2) a composition comprising (I) and a carrier.",,,"B05 (Other organics - aromatics, aliphatic, organo-metallics, compounds whose substituents vary such that they would be classified in several of B01 - B05.); A96 (Medical, dental, veterinary, cosmetic.)",B06-H; B07-H; B14-A02; B14-K01D; A12-V01,...,WO2022072401-A1: (National): AE; AG; AL; AM; AO; AT; AU; AZ; BA; BB; BG; BH; BN; BR; BW; BY; BZ; CA; CH; CL; CN; CO; CR; CU; CZ; DE; DJ; DK; DM; DO; DZ; EC; EE; EG; ES; FI; GB; GD; GE; GH; GM; GT; HN; HR; HU; ID; IL; IN; IR; IS; IT; JO; JP; KE; KG; KH; KN; KP; KR; KW; KZ; LA; LC; LK; LR; LS; LU; LY; MA; MD; ME; MG; MK; MN; MW; MX; MY; MZ; NA; NG; NI; NO; NZ; OM; PA; PE; PG; PH; PL; PT; QA; RO; RS; RU; RW; SA; SC; SD; SE; SG; SK; SL; ST; SV; SY; TH; TJ; TM; TN; TR; TT; TZ; UA; UG; US; UZ; VC; VN; WS; ZA; ZM; ZW (Regional): BW; GH; GM; KE; LR; LS; MW; MZ; NA; RW; SD; SL; ST; SZ; TZ; UG; ZM; ZW; EA; AL; AT; BE; BG; CH; CY; CZ; DE; DK; EE; ES; FI; FR; GB; GR; HR; HU; IE; IS; IT; LT; LU; LV; MC; MK; MT; NL; NO; PL; PT; RO; RS; SE; SI; SK; SM; TR; OA,,,,K U; K U; K U; K U; K U,227024601 K U,,RD6HGQ K U; RD6HGR K U; RD6HGS K U; RD6HGT K U; RD6HGU K U,,Exports_Web_of_Science_FY2022
6172,IN202011015157-A,"Deep convolutional neural network system for reliable detection of severe acute respiratory syndrome (SARS) related coronavirus, has data augmentation technique and supervised machine learning model that are provided to demonstrate construction and training of system",DHALL I; VASHISTH S; SARASWAT S,UNIV AMITY (UAMI-C),2021C0880K,"NOVELTY - The system comprises automated feature extraction from a set of data of the X-Ray images of user which are loaded in the system. The data augmentation technique and a supervised machine learning model are used to demonstrate the construction and training of system. A classification report is provided to involve various evaluation parameters such as precision, recall, f1-score, and support. The images are fed to the convolutional neural network model for the classification of COVID-19 positive and negative cases and receive a software system to diagnose Corona virus disease. USE - Deep convolutional neural network system for reliable detection of SARS- related coronavirus and for building model and analyzing data of people who are suffering from Coronavirus disease. ADVANTAGE - The accuracy improved consistently with respect to the number of steps is provided. The COVID-19 is identified in early stages of diagnosis, thus reducing the plausible deaths. The accurate statistics of patients is provided with a positive report of COVID-19 which is used in medical research. The system prevents spreading the pandemic at the very rapid rate. DESCRIPTION OF DRAWING(S) - The drawing shows a flowchart illustrating the working of CoronaX showing the real time system architecture to determine SARS-related coronavirus (COVID-19).",,,"T01 (Digital Computers); W06 (Aviation, Marine and Radar Systems)",T01-J03; T01-J05B2; T01-J16C1; T01-J16C2; T01-N01E1; T01-N03; W06-A06H3,...,,,,,,,,,,Exports_Web_of_Science_FY2021


In [5]:
# @title Split the column and create a new DataFrame with the split values

# Split the 'optimized_assignee' column by '|' and create a new DataFrame
new_df = combined_data['AE'].str.split(';', expand=True)
new_df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
0,KAWASAKI GAKUEN EDUCATIONAL FOUND (KAWA-Non-standard),KAWASAKI GAKUEN EDUCATIONAL FOUND (KAWA-Non-standard),,,,,,,,,...,,,,,,,,,,
1,SHANGHAI VETERINARY INST CHINESE ACAD AG (CAGS-C),SHANGHAI VETERINARY INST CHINESE ACAD (CAGS-C),,,,,,,,,...,,,,,,,,,,
2,KOREA PUBLIC (KOPU-Non-standard),,,,,,,,,,...,,,,,,,,,,
3,UNIV KOOKMIN IND ACAD COOP FOUND (UYKO-C),,,,,,,,,,...,,,,,,,,,,
4,KOREA RES INST BIOSCIENCE & BIOTECHNOLOG (KRIB-C),,,,,,,,,,...,,,,,,,,,,
5,SHAANXI NUOWEILIHUA BIOTECHNOLOGY CO LTD (SHAA-Non-standard),SHAANXI NUOWEILIHUA BIOTECHNOLOGY CO LTD (SHAA-Non-standard),,,,,,,,,...,,,,,,,,,,
6,UNIV INNER MONGOLIA NATIONALITY (UIMN-C),UNIV INNER MONGOLIA NATIONALITY (UIMN-C),,,,,,,,,...,,,,,,,,,,
7,CHINESE PEOPLES LIBERATION ARMY DISEASE (CHPE-Non-standard),,,,,,,,,,...,,,,,,,,,,
8,UNIV YONSEI IND ACADEMIC COOP FOUND (UYIA-C),KOREA DISEASE CONTROL & PREVENTION CENT (KODI-Non-standard),,,,,,,,,...,,,,,,,,,,
9,NANTONG INT TRAVEL HEALTH CARE CLINIC (NANT-Non-standard),,,,,,,,,,...,,,,,,,,,,


In [6]:
# @title Join dataframes

# Create a new DataFrame with 'publication_number' and 'optimized_assignee' columns
new_df1 = combined_data[['PN','AE']]
# Join the new DataFrame with the split columns DataFrame
new_df1 = new_df1.join(new_df)
# Display the sample of joined DataFrame
new_df1.sample(5,random_state = 43)

Unnamed: 0,PN,AE,0,1,2,3,4,5,6,7,...,44,45,46,47,48,49,50,51,52,53
21441,CN116694637-A,GUIYI TECHNOLOGY SHANGHAI CO LTD (GUIY-Non-standard),GUIYI TECHNOLOGY SHANGHAI CO LTD (GUIY-Non-standard),,,,,,,,...,,,,,,,,,,
7458,US11035817-B1,UNIV ALFAISAL (UYAL-Non-standard),UNIV ALFAISAL (UYAL-Non-standard),,,,,,,,...,,,,,,,,,,
17219,US2022162686-A1; WO2022115187-A1; CA3194659-A1; IN202317027250-A,CALIFORNIA INST OF TECHNOLOGY (CALY-C); CALIFORNIA INST OF TECHNOLOGY (CALY-C),CALIFORNIA INST OF TECHNOLOGY (CALY-C),CALIFORNIA INST OF TECHNOLOGY (CALY-C),,,,,,,...,,,,,,,,,,
12754,WO2022072401-A1; US2023381159-A1,UNIV MIAMI (UYMI-C); UNIV MIAMI (UYMI-C),UNIV MIAMI (UYMI-C),UNIV MIAMI (UYMI-C),,,,,,,...,,,,,,,,,,
6172,IN202011015157-A,UNIV AMITY (UAMI-C),UNIV AMITY (UAMI-C),,,,,,,,...,,,,,,,,,,


In [7]:
# @title Get unique values from each column - with ID option

# Reshape the DataFrame to get unique values from each column with ID options
df_stacked = new_df1.melt(id_vars =['PN','AE'],var_name='Coluna', value_name='assignee_split')
# Drop rows with NaN values
df_stacked = df_stacked.dropna()#.drop_duplicates().dropna()

# Drop the 'Coluna' column
df_stacked.drop(['Coluna'], axis = 1, inplace = True)

print('Len Dataframe:',len(df_stacked.assignee_split))
# Print the count of unique values in 'assignee_split'
print('Unique Assignees:',len(df_stacked.assignee_split.unique()))
print("-----------------------")

# Sort the DataFrame by 'publication_number' and take a random sample of 10 rows
df_stacked.sort_values(by = ['PN'], ascending = False).sample(10, random_state = 43)

Len Dataframe: 53395
Unique Assignees: 25282
-----------------------


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_stacked.drop(['Coluna'], axis = 1, inplace = True)


Unnamed: 0,PN,AE,assignee_split
13682,WO2022061292-A2; WO2022061292-A3; CA3193258-A1; EP4213938-A2; US2023348586-A1,UNIV TEXAS SYSTEM (TEXA-C); UNIV TEXAS SYSTEM (TEXA-C); UNIV TEXAS SYSTEM BOARD REGENTS (TEXA-C),UNIV TEXAS SYSTEM (TEXA-C)
37756,WO2022271199-A1; EP4358974-A1,ONCOVIR INC (ONCO-Non-standard); ONCOVIR INC (ONCO-Non-standard),ONCOVIR INC (ONCO-Non-standard)
22072,WO2023177976-A2; WO2023177976-A3; US2023366045-A1,PROMEGA CORP (PRMG-C); PROMEGA CORP (PRMG-C),PROMEGA CORP (PRMG-C)
92466,WO2022232641-A1; AU2022266821-A1; CA3217822-A1; AU2022266821-A9; CN117500522-A; EP4329807-A1; JP2024519450-W; US2024190970-A1,PURETECH LYT INC (PURE-Non-standard); PURETECH LYT INC (PURE-Non-standard); PURETECH HEALTH LLC (PURE-Non-standard); PURETECH LYT INC (PURE-Non-standard); PURETECH LYT INC (PURE-Non-standard),PURETECH LYT INC (PURE-Non-standard)
157,WO2019196937-A1; CN110372664-A; CN112119070-A; EP3782993-A1; KR2021023814-A; US2021179598-A1; JP2021521214-W; EP3782993-A4; CN112119070-B,UNIV EAST CHINA SCI & TECHNOLOGY (UYEC-C); UNIV EAST CHINA SCI & TECHNOLOGY (UYEC-C); UNIV EAST CHINA SCI & TECHNOLOGY (UYEC-C); UNIV EAST CHINA SCI & TECHNOLOGY (UYEC-C),UNIV EAST CHINA SCI & TECHNOLOGY (UYEC-C)
18080,CN116920124-A,BEIJING SANROAD BIOLOGICAL PROD CO LTD (BEIJ-Non-standard),BEIJING SANROAD BIOLOGICAL PROD CO LTD (BEIJ-Non-standard)
3353,CN112914186-A,SURITALATU (SURI-Individual),SURITALATU (SURI-Individual)
20287,GB2620028-A,BIONTECH SE (BNTC-C),BIONTECH SE (BNTC-C)
23151,KR2023089644-A,JONG K S (JONG-Individual),JONG K S (JONG-Individual)
2607,CN111610327-A; CN111610327-B,BEIJING BEIER BIOENGINEERING CO LTD (BEIJ-Non-standard),BEIJING BEIER BIOENGINEERING CO LTD (BEIJ-Non-standard)


In [12]:
# @title Check error part 2
# If in the block "Count individual and legal person" below has some inconsistence replace values
df_stacked['assignee_split'] = df_stacked['assignee_split'].replace(
    "AMBULANC (SHENZHEN) TECH CO LTD (AMBU-Non-standard)",
    "AMBULANC SHENZHEN TECH CO LTD (AMBU-Non-standard)"
)

In [13]:
# @title Split the 'assignee_split' column and create new columns to individual and legal person
#import pandas as pd

# Function to split the 'assignee_split' column and create new columns
def split_assignee_column(df):
    # Extracting the 'assignee_name'
    df['assignee_name'] = df['assignee_split'].str.extract(r'^(.*?)\s*\(')

    # Extracting the 'assignee_abbreviation'
    df['assignee_abbreviation'] = df['assignee_split'].str.extract(r'\((.*?)-')

    # Extracting the 'assignee_individual_legal'
    df['assignee_individual_legal'] = df['assignee_split'].str.extract(r'\((.{5})(.*)\)')[1]

    return df

# Simulating the combined dataframe with the 'assignee_split' column
#data = {
#    'assignee_split': [
#        "ETERNIVAX BIOMEDICAL INC (ETER-Non-standard)",
#        "GENETECH PHARMA (GEN-Standard)",
#        "HEALTHY LIFE CORP (HL-Non-standard)"
#    ]
#}
#combined_data = pd.DataFrame(data)

# Applying the function to create new columns
df_stacked = split_assignee_column(df_stacked)

# Displaying the dataframe to show the result
df_stacked.head()

Unnamed: 0,PN,AE,assignee_split,assignee_name,assignee_abbreviation,assignee_individual_legal
0,WO2019093347-A1; JP2019552824-X; JP7227615-B2,KAWASAKI GAKUEN EDUCATIONAL FOUND (KAWA-Non-standard); KAWASAKI GAKUEN EDUCATIONAL FOUND (KAWA-Non-standard),KAWASAKI GAKUEN EDUCATIONAL FOUND (KAWA-Non-standard),KAWASAKI GAKUEN EDUCATIONAL FOUND,KAWA,Non-standard
1,CN109796531-A; CN109796531-B,SHANGHAI VETERINARY INST CHINESE ACAD AG (CAGS-C); SHANGHAI VETERINARY INST CHINESE ACAD (CAGS-C),SHANGHAI VETERINARY INST CHINESE ACAD AG (CAGS-C),SHANGHAI VETERINARY INST CHINESE ACAD AG,CAGS,C
2,KR2018201-B1,KOREA PUBLIC (KOPU-Non-standard),KOREA PUBLIC (KOPU-Non-standard),KOREA PUBLIC,KOPU,Non-standard
3,KR2019092776-A; KR2234745-B1,UNIV KOOKMIN IND ACAD COOP FOUND (UYKO-C),UNIV KOOKMIN IND ACAD COOP FOUND (UYKO-C),UNIV KOOKMIN IND ACAD COOP FOUND,UYKO,C
4,KR2019119391-A; KR2047072-B1,KOREA RES INST BIOSCIENCE & BIOTECHNOLOG (KRIB-C),KOREA RES INST BIOSCIENCE & BIOTECHNOLOG (KRIB-C),KOREA RES INST BIOSCIENCE & BIOTECHNOLOG,KRIB,C


## Assignee terms
In the assignee column of a patent database, the following terms have specific meanings:

1. **C (Company)**: This indicates that the assignee of the patent is a company or a corporate entity. This means the rights to the patent are held by a business organization rather than an individual.

2. **Individual**: This indicates that the assignee is an individual person. The rights to the patent are held by a single inventor or a person rather than a corporation or organization.

3. **Non-standard**: This term is used for assignees that do not fit into the typical categories of companies, individuals, or recognized institutions. It may include various types of organizations or entities that do not conform to the standard classifications.

4. **Soviet Institute**: This specifically refers to research institutes or organizations that were part of the Soviet Union. These institutes were often involved in scientific research and development and were common assignees for patents filed during the era of the Soviet Union.

These terms help categorize the ownership of patents and provide insight into who holds the rights to the inventions.

In [14]:
# @title Count individual and legal person
df_stacked.groupby('assignee_individual_legal').size().reset_index(name='count')

Unnamed: 0,assignee_individual_legal,count
0,C,17370
1,Individual,13505
2,Non-standard,22511
3,Soviet Institute,9


In [16]:
# @title Check error part 1
# If necessary check which "assignee_individual_legal" is out of standard to make adjustment in code block:
# "Split the 'assignee_split' column and create new columns to individual and legal person"
df_stacked[df_stacked['assignee_individual_legal'] == 'HEN) TECH CO LTD (AMBU-Non-standard']

In [19]:
# @title Download the csv file

# Define folder name and file path
folder_name = 'Assignees_Splits_with_Publications_Numbers_Notebook_02.1_From_GitHub_Split_Assignees'
file_path = f'{folder_name}.csv'

# Save the final DataFrame to a CSV file
df_stacked.to_csv(file_path, index=False)

# Download the CSV file
files.download(file_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>