# Data Preprocessing

Performing some fundamental data wrangling that, together, form the pre-processing phase of data analysis (convert data from an initial format to a format that may be better for analysis). These steps include handling missing values in data, formatting data to standardize it and make it consistent and converting categorical variables into numerical quantitative variables.

In [1]:
import pandas as pd
import numpy as np

import utils

In [2]:
file_path = "../data/patsnap_data.xlsx"

In [3]:
df = utils.load_patent_data(file_path, 'sheet1')

In [4]:
df

Unnamed: 0,Number,Publication Number,Title,Legal Status & Events,Current Assignee,Application Date,IPC,Patent Valuation,Abstract,Abstract_English,Claims,Title_English,CPC
0,1,US6056237A,Sonotube compatible unmanned aerial vehicle an...,Non-payment,1281329 ALBERTA LTD.,1997-06-25,B64D1/02 | B64D1/00 | B64D33/02 | B64C39/00 | ...,-,The present invention is generally comprised o...,The present invention is generally comprised o...,I claim:_x000D_\n1. A sonotube compatible unma...,Sonotube compatible unmanned aerial vehicle an...,B64C3/40 | B64C5/12 | B64C39/024 | B64D1/02 | ...
1,2,US8511606B1,Unmanned aerial vehicle base station,Granted,THE BOEING COMPANY,2009-12-09,B64D41/00,"$ 56,000","A method and apparatus comprising a platform, ...","A method and apparatus comprising a platform, ...",1. An apparatus comprising:_x000D_\na platform...,Unmanned aerial vehicle base station,B64C39/028 | B64C39/024 | B64C2201/066 | B64C2...
2,3,US8948935B1,Providing a medical support device via an unma...,Granted | Transfer,WING AVIATION LLC,2013-01-02,G06Q10/00 | B64C39/02 | G16H40/67,"$ 79,000",Embodiments described herein may relate to an ...,Embodiments described herein may relate to an ...,1. An unmanned aerial vehicle (UAV) comprising...,Providing a medical support device via an unma...,A61B5/00 | A61B19/0264 | B64C39/024 | G06F19/3...
3,4,US20100250022A1,Useful unmanned aerial vehicle,Withdrawn-Deemed,"AIR RECON, INC.",2006-12-29,G05D1/00 | B64C13/20 | G06F3/048,-,An unmanned aerial vehicle (UAV) addresses rem...,An unmanned aerial vehicle (UAV) addresses rem...,1-17. (canceled)_x000D_\n18. A method of opera...,Useful unmanned aerial vehicle,B64C2201/141 | B64C2201/145 | G05D1/0094 | G05...
4,5,US20110084162A1,Autonomous Payload Parsing Management System a...,Abandoned-Undetermined,HONEYWELL INTERNATIONAL INC.,2009-10-09,B64C29/00 | G01M1/12 | B64C17/10 | B64D37/14 |...,-,An unmanned aerial vehicle (UAV) for making pa...,An unmanned aerial vehicle (UAV) for making pa...,1. An unmanned aerial vehicle (UAV) for making...,Autonomous Payload Parsing Management System a...,B64C39/024 | B64C2201/027 | B64C2201/088 | B64...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,CN206407426U,Take precautions against earthquakes and swing...,Granted,华南智能机器人创新研究院,2016-12-30,B65D90/52 | B64C39/02,"$ 3,600",The utility model discloses a taking precautio...,The utility model discloses a taking precautio...,1.一种防震荡药箱，其特征在于，包括外壳、用来密封外壳的上盖、用于防震荡的药液分隔结构；外壳...,Take precautions against earthquakes and swing...,-
9996,9997,CN106708070A,Aerial photographing control method and apparatus,Granted,深圳市道通智能航空技术股份有限公司,2015-08-17,G05D1/08,"$ 130,000","The invention, which relates to the technical ...","The invention, which relates to the technical ...",1.一种航拍控制方法，其特征在于，包括步骤：_x000D_\n预设无人机状态数据与触发无人机...,Aerial photographing control method and apparatus,-
9997,9998,CN205864058U,A redundancy power supply for unmanned aerial ...,Non-payment,深圳光启空间技术有限公司,2016-07-14,H02J9/06,-,The utility model discloses a redundancy power...,The utility model discloses a redundancy power...,1.一种用于无人机系统的冗余电源，其特征在于，包括：_x000D_\n电源保护模块，与电源模...,A redundancy power supply for unmanned aerial ...,-
9998,9999,GB2377683A,Composite of unmanned aerial vehicles,Withdrawn-Undetermined,BAE SYSTEMS PLC,2001-07-20,B64D7/08 | B64D5/00 | B64C39/02 | F42B15/36 | ...,-,The aerial vehicle comprises a plurality of UA...,The aerial vehicle comprises a plurality of UA...,Claims_x000D_\n1 An aerial vehicle comprising ...,Composite of unmanned aerial vehicles,B64C39/024 | B64C2201/082 | B64C2201/102 | B64...


## Checking Data Types

Purpose:
1. Potential info and type mismatch
2. Compatibility with Python methods

In [5]:
df.dtypes

Number                    int64
Publication Number       object
Title                    object
Legal Status & Events    object
Current Assignee         object
Application Date         object
IPC                      object
Patent Valuation         object
Abstract                 object
Abstract_English         object
Claims                   object
Title_English            object
CPC                      object
dtype: object

The datatypes seems reasonable

## Data Cleaning

Check duplicates

In [8]:
# duplicate_count = df.duplicated().sum()

# duplicate_count

# Check for duplicates based on Title_English and Abstract_English columns
duplicates = df.duplicated(subset=['Title_English', 'Abstract_English'])

# Print duplicated rows
df[duplicates]

Unnamed: 0,Number,Publication Number,Title,Legal Status & Events,Current Assignee,Application Date,IPC,Patent Valuation,Abstract,Abstract_English,Claims,Title_English,CPC
212,213,USD813724S1,Unmanned aerial vehicle,Granted,"SHENZHEN C-FLY INTELLIGENT TECHNOLOGY CO., LTD.",2017-05-18,-,-,-,-,The ornamental design for an “unmanned aerial ...,Unmanned aerial vehicle,-
250,251,USD761690S1,Unmanned aerial vehicle,Granted,"DRONESMITH TECHNOLOGIES, INC.",2014-11-06,-,-,-,-,The ornamental design for an unmanned aerial v...,Unmanned aerial vehicle,-
342,343,USD782365S1,Unmanned aerial vehicle,Granted,XDYNAMICS LIMITED,2016-03-17,-,-,-,-,The ornamental design for an unmanned aerial v...,Unmanned aerial vehicle,-
416,417,USD818874S1,Unmanned aerial vehicle,Granted,"YUNEEC INTERNATIONAL (CHINA) CO, LTD",2017-06-27,-,-,-,-,The ornamental design for a unmanned aerial ve...,Unmanned aerial vehicle,-
429,430,USD776569S1,Unmanned aerial vehicle,Granted,"MATTERNET, INC.",2015-03-26,-,-,-,-,The ornamental design for an unmanned aerial v...,Unmanned aerial vehicle,-
484,485,USD725548S1,Unmanned aerial vehicle,Granted | Pledge,"MERRILL TECHNOLOGIES GROUP, INC.",2014-07-14,-,-,-,-,The ornamental design for an unmanned aerial v...,Unmanned aerial vehicle,-
529,530,USD784202S1,Unmanned aerial vehicle,Granted,"HANWHA TECHWIN CO., LTD",2016-03-10,-,-,-,-,The ornamental design of an unmanned aerial ve...,Unmanned aerial vehicle,-
542,543,USD665331S1,Unmanned aerial vehicle,Granted | Pledge,"UNMANNED SYSTEMS, INC.",2011-11-09,-,-,-,-,The ornamental design for an unmanned aerial v...,Unmanned aerial vehicle,-
563,564,USD776570S1,Unmanned aerial vehicle,Granted,"MATTERNET, INC.",2015-03-26,-,-,-,-,The ornamental design for an unmanned aerial v...,Unmanned aerial vehicle,-
612,613,USD745435S1,Unmanned aerial vehicle,Granted,"HANWHA TECHWIN CO., LTD.",2014-11-21,-,-,-,-,The ornamental design for an unmanned aerial v...,Unmanned aerial vehicle,-


Formatting Topic/Title and Category

In [6]:
df[['TITLE','CATEGORY','SUB_CATEGORY']] = df[['TITLE','CATEGORY','SUB_CATEGORY']].apply(lambda x: x.str.upper())

df[['TITLE', 'CATEGORY','SUB_CATEGORY']]

Unnamed: 0,TITLE,CATEGORY,SUB_CATEGORY
0,SYSTEMS AND METHODS FOR STRUCTURE DISCOVERY AN...,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES
1,"FREQUENCY SEPARATOR, OPTICAL QUANTIZATION CIRC...",ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES
2,MEDICAL STRUCTURED REPORTING WORKFLOW ASSISTED...,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES
3,"SYSTEMS, METHODS, AND APPARATUSES FOR GENERATI...",ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES
4,NATURAL LANGUAGE PROCESSING FOR BLOCKCHAIN-BAS...,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES
...,...,...,...
8151,SYSTEMS AND METHODS FOR MONITORING AUTOMATED C...,COMPOSITE,MANUFACTURING PROCESSES/ DESIGN TOOLS/ TECHNIQUES
8152,PORTABLE COMPOSITE BONDING INSPECTION SYSTEM,COMPOSITE,MANUFACTURING PROCESSES/ DESIGN TOOLS/ TECHNIQUES
8153,METHODS OF DEBONDING A COMPOSITE TOOLING,COMPOSITE,MANUFACTURING PROCESSES/ DESIGN TOOLS/ TECHNIQUES
8154,ADHESIVE OF A SILICON AND SILICA COMPOSITE FOR...,COMPOSITE,MANUFACTURING PROCESSES/ DESIGN TOOLS/ TECHNIQUES


Formatting Asignee

In [7]:
# Standardize company names
df['ASIGNEE'] = df['ASIGNEE'].str.upper().str.strip()

# Remove duplicates after standardization
df.drop_duplicates(inplace=True)

# Display standardized company names
df['ASIGNEE']

0                           CASETEXT, INC.
1          MITSUBISHI ELECTRIC CORPORATION
2                                 EBIT SRL
3       PREMIER HEALTHCARE SOLUTIONS, INC.
4                                      NaN
                       ...                
8151                    THE BOEING COMPANY
8152                     SPACE MICRO, INC.
8153       TOYOTA MOTOR SALES U.S.A., INC.
8154                                   NaN
8155                    THE BOEING COMPANY
Name: ASIGNEE, Length: 8156, dtype: object

In [8]:
df.head()

Unnamed: 0,TITLE,CPC,ASIGNEE,FILING_DATE,PATENT_NO,RELEVANCY,PRODUCT,CATEGORY,SUB_CATEGORY,TAXONOMY_SSTT
0,SYSTEMS AND METHODS FOR STRUCTURE DISCOVERY AN...,G06F40/289;G06F40/40;G06F16/33;G06F40/279;G06F...,"CASETEXT, INC.",2023-06-29,11861321,69.63724,,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09
1,"FREQUENCY SEPARATOR, OPTICAL QUANTIZATION CIRC...",G02F1/01;G02F2/006;G02F7/00,MITSUBISHI ELECTRIC CORPORATION,2023-06-27,20230350269,46.685993,,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09
2,MEDICAL STRUCTURED REPORTING WORKFLOW ASSISTED...,G06F40/174;G06N20/00;G16H50/20;G16H10/60;G16H1...,EBIT SRL,2023-06-27,20240006039,71.25871,,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09
3,"SYSTEMS, METHODS, AND APPARATUSES FOR GENERATI...",G06F40/30;G06F16/34;G06F40/279;G06F21/6254;G06...,"PREMIER HEALTHCARE SOLUTIONS, INC.",2023-06-23,20230418981,61.53855,,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09
4,NATURAL LANGUAGE PROCESSING FOR BLOCKCHAIN-BAS...,G06F16/289;G06F16/283;G06F16/24573;G06F16/27,,2023-06-09,20230409604,68.528656,,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09


In [9]:
df.describe(include="all")

Unnamed: 0,TITLE,CPC,ASIGNEE,FILING_DATE,PATENT_NO,RELEVANCY,PRODUCT,CATEGORY,SUB_CATEGORY,TAXONOMY_SSTT
count,8156,8043,3698,8156,8156.0,8156.0,0.0,8156,8156,8156
unique,7376,7794,1661,,8092.0,,,5,17,17
top,UNMANNED AERIAL VEHICLE,G06F21/577,INTERNATIONAL BUSINESS MACHINES CORPORATION,,20200390000.0,,,ARTIFICIAL INTELLIGENCE,INTEGRATED PLATFORMS,C02
freq,147,11,122,,2.0,,,2812,2063,2063
mean,,,,2016-03-28 17:00:08.827856896,,60.742594,,,,
min,,,,2006-01-03 00:00:00,,12.973147,,,,
25%,,,,2013-02-13 00:00:00,,45.402196,,,,
50%,,,,2017-02-22 00:00:00,,58.735012,,,,
75%,,,,2019-12-27 00:00:00,,70.13864,,,,
max,,,,2023-10-16 00:00:00,,229.25974,,,,


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8156 entries, 0 to 8155
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   TITLE          8156 non-null   object        
 1   CPC            8043 non-null   object        
 2   ASIGNEE        3698 non-null   object        
 3   FILING_DATE    8156 non-null   datetime64[ns]
 4   PATENT_NO      8156 non-null   object        
 5   RELEVANCY      8156 non-null   float64       
 6   PRODUCT        0 non-null      float64       
 7   CATEGORY       8156 non-null   object        
 8   SUB_CATEGORY   8156 non-null   object        
 9   TAXONOMY_SSTT  8156 non-null   object        
dtypes: datetime64[ns](1), float64(2), object(7)
memory usage: 637.3+ KB


## Missing values

There are two columns missing values; `CPC` and `ASIGNEE`.

In [11]:
missing_data = df.isna()

# sum() to count missing values in each column
missing_data_count = missing_data.sum()

print(missing_data_count)

TITLE               0
CPC               113
ASIGNEE          4458
FILING_DATE         0
PATENT_NO           0
RELEVANCY           0
PRODUCT          8156
CATEGORY            0
SUB_CATEGORY        0
TAXONOMY_SSTT       0
dtype: int64


In [12]:
# Drop rows with missing values in some columns
df.dropna(subset=['FILING_DATE','PATENT_NO','RELEVANCY','CPC'], inplace=True)
print(df.isna().sum())

TITLE               0
CPC                 0
ASIGNEE          4439
FILING_DATE         0
PATENT_NO           0
RELEVANCY           0
PRODUCT          8043
CATEGORY            0
SUB_CATEGORY        0
TAXONOMY_SSTT       0
dtype: int64


`replace()` method in pandas will be used. *Values of the Series/DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require to specify a location to update with some value.*

> Need further disussion to handle the missing values

Drop data that has no CPC Inventive code

In [13]:
# #Will be moved to utils.py for reusability and maintainabilty

# def replace_to_unknown(df):
#     # Replace NaN values with "Unknown" in the entire DataFrame
#     return df.replace(np.nan, "Unknown")

In [14]:
# main_df = replace_to_unknown(main_df)

# main_df.head()

In [15]:
# #save the cleaned data to a new file
# data_path = "../data/processed/uav.csv"

# main_df.to_csv(data_path, index=False)

## Feature Engineering

Adding Section Feature by extracting from the `SSTT_TAXONOMY` feature

In [16]:
# Extract only the letters from the "SSTT_TAXONOMY" column
df['SECTION'] = df['TAXONOMY_SSTT'].str.extract(r'([A-Z]+)')

df[['TAXONOMY_SSTT','SECTION']].head()

Unnamed: 0,TAXONOMY_SSTT,SECTION
0,A09,A
1,A09,A
2,A09,A
3,A09,A
4,A09,A


Unique Values

In [17]:
# Get unique values in the specific column
unique_sstt = df['TAXONOMY_SSTT'].unique()

unique_sstt

array(['A09', 'C01', 'A08', 'C07', 'B07', 'B14', 'B02', 'C03', 'B10',
       'C02', 'B08', 'B09', 'C04', 'B04', 'A04', 'A01', 'B12'],
      dtype=object)

In [18]:
# Count unique values in the "ASIGNEE" column
unique_asignee_count = df['ASIGNEE'].nunique()

print(unique_asignee_count)

1624


Drop `RELEVANCY` and `PRODUCT` for now

In [19]:
df.drop(columns=['RELEVANCY','PRODUCT'], inplace=True)

Adding `YEAR` feature

In [20]:
# Extract year and create 'YEAR' column
df['YEAR'] = df['FILING_DATE'].dt.year

# Move 'YEAR' column next to 'FILING_DATE' column
filing_date_index = df.columns.get_loc('FILING_DATE')
df.insert(filing_date_index + 1, 'YEAR', df.pop('YEAR'))

In [21]:
df.tail()

Unnamed: 0,TITLE,CPC,ASIGNEE,FILING_DATE,YEAR,PATENT_NO,CATEGORY,SUB_CATEGORY,TAXONOMY_SSTT,SECTION
8151,SYSTEMS AND METHODS FOR MONITORING AUTOMATED C...,B29C70/386;G01N21/88,THE BOEING COMPANY,2006-05-16,2006,20070277919,COMPOSITE,MANUFACTURING PROCESSES/ DESIGN TOOLS/ TECHNIQUES,B12,B
8152,PORTABLE COMPOSITE BONDING INSPECTION SYSTEM,G01N21/94,"SPACE MICRO, INC.",2007-04-23,2007,20070252084,COMPOSITE,MANUFACTURING PROCESSES/ DESIGN TOOLS/ TECHNIQUES,B12,B
8153,METHODS OF DEBONDING A COMPOSITE TOOLING,B29C70/30;B29C33/505;B64F5/10,"TOYOTA MOTOR SALES U.S.A., INC.",2006-07-21,2006,20070006960,COMPOSITE,MANUFACTURING PROCESSES/ DESIGN TOOLS/ TECHNIQUES,B12,B
8154,ADHESIVE OF A SILICON AND SILICA COMPOSITE FOR...,H01L21/67306,,2006-06-01,2006,20060213601,COMPOSITE,MANUFACTURING PROCESSES/ DESIGN TOOLS/ TECHNIQUES,B12,B
8155,COMPOSITE LAMINATION USING ARRAY OF PARALLEL M...,B29C70/545;B29C70/386;B29C70/202;B32B38/1808;B...,THE BOEING COMPANY,2006-03-02,2006,20060162143,COMPOSITE,MANUFACTURING PROCESSES/ DESIGN TOOLS/ TECHNIQUES,B12,B


In [22]:
df.dtypes

TITLE                    object
CPC                      object
ASIGNEE                  object
FILING_DATE      datetime64[ns]
YEAR                      int32
PATENT_NO                object
CATEGORY                 object
SUB_CATEGORY             object
TAXONOMY_SSTT            object
SECTION                  object
dtype: object

## Aggregate data per Week

In [23]:
def calculate_application_per_week(data: pd.DataFrame, date_column: str):
    """
    Calculate the cumulative count of patent applications per week.

    Parameters:
    - data: DataFrame containing the patent application data
    - date_column: Name of the column containing the date information

    Returns:
    - DataFrame with the cumulative count of patent applications per week
    """
    # Convert date column to datetime format
    data[date_column] = pd.to_datetime(data[date_column])

    # Extract year and week number from the date
    data['Week'] = data[date_column].dt.to_period('W')

    # Group by week and count the number of applications in each group
    applications_per_week = data.groupby('Week').size().reset_index(name='Applications')

    return applications_per_week

In [25]:
series = calculate_application_per_week(data=df, date_column='FILING_DATE')

Unnamed: 0,Week,Applications
0,2006-01-02/2006-01-08,3
1,2006-01-09/2006-01-15,8
2,2006-01-16/2006-01-22,5
3,2006-01-23/2006-01-29,4
4,2006-01-30/2006-02-05,6
...,...,...
918,2023-09-18/2023-09-24,1
919,2023-09-25/2023-10-01,2
920,2023-10-02/2023-10-08,1
921,2023-10-09/2023-10-15,3


Write preprocessed dataset to Excel to be used for modeling and exploratory

In [28]:
missing_data = df.isna()

# sum() to count missing values in each column
missing_data_count = missing_data.sum()

print(missing_data_count)

TITLE               0
CPC                 0
ASIGNEE          4439
FILING_DATE         0
YEAR                0
PATENT_NO           0
CATEGORY            0
SUB_CATEGORY        0
TAXONOMY_SSTT       0
SECTION             0
dtype: int64
