# Data Preprocessing

Performing some fundamental data wrangling that, together, form the pre-processing phase of data analysis (convert data from an initial format to a format that may be better for analysis). These steps include handling missing values in data, formatting data to standardize it and make it consistent and converting categorical variables into numerical quantitative variables.

In [1]:
import pandas as pd
import numpy as np

In [2]:
file_path = "../data/raw/data.xlsx"

In [3]:
df = pd.read_excel(file_path, sheet_name="integrated_db")
df.head()

Unnamed: 0,TITLE,CPC,ASIGNEE,FILING_DATE,PATENT_NO,RELEVANCY,CATEGORY,PRODUCT,SSTT_TAXONOMY
0,DEAD TIME GENERATOR AND DIGITAL SIGNAL PROCESS...,H03K5/1515;H03K3/017;H03F3/2173;H03K3/78;H03F1...,,2018-06-22,20200220527,58.472015,Artificial Intelligence,Information and Signal Processing Technologies,A09
1,APPARATUS AND METHOD FOR SPEAKER TUNING AND AU...,H04R3/12;H04R1/26;H04R1/403;H04R29/002;H04R1/026,"HARMAN INTERNATIONAL INDUSTRIES, INCORPORATED",2018-06-01,20200169821,56.17669,Artificial Intelligence,Information and Signal Processing Technologies,A09
2,Modified pi-sigma-delta-modulator based digita...,H03M3/42;H03M3/414;H03M7/3022;H03M3/352;H03M7/...,"HUAWEI TECHNOLOGIES CO., LTD.",2018-11-02,10615819,33.97642,Artificial Intelligence,Information and Signal Processing Technologies,A09
3,DIGITAL SIGNAL PROCESSING NOISE FILTER TO INCR...,G06F9/45558;H04L43/50;H04L41/0604;H04L43/20;H0...,,2018-05-25,20190363970,37.46541,Artificial Intelligence,Information and Signal Processing Technologies,A09
4,Optical Signal Processing Device,H04B10/548;H04J14/0213;G02F1/31;H04J14/0212;H0...,,2017-12-25,20190349112,39.456894,Artificial Intelligence,Information and Signal Processing Technologies,A09


## Checking Data Types

Purpose:
1. Potential info and type mismatch
2. Compatibility with Python methods

In [4]:
df.dtypes

TITLE                    object
CPC                      object
ASIGNEE                  object
FILING_DATE      datetime64[ns]
PATENT_NO                object
RELEVANCY               float64
CATEGORY                 object
PRODUCT                  object
SSTT_TAXONOMY            object
dtype: object

The datatypes seems reasonable

## Data Cleaning

Check duplicates

In [5]:
duplicate_count = df.duplicated().sum()

duplicate_count

0

Formatting Topic/Title and Category

In [6]:
df[['TITLE','CATEGORY','PRODUCT']] = df[['TITLE','CATEGORY','PRODUCT']].apply(lambda x: x.str.upper())

df[['TITLE', 'CATEGORY','PRODUCT']]

Unnamed: 0,TITLE,CATEGORY,PRODUCT
0,DEAD TIME GENERATOR AND DIGITAL SIGNAL PROCESS...,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES
1,APPARATUS AND METHOD FOR SPEAKER TUNING AND AU...,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES
2,MODIFIED PI-SIGMA-DELTA-MODULATOR BASED DIGITA...,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES
3,DIGITAL SIGNAL PROCESSING NOISE FILTER TO INCR...,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES
4,OPTICAL SIGNAL PROCESSING DEVICE,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES
...,...,...,...
2471,HYBRID CERAMIC MATRIX COMPOSITE TURBINE BLADES...,COMPOSITE,STRUCTURAL & SMART MATERIALS & STRUCTURAL MECH...
2472,PROCESS FOR MANUFACTURING A TUBULAR COMPONENT ...,COMPOSITE,STRUCTURAL & SMART MATERIALS & STRUCTURAL MECH...
2473,SELECTIVE REINFORCEMENT WITH METAL MATRIX COMP...,COMPOSITE,STRUCTURAL & SMART MATERIALS & STRUCTURAL MECH...
2474,POLYMER MATRIX COMPOSITE PUSHROD,COMPOSITE,STRUCTURAL & SMART MATERIALS & STRUCTURAL MECH...


Formatting Asignee

In [7]:
# Standardize company names
df['ASIGNEE'] = df['ASIGNEE'].str.upper().str.strip()

# Remove duplicates after standardization
df.drop_duplicates(inplace=True)

# Display standardized company names
df['ASIGNEE']

0                                                 NaN
1       HARMAN INTERNATIONAL INDUSTRIES, INCORPORATED
2                       HUAWEI TECHNOLOGIES CO., LTD.
3                                                 NaN
4                                                 NaN
                            ...                      
2471                         GENERAL ELECTRIC COMPANY
2472                                           SNECMA
2473             TOUCHSTONE RESEARCH LABORATORY, LTD.
2474                 3M INNOVATIVE PROPERTIES COMPANY
2475                                              NaN
Name: ASIGNEE, Length: 2476, dtype: object

In [8]:
df.head()

Unnamed: 0,TITLE,CPC,ASIGNEE,FILING_DATE,PATENT_NO,RELEVANCY,CATEGORY,PRODUCT,SSTT_TAXONOMY
0,DEAD TIME GENERATOR AND DIGITAL SIGNAL PROCESS...,H03K5/1515;H03K3/017;H03F3/2173;H03K3/78;H03F1...,,2018-06-22,20200220527,58.472015,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09
1,APPARATUS AND METHOD FOR SPEAKER TUNING AND AU...,H04R3/12;H04R1/26;H04R1/403;H04R29/002;H04R1/026,"HARMAN INTERNATIONAL INDUSTRIES, INCORPORATED",2018-06-01,20200169821,56.17669,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09
2,MODIFIED PI-SIGMA-DELTA-MODULATOR BASED DIGITA...,H03M3/42;H03M3/414;H03M7/3022;H03M3/352;H03M7/...,"HUAWEI TECHNOLOGIES CO., LTD.",2018-11-02,10615819,33.97642,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09
3,DIGITAL SIGNAL PROCESSING NOISE FILTER TO INCR...,G06F9/45558;H04L43/50;H04L41/0604;H04L43/20;H0...,,2018-05-25,20190363970,37.46541,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09
4,OPTICAL SIGNAL PROCESSING DEVICE,H04B10/548;H04J14/0213;G02F1/31;H04J14/0212;H0...,,2017-12-25,20190349112,39.456894,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09


In [9]:
df.describe(include="all")

Unnamed: 0,TITLE,CPC,ASIGNEE,FILING_DATE,PATENT_NO,RELEVANCY,CATEGORY,PRODUCT,SSTT_TAXONOMY
count,2476,2461,1317,2476,2476.0,2476.0,2476,2476,2476
unique,2266,2400,612,,2465.0,,5,17,17
top,SYSTEM AND METHOD FOR DIGITAL SIGNAL PROCESSING,F02K9/52,"MITSUI CHEMICALS, INC.",,7784390.0,,ARTIFICIAL INTELLIGENCE,STRUCTURAL & SMART MATERIALS & STRUCTURAL MECH...,A01
freq,13,9,54,,2.0,,582,431,431
mean,,,,2013-02-13 00:01:44.684975616,,58.674873,,,
min,,,,2006-01-05 00:00:00,,12.977243,,,
25%,,,,2009-06-28 12:00:00,,37.27437,,,
50%,,,,2013-09-24 12:00:00,,52.171883,,,
75%,,,,2016-08-04 06:00:00,,73.41124,,,
max,,,,2018-12-28 00:00:00,,144.04594,,,


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2476 entries, 0 to 2475
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   TITLE          2476 non-null   object        
 1   CPC            2461 non-null   object        
 2   ASIGNEE        1317 non-null   object        
 3   FILING_DATE    2476 non-null   datetime64[ns]
 4   PATENT_NO      2476 non-null   object        
 5   RELEVANCY      2476 non-null   float64       
 6   CATEGORY       2476 non-null   object        
 7   PRODUCT        2476 non-null   object        
 8   SSTT_TAXONOMY  2476 non-null   object        
dtypes: datetime64[ns](1), float64(1), object(7)
memory usage: 174.2+ KB


## Missing values

There are two columns missing values; `CPC` and `ASIGNEE`.

In [11]:
missing_data = df.isna()

# sum() to count missing values in each column
missing_data_count = missing_data.sum()

print(missing_data_count)

TITLE               0
CPC                15
ASIGNEE          1159
FILING_DATE         0
PATENT_NO           0
RELEVANCY           0
CATEGORY            0
PRODUCT             0
SSTT_TAXONOMY       0
dtype: int64


`replace()` method in pandas will be used. *Values of the Series/DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require to specify a location to update with some value.*

> Need further disussion to handle the missing values

In [12]:
# #Will be moved to utils.py for reusability and maintainabilty

# def replace_to_unknown(df):
#     # Replace NaN values with "Unknown" in the entire DataFrame
#     return df.replace(np.nan, "Unknown")

In [13]:
# main_df = replace_to_unknown(main_df)

# main_df.head()

In [14]:
# #save the cleaned data to a new file
# data_path = "../data/processed/uav.csv"

# main_df.to_csv(data_path, index=False)

## Feature Engineering

Adding Section Feature by extracting from the `SSTT_TAXONOMY` feature

In [15]:
# Extract only the letters from the "SSTT_TAXONOMY" column
df['SECTION'] = df['SSTT_TAXONOMY'].str.extract(r'([A-Z]+)')

df[['SSTT_TAXONOMY','SECTION']].head()

Unnamed: 0,SSTT_TAXONOMY,SECTION
0,A09,A
1,A09,A
2,A09,A
3,A09,A
4,A09,A


Unique Values

In [16]:
# Get unique values in the specific column
unique_sstt = df['SSTT_TAXONOMY'].unique()

unique_sstt

array(['A09', 'C01', 'A08', 'C07', 'B07', 'B14', 'C02', 'B02', 'C03',
       'C05', 'C04', 'B08', 'B09', 'A04', 'B04', 'B12', 'A01'],
      dtype=object)

In [17]:
# Count unique values in the "ASIGNEE" column
unique_asignee_count = df['ASIGNEE'].nunique()

print(unique_asignee_count)

612


Drop `RELEVANCY` as it is no longer needed

In [18]:
# Drop 'RELEVANCY' column
df.drop(columns=['RELEVANCY'], inplace=True)

Adding `YEAR` feature

In [19]:
# Extract year and create 'YEAR' column
df['YEAR'] = df['FILING_DATE'].dt.year

# Move 'YEAR' column next to 'FILING_DATE' column
filing_date_index = df.columns.get_loc('FILING_DATE')
df.insert(filing_date_index + 1, 'YEAR', df.pop('YEAR'))

In [20]:
df

Unnamed: 0,TITLE,CPC,ASIGNEE,FILING_DATE,YEAR,PATENT_NO,CATEGORY,PRODUCT,SSTT_TAXONOMY,SECTION
0,DEAD TIME GENERATOR AND DIGITAL SIGNAL PROCESS...,H03K5/1515;H03K3/017;H03F3/2173;H03K3/78;H03F1...,,2018-06-22,2018,20200220527,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09,A
1,APPARATUS AND METHOD FOR SPEAKER TUNING AND AU...,H04R3/12;H04R1/26;H04R1/403;H04R29/002;H04R1/026,"HARMAN INTERNATIONAL INDUSTRIES, INCORPORATED",2018-06-01,2018,20200169821,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09,A
2,MODIFIED PI-SIGMA-DELTA-MODULATOR BASED DIGITA...,H03M3/42;H03M3/414;H03M7/3022;H03M3/352;H03M7/...,"HUAWEI TECHNOLOGIES CO., LTD.",2018-11-02,2018,10615819,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09,A
3,DIGITAL SIGNAL PROCESSING NOISE FILTER TO INCR...,G06F9/45558;H04L43/50;H04L41/0604;H04L43/20;H0...,,2018-05-25,2018,20190363970,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09,A
4,OPTICAL SIGNAL PROCESSING DEVICE,H04B10/548;H04J14/0213;G02F1/31;H04J14/0212;H0...,,2017-12-25,2017,20190349112,ARTIFICIAL INTELLIGENCE,INFORMATION AND SIGNAL PROCESSING TECHNOLOGIES,A09,A
...,...,...,...,...,...,...,...,...,...,...
2471,HYBRID CERAMIC MATRIX COMPOSITE TURBINE BLADES...,F01D5/282;F01D5/284;C04B35/62868;C04B35/565;C0...,GENERAL ELECTRIC COMPANY,2006-06-22,2006,20070072007,COMPOSITE,STRUCTURAL & SMART MATERIALS & STRUCTURAL MECH...,A01,A
2472,PROCESS FOR MANUFACTURING A TUBULAR COMPONENT ...,C22C47/064;B23K20/233;C22C49/11;B23K20/021,SNECMA,2006-05-26,2006,20070045251,COMPOSITE,STRUCTURAL & SMART MATERIALS & STRUCTURAL MECH...,A01,A
2473,SELECTIVE REINFORCEMENT WITH METAL MATRIX COMP...,B22D19/02;B22D19/14,"TOUCHSTONE RESEARCH LABORATORY, LTD.",2006-01-31,2006,20060254744,COMPOSITE,STRUCTURAL & SMART MATERIALS & STRUCTURAL MECH...,A01,A
2474,POLYMER MATRIX COMPOSITE PUSHROD,F01L1/146,3M INNOVATIVE PROPERTIES COMPANY,2006-06-06,2006,20060225685,COMPOSITE,STRUCTURAL & SMART MATERIALS & STRUCTURAL MECH...,A01,A


In [22]:
df.dtypes

TITLE                    object
CPC                      object
ASIGNEE                  object
FILING_DATE      datetime64[ns]
YEAR                      int32
PATENT_NO                object
CATEGORY                 object
PRODUCT                  object
SSTT_TAXONOMY            object
SECTION                  object
dtype: object

Write preprocessed dataset to Excel to be used for modeling and exploratory

In [21]:
df.to_excel('../data/processed/cleaned_data.xlsx', index=False, sheet_name='cleaned_data')

# One Hot Encoding

Categorical variables encoding

> Will be implemented on the Integrated Dataset

**Problem**

- Most statistical models cannot take in objects or strings as input, and for model training, only take the numbers as inputs

**Solution**
- Encode the values by adding new features corresponding to each unique element in the original feature we would like to encode
- Add dummy variables for each unique category using `pandas.get_dummies()` method