# Data Preprocessing

Performing some fundamental data wrangling that, together, form the pre-processing phase of data analysis (convert data from an initial format to a format that may be better for analysis). These steps include handling missing values in data, formatting data to standardize it and make it consistent and converting categorical variables into numerical quantitative variables.

In [1]:
import pandas as pd
import numpy as np

In [2]:
file_path = "D:\OneDrive\Documents\data.xlsx"

In [3]:
df = pd.read_excel(file_path, sheet_name="UAV")
df.head()

Unnamed: 0,title,cpc_i,asignee,filing_date,patent_no,relevancy,product,category,sub_category,taxonomy_sstt,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,UNMANNED AERIAL VEHICLE,B64C27/08;B64C39/024;B64C27/16;B64U30/296;B64U...,,2018-12-27,20210214075,73.887054,,Aviation,,,,,,,,,
1,"LOGISTICS SYSTEM, UNMANNED AERIAL VEHICLE, AND...",B65G1/1373;B64C39/024;G05D1/104;B64D1/22;G06K7...,,2018-12-27,20220076192,64.447464,,Aviation,,,,,,,,,
2,"UNMANNED AERIAL VEHICLE CONTROL SYSTEM, UNMANN...",G08G5/045;B64F1/36;G05D1/0653;G08G5/0013;G08G5...,"Rakuten, Inc.",2018-12-25,20210209954,68.0203,,Aviation,,,,,Year,Patent Application,Patent Application Percentage(%),Cumulative Count,Cumulative Percentage(%)
3,"SYSTEMS, METHODS, AND DEVICES FOR ITEM DELIVER...",G05D1/102;G06Q20/065;G06Q20/322;H04L9/3213;G06...,"Ford Global Technologies, LLC",2018-12-21,20200202284,63.734207,,Aviation,,,,,2006,9,0.006891,9,0.006891
4,SYSTEM FOR AUTONOMOUS UNMANNED AERIAL VEHICLE ...,B64C39/024;G08G5/045;G08G5/0052;G08G5/0043;G08...,,2018-12-20,20200202729,66.28873,,Aviation,,,,,2007,12,0.009188,21,0.01608


In [4]:
main_table = ['title','cpc_i','asignee','filing_date','patent_no','relevancy','category']
main_df = df[main_table]

main_df

Unnamed: 0,title,cpc_i,asignee,filing_date,patent_no,relevancy,category
0,UNMANNED AERIAL VEHICLE,B64C27/08;B64C39/024;B64C27/16;B64U30/296;B64U...,,2018-12-27,20210214075,73.887054,Aviation
1,"LOGISTICS SYSTEM, UNMANNED AERIAL VEHICLE, AND...",B65G1/1373;B64C39/024;G05D1/104;B64D1/22;G06K7...,,2018-12-27,20220076192,64.447464,Aviation
2,"UNMANNED AERIAL VEHICLE CONTROL SYSTEM, UNMANN...",G08G5/045;B64F1/36;G05D1/0653;G08G5/0013;G08G5...,"Rakuten, Inc.",2018-12-25,20210209954,68.020300,Aviation
3,"SYSTEMS, METHODS, AND DEVICES FOR ITEM DELIVER...",G05D1/102;G06Q20/065;G06Q20/322;H04L9/3213;G06...,"Ford Global Technologies, LLC",2018-12-21,20200202284,63.734207,Aviation
4,SYSTEM FOR AUTONOMOUS UNMANNED AERIAL VEHICLE ...,B64C39/024;G08G5/045;G08G5/0052;G08G5/0043;G08...,,2018-12-20,20200202729,66.288730,Aviation
...,...,...,...,...,...,...,...
1301,Unmanned aerial vehicle,,"Aerovision Vehiculos Aeros, S.L.",2006-06-12,D573939,72.343630,Aviation
1302,Self-Contained Avionics Sensing And Flight Con...,G05D1/101,United States of America as represented by the...,2006-06-08,20070069083,62.483673,Aviation
1303,Relative navigation for aerial refueling of an...,G01S19/40;B64U80/25;G01S19/42;G05D1/104;B64D39...,Honeywell International Inc.,2006-05-15,20090326736,68.210840,Aviation
1304,METHOD AND SYSTEM FOR AUTONOMOUS TRACKING OF A...,G01S3/7864;G01S13/723;G05D1/0094,Honeywell International Inc.,2006-04-25,20070250260,64.014730,Aviation


## Checking Data Types

Purpose:
1. Potential info and type mismatch
2. Compatibility with Python methods

In [5]:
main_df.dtypes

title                  object
cpc_i                  object
asignee                object
filing_date    datetime64[ns]
patent_no              object
relevancy             float64
category               object
dtype: object

The datatypes seems reasonable

In [6]:
main_df.describe(include="all")

Unnamed: 0,title,cpc_i,asignee,filing_date,patent_no,relevancy,category
count,1306,1238,541,1306,1306.0,1306.0,1306
unique,1165,1223,267,,1301.0,,1
top,Unmanned aerial vehicle,B64C39/024,"Amazon Technologies, Inc.",,20140200000.0,,Aviation
freq,50,7,52,,2.0,,1306
mean,,,,2016-05-13 23:54:29.218989312,,66.406219,
min,,,,2006-02-10 00:00:00,,50.843895,
25%,,,,2015-09-09 06:00:00,,64.164612,
50%,,,,2016-11-19 12:00:00,,66.74526,
75%,,,,2017-12-06 12:00:00,,68.755875,
max,,,,2018-12-27 00:00:00,,74.89759,


In [7]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1306 entries, 0 to 1305
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   title        1306 non-null   object        
 1   cpc_i        1238 non-null   object        
 2   asignee      541 non-null    object        
 3   filing_date  1306 non-null   datetime64[ns]
 4   patent_no    1306 non-null   object        
 5   relevancy    1306 non-null   float64       
 6   category     1306 non-null   object        
dtypes: datetime64[ns](1), float64(1), object(5)
memory usage: 71.6+ KB


## Handle missing values

There are two columns missing values; `cpc_i` and `asignee`.

In [8]:
missing_data = main_df.isna()

# sum() to count missing values in each column
missing_data_count = missing_data.sum()

print(missing_data_count)

title            0
cpc_i           68
asignee        765
filing_date      0
patent_no        0
relevancy        0
category         0
dtype: int64


In [9]:
for column in missing_data.columns.values.tolist():
    print (missing_data[column].value_counts())
    print("")    

title
False    1306
Name: count, dtype: int64

cpc_i
False    1238
True       68
Name: count, dtype: int64

asignee
True     765
False    541
Name: count, dtype: int64

filing_date
False    1306
Name: count, dtype: int64

patent_no
False    1306
Name: count, dtype: int64

relevancy
False    1306
Name: count, dtype: int64

category
False    1306
Name: count, dtype: int64



`replace()` method in pandas will be used. *Values of the Series/DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require to specify a location to update with some value.*

> Need further disussion to handle the missing values

In [10]:
#Will be moved to utils.py for reusability and maintainabilty

def replace_to_unknown(df):
    # Replace NaN values with "Unknown" in the entire DataFrame
    return df.replace(np.nan, "Unknown")

In [11]:
main_df = replace_to_unknown(main_df)

main_df.head()

Unnamed: 0,title,cpc_i,asignee,filing_date,patent_no,relevancy,category
0,UNMANNED AERIAL VEHICLE,B64C27/08;B64C39/024;B64C27/16;B64U30/296;B64U...,Unknown,2018-12-27,20210214075,73.887054,Aviation
1,"LOGISTICS SYSTEM, UNMANNED AERIAL VEHICLE, AND...",B65G1/1373;B64C39/024;G05D1/104;B64D1/22;G06K7...,Unknown,2018-12-27,20220076192,64.447464,Aviation
2,"UNMANNED AERIAL VEHICLE CONTROL SYSTEM, UNMANN...",G08G5/045;B64F1/36;G05D1/0653;G08G5/0013;G08G5...,"Rakuten, Inc.",2018-12-25,20210209954,68.0203,Aviation
3,"SYSTEMS, METHODS, AND DEVICES FOR ITEM DELIVER...",G05D1/102;G06Q20/065;G06Q20/322;H04L9/3213;G06...,"Ford Global Technologies, LLC",2018-12-21,20200202284,63.734207,Aviation
4,SYSTEM FOR AUTONOMOUS UNMANNED AERIAL VEHICLE ...,B64C39/024;G08G5/045;G08G5/0052;G08G5/0043;G08...,Unknown,2018-12-20,20200202729,66.28873,Aviation


In [12]:
#save the cleaned data to a new file
data_path = "../data/processed/uav.csv"

main_df.to_csv(data_path, index=False)

# One Hot Encoding

Categorical variables encoding

> Will be implemented on the Integrated Dataset

**Problem**

- Most statistical models cannot take in objects or strings as input, and for model training, only take the numbers as inputs

**Solution**
- Encode the values by adding new features corresponding to each unique element in the original feature we would like to encode
- Add dummy variables for each unique category using `pandas.get_dummies()` method