# 特征工程

本文件主要实现以下内容：
* **特征构造**：
    * 缺失值数量特征：提取每个样本的缺失值数量
    * 文本特征：合并所有文本型变量，方便进行`TF-IDF`转换
    * 文本长度特征：提取文本的长度
* **特征转换**
    * 文本特征：使用`TF-IDF`转换文本特征
    * 数值特征：使用`StandScaler`进行归一化
    * 有序分类型变量：使用`OrdinalEncoder`进行编码
    * 分类型变量：使用`OneHotEncoder`进行编码
* **创建预处理PipeLine**

In [1]:
import joblib
import pathlib
import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

processed_data_dir = pathlib.Path("../dataset/processed")
model_dir = pathlib.Path("../app/models")

In [2]:
df = pd.read_feather(processed_data_dir / "processed_data.feather")
df.sample(n=1, random_state=42)

Unnamed: 0,title,description,requirements,company_profile,location,employment_type,industry,benefits,fraudulent,department,telecommuting,has_company_logo,has_questions,required_experience,required_education,function,is_missing_required_experience,is_missing_employment_type,is_missing_industry,is_missing_required_education,is_missing_requirements,is_missing_department,is_missing_telecommuting,is_missing_has_company_logo,is_missing_has_questions,is_missing_company_profile,salary
11270,sales executive,city location hub city usajoin fastestgrowe go...,new business developmentprospect qualify close...,visual bi lead fast grow firm focus exclusivel...,"us, ny, new york city",fulltime,information technology and services,salary bonus commensurate experienceexcellent ...,0,Missing,1.0,1.0,1.0,midsenior level,high school or equivalent,consulting,0,0,0,0,0,1,0,0,0,0,0.0


## 特征构造(Feature Construction)

### 缺失值字段数量提取

构造一个字段用于统计**某个样本含有缺失值字段的个数**

In [3]:
columns = [col for col in df.columns if col.startswith('is_missing')]
df["missing"] = df[columns].sum(axis=1)
df = df.drop(columns=columns)
df.sample(n=1, random_state=42)

Unnamed: 0,title,description,requirements,company_profile,location,employment_type,industry,benefits,fraudulent,department,telecommuting,has_company_logo,has_questions,required_experience,required_education,function,salary,missing
11270,sales executive,city location hub city usajoin fastestgrowe go...,new business developmentprospect qualify close...,visual bi lead fast grow firm focus exclusivel...,"us, ny, new york city",fulltime,information technology and services,salary bonus commensurate experienceexcellent ...,0,Missing,1.0,1.0,1.0,midsenior level,high school or equivalent,consulting,0.0,1


### 文本型变量构造

长文本变量有`title`、`description`、`requirements`、`company_profile`、`benefits`, 将这些变量合并为一个变量

In [4]:
text_columns = ['title', 'description', 'requirements', 'company_profile', 'benefits']
df["text"] = (
    df[text_columns]
    .fillna('')
    .astype(str)
    .agg(lambda x: ' '.join(x.str.strip()), axis=1)
    .str.replace(r'\s+', ' ', regex=True)
)
df = df.drop(columns=['title', 'description', 'requirements', 'company_profile', 'benefits'])
df.sample(n=1, random_state=42)

Unnamed: 0,location,employment_type,industry,fraudulent,department,telecommuting,has_company_logo,has_questions,required_experience,required_education,function,salary,missing,text
11270,"us, ny, new york city",fulltime,information technology and services,0,Missing,1.0,1.0,1.0,midsenior level,high school or equivalent,consulting,0.0,1,sales executive city location hub city usajoin...


### 文本长度提取

提取文本的长度，用于判断文本长度与虚假招聘信息之间的关系

In [5]:
df["text_length"] = df["text"].apply(len)
df.sample(n=1, random_state=42)

Unnamed: 0,location,employment_type,industry,fraudulent,department,telecommuting,has_company_logo,has_questions,required_experience,required_education,function,salary,missing,text,text_length
11270,"us, ny, new york city",fulltime,information technology and services,0,Missing,1.0,1.0,1.0,midsenior level,high school or equivalent,consulting,0.0,1,sales executive city location hub city usajoin...,2364


## 特征转换(Feature Transformer)

In [6]:
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer

### Ordinal Encoding

`Ordinal Encoding`（顺序编码） 是一种将类别型数据转换为数值型数据的编码方式。它适用于类别之间具有内在顺序关系的情况，即类别的顺序是有意义的

In [7]:
ordinal_encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
ordinal_columns = ["required_education", "required_experience"]
ordinal_encoder

### OneHot Encoding

In [8]:
onthot_columns = ["location", "department", "function", "employment_type", "industry"]
onthot_encoder = OneHotEncoder(handle_unknown="ignore")
onthot_encoder

### StandardScaler

In [9]:
num_columns = ['salary', 'text_length', "missing"]
scaler = StandardScaler()
scaler

### TF-IDF向量化

**TF-IDF（Term Frequency-Inverse Document Frequency）** 是一种常用于文本挖掘的特征提取方法，目的是评估一个词对于某一文本或一组文本（语料库）的重要程度
* **TF（Term Frequency）**: 词频，指的是某个词在文档中出现的次数
$$
TF(t)=\frac{\text{某个词t在文档中出现的次数}}{\text{文档中的总词数}}
$$
* **IDF（Inverse Document Frequency）**: 逆文档频率，指的是一个词在语料库中出现的稀有程度
$$
IDF(t)=\log{\frac{总文档数}{包含词t的文档+1}}
$$
* TF-IDF 计算: 最终的 $TF-IDF$ 权重是 $TF$ 和 $IDF$ 的乘积，表示一个词在某一文档中的重要性

In [10]:
vectorizer = TfidfVectorizer(
    ngram_range=(1, 3), 
    encoding="utf-8"
)
vectorizer

### 创建预处理PipeLine

In [11]:
preprocessor = ColumnTransformer(
    transformers=[
        ("vectorizer", vectorizer, "text"),
        ("OrdinalEncoder", ordinal_encoder, ordinal_columns), 
        ("OneHotEncoder", onthot_encoder, onthot_columns),
        ("StandScaler", scaler, num_columns),  
    ],
        remainder="drop",
        verbose=True,
        n_jobs=-1
)
preprocessor

## 保存数据

### 保存`preprocessor`

In [12]:
joblib.dump(preprocessor, processed_data_dir / "preprocessor.joblib", compress=('gzip', 5))

['../dataset/processed/preprocessor.joblib']

### 保存特征与属性

In [13]:
inputs = df.drop(columns=["fraudulent"])
target = df["fraudulent"]
joblib.dump(inputs, processed_data_dir / "inputs.joblib", compress=('gzip', 5))
joblib.dump(target, processed_data_dir / "target.joblib", compress=('gzip', 5))

['../dataset/processed/target.joblib']