# 数据处理

本文件主要实现以下内容：
* **数据集成**：将两份数据集合并成一份数据集
* **数据处理**：处理缺失值、重复值和异常值
    * 缺失值处理：见下文的缺失值处理策略
    * 异常值处理：见下文的异常值处理策略
    * 重复值处理：直接删除，并保留第一次出现的记录即可
* **维度规约**：删除字段中不必要的字符（例如$等等)
* **数据提取**：提取薪资范围的最低薪资与最高新增

In [1]:
import os

import sys
import re
from pathlib import Path
import shutup

os.environ['NUMPY_NO_NEP50_WARNING'] = '1'
shutup.please()

In [2]:
import pandas as pd 
import numpy as np
import ydata_profiling as yp
import spacy
from scipy import stats
import concurrent.futures
from tqdm import tqdm

tqdm.pandas()
# 添加系统路径
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
# 指定一些文件路径
RAW_DATA_DIR = Path("../dataset/raw")
EXTERNAL_DATA_DIR = Path("../dataset/external")
PROCESSED_DATA_DIR = Path("../dataset/processed")
TEMPLATES_DIR = Path("../templates")
STATIC_DATA_DIR = Path("../static")

In [3]:
import json

# 自定义读取json文件的函数
def read_json(file_path):
    try:
        with open(file_path, "r", encoding="utf-8") as f:
            data = json.load(f)
        return data
    except FileNotFoundError:
        print(f"文件{file_path}没找到")

## 读取数据集

### 读取第一份数据集

In [4]:
df1 = pd.read_csv(RAW_DATA_DIR / "Fake Postings.csv")
print(f"数据集的形状为：{df1.shape}")
df1.sample(n=1, random_state=42)

数据集的形状为：(10000, 10)


Unnamed: 0,title,description,requirements,company_profile,location,salary_range,employment_type,industry,benefits,fraudulent
6252,Plant breeder/geneticist,Debate capital begin me protect. Earn $5000/we...,"Basic knowledge in throw, no degree required. ...",Terry Ltd - Established 1996.,Staciemouth,$65149-$136311,Contract,Automotive,Remote work opportunities,1


In [5]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   title            10000 non-null  object
 1   description      10000 non-null  object
 2   requirements     10000 non-null  object
 3   company_profile  10000 non-null  object
 4   location         10000 non-null  object
 5   salary_range     10000 non-null  object
 6   employment_type  10000 non-null  object
 7   industry         10000 non-null  object
 8   benefits         10000 non-null  object
 9   fraudulent       10000 non-null  int64 
dtypes: int64(1), object(9)
memory usage: 781.4+ KB


| 字段                  | 有效值数量 | 缺失值数量 | 缺失值率  | 数据类型   | 变量类型                  |
|:---:                  |:---:       |:---:       |:---:      |:---:      |:---:                      |
| `title`               | 10000      | 0          | 0.0000    | object     | 文本变量                  |
| `description`         | 10000      | 0          | 0.0000    | object     | 文本变量                  |
| `requirements`        | 10000      | 0          | 0.0000    | object     | 文本变量                  |
| `company_profile`     | 10000      | 0          | 0.0000    | object     | 文本变量                  |
| `location`            | 10000      | 0          | 0.0000    | object     | 分类型变量                |
| `salary_range`        | 10000      | 0          | 0.0000    | object     | 离散型定量变量            |
| `employment_type`     | 10000      | 0          | 0.0000    | object     | 有序分类型变量            |
| `industry`            | 10000      | 0          | 0.0000    | object     | 分类型变量                |
| `benefits`            | 10000      | 0          | 0.0000    | object     | 文本变量                  |
| `fraudulent`          | 10000      | 0          | 0.0000    | int64      | 离散型定量变量            |


In [6]:
yp.ProfileReport(df1, title="The First Dataset Analysis Report", explorative=True).to_file(TEMPLATES_DIR / "firstDataSet.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:00<00:00, 16.03it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

考虑到这份数据集没有缺失值，因此可以直接进行数据探索和处理

### 读取第二份数据集

In [7]:
df2 = pd.read_csv(RAW_DATA_DIR / "fake_job_postings.csv")
print(f"数据集的形状为:{df2.shape}")
df2.sample(1)

数据集的形状为:(17880, 18)


Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
13861,13862,Sr. Interactive Producer,"US, VA, Richmond",,,"We're artists, thinkers, and doers in an open,...",The Senior Interactive Producer’s primary resp...,A passion for delivering top-notch creative so...,,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Project Management,0


In [8]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15184 non-null  object
 8   benefits             10668 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

| 字段                  | 有效值数量 | 缺失值数量 | 缺失值率  | 数据类型   | 变量类型                  |
|---|---|---|---|---|---|
| `job_id`             | 17880      | 0          | 0.0000    | int64      | 离散型定量变量            |
| `title`              | 17880      | 0          | 0.0000    | object     | 文本变量                  |
| `location`           | 17534      | 346        | 0.0194    | object     | 分类型变量                |
| `department`         | 6333       | 11547      | 0.6458    | object     | 分类型变量                |
| `salary_range`       | 2868       | 15012      | 0.8396    | object     | 离散型定量变量            |
| `company_profile`    | 14572      | 3308       | 0.1850    | object     | 文本变量                  |
| `description`        | 17879      | 1          | 0.0001    | object     | 文本变量                  |
| `requirements`       | 15184      | 2696       | 0.1508    | object     | 文本变量                  |
| `benefits`           | 10668      | 7212       | 0.4034    | object     | 文本变量                  |
| `telecommuting`      | 17880      | 0          | 0.0000    | int64      | 二元变量                  |
| `has_company_logo`   | 17880      | 0          | 0.0000    | int64      | 二元变量                  |
| `has_questions`      | 17880      | 0          | 0.0000    | int64      | 二元变量                  |
| `employment_type`    | 14409      | 3471       | 0.1941    | object     | 有序分类型变量            |
| `required_experience`| 10830      | 7050       | 0.3943    | object     | 有序分类型变量            |
| `required_education` | 9775       | 8105       | 0.4533    | object     | 有序分类型变量            |
| `industry`           | 12977      | 4903       | 0.2742    | object     | 分类型变量                |
| `function`           | 11425      | 6455       | 0.3610    | object     | 分类型变量                |
| `fraudulent`         | 17880      | 0          | 0.0000    | int64      | 二元变量                  |


In [9]:
yp.ProfileReport(df2, title="The Second Dataset Analysis Report", explorative=True).to_file(TEMPLATES_DIR / "secondDataSet.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 18/18 [00:06<00:00,  2.76it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## 数据合并

In [10]:
df = pd.concat([df1, df2], axis=0)
del df["job_id"] # 删除job_id这一列
print(f"数据集的形状为:{df.shape}")
df.to_feather(PROCESSED_DATA_DIR / "combined_data.feather") # 保存数据
df.sample(n=1, random_state=42)

数据集的形状为:(27880, 17)


Unnamed: 0,title,description,requirements,company_profile,location,salary_range,employment_type,industry,benefits,fraudulent,department,telecommuting,has_company_logo,has_questions,required_experience,required_education,function
11326,Administrative Assistant,Our customer is providing something new in Hea...,"High school diploma/GED equivalent, or equival...",MedTalent is a modern staffing company that sp...,"US, FL, Jacksonville",,Full-time,Hospital & Health Care,,0,,0.0,1.0,1.0,Associate,High School or equivalent,Administrative


In [11]:
report = yp.ProfileReport(df, title="Not Processed Fake Job Postings Dataset Analysis Report", explorative=True)
report.to_file(TEMPLATES_DIR / "raw_data_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 17/17 [00:06<00:00,  2.45it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

* 分类型变量：`location`、`industry`、`department`、`function`、`telecommuting`、`has_company_logo`、`has_questions`      
* 有序分类型变量：`required_experience`、`required_education`、`employment_type`  
* 文本型变量：`title`、`description`、`requirements`、`company_profile`、`benefits`
* 离散定量型变量:`salary_range`

## 数据清理

### 重复值处理

对于重复值直接删除，并保留第一次出现的记录即可

In [12]:
print(f"处理重复值前的第一份数据集形状为:{df.shape}")
df.drop_duplicates(keep='first', inplace=True)
print(f"删除重复值后的第一份数据集形状为:{df.shape}")

处理重复值前的第一份数据集形状为:(27880, 17)
删除重复值后的第一份数据集形状为:(27599, 17)


### 异常值处理

* 异常值主要针对于数值型变量，异常值检测的方法主要有以下几种：
  * 箱线图：箱线图被广泛用于检测和识别数据中的异常值 (离群点)
  * Z-score 方法：均适用于数据对称情形
  * Tukey’s method：这是一种常用的识别离群值的方法，以中位数和四分位数为基础来识别离群点（适用于数据对称的情形）
  * 调整的箱型图方法

#### 文本型变量处理

对长文本型变量执行以下操作：
* **去除噪声**: 去掉标点符号、停用词（如“的”、“是”等）、多余空格、HTML标签等
* **词形还原**: 将词汇归一化，例如将`running`和`ran`统一为`run`

In [13]:
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser", "textcat"])

def preprocess(text):
    if isinstance(text, str):
        text = text.strip()
        text = text.lower()
        text = re.sub(r'<.*?>', '', text)
        text = re.sub(r'[^\w\s]', '', text)
        text = re.sub(r'[^\x00-\x7F]+', '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        text = re.sub(r'\d+', '', text)
        doc = nlp(text)
        processed_text = " ".join([token.lemma_ for token in doc if not token.is_stop and not token.is_punct])
        return processed_text
    else:
        return

columns = ["description", "requirements", "company_profile", "benefits"]

def parallel_apply(df, column_name):
    with tqdm(total=len(df[column_name]), desc=f"Processing {column_name}") as pbar:
        def update_progress(text):
            pbar.update(1)
            return preprocess(text)
        
        with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
            result = list(executor.map(update_progress, df[column_name]))
    
    return result

for col in columns:
    df[col] = parallel_apply(df, col)

df.sample(n=1, random_state=42)

Processing description: 100%|██████████| 27599/27599 [04:59<00:00, 92.16it/s] 
Processing requirements: 100%|██████████| 27599/27599 [03:32<00:00, 130.00it/s]
Processing company_profile: 100%|██████████| 27599/27599 [03:12<00:00, 143.54it/s]
Processing benefits: 100%|██████████| 27599/27599 [02:07<00:00, 216.99it/s]


Unnamed: 0,title,description,requirements,company_profile,location,salary_range,employment_type,industry,benefits,fraudulent,department,telecommuting,has_company_logo,has_questions,required_experience,required_education,function
17192,Decision analytics and optimization,merl seek highly motivated qualified intern as...,,merls internship program give student excellen...,"US, MA, Cambridge",,,,,0,DA,0.0,1.0,1.0,,,


对所有字符型变量执行以下操作：
* 去除字符串前后两端的空格
* 小写化

In [14]:
def clean_text(text):
    if isinstance(text, str):
        text = re.sub(r'<.*?>', '', text)
        text = text.strip()
        text = re.sub(r'[^\w\s]', '', text)
        text = re.sub(r'[^\x00-\x7F]+', '', text)
        text = text.lower()
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    else:
        return text

columns = [col for col in df.columns if col not in ["salary_range", "location"]]

df[columns] = df[columns].applymap(clean_text)
df["location"] = df["location"].apply(lambda x: x.lower() if isinstance(x, str) else x)
df.sample(n=1, random_state=42)

Unnamed: 0,title,description,requirements,company_profile,location,salary_range,employment_type,industry,benefits,fraudulent,department,telecommuting,has_company_logo,has_questions,required_experience,required_education,function
17192,decision analytics and optimization,merl seek highly motivated qualified intern as...,,merls internship program give student excellen...,"us, ma, cambridge",,,,,0,da,0.0,1.0,1.0,,,


#### 数值型变量的异常值

数值型变量`salary_range`有一些类似于Oct-20的异常值，对于这类异常值直接置为缺失值，后续进一步处理即可

In [15]:
df['salary_range'] = df['salary_range'].str.replace('$', '', regex=False)
# salary_range有以下的异常值
data = df[~df["salary_range"].isna()]
data[~data['salary_range'].str.match(r'^\d+-\d+$', na=False)]["salary_range"].sample(n=5, random_state=42)

9911     Oct-20
10905    10-Nov
159       9-Dec
17233    10-Nov
10788    11-Nov
Name: salary_range, dtype: object

对于这类异常值,直接将其置为缺失值

In [16]:
print(f'异常值处理前缺失值的个数为:{df["salary_range"].isna().sum()}')
outlier = data[~data['salary_range'].str.match(r'^\d+-\d+$', na=False)]
df.loc[df['salary_range'] == '40000', 'salary_range'] = '40000-40000'
df.loc[outlier.index, 'salary_range'] = np.nan
print(f'异常值处理后缺失值的个数为:{df["salary_range"].isna().sum()}')

异常值处理前缺失值的个数为:14772
异常值处理后缺失值的个数为:14807


### 缺失值的处理

* **删除变量**：若变量的缺失率较高（大于$80\%$），覆盖率较低，且重要性较低，可以直接将变量删除
* **定值填充**：使用常量进行填充
* **统计量填充**：若缺失率较低（小于$95\%$）且重要性较低，则根据数据分布的情况进行填充
  * 对于数据符合均匀分布，用该变量的均值填补缺失
  * 对于数据存在倾斜分布的情况，采用中位数进行填补
* **插值法填充**：包括随机插值，多重差补法，热平台插补，拉格朗日插值，牛顿插值等
* **模型填充**：使用回归、贝叶斯、随机森林、决策树等模型对缺失数据进行预测
* **哑变量填充**：若变量是离散型，且不同值较少，可转换成哑变量
  * 例如性别`SEX`变量，存在`male`,`fameal`,`NA`三个不同的值，可将该列转换成`IS_SEX_MALE`, `IS_SEX_FEMALE`, `IS_SEX_NA`
  * 若某个变量存在十几个不同的值，可根据每个值的频数，将频数较小的值归为一类`other`，降低维度。此做法可最大化保留变量的信息

#### 卡方检验

* **卡方检验**（Chi-Square Test） 是一种统计方法，用于检**验两个分类变量之间是否存在统计学上的显著关联**，或者观测值是否与期望值存在显著差异。它是一种非参数检验方法，广泛应用于频数数据分析


##### 缺失值状态对欺诈招聘的影响

In [17]:
results = []

for column in df.columns:
    if df[column].isnull().sum() > 0: 
        contingency_table = pd.crosstab(df[column].isna(), df['fraudulent'])
        chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
        results.append({
            'Column': column,
            'Chi2': chi2,
            'p-value': p_value,
            'Degrees of Freedom': dof
        })
pd.DataFrame(results)

Unnamed: 0,Column,Chi2,p-value,Degrees of Freedom
0,description,0.04767,0.8271693,1
1,requirements,1387.620051,1.029501e-303,1
2,company_profile,731.502645,4.2214269999999995e-161,1
3,location,163.507137,1.938347e-37,1
4,salary_range,16376.809302,0.0,1
5,employment_type,1725.716525,0.0,1
6,industry,2798.309855,0.0,1
7,benefits,4713.773972,0.0,1
8,department,3938.702578,0.0,1
9,telecommuting,24182.277811,0.0,1


##### 分类变量本身对欺诈招聘的影响

* 这里只考虑分类型变量，文本型变量、离散型定量变量后续再考虑

In [18]:
results = []
categories = ["location", "department", "industry", "function", "telecommuting", "has_company_logo", "has_questions", "employment_type", "required_experience", "required_education"]

for column in categories:
    if df[column].isnull().sum() > 0: 
        contingency_table = pd.crosstab(df[column], df['fraudulent'])
        chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
        results.append({
            'Column': column,
            'Chi2': chi2,
            'p-value': p_value,
            'Degrees of Freedom': dof
        })
pd.DataFrame(results)

Unnamed: 0,Column,Chi2,p-value,Degrees of Freedom
0,location,25242.531697,0.0,3033
1,department,3120.5525,7.216729e-169,1209
2,industry,18375.21225,0.0,134
3,function,566.752727,1.250509e-96,36
4,telecommuting,21.549122,3.448805e-06,1
5,has_company_logo,1185.495956,8.658614999999999e-260,1
6,has_questions,144.784234,2.394048e-33,1
7,employment_type,9507.30549,0.0,5
8,required_experience,99.411614,3.328869e-19,6
9,required_education,508.975897,2.7228650000000002e-101,12


Chi-square test 的结果表明:
* 第一组结果（缺失值状态）聚焦于变量是否缺失对欺诈招聘的影响，揭示缺失值是否携带显著信号
  * 显著变量：`company_profile`、`salary_range`、`required_experience`、`employment_type`、`industry`、`required_education`、`requirements`、`location`、`benefits`、`department`、`telecommuting`、`has_company_logo`, `has_questions`
  * 无显著关系变量：`description`
* 第二组结果（分类值分布）关注分类变量中具体取值与欺诈招聘的关联性，强调不同类别的重要性
  * **所有分类型变量都是显著的**

#### 缺失值处理策略

##### 分类型变量的缺失值处理

* 分类型变量有`location`、`industry`、`department`、`function`、`telecommuting`、`has_company_logo`、`has_questions`、`employment_type`、`required_education`和`required_experience`
  * 对于`location`变量，缺失值只有346个，直接删除缺失值即可
  * 其他变量的处理
    * 创建一个新的字段来标记缺失值
    * 填充为`Missing`

In [19]:
significant_variables = read_json(EXTERNAL_DATA_DIR/"significant_variable.json")["category"]["significant"]
significant_variables[:-1]

['required_experience',
 'employment_type',
 'industry',
 'required_education',
 'requirements',
 'department',
 'telecommuting',
 'has_company_logo',
 'has_questions']

In [20]:
# 对于location，删除缺失值即可
df.dropna(subset=["location"], inplace=True)

In [21]:
# 标记缺失值
for var in significant_variables[:-1]: 
    df[f'is_missing_{var}'] = df[var].isna().astype(int)
nums = ['telecommuting', 'has_company_logo', 'has_questions']
fields = ["function", "department", "industry", "required_experience", "required_education", "employment_type"]

for field in fields:
    df[field] = df[field].fillna("Missing")

for num in nums:
    df[num] = df[num].fillna(-1)
df.isna().sum()

title                                 0
description                           1
requirements                       2475
company_profile                    3221
location                              0
salary_range                      14486
employment_type                       0
industry                              0
benefits                           6859
fraudulent                            0
department                            0
telecommuting                         0
has_company_logo                      0
has_questions                         0
required_experience                   0
required_education                    0
function                              0
is_missing_required_experience        0
is_missing_employment_type            0
is_missing_industry                   0
is_missing_required_education         0
is_missing_requirements               0
is_missing_department                 0
is_missing_telecommuting              0
is_missing_has_company_logo           0


##### 文本型变量的缺失值处理

* 文本型变量有`company_profile`、`description`、`requirements`和`benefits`,对这四个变量的缺失值处理遵循以下规则：
  * `description`的缺失值只有一条，**删除这一条记录即可**
  * `company_profile`的缺失值占比为$18.5\%$，添加二元特征 `is_company_profile_missing`, 记录 `company_profile` 是否缺失，保留缺失值的显著性
  * `requirements`的缺失值占比为$15\%$, 直接填充为`Missing`即可
  * `benefits`的缺失值占比为$40\%$,这些变量缺失值较大，直接填充为`Missing`即可

In [22]:
df = df.dropna(subset=['description'])
df['is_missing_company_profile'] = df['company_profile'].isna().astype(int)
df["company_profile"] = df["company_profile"].fillna("Missing")
df['requirements'] = df['requirements'].fillna('Missing')
df['benefits'] = df['benefits'].fillna('Missing')
df.isna().sum()

title                                 0
description                           0
requirements                          0
company_profile                       0
location                              0
salary_range                      14485
employment_type                       0
industry                              0
benefits                              0
fraudulent                            0
department                            0
telecommuting                         0
has_company_logo                      0
has_questions                         0
required_experience                   0
required_education                    0
function                              0
is_missing_required_experience        0
is_missing_employment_type            0
is_missing_industry                   0
is_missing_required_education         0
is_missing_requirements               0
is_missing_department                 0
is_missing_telecommuting              0
is_missing_has_company_logo           0


##### 离散定量型变量的缺失值处理

* 离散型定量变量有`salary_range`, 缺失值占比为$83\%$,对于这个变量的缺失值处理分以下两个步骤：
  1. 新增一个二元变量 `is_salary_range_missing`，记录 `salary_range` 是否缺失,这种标记能直接捕捉**缺失值状态与欺诈的显著关系**
  2. 填充为`0-0`

In [23]:
data['is_missing_salary_range'] = data['salary_range'].isnull().astype(int)
df["salary_range"] = df["salary_range"].fillna("0-0")
df.isna().sum()

title                             0
description                       0
requirements                      0
company_profile                   0
location                          0
salary_range                      0
employment_type                   0
industry                          0
benefits                          0
fraudulent                        0
department                        0
telecommuting                     0
has_company_logo                  0
has_questions                     0
required_experience               0
required_education                0
function                          0
is_missing_required_experience    0
is_missing_employment_type        0
is_missing_industry               0
is_missing_required_education     0
is_missing_requirements           0
is_missing_department             0
is_missing_telecommuting          0
is_missing_has_company_logo       0
is_missing_has_questions          0
is_missing_company_profile        0
dtype: int64

至此，所有缺失值处理完毕

## 数据提取

In [24]:
df[['lower_salary', 'upper_salary']] = df['salary_range'].str.split("-", expand=True)
df["lower_salary"] = df["lower_salary"].astype(float)
df["upper_salary"] = df["upper_salary"].astype(float)
df["salary"] = (df["lower_salary"] + df["upper_salary"]) / 2
df = df.drop(columns=["salary_range", "lower_salary", "upper_salary"])
df.dtypes

title                              object
description                        object
requirements                       object
company_profile                    object
location                           object
employment_type                    object
industry                           object
benefits                           object
fraudulent                          int64
department                         object
telecommuting                     float64
has_company_logo                  float64
has_questions                     float64
required_experience                object
required_education                 object
function                           object
is_missing_required_experience      int64
is_missing_employment_type          int64
is_missing_industry                 int64
is_missing_required_education       int64
is_missing_requirements             int64
is_missing_department               int64
is_missing_telecommuting            int64
is_missing_has_company_logo       

## 保存数据

In [25]:
df.drop_duplicates(keep="first", inplace=True) # 删除重复值
df.to_feather(PROCESSED_DATA_DIR / "processed_data.feather") # 保存数据

In [26]:
profile = yp.ProfileReport(df, title="Fake Job Postings Dataset Analysis Report")
profile.to_file(TEMPLATES_DIR / "report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 27/27 [00:01<00:00, 17.77it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]