# 数据预处理

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#读取数据集" data-toc-modified-id="读取数据集-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>读取数据集</a></span></li><li><span><a href="#处理缺失值" data-toc-modified-id="处理缺失值-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>处理缺失值</a></span></li><li><span><a href="#转化为张量" data-toc-modified-id="转化为张量-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>转化为张量</a></span></li></ul></div>

## 读取数据集

创建人工数据集

In [1]:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\n') # 列名 
    f.write('NA,Pave,127500\n') # 每⾏表⽰⼀个数据样本 
    f.write('2,NA,106000\n') 
    f.write('4,NA,178100\n') 
    f.write('NA,NA,140000\n')

使用`pandas`读取数据

In [2]:
import pandas as pd

data = pd.read_csv(data_file)
data

Unnamed: 0,NumRooms,Alley,Price
0,,Pave,127500
1,2.0,,106000
2,4.0,,178100
3,,,140000


## 处理缺失值 

通过位置索引iloc，我们将data分成inputs和outputs，其中前者为data的前两列，⽽后者为data的最后⼀列。 对于inputs中缺少的数值，我们⽤同⼀列的均值替换“NaN”项。

`pd.DataFrame.mean()`不再支持字符串操作，通过`pd.DataFrame.select_dtypes()`进行数据列的删选！

In [3]:
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.select_dtypes(include='number').mean())
inputs

Unnamed: 0,NumRooms,Alley
0,3.0,Pave
1,2.0,
2,4.0,
3,3.0,


In [4]:
inputs = pd.get_dummies(inputs, dummy_na=True, dtype=float)
inputs

Unnamed: 0,NumRooms,Alley_Pave,Alley_nan
0,3.0,1.0,0.0
1,2.0,0.0,1.0
2,4.0,0.0,1.0
3,3.0,0.0,1.0


## 转化为张量

In [5]:
import torch
X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y

(tensor([[3., 1., 0.],
         [2., 0., 1.],
         [4., 0., 1.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))