<a href="https://colab.research.google.com/github/boshuaiYu/Deeplearning_pytorch/blob/main/Chapter_2.2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

2.2 数据预处理

In [12]:
import torch

2.2.1 读取数据集

In [13]:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\n')  # 列名
    f.write('NA,Pave,127500\n')  # 每行表示一个数据样本
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

详细os模块使用参考https://blog.csdn.net/m0_55697123/article/details/119464001

In [14]:
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN  140000


2.2.2 处理缺失值

In [15]:
# 插值法：用一个替代值进行弥补缺失值
# 删除法：直接忽略缺失值

inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2] # 左闭右开
inputs = inputs.fillna(inputs.mean())
inputs

  """


Unnamed: 0,NumRooms,Alley
0,3.0,Pave
1,2.0,
2,4.0,
3,3.0,


详细fillna函数使用参考https://blog.csdn.net/weixin_39549734/article/details/81221276

对于inputs中的类别值或离散值，我们将“NaN”视为一个类别。 由于“巷子类型”（“Alley”）列只接受两种类型的类别值“Pave”和“NaN”， pandas可以自动将此列转换为两列“Alley_Pave”和“Alley_nan”。 巷子类型为“Pave”的行会将“Alley_Pave”的值设置为1，“Alley_nan”的值设置为0。 缺少巷子类型的行会将“Alley_Pave”和“Alley_nan”分别设置为0和1

In [16]:
inputs = pd.get_dummies(inputs,dummy_na=True) # 利用pandas实现独热编码，常用于分类
inputs

Unnamed: 0,NumRooms,Alley_Pave,Alley_nan
0,3.0,1,0
1,2.0,0,1
2,4.0,0,1
3,3.0,0,1


2.2.3 转换为张量模式

In [17]:
X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y

(tensor([[3., 1., 0.],
         [2., 0., 1.],
         [4., 0., 1.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))

2.3 张量计算

高维张量理解参考https://blog.csdn.net/qq_40288627/article/details/109129487


https://yuchi.blog.csdn.net/article/details/85786389?spm=1001.2101.3001.6650.4&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-4-85786389-blog-109129487.pc_relevant_3mothn_strategy_recovery&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-4-85786389-blog-109129487.pc_relevant_3mothn_strategy_recovery&utm_relevant_index=9 

In [21]:
A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
B = A.clone()  # 通过分配新内存，将A的一个副本分配给B
A

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [12., 13., 14., 15.],
        [16., 17., 18., 19.]])

In [20]:
A_sum_axis0 = A.sum(axis=0) # 按列相加
A_sum_axis1 = A.sum(axis=1) # 按行相加
A_sum_axis0, A_sum_axis0.shape, A_sum_axis1, A_sum_axis1.shape


(tensor([40., 45., 50., 55.]),
 torch.Size([4]),
 tensor([ 6., 22., 38., 54., 70.]),
 torch.Size([5]))

In [22]:
# 计算总和的时候保持轴数不变
sum_A = A.sum(axis=1, keepdims=True)
sum_A

tensor([[ 6.],
        [22.],
        [38.],
        [54.],
        [70.]])

In [25]:
# 点积
x = torch.tensor([2., 3., 4., 5.])
y = torch.ones(4, dtype=torch.float32)
x, y, torch.dot(x,y)

(tensor([2., 3., 4., 5.]), tensor([1., 1., 1., 1.]), tensor(14.))

In [27]:
# 矩阵向量积
A.shape, x.shape, torch.mv(A, x)

(torch.Size([5, 4]), torch.Size([4]), tensor([ 26.,  82., 138., 194., 250.]))

In [28]:
# 矩阵矩阵积
B = torch.ones(4, 3)
torch.mm(A, B)

tensor([[ 6.,  6.,  6.],
        [22., 22., 22.],
        [38., 38., 38.],
        [54., 54., 54.],
        [70., 70., 70.]])

In [30]:
a = torch.ones((2,3,4))
print(len(x), a.shape, a.size())

4 torch.Size([2, 3, 4]) torch.Size([2, 3, 4])


In [35]:
a = torch.arange(24).reshape(2,3,4)
print(a.sum(axis=0), a.sum(axis=1), a.sum(axis=2))

tensor([[12, 14, 16, 18],
        [20, 22, 24, 26],
        [28, 30, 32, 34]]) tensor([[12, 15, 18, 21],
        [48, 51, 54, 57]]) tensor([[ 6, 22, 38],
        [54, 70, 86]])
