iris数据集，Iris数据集是常用的分类实验数据集，由Fisher, 1936收集整理。Iris也称鸢尾花卉数据集，是一类多重变量分析的数据集。数据集包含150个数据样本，分为3类，每类50个数据，每个数据包含4个属性。可通过花萼长度，花萼宽度，花瓣长度，花瓣宽度4个属性预测鸢尾花卉属于（Setosa，Versicolour，Virginica）三个种类中的哪一类。

接下来，我们通过实际的数据集，来学习深度学习的一般流程。

下载数据集：
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

* 将数据集拆分成训练集和测试集
shuf iris.data | split -l $(( 151* 80 / 100 )) 
* 该命令会产出xaa,xbb两个文件，重命名这两个文件
* mv xaa train.csv && mv xbb test.csv


一些说明：
1. 为什么要拆分训练集和测试集？
2. 数据特征的归一化处理
3. 数据加载、随机批量训练方式
5. 训练收敛
6. 欠拟合，过拟合问题：
6. 使用网格搜索调整DNN的超参数：学习率，网络深度，隐藏层大小等；

In [1]:
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

torch.manual_seed(0)


#1，定义数据集
class IrisDataset(Dataset):
    def __init__(self,data_type="train"):
        assert data_type in ('train','test')
        self.labels = {'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2}
        self.pd_frame = pd.read_csv("./dataset/iris/%s.csv" % (data_type),header=None )

    def __len__(self):
        return len(self.pd_frame)

    def __getitem__(self, idx):
        label = self.pd_frame.iloc[idx, 4]
        X = self.pd_frame.iloc[idx, 0:4]
        return X.to_numpy(np.float32),self.labels[label]

# 单个test
train = IrisDataset("train")
test = IrisDataset("test")
print(len(train),len(test))
print(train[0])

# 批量加载测试
train_dataloader = DataLoader(train, batch_size=2, shuffle=True)
train_features, train_labels = next(iter(train_dataloader))
print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")

features = train_features[0]
print("features0:",features)
features = features.squeeze()
print("features1:",features)
label = train_labels[0]
print("label:",label)


test_dataloader = DataLoader(test, batch_size=2, shuffle=True)


36 30
(array([7.6, 3. , 6.6, 2.1], dtype=float32), 2)
Feature batch shape: torch.Size([2, 4])
Labels batch shape: torch.Size([2])
features0: tensor([5.4000, 3.9000, 1.7000, 0.4000])
features1: tensor([5.4000, 3.9000, 1.7000, 0.4000])
label: tensor(0)


In [2]:
# 2. 定义模型
class IrisModel(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(IrisModel, self).__init__()
        self.fc_0 = nn.Linear(input_dim, 25) 
        self.fc_1 = nn.Linear(25, 30) 
        self.fc_2 = nn.Linear(30, output_dim) 

    def forward(self, x):
        out = F.relu(self.fc_0(x))
        out = F.relu(self.fc_1(out))
        out = self.fc_2(out)
#         print("out:",out.shape,out)
#         out = F.softmax(out, dim=1)
        return out

# 模型初始化
model = IrisModel(4,3) #特征的维度=4，iris有3个分类

# 单条测试
# x = torch.tensor(train[0][0]).view(1,-1)
# print(x,x.shape)
# output = model(x)
# print("单条测试:", output)

# 多个测试
train_features, train_labels = next(iter(train_dataloader))
print(train_features.shape)
print(train_features)
pred = model(train_features)
print("多个测试:",pred)
print(pred.shape,train_labels.shape)


torch.Size([2, 4])
tensor([[5.5000, 3.5000, 1.3000, 0.2000],
        [5.2000, 4.1000, 1.5000, 0.1000]])
多个测试: tensor([[-0.2501, -0.7116, -0.0021],
        [-0.2437, -0.7377,  0.0563]], grad_fn=<AddmmBackward>)
torch.Size([2, 3]) torch.Size([2])


In [3]:
# 3. 定义损失函数，优化算法等
learning_rate = 0.001
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) #定义最优化算法

criterion = nn.CrossEntropyLoss()  

loss = criterion(pred,train_labels)
print(loss)

#error: cross_entropy 1D target tensor expected, multi-target not supported

tensor(1.0770, grad_fn=<NllLossBackward>)


In [4]:
# 4 启动训练
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train() #训练模式
    for batch, (X, y) in enumerate(dataloader):         
        pred = model(X)
        loss = criterion(pred,y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 5 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")
            
    print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

train(train_dataloader, model, criterion, optimizer)

def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval() #预测模式
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
#             X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
    
epochs = 30
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, model, criterion, optimizer)
    test(test_dataloader, model, criterion)
print("Done!")

loss: 1.675363  [    0/   36]
loss: 0.899168  [   10/   36]
loss: 0.829396  [   20/   36]
loss: 1.261095  [   30/   36]
loss: 1.045605  [   30/   36]
Epoch 1
-------------------------------
loss: 0.928218  [    0/   36]
loss: 1.050659  [   10/   36]
loss: 0.899680  [   20/   36]
loss: 1.273160  [   30/   36]
loss: 1.197524  [   30/   36]
Test Error: 
 Accuracy: 30.0%, Avg loss: 1.183782 

Epoch 2
-------------------------------
loss: 1.279933  [    0/   36]
loss: 0.786806  [   10/   36]
loss: 1.274202  [   20/   36]
loss: 0.895924  [   30/   36]
loss: 1.314132  [   30/   36]
Test Error: 
 Accuracy: 50.0%, Avg loss: 1.157805 

Epoch 3
-------------------------------
loss: 0.961736  [    0/   36]
loss: 0.825188  [   10/   36]
loss: 1.168127  [   20/   36]
loss: 1.205422  [   30/   36]
loss: 1.253262  [   30/   36]
Test Error: 
 Accuracy: 60.0%, Avg loss: 1.133702 

Epoch 4
-------------------------------
loss: 1.275472  [    0/   36]
loss: 0.915248  [   10/   36]
loss: 1.147025  [   20/ 