# 数据集介绍

## 关于数据集
### 🦠 乳腺癌数据集
该数据集包含被诊断患有癌症的患者的特征。该数据集包含每个患者的唯一 ID、癌症类型（诊断）、癌症的视觉特征以及这些特征的平均值。

### 📚 数据集的主要特点如下：
1. id：表示每个患者的唯一 ID。
2. 诊断：指示癌症的类型。此属性可以取值“M”（恶性 - 良性）或“B”（良性 - 恶性）。
3. radius_mean、texture_mean、perimeter_mean、area_mean、smoothness_mean、compactness_mean、concavity_mean、凹points_mean：表示癌症视觉特征的平均值。

还有几个分类特征，其中数据集中的患者使用数值进行标记。您可以在“图表”区域中检查它们。

其他特征包含癌症图像特征的平均值的特定范围：

* radius_mean、texture_mean、perimeter_mean、area_mean、smoothness_mean、compactness_mean、concavity_mean、凹points_mean
这些要素中的每一个都映射到一个表，其中包含给定范围内的值数。**您可以检查图表**

**每个样本都包含患者的唯一 ID、癌症诊断和癌症视觉特征的平均值。**

这样的数据集可用于训练或测试用于进行癌症诊断的模型和算法。了解和分析数据集有助于改善与癌症相关的视觉特征和诊断。

### ✨ 可以使用数据集完成的项目示例
**逻辑回归**：该算法可以有效地用于二元分类问题。在此数据集中，逻辑回归可能是一个合适的选择，因为存在“恶性”（良性）和“良性”（恶性）类别。它可用于通过数据集中的视觉特征预测癌症类型。

**K 最近邻 （KNN）**：KNN 通过查看示例周围的 k 个最接近的示例来对示例进行分类。该算法假设具有相似特征的患者往往患有相似类型的癌症。KNN 可以通过考虑数据集中的邻域关系来用于癌症诊断。

**支持向量机 （SVM）**：SVM 对于分类任务非常有效，尤其是对于两类问题。SVM 专注于数据集中类别的清晰分离，是一种可用于癌症诊断的强大算法。

### 获取数据集

In [5]:
# 导入相关模块
from sklearn.model_selection import train_test_split, GridSearchCV
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
# 1.获取数据集
data = pd.read_csv("D:\ALL_code\\al_study_file\\al_study\MachineLearning\Cancer_Data.csv")

In [6]:
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [7]:
data.shape

(569, 33)

In [8]:
data.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,


### 2.基本数据处理 

In [47]:
# 2.基本数据处理 
# 2.4 确定特征值和⽬标值
headers = data.columns.tolist()
headers = headers[2:-1]
print(headers)

['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']


In [48]:
x = data[headers]  # 特征值是一个二维数组（或者 DataFrame）
y = data["diagnosis"]  # 目标值是一个一维数组（或者 Series）

In [50]:
x.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [51]:
y.head()

0    M
1    M
2    M
3    M
4    M
Name: diagnosis, dtype: object

In [52]:
# 2.5 分割数据集
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=2, test_size=0.2)
x_train.shape
x_train.head()
type(x_train)

pandas.core.frame.DataFrame

In [53]:
x_test.shape

(114, 30)

In [54]:
y_train.shape

(455,)

In [55]:
y_test.shape

(114,)

In [56]:
# 3.特征⼯程 -- 特征预处理(标准化)
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)

In [57]:
# 4.机器学习 -- knn+cv
# 4.1 实例化一个训练器
estimator = KNeighborsClassifier()

# 4.2 交叉验证，网格搜索是实现
param_grid = {"n_neighbors": [3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}
estimator = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=10)

# 4.3 模型训练
estimator.fit(x_train, y_train)

In [58]:
# 5.模型评估
# 5.1 准确率输出
score_ret = estimator.score(x_test, y_test)
print("准确率为：\n",score_ret)


准确率为：
 0.9736842105263158


In [59]:
# 5.2 预测结果
y_pred = estimator.predict(x_test)
print("预测值是: \n", y_pred)

预测值是: 
 ['B' 'B' 'B' 'M' 'B' 'M' 'B' 'B' 'B' 'B' 'M' 'B' 'B' 'B' 'B' 'M' 'B' 'B'
 'B' 'M' 'B' 'B' 'B' 'B' 'B' 'M' 'B' 'B' 'M' 'M' 'B' 'B' 'B' 'M' 'M' 'B'
 'B' 'B' 'B' 'B' 'M' 'M' 'B' 'B' 'M' 'B' 'B' 'B' 'M' 'M' 'B' 'M' 'B' 'B'
 'B' 'B' 'B' 'B' 'M' 'B' 'B' 'M' 'B' 'M' 'M' 'B' 'M' 'M' 'B' 'M' 'M' 'M'
 'B' 'M' 'B' 'M' 'B' 'B' 'B' 'M' 'M' 'M' 'M' 'B' 'B' 'B' 'B' 'B' 'B' 'M'
 'B' 'B' 'B' 'M' 'M' 'B' 'M' 'M' 'B' 'B' 'B' 'M' 'M' 'M' 'B' 'B' 'B' 'B'
 'B' 'B' 'M' 'M' 'B' 'M']


In [62]:
print(y_test == y_pred)

528     True
291     True
467     True
108     True
340     True
       ...  
471     True
449     True
24      True
38     False
230     True
Name: diagnosis, Length: 114, dtype: bool


In [60]:
# 5.3 其他结果输出
print("最好的模型是：\n", estimator.best_estimator_)
print("最好的结果是：\n", estimator.best_score_)
print("所有的结果是：\n", estimator.cv_results_)

最好的模型是：
 KNeighborsClassifier(n_neighbors=7)
最好的结果是：
 0.9692270531400966
所有的结果是：
 {'mean_fit_time': array([0.00109971, 0.0009968 , 0.00109744, 0.00100183, 0.00100286,
       0.00110343, 0.00101013, 0.0009033 , 0.00100784, 0.00100584]), 'std_fit_time': array([2.82816863e-04, 1.28030777e-05, 3.01016633e-04, 8.70733898e-06,
       8.66417840e-06, 2.98634159e-04, 1.47568582e-05, 3.01619048e-04,
       1.73868455e-05, 2.56899310e-05]), 'mean_score_time': array([0.03051338, 0.00800278, 0.00750592, 0.00840697, 0.00769069,
       0.00810118, 0.0084908 , 0.00800173, 0.00809622, 0.00788808]), 'std_score_time': array([0.06688248, 0.00046285, 0.00049515, 0.0004844 , 0.00065589,
       0.00070225, 0.00067129, 0.00042874, 0.00054133, 0.00069226]), 'param_n_neighbors': masked_array(data=[3, 5, 7, 9, 11, 13, 15, 17, 19, 21],
             mask=[False, False, False, False, False, False, False, False,
                   False, False],
       fill_value='?',
            dtype=object), 'params': [{'n_neigh