以下是sklearn官网的模型选择导航图

![ml_map](../images/ml_map.svg)

很多偏统计学习的内容我还没有学到，比如朴素贝叶斯，核估计，等等。

传统ML中还有一个很重要的模型：支持向量机。

# Day7：SVM
使用UCI Adult dataset与day1-2中logistic回归对比   
使用California house-price与day3-4中个模型对比

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PowerTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC,SVR
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV

In [None]:
X = pd.read_csv('../data/Adult/X.csv')
y = pd.read_csv('../data/Adult/y.csv')
#数据清洗（血的教训）
y_series = y.iloc[:, 0] # 获取目标列的 Series
y_series = y_series.str.strip() # 去掉前后的空格
y = y_series.str.replace(r'\.$', '', regex=True) # 去掉字符串末尾的点（如果存在）
#确保y中只有两种类别

In [5]:
def build_preprocessor():
    """构建预处理流水线，分别处理数值和分类特征"""
    # 数值特征预处理：填充缺失值并标准化
    numerical_features = ['age', 'fnlwgt','education-num','capital-gain','capital-loss','hours-per-week']
    numerical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ])
    
    # 分类特征预处理：填充缺失值，并进行OneHot编码
    categorical_features = ['workclass','education','marital-status','occupation','relationship','race','sex','native-country']
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_features),
            ('cat', categorical_transformer, categorical_features)
        ],
        remainder='passthrough'  # 保留未指定的列
    )
    return preprocessor
    

In [6]:
# 划分训练集和测试集，并保持DataFrame格式以保留列名
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
x_train = pd.DataFrame(x_train, columns=X.columns)
x_test = pd.DataFrame(x_test, columns=X.columns)

In [8]:
# 替换分类变量中的缺失值符号 '?' 为 np.nan
categorical_features = ['workclass','education','marital-status','occupation','relationship','race','sex','native-country']
for df in [x_train, x_test]:
    df.loc[:, categorical_features] = df.loc[:, categorical_features].replace('?', np.nan)

In [9]:
 # 构建预处理器，并对数据进行转换
preprocessor = build_preprocessor()
X_train_processed = preprocessor.fit_transform(x_train)
X_test_processed = preprocessor.transform(x_test)

In [13]:
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

# 构建SVC模型
svc = SVC()

# 使用GridSearchCV进行参数搜索
grid_search = GridSearchCV(
    estimator=svc,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=2
)

- C：正则化参数。  
- kernel：核函数类型。常用有 'linear'（线性核）、'rbf'（高斯径向基核）、'poly'（多项式核）、'sigmoid'（sigmoid核）。  
- gamma：核函数系数。对 'rbf'、'poly' 和 'sigmoid' 有效。gamma 越大，单个样本的影响范围越小，模型更复杂；gamma 越小，影响范围更大，模型更简单。  
- degree：多项式核函数的次数。只对 kernel='poly' 有效。  
- probability：是否启用概率估计（默认为 False）。启用后可以用 predict_proba 方法，但会增加计算开销。  
- shrinking：是否使用启发式收缩（默认为 True），加速训练过程。  
- class_weight：类别权重，用于处理类别不平衡问题。  
- random_state：随机数种子，保证结果可复现（仅在概率估计等需要随机性的地方有效）。

In [16]:
grid_search.fit(X_train_processed, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time= 1.1min
[CV] END .....................C=0.1, gamma=scale, kernel=rbf; total time= 1.2min
[CV] END .....................C=0.1, gamma=scale, kernel=rbf; total time= 1.2min
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time= 1.2min
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time= 1.2min
[CV] END .....................C=0.1, gamma=scale, kernel=rbf; total time= 1.2min
[CV] END ....................C=0.1, gamma=scale, kernel=poly; total time= 1.3min
[CV] END .....................C=0.1, gamma=scale, kernel=rbf; total time= 1.3min
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time= 1.3min
[CV] END ..................C=0.1, gamma=scale, kernel=linear; total time= 1.3min
[CV] END .....................C=0.1, gamma=scale, kernel=rbf; total time= 1.3min
[CV] END ....................C=0.1, gamma=scale, kernel=poly; total time= 1.4min
[CV] END ...................

In [None]:
y_pred = grid_search.predict(X_test_processed)
report = classification_report(y_test, y_pred)

In [18]:
print(report)

              precision    recall  f1-score   support

       <=50K       0.87      0.95      0.91      7414
        >50K       0.77      0.56      0.65      2355

    accuracy                           0.85      9769
   macro avg       0.82      0.75      0.78      9769
weighted avg       0.85      0.85      0.85      9769



几乎跟logistic regression一直。官方给出的基线也是如此。对于这个任务树模型的分类效果特别好。

---

我把使用SVR对california house-price的预测代码写入scripts/day7.py中