0. 以下作业 要把周五课上的预处理全流程带上 （ColumnTransformer），可以在熟悉了课件里的代码后，考虑自己实现
    实现要求：
  a. 大多数ML算法不期望缺失值, 数字特征中的缺失值将通过用中位数替换它们来估算，。在分类特征中，缺失值将被最常见的类别替换。
  b. 大多数ML算法只接受数字输入, 类别特征将被独热编码
  c. 计算并添加一些比率特征：bedrooms_ratio、rooms_per_house和people_per_house。希望这些能更好地与房价中位数相关联
  d. 添加集群相似性特征。可能比纬度和经度对模型更有用
  e. 长尾特征被它们的对数取代，因为大多数模型更喜欢具有大致均匀分布或高斯分布的特征。
  f. 大多数ML算法更喜欢所有特征具有大致相同的尺度, 所有数值特征都将被标准化


In [1]:
from pathlib import Path
import pandas as pd
housing = pd.read_csv(Path("datasets/housing/housing.csv"))

In [2]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,StandardScaler,FunctionTransformer
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.compose import make_column_selector, make_column_transformer

def column_ratio(X):
    return X[:, [0]] / X[:, [1]]

def ratio_name(function_transformer, feature_names_in):
    return ["ratio"]  # feature names out

def ratio_pipeline():
    return make_pipeline(
        SimpleImputer(strategy="median"),
        FunctionTransformer(column_ratio, feature_names_out=ratio_name),
        StandardScaler())

log_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    FunctionTransformer(np.log, feature_names_out="one-to-one"),
    StandardScaler())

class ClusterSimilarity(BaseEstimator, TransformerMixin):
    def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
        self.n_clusters = n_clusters
        self.gamma = gamma
        self.random_state = random_state

    def fit(self, X, y=None, sample_weight=None):
        # KMeans估计器相关参数：集群数量，随机种子，KMeans是一个随机算法，依赖随机性来定位集群
        self.kmeans_ = KMeans(self.n_clusters, n_init=10,
                              random_state=self.random_state)

        # sample_weight可指定样本的相对权重, 属于KMeans算法里的超参数，训练前指定。
        self.kmeans_.fit(X, sample_weight=sample_weight)
        return self # 永远返回self

    def transform(self, X):
        # self.kmeans_.cluster_centers_ 集群中心的位置
        return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)

    def get_feature_names_out(self, names=None):
        return [f"Cluster {i} similarity" for i in range(self.n_clusters)]

cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)

default_num_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler())

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore"))

preprocessing = ColumnTransformer([
        ("bedrooms", ratio_pipeline(), ["total_bedrooms", "total_rooms"]),
        ("rooms_per_house", ratio_pipeline(), ["total_rooms", "households"]),
        ("people_per_house", ratio_pipeline(), ["population", "households"]),
        ("log", log_pipeline, ["total_bedrooms", "total_rooms", "population",
                               "households", "median_income"]),
        ("geo", cluster_simil, ["latitude", "longitude"]),
        ("cat", cat_pipeline, make_column_selector(dtype_include=object)),
    ],
    remainder=default_num_pipeline)  # remainder，剩下的列用什么转换器，现在就剩下housing_median_age

In [3]:
housing_prepared = preprocessing.fit_transform(housing)

In [4]:
preprocessing.get_feature_names_out()

array(['bedrooms__ratio', 'rooms_per_house__ratio',
       'people_per_house__ratio', 'log__total_bedrooms',
       'log__total_rooms', 'log__population', 'log__households',
       'log__median_income', 'geo__Cluster 0 similarity',
       'geo__Cluster 1 similarity', 'geo__Cluster 2 similarity',
       'geo__Cluster 3 similarity', 'geo__Cluster 4 similarity',
       'geo__Cluster 5 similarity', 'geo__Cluster 6 similarity',
       'geo__Cluster 7 similarity', 'geo__Cluster 8 similarity',
       'geo__Cluster 9 similarity', 'cat__ocean_proximity_<1H OCEAN',
       'cat__ocean_proximity_INLAND', 'cat__ocean_proximity_ISLAND',
       'cat__ocean_proximity_NEAR BAY', 'cat__ocean_proximity_NEAR OCEAN',
       'remainder__housing_median_age', 'remainder__median_house_value'],
      dtype=object)

1. 尝试支持向量机回归器(sklearn.svm.SVR)，用这个模型来做回归。
     试试这个模型的超参数，例如kernel="linear"，kernel="rbf"，不同的kernel选择下也会有不同的超参数​。  分别用GridSearchCV和RandomizedSearchCV探索性能最优（交叉验证后的预测表现最好）的超参数

     请注意，支持向量机不能扩展到大型数据集，因此应该仅在训练集的前5000个实例上训练你的模型，并且仅使用3折交叉验证，否则会要运行很久（按小时计）。
     现在不要担心支持向量机超参数的含义，将在讲支持向量机（SVM）的时候详解

In [5]:
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split,GridSearchCV
X = housing.drop("median_house_value", axis=1)
y = housing["median_house_value"]

# 只取前5000个样本
X_subset = X.iloc[:5000]
y_subset = y.iloc[:5000]

# 划分训练/验证集 验证集取0.2
X_train, X_test, y_train, y_test = train_test_split(
    X_subset, y_subset, test_size=0.2, random_state=42
)

# -------------------------------
# 2. 构建完整流水线：预处理 + SVR
# -------------------------------
svr_pipeline = make_pipeline(preprocessing, SVR())

# -------------------------------
# 3. GridSearchCV：穷举搜索
# -------------------------------
param_grid = [
    {
        'svr__kernel': ['linear'],
        'svr__C': [0.1, 1, 10, 100]
    },
    {
        'svr__kernel': ['rbf', 'poly'],
        'svr__C': [0.1, 1, 10, 100],
        'svr__gamma': ['scale', # scale gamma = 1 / (n_features * X.var()) 特征尺度差异较大
                       'auto', # auto gamma = 1 / n_features 适用于特征尺度大致相同
                       0.001, 0.01, 0.1, 1]
    }
]

grid_search = GridSearchCV(
    svr_pipeline,
    param_grid,
    cv=3,
    scoring='neg_mean_squared_error', #取-的均方根误差
    n_jobs=-1, #cpu并行
    verbose=1 #进度可视化
)
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 52 candidates, totalling 156 fits


In [6]:
print("最佳参数:", grid_search.best_params_)
print("最佳交叉验证得分 (RMSE):", np.sqrt(-grid_search.best_score_))

最佳参数: {'svr__C': 100, 'svr__gamma': 'scale', 'svr__kernel': 'rbf'}
最佳交叉验证得分 (RMSE): 112159.84794959595


2. 去了解sklearn里 SelectFromModel的用法，尝试在数据预处理流水线中添加一个SelectFromModel转换器来仅选择最重要的属性。 并用你想尝试的回归模型去训练数据（线性回归/决策树/随机森林）


In [11]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline

# 定义特征选择器（基于随机森林）
feature_selector = SelectFromModel(
    RandomForestRegressor(n_estimators=10, random_state=42),
    threshold="median"  # 保留重要性高于中位数的特征
)

# 构建完整流水线
full_pipeline_rfr = Pipeline([
    ("preprocessing", preprocessing),
    ("feature_selection", feature_selector),
    ("regressor", RandomForestRegressor(n_estimators=10, random_state=42))
])

In [12]:
from sklearn.metrics import mean_squared_error, r2_score
# 以随机森林为例
full_pipeline_rfr.fit(X_train, y_train)

# 预测
y_pred = full_pipeline_rfr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"均方误差: {mse:,.2f}")
print(f"R² 得分: {r2:.4f}")

均方误差: 3,614,193,326.43
R² 得分: 0.7412
