# 使用XGBoost作分类问题

资料来源：[link](https://towardsdatascience.com/beginners-guide-to-xgboost-for-classification-problems-50f75aac5390)

## 数据预处理

除了做数据清洗，还需要要进行以下操作：
1. 数值特征尺度化
2. 类别特征进行编码

这里，使用Kaggle上的数据集[Rain in Australia](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package)，使用一些天气测量值来预测今天是否下雨。  
这部分，我们使用Scikit-Learn Pipelines进行预处理

In [1]:
import pandas as pd

rain = pd.read_csv("data/weatherAUS.csv")

rain.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


In [2]:
rain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null

这个数据集包含了澳大利亚多个气象站的10年的气象测量。你可以用来预测明天或者今天是否下雨，因此，数据集中有两种目标值：`RainToday` 和 `RainTomorrow`。  

### 去除不必要的特征

因为我们只预测`RainToday`，所以，将其他一些不必要的特征去除。

In [3]:
cols_to_drop = ["Date", "Location", "RainTomorrow", "Rainfall"]

rain.drop(cols_to_drop, axis=1, inplace=True)

### 处理缺失值

这里，去除`Rainfall`是必须的，因为其记录了降水量  

接下来，通过查看每一列中缺失数据的百分比来处理缺失数据

In [4]:
missing_props = rain.isna().mean(axis=0)

missing_props

MinTemp          0.010209
MaxTemp          0.008669
Evaporation      0.431665
Sunshine         0.480098
WindGustDir      0.070989
WindGustSpeed    0.070555
WindDir9am       0.072639
WindDir3pm       0.029066
WindSpeed9am     0.012148
WindSpeed3pm     0.021050
Humidity9am      0.018246
Humidity3pm      0.030984
Pressure9am      0.103568
Pressure3pm      0.103314
Cloud9am         0.384216
Cloud3pm         0.408071
Temp9am          0.012148
Temp3pm          0.024811
RainToday        0.022419
dtype: float64

如果缺失数据比例超过40%，则去除该列

In [5]:
over_threshold = missing_props[missing_props >= 0.4]

over_threshold

Evaporation    0.431665
Sunshine       0.480098
Cloud3pm       0.408071
dtype: float64

有3列数据的缺失数据比例超过了40%，剔除掉

In [6]:
rain.drop(over_threshold.index, 
        axis=1, 
        inplace=True)

In [20]:
# 将数据划分特征和目标值

X = rain.drop("RainToday", axis=1)
y = rain.RainToday

In [31]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y = le.fit_transform(y)

y

array([0, 0, 0, ..., 0, 0, 0])

### 构建特征处理的pipeline

对于类别特征，我们将使用列的模式来估算缺失值，并使用 One-Hot 编码对它们进行编码

In [32]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

categorical_pipeline = Pipeline(
    steps=[
        ("impute", SimpleImputer(strategy="most_frequent")),
        ("oh-encode", OneHotEncoder(handle_unknown="ignore", sparse=False)),
    ]
)

对于数值特征，选择其均值作为其估计，并使用`StandardScaler`对特征标准化为零均值和1标准差

In [33]:
from sklearn.preprocessing import StandardScaler

numeric_pipeline = Pipeline(
    steps=[("impute", SimpleImputer(strategy="mean")), 
            ("scale", StandardScaler())]
)

最后，我们将两种pipeline组合成列的transformer。 
 
为了确定确定pipeline对应哪些列，我们需要将类别和数值特征隔离。

In [34]:
cat_cols = X.select_dtypes(exclude="number").columns
num_cols = X.select_dtypes(include="number").columns

In [35]:
from sklearn.compose import ColumnTransformer

full_processor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_pipeline, num_cols),
        ("categorical", categorical_pipeline, cat_cols),
    ]
)

## 使用xgboost分类

In [36]:
import xgboost as xgb

xgb_cl = xgb.XGBClassifier()

In [37]:
print(type(xgb_cl))

<class 'xgboost.sklearn.XGBClassifier'>


In [40]:
# Apply preprocessing
X_processed = full_processor.fit_transform(X)
y_processed = SimpleImputer(strategy="most_frequent").fit_transform(
    y.reshape(-1, 1)
)

from sklearn.model_selection import train_test_split

# 使用 stratify 来保证训练集和测试集比例一致
X_train, X_test, y_train, y_test = train_test_split(
    X_processed, y_processed, stratify=y_processed, random_state=1121218
)

In [41]:
from sklearn.metrics import accuracy_score

# Init classifier
xgb_cl = xgb.XGBClassifier()

# Fit
xgb_cl.fit(X_train, y_train)

# Predict
preds = xgb_cl.predict(X_test)

In [42]:
accuracy_score(y_test, preds)

0.8350886841743435