# 红酒品质分类
#### 顾翔，沈贝宁、那铭心、季诚
#### 本题由顾翔负责
---
## 运行环境
* python3
* 安装有pandas、scikit-learn、numpy库
* 运行软件：Anaconda、jupyter notebook

### 一、数据处理
* 将数据集随机分层抽取数据，分成80%训练集和20%测试集
* 经过z-score标准化后的结果并没有提升，因此不使用标准化

In [1]:
import pandas as pd
df=pd.read_csv('winequality-red.csv',sep=';')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [2]:
from sklearn.model_selection import train_test_split
#from sklearn import preprocessing
train_x,test_x=train_test_split(df,test_size=0.2,stratify=df['quality'],random_state=5)
train_y=train_x['quality']
del train_x['quality']
test_y=test_x['quality']
del test_x['quality']
#train_x=preprocessing.scale(train_x.values)
#test_x=preprocessing.scale(test_x.values)

### 一、随机森林
* 采用了所有基评估器都是决策树的Bagging算法，效果理论上优于单个决策树
* 库参考文档https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
* 采用网格搜索获取最佳超参数

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
rfc=RandomForestClassifier()
grid_values={'n_estimators':[50,100,200],'max_depth':[None,30,15,5],'max_features':['auto','sqrt','log2'],'min_samples_leaf':[1,20,50,100]}
grid_rfc=GridSearchCV(rfc,param_grid=grid_values,scoring='accuracy')
grid_rfc.fit(train_x,train_y)
score=grid_rfc.score(test_x, test_y)
print("随机森林的正确率：",score)

随机森林的正确率： 0.66875


### 二、AdaBoost
* AdaBoost:利用同一训练集的不同加权版本，训练一组弱分类器，把弱分类器以加权的形式组合成一个强分类器
* 库参考文档 https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
* 采用网格搜索获取最佳超参数

In [60]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
#from sklearn.model_selection import GridSearchCV
ada=AdaBoostClassifier(DecisionTreeClassifier(max_depth=10))
grid_values={'n_estimators':[50,100,200],'learning_rate':[0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]}
grid_ada=GridSearchCV(ada,param_grid=grid_values,scoring='accuracy')
grid_ada.fit(train_x,train_y)
score=grid_ada.score(test_x, test_y)
print("AdaBoost正确率：",score)

AdaBoost正确率： 0.7125


### 五、结果分析
* 模型评估标准采用 $正确率=\frac{(TP+TN)}{(TN+FN+FP+TP)}$
* 正确率为70%左右，随机森林略弱于AdaBoost，猜测是由于数据可能存在离群值以及无关或有相关性的特征，导致准确率并不高，下一步可以考虑加入lasso进行特征筛选后再进行分类。