# ML Pipeline 
按照如下的指导要求，搭建你的机器学习管道。
### 1. 导入与加载
- 导入 Python 库
- 使用 [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html) 从数据库中加载数据集
- 定义特征变量X 和目标变量 Y

In [1]:
import nltk
nltk.download(['punkt', 'wordnet', 'stopwords'])

import sqlite3

import re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.pipeline import Pipeline 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
# from sklearn.pipeline import Pipeline, FeatureUnion
# from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# load data from database
from sqlalchemy import create_engine

engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('DisasterResponse', engine)


In [3]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
X = df.message.values
Y = df.iloc[:, -36:]

### 2. 编写分词函数，开始处理文本

In [5]:
def tokenize(text):
    
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    words = word_tokenize(text)    
    stop_words = stopwords.words("english")        
    tokens = [WordNetLemmatizer().lemmatize(word) for word in words if word not in stop_words]    
    return tokens

### 3. 创建机器学习管道 
这个机器学习管道应该接收 `message` 列作输入，输出分类结果，分类结果属于该数据集中的 36 个类。你会发现 [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) 在预测多目标变量时很有用。

In [6]:
pipeline = Pipeline([
        ('vectorizer', CountVectorizer(tokenizer=tokenize)),
        ('transformer', TfidfTransformer()),
        ('clf', MultiOutputClassifier(DecisionTreeClassifier(random_state =10), n_jobs = -1))
         ])

In [7]:
len(X)

26216

### 4. 训练管道
- 将数据分割成训练和测试集
- 训练管道

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5, random_state=10)
#

In [9]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ...tion_leaf=0.0, presort=False, random_state=10,
            splitter='best'),
           n_jobs=-1))])

### 5. 测试模型
报告数据集中每个输出类别的 f1 得分、准确度和召回率。你可以对列进行遍历，并对每个元素调用 sklearn 的 `classification_report`。

In [10]:
y_pred = pipeline.predict(X_test)
y_test = np.asarray(y_test)
for i in range(0,len(y_pred.T)):
    print("------nth column : ", i)
    print(classification_report(y_test.T[i], y_pred.T[i]))

------nth column :  0
             precision    recall  f1-score   support

          0       0.51      0.49      0.50      3078
          1       0.85      0.84      0.84      9930
          2       0.13      0.48      0.21       100

avg / total       0.76      0.75      0.76     13108

------nth column :  1
             precision    recall  f1-score   support

          0       0.91      0.92      0.91     10837
          1       0.58      0.54      0.56      2271

avg / total       0.85      0.85      0.85     13108

------nth column :  2
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     13048
          1       0.02      0.02      0.02        60

avg / total       0.99      0.99      0.99     13108

------nth column :  3
             precision    recall  f1-score   support

          0       0.75      0.75      0.75      7656
          1       0.65      0.64      0.64      5452

avg / total       0.70      0.70      0.70     13108


In [11]:
DecisionTreeClassifier().get_params().keys()


dict_keys(['class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'presort', 'random_state', 'splitter'])

### 6. 优化模型
使用网格搜索来找到最优的参数组合。 

In [12]:
from sklearn.model_selection import GridSearchCV

parameters = {
             'clf__estimator__max_depth':(3,8)
             }

cv = GridSearchCV(pipeline, parameters)

### 7. 测试模型
打印微调后的模型的精确度、准确率和召回率。  

因为本项目主要关注代码质量、开发流程和管道技术，所有没有模型性能指标的最低要求。但是，微调模型提高精确度、准确率和召回率可以让你的项目脱颖而出——特别是让你的简历更出彩。

In [None]:
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
for i in range(10):
    print(classification_report(y_test[:,i],y_pred[:,i]) )


In [None]:
import sklearn as sklearn
for i in range(10):
    print(sklearn.metrics.accuracy_score(y_test[:,i],y_pred[:,i]) )

### 8. 继续优化模型，比如：
* 尝试其他的机器学习算法
* 尝试除 TF-IDF 外其他的特征

In [15]:
from sklearn import multioutput
from sklearn.multioutput import MultiOutputClassifier

pipeline2 = Pipeline([
        ('vectorizer', CountVectorizer(tokenizer=tokenize)),
        ('transformer', TfidfTransformer()),
        ('clf', multioutput.MultiOutputClassifier(RandomForestClassifier(),n_jobs=-1))
      ])

pipeline2.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ...ob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=-1))])

In [16]:
y_pred2 = pipeline2.predict(X_test)

for i in range(10):
    print(classification_report(y_test[:,i], y_pred2[:,i]))
for i in range(10):
    print(sklearn.metrics.accuracy_score(y_test[:,i],y_pred2[:,i]))

             precision    recall  f1-score   support

          0       0.63      0.45      0.53      3078
          1       0.85      0.91      0.88      9930
          2       0.17      0.40      0.24       100

avg / total       0.79      0.80      0.79     13108

             precision    recall  f1-score   support

          0       0.89      0.98      0.93     10837
          1       0.80      0.40      0.53      2271

avg / total       0.87      0.88      0.86     13108

             precision    recall  f1-score   support

          0       1.00      1.00      1.00     13048
          1       0.00      0.00      0.00        60

avg / total       0.99      1.00      0.99     13108

             precision    recall  f1-score   support

          0       0.74      0.86      0.80      7656
          1       0.75      0.58      0.65      5452

avg / total       0.74      0.74      0.74     13108

             precision    recall  f1-score   support

          0       0.93      0.99 

  'precision', 'predicted', average, warn_for)


### 9. 导出模型为 pickle file

In [17]:
import pickle 


with open('pipeline.pkl','wb') as fw:
    pickle.dump(pipeline2,fw)
    

### 10. Use this notebook to complete `train.py`
使用资源 (Resources)文件里附带的模板文件编写脚本，运行上述步骤，创建一个数据库，并基于用户指定的新数据集输出一个模型。