# ML Pipeline 
按照如下的指导要求，搭建你的机器学习管道。
### 1. 导入与加载
- 导入 Python 库
- 使用 [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html) 从数据库中加载数据集
- 定义特征变量X 和目标变量 Y

In [1]:
# import libraries
import re
import pickle
import pandas as pd
from sqlalchemy import create_engine

from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

In [2]:
nltk.download(['punkt', 'wordnet','stopwords'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName_0.db')
df = pd.read_sql_table("InsertTableName_0", engine)
X = df["message"]
Y = df.drop(["id", "message", "original", "genre"], axis=1)
category_names = Y.columns.tolist()

### 2. 编写分词函数，开始处理文本

In [16]:
def tokenize(text):
    # Convert to lower case
    text = text.lower()
    
    # Remove punctuation characters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stop words
    tokens = [word for word in tokens if word not in stopwords.words("english")]
    
    # Lemmatize
    tokens = [WordNetLemmatizer().lemmatize(word).strip() for word in tokens]
    
    return tokens

### 3. 创建机器学习管道 
这个机器学习管道应该接收 `message` 列作输入，输出分类结果，分类结果属于该数据集中的 36 个类。你会发现 [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) 在预测多目标变量时很有用。

In [17]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])
pipeline

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 4. 训练管道
- 将数据分割成训练和测试集
- 训练管道

In [18]:
# Split data into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

# Train pipeline
pipeline.fit(X_train, Y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

In [19]:
predict_y = pipeline.predict(X_test)
for i in range(len(category_names)):
    category = category_names[i]
    print(category)
    print(classification_report(Y_test[category], predict_y[:, i]))

related
             precision    recall  f1-score   support

          0       0.61      0.46      0.53      1563
          1       0.84      0.90      0.87      4944
          2       0.23      0.32      0.27        47

avg / total       0.78      0.79      0.78      6554

request
             precision    recall  f1-score   support

          0       0.89      0.98      0.93      5443
          1       0.77      0.40      0.53      1111

avg / total       0.87      0.88      0.86      6554

offer
             precision    recall  f1-score   support

          0       0.99      1.00      1.00      6521
          1       0.00      0.00      0.00        33

avg / total       0.99      0.99      0.99      6554

aid_related
             precision    recall  f1-score   support

          0       0.76      0.85      0.80      3884
          1       0.73      0.61      0.66      2670

avg / total       0.75      0.75      0.74      6554

medical_help


  'precision', 'predicted', average, warn_for)


             precision    recall  f1-score   support

          0       0.92      0.99      0.96      6019
          1       0.53      0.09      0.15       535

avg / total       0.89      0.92      0.89      6554

medical_products
             precision    recall  f1-score   support

          0       0.95      1.00      0.97      6210
          1       0.66      0.10      0.17       344

avg / total       0.94      0.95      0.93      6554

search_and_rescue
             precision    recall  f1-score   support

          0       0.98      1.00      0.99      6395
          1       0.58      0.04      0.08       159

avg / total       0.97      0.98      0.97      6554

security
             precision    recall  f1-score   support

          0       0.98      1.00      0.99      6438
          1       0.10      0.01      0.02       116

avg / total       0.97      0.98      0.97      6554

military
             precision    recall  f1-score   support

          0       0.97      1.00 

### 5. 测试模型
报告数据集中每个输出类别的 f1 得分、准确度和召回率。你可以对列进行遍历，并对每个元素调用 sklearn 的 `classification_report`。

### 6. 优化模型
使用网格搜索来找到最优的参数组合。 

In [20]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7f86d5c00378>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=None,

In [23]:
parameters = {
    "clf__estimator__n_estimators": [50, 100],
    "clf__estimator__min_samples_split": [2, 3]
}

cv = GridSearchCV(pipeline, param_grid=parameters)

In [None]:
cv.fit(X_train, Y_train)

### 7. 测试模型
打印微调后的模型的精确度、准确率和召回率。  

因为本项目主要关注代码质量、开发流程和管道技术，所有没有模型性能指标的最低要求。但是，微调模型提高精确度、准确率和召回率可以让你的项目脱颖而出——特别是让你的简历更出彩。

In [None]:
cv_predict_y = pickle_model.predict(X_test)
for i in range(len(category_names)):
    category = category_names[i]
    print(category)
    print(classification_report(Y_test[category], cv_predict_y[:, i]))

### 8. 继续优化模型，比如：
* 尝试其他的机器学习算法
* 尝试除 TF-IDF 外其他的特征

In [None]:
pipeline2 = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(AdaBoostClassifier()))
])

pipeline2

In [None]:
pipeline2.fit(X_train, Y_train)

In [None]:
ada_predict_y = pipeline2.predict(X_test)
for i in range(len(category_names)):
    category = category_names[i]
    print(category)
    print(classification_report(Y_test[category], ada_predict_y[:, i]))

### 9. 导出模型为 pickle file

In [None]:
pkl_filename = 'ada_clf_model.pkl'
with open(pkl_filename, 'wb') as file:  
    pickle.dump(pipeline2, file)

### 10. Use this notebook to complete `train.py`
使用资源 (Resources)文件里附带的模板文件编写脚本，运行上述步骤，创建一个数据库，并基于用户指定的新数据集输出一个模型。