# ML Pipeline 
按照如下的指导要求，搭建你的机器学习管道。
### 1. 导入与加载
- 导入 Python 库
- 使用 [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html) 从数据库中加载数据集
- 定义特征变量X 和目标变量 Y

In [1]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.grid_search import RandomizedSearchCV

from sklearn.externals import joblib
import pickle



In [2]:
categories = ['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
        'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
        'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport',
        'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
        'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']

In [3]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('messages_categories', engine)
#X = df[['message']]
X = df.message.values
y = df[categories].values


In [4]:
X.shape

(26180,)

In [5]:
y.shape

(26180, 36)

In [6]:
# test
df = df[(df['related'] == 2)]
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
117,146,Dans la zone de Saint Etienne la route de Jacm...,Nan zon st. etine rout jakmel la bloke se mize...,direct,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
218,263,. .. i with limited means. Certain patients co...,t avec des moyens limites. Certains patients v...,direct,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
304,373,The internet caf Net@le that's by the Dal road...,Cyber cafe net@le ki chita rout de dal tou pr ...,direct,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
459,565,"Bonsoir, on est a bon repos aprs la compagnie ...",Bonswa nou nan bon repo apri teleko nan wout t...,direct,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
575,700,URGENT CRECHE ORPHANAGE KAY TOUT TIMOUN CROIX ...,r et Salon Furterer. mwen se yon Cosmtologue. ...,direct,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. 编写分词函数，开始处理文本

In [6]:
def tokenize(text):
    # Normalize text
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    # # Tokenize text
    words = word_tokenize(text)
    
    # Remove stop words
    words = [w for w in words if w not in stopwords.words("english")]
    
    # reduce words to their stems 之前用stem，后来改用lemmatizer了
    # stemmed = [PorterStemmer().stem(w).lower().strip() for w in words]
    
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = []
    for tok in words:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
    
    return clean_tokens
    

### 3. 创建机器学习管道 
这个机器学习管道应该接收 `message` 列作输入，输出分类结果，分类结果属于该数据集中的 36 个类。你会发现 [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) 在预测多目标变量时很有用。

In [18]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])

### 4. 训练管道
- 将数据分割成训练和测试集
- 训练管道

In [20]:
X_train.shape

(19635,)

In [21]:
y_train.shape

(19635, 36)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y)


In [22]:
%%time
# train classifier
pipeline.fit(X_train, y_train)

CPU times: user 1min 24s, sys: 10.2 s, total: 1min 34s
Wall time: 1min 36s


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

In [14]:
%%time
# predict on test data
y_pred = pipeline.predict(X_test)

### 5. 测试模型
报告数据集中每个输出类别的 f1 得分、准确度和召回率。你可以对列进行遍历，并对每个元素调用 sklearn 的 `classification_report`。

In [77]:
y_test[:,0]

array([1, 1, 1, ..., 1, 1, 1])

In [15]:
for i in range(0,35):
    print("Categories:", categories[i])
    print(classification_report(y_test[:,i], y_pred[:,i]))

Categories: related
             precision    recall  f1-score   support

          0       0.67      0.11      0.19      1594
          1       0.82      0.07      0.13      4899
          2       0.01      0.96      0.02        52

avg / total       0.78      0.09      0.15      6545

Categories: request
             precision    recall  f1-score   support

          0       0.84      0.99      0.91      5446
          1       0.73      0.09      0.15      1099

avg / total       0.82      0.84      0.79      6545

Categories: offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6516
          1       0.00      0.00      0.00        29

avg / total       0.99      1.00      0.99      6545

Categories: aid_related
             precision    recall  f1-score   support

          0       0.60      0.99      0.75      3895
          1       0.72      0.04      0.08      2650

avg / total       0.65      0.61      0.48      6545

Categ

  'precision', 'predicted', average, warn_for)


### 6. 优化模型
使用网格搜索来找到最优的参数组合。 

In [36]:
pipeline.get_params()

{'clf': MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False),
            n_jobs=1),
 'clf__estimator': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False),
 'clf__estimator__bootstrap': True,
 'clf__estimator__class_we

In [8]:
def build_model():
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])
    
    parameters = {
        'vect__ngram_range': ((1, 1), (1, 2)),
        'vect__max_df': (0.5, 1.0),
        'vect__max_features': (None, 5000),
        'tfidf__use_idf': (True, False),
        'clf__estimator__n_estimators': [10, 50],
        'clf__estimator__min_samples_split': [2, 4]
    }
    
    #cv = GridSearchCV(pipeline, param_grid = parameters, n_jobs=-1)
    
    # 如果参数多，可以尝试用RandomizedSearchCV替代GridSearchCV
    n_iter_search = 5
    cv = RandomizedSearchCV(pipeline, param_distributions = parameters,n_iter = n_iter_search)
    
    return cv



In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [10]:
%%time

model = build_model()
model.fit(X_train, y_train)

CPU times: user 26min 25s, sys: 2min 30s, total: 28min 55s
Wall time: 29min 16s


In [11]:
model.best_estimator_

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

In [12]:
model.best_score_

0.25408708938120705

In [13]:
model.best_params_

{'clf__estimator__min_samples_split': 2,
 'clf__estimator__n_estimators': 50,
 'tfidf__use_idf': False,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__ngram_range': (1, 1)}

In [14]:
%%time
y_pred = model.predict(X_test)

CPU times: user 22.3 s, sys: 3.16 s, total: 25.5 s
Wall time: 25.9 s


### 7. 测试模型
打印微调后的模型的精确度、准确率和召回率。  

因为本项目主要关注代码质量、开发流程和管道技术，所有没有模型性能指标的最低要求。但是，微调模型提高精确度、准确率和召回率可以让你的项目脱颖而出——特别是让你的简历更出彩。

In [17]:
for i in range(len(categories)):
    print("Categories:", categories[i])
    print(classification_report(y_test[:,i], y_pred[:,i]))

Categories: related
             precision    recall  f1-score   support

          0       0.70      0.40      0.51      1490
          1       0.84      0.95      0.89      5006
          2       0.90      0.18      0.31        49

avg / total       0.81      0.82      0.80      6545

Categories: request
             precision    recall  f1-score   support

          0       0.90      0.98      0.94      5428
          1       0.83      0.50      0.62      1117

avg / total       0.89      0.90      0.89      6545

Categories: offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6520
          1       0.00      0.00      0.00        25

avg / total       0.99      1.00      0.99      6545

Categories: aid_related
             precision    recall  f1-score   support

          0       0.80      0.83      0.81      3870
          1       0.74      0.69      0.72      2675

avg / total       0.77      0.78      0.77      6545

Categ

  'precision', 'predicted', average, warn_for)


### 8. 继续优化模型，比如：
* 尝试其他的机器学习算法
* 尝试除 TF-IDF 外其他的特征

### 9. 导出模型为 pickle file

In [18]:
# 保存至本地磁盘
with open('model.pkl', 'wb') as file:
    pickle.dump(model, file)


In [35]:
# 从本地磁盘加载模型
with open('model.pkl', 'rb') as file:
    model_joblib = pickle.load(file)
# 加载出来的模型可以进行predict等功能
#print(model_joblib.predict([[4, 6, 10]]))

### 10. Use this notebook to complete `train.py`
使用资源 (Resources)文件里附带的模板文件编写脚本，运行上述步骤，创建一个数据库，并基于用户指定的新数据集输出一个模型。