# ML Pipeline 
按照如下的指导要求，搭建你的机器学习管道。
### 1. 导入与加载
- 导入 Python 库
- 使用 [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html) 从数据库中加载数据集
- 定义特征变量X 和目标变量 Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
import sqlite3
from sqlalchemy import create_engine


In [2]:
# load data from 

engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table(table_name='InsertTableName',con=engine,index_col='id')
X = df['message']
Y = df[['related', 'request', 'offer',
       'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
       'security', 'military', 'child_alone', 'water', 'food', 'shelter',
       'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report']]

### 2. 编写分词函数，开始处理文本

In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()


def tokenize(text):
    text = re.sub(r"[^z-zA-Z0-9]"," " ,text.lower())
    
    tokens = word_tokenize(text)
    
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    
    return tokens

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


### 3. 创建机器学习管道 
这个机器学习管道应该接收 `message` 列作输入，输出分类结果，分类结果属于该数据集中的 36 个类。你会发现 [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) 在预测多目标变量时很有用。

In [4]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.tree import DecisionTreeClassifier

class StartingVerbExtractor(BaseEstimator, TransformerMixin):

    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)


pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier( DecisionTreeClassifier(random_state =42), n_jobs = -1))
         ])

### 4. 训练管道
- 将数据分割成训练和测试集
- 训练管道

In [5]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split 


X_train,X_test ,Y_train ,Y_test =train_test_split(X,Y)
# X_train.drop([6554,19661])
pipeline.fit(X_train ,Y_train)
Y_pred=pipeline.predict(X_test)



In [6]:
Y_test.columns

Index(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')

In [7]:
Y_pred=pd.DataFrame(Y_pred,columns=['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report']).astype('int')
Y_pred['related'].head()
Y_test=pd.DataFrame(Y_test).astype('int')

In [8]:
for i in Y_test.columns:
    print(Y_test[i].head())

id
15086    1
18727    1
14138    2
18064    1
18165    1
Name: related, dtype: int64
id
15086    0
18727    0
14138    0
18064    0
18165    0
Name: request, dtype: int64
id
15086    0
18727    0
14138    0
18064    0
18165    0
Name: offer, dtype: int64
id
15086    1
18727    0
14138    0
18064    1
18165    0
Name: aid_related, dtype: int64
id
15086    0
18727    0
14138    0
18064    0
18165    0
Name: medical_help, dtype: int64
id
15086    0
18727    0
14138    0
18064    0
18165    0
Name: medical_products, dtype: int64
id
15086    0
18727    0
14138    0
18064    0
18165    0
Name: search_and_rescue, dtype: int64
id
15086    0
18727    0
14138    0
18064    0
18165    0
Name: security, dtype: int64
id
15086    0
18727    0
14138    0
18064    0
18165    0
Name: military, dtype: int64
id
15086    0
18727    0
14138    0
18064    0
18165    0
Name: child_alone, dtype: int64
id
15086    0
18727    0
14138    0
18064    0
18165    0
Name: water, dtype: int64
id
15086    1
18727    0

### 5. 测试模型
报告数据集中每个输出类别的 f1 得分、准确度和召回率。你可以对列进行遍历，并对每个元素调用 sklearn 的 `classification_report`。

In [9]:

from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

for i in Y_test.columns:
    print(i)
    print('f1_scroe:',f1_score(Y_test[i], Y_pred[i], average='macro'))
    print('precision_score:',precision_score(Y_test[i], Y_pred[i], average="macro"))
    print('recall_score:',recall_score(Y_test[i], Y_pred[i], average="macro"),'\n')

      
  

related
f1_scroe: 0.299576714199
precision_score: 0.335976975569
recall_score: 0.333886081298 

request
f1_scroe: 0.4883023607
precision_score: 0.602200950966
recall_score: 0.512895315414 

offer
f1_scroe: 0.498776384215
precision_score: 0.49786259542
recall_score: 0.499693533558 

aid_related
f1_scroe: 0.487806472407
precision_score: 0.572090312223
recall_score: 0.532164226416 

medical_help
f1_scroe: 0.502886575454
precision_score: 0.549253210798
recall_score: 0.508952425393 

medical_products
f1_scroe: 0.516190704728
precision_score: 0.570202633055
recall_score: 0.513898245542 

search_and_rescue
f1_scroe: 0.502318730975
precision_score: 0.51844717878
recall_score: 0.503567903694 

security
f1_scroe: 0.494874759152
precision_score: 0.491352923171
recall_score: 0.498447446049 

military
f1_scroe: 0.494635935693
precision_score: 0.497350651029
recall_score: 0.499500696057 

child_alone
f1_scroe: 1.0
precision_score: 1.0
recall_score: 1.0 

water
f1_scroe: 0.509144934011
precision_scor

### 6. 优化模型
使用网格搜索来找到最优的参数组合。 

In [10]:

from sklearn.model_selection import GridSearchCV
# pipeline = Pipeline([
#         ('vect', CountVectorizer(tokenizer=tokenize)),
#         ('tfidf', TfidfTransformer()),
#         ('clf', MultiOutputClassifier(DecisionTreeClassifier(random_state =42), n_jobs = -1))
#     ])



parameters = {
        'vect__ngram_range': ((1, 1), (1, 2)),
        'vect__max_df': (0.5, 0.75, 1.0),
        'vect__max_features': (None, 5000, 10000),
        'tfidf__use_idf': (True, False)
#         'clf__max_depth':(1,21),
#         'clf__criterion':np.array(['entropy','gini'])
    
#         'clf__n_estimators': [50, 100, 200],
#         'clf__min_samples_split': [2, 3, 4]
    
#         'features__transformer_weights': (
#             {'text_pipeline': 1, 'starting_verb': 0.5},
#             {'text_pipeline': 0.5, 'starting_verb': 1},
#             {'text_pipeline': 0.8, 'starting_verb': 1},
#         )
    }

cv = GridSearchCV(pipeline, param_grid=parameters)
# param_grid= dict(features__pca__n__components=[1, 2, 3],  features__univ__select__k=[1,2],svm__C=[0.1, 1, 10])  
# grid_search= GridSearchCV(pipeline, param_grid=param_grid, verbose=10)  

### 7. 测试模型
打印微调后的模型的精确度、准确率和召回率。  

因为本项目主要关注代码质量、开发流程和管道技术，所有没有模型性能指标的最低要求。但是，微调模型提高精确度、准确率和召回率可以让你的项目脱颖而出——特别是让你的简历更出彩。

In [11]:
fit=cv.fit(X_train, Y_train)
print("\nBest Parameters:", fit.best_params_)
Y_pred=fit.best_estimator_.predict(X_test)


Best Parameters: {'tfidf__use_idf': False, 'vect__max_df': 0.5, 'vect__max_features': None, 'vect__ngram_range': (1, 2)}


In [12]:
Y_pred=pd.DataFrame(Y_pred,columns=['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report']).astype('int')
Y_test=pd.DataFrame(Y_test).astype('int')
for i in Y_test.columns:
    print(i)
    print('f1_scroe:',f1_score(Y_test[i], Y_pred[i], average='macro'))
    print('precision_score:',precision_score(Y_test[i], Y_pred[i], average="macro"))
    print('recall_score:',recall_score(Y_test[i], Y_pred[i], average="macro"),'\n')

related
f1_scroe: 0.297852301971
precision_score: 0.342548256636
recall_score: 0.334308985016 

request
f1_scroe: 0.485681227553
precision_score: 0.607627627537
recall_score: 0.511981072474 

offer
f1_scroe: 0.49889135255
precision_score: 0.497863573936
recall_score: 0.49992338339 

aid_related
f1_scroe: 0.495077460421
precision_score: 0.58276897423
recall_score: 0.537440189173 

medical_help
f1_scroe: 0.497369258099
precision_score: 0.546473914435
recall_score: 0.506405279503 

medical_products
f1_scroe: 0.512931034483
precision_score: 0.595324060612
recall_score: 0.512193899692 

search_and_rescue
f1_scroe: 0.492645920421
precision_score: 0.487065666616
recall_score: 0.498355520752 

security
f1_scroe: 0.495302633605
precision_score: 0.491367456073
recall_score: 0.499301350722 

military
f1_scroe: 0.503499538803
precision_score: 0.531235625575
recall_score: 0.504961832933 

child_alone
f1_scroe: 1.0
precision_score: 1.0
recall_score: 1.0 

water
f1_scroe: 0.499872696522
precision_sco

### 8. 继续优化模型，比如：
* 尝试其他的机器学习算法
* 尝试除 TF-IDF 外其他的特征

In [13]:
from sklearn.ensemble import RandomForestClassifier
import nltk
nltk.download('averaged_perceptron_tagger')

def tokenize(text):
    text = re.sub(r"[^z-zA-Z0-9]"," " ,text.lower())
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return tokens



def TextLengthExtractor(text):
    txt_length = text.apply(len)
    return txt_length
    
pipeline_fix = Pipeline([
        ('features', FeatureUnion([
            
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ]))

#             ('starting_verb', StartingVerbExtractor())
        ])),
        ('clf', RandomForestClassifier())
    ])
# RandomForestClassifier()
# MultiOutputClassifier(DecisionTreeClassifier(random_state =42), n_jobs = -1)
pipeline_fix.fit(X_train,Y_train)
Y_pred=pipeline_fix.predict(X_test)
Y_pred=pd.DataFrame(Y_pred,columns=['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report']).astype('float')
# Y_test=Y_test.apply(pd.to_numeric)
Y_test=Y_test.astype('float')

for i in Y_test.columns:
    print(i)
    print('f1_scroe:',f1_score(Y_test[i], Y_pred[i], average='macro'))
    print('precision_score:',precision_score(Y_test[i], Y_pred[i], average="macro"))
    print('recall_score:',recall_score(Y_test[i], Y_pred[i], average="macro"),'\n')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
related
f1_scroe: 0.296423427529
precision_score: 0.367164595635
recall_score: 0.335470391587 

request
f1_scroe: 0.486145164774
precision_score: 0.644660337087
recall_score: 0.51317166574 

offer
f1_scroe: 0.498929663609
precision_score: 0.497863899908
recall_score: 0.5 

aid_related
f1_scroe: 0.490926051925
precision_score: 0.570292308644
recall_score: 0.532757225117 

medical_help
f1_scroe: 0.490291790674
precision_score: 0.558448269894
recall_score: 0.503862054813 

medical_products
f1_scroe: 0.490393411748
precision_score: 0.507618920159
recall_score: 0.500411750119 

search_and_rescue
f1_scroe: 0.49323436171
precision_score: 0.487095296274
recall_score: 0.499530148786 

security
f1_scroe: 0.49545804465
precision_score: 0.491372728661
recall_score: 0.499611861512 

military
f1_scroe: 0.500470356082
precision_score: 0.54682

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


precision_score: 0.494123931624
recall_score: 0.499845607534 

other_infrastructure
f1_scroe: 0.495613360012
precision_score: 0.62124419036
recall_score: 0.503109984244 

weather_related
f1_scroe: 0.525026301975
precision_score: 0.630205360508
recall_score: 0.544530654268 

floods
f1_scroe: 0.495971993409
precision_score: 0.587104940166
recall_score: 0.506715285875 

storm
f1_scroe: 0.495261582126
precision_score: 0.624263914392
recall_score: 0.510095340184 

fire
f1_scroe: 0.49712268856
precision_score: 0.494655672622
recall_score: 0.499614435534 

earthquake
f1_scroe: 0.570010498264
precision_score: 0.709411031474
recall_score: 0.552025041377 

cold
f1_scroe: 0.49432914127
precision_score: 0.489083969466
recall_score: 0.499688036188 

other_weather
f1_scroe: 0.495709090163
precision_score: 0.563582085398
recall_score: 0.504086912531 

direct_report
f1_scroe: 0.482394088114
precision_score: 0.637137898159
recall_score: 0.513139980372 



### 9. 导出模型为 pickle file

In [14]:
import pickle 
from sklearn.externals import joblib
from sklearn.svm import SVC
from sklearn import datasets



#1.保存成Python支持的文件格式Pickle
#在当前目录下可以看到svm.pickle
with open('pipeline_fix.pickle','wb') as fw:
    pickle.dump(pipeline_fix,fw)
#加载svm.pickle
with open('pipeline_fix.pickle','rb') as fr:
    new_pipeline_fix1 = pickle.load(fr)



# #2.保存成sklearn自带的文件格式Joblib
# joblib.dump(svm,'svm.pkl')
# #加载svm.pkl
# new_svm2 = joblib.load('svm.pkl')
# print (new_svm2.predict(X[0:1]))

### 10. Use this notebook to complete `train.py`
使用资源 (Resources)文件里附带的模板文件编写脚本，运行上述步骤，创建一个数据库，并基于用户指定的新数据集输出一个模型。