# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import re
import numpy as np
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, multilabel_confusion_matrix
from joblib import dump

In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('categorized_messages', engine)

In [3]:
# checking if there are any columns with only one single value
df.nunique()[df.nunique()==1]

child_alone    1
dtype: int64

In [4]:
# droping columns with no value for model

X = df['message']
y = df.drop(columns=['id','message','original','genre', 'child_alone'])
category_names = y.columns.tolist()

### 2. Write a tokenization function to process your text data

In [5]:
def tokenize(text):
    '''
    INPUT - text (string)
    OUTPUT - tokenized and cleansed text (string)
    
    This function tokenizes and cleanses a string by the following steps:
        1. any url will be replaced by the string 'urlplaceholder'
        2. any puctuation and capitalization will be removed
        3. the text will be tokenized
        4. any stopword will be removed
        5. each word will be first lemmatized and then stemmed    
    '''
    
    # get list of all urls using regex
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    
    # replace each url in text string with placeholder
    for url in detected_urls:
        text = text.replace(url, 'urlplaceholder')

    # Remove punctuation characters
    text = re.sub(r'[^a-zA-Z0-9äöüÄÖÜß ]', '', text.lower())
    
    # tokenize
    tokens = word_tokenize(text)
    
    # Remove stop words
    tokens = [tok for tok in tokens if tok not in stopwords.words("english")]

    # instantiate lemmatizer and stemmer
    lemmatizer = WordNetLemmatizer()

    # lemmatize and stemm
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [6]:
def init_pipelines():
    '''
    Input - None
    Output - list of ML pipelines
    
    This function instantiates ML pipelines including a CountVectorizer including the above created tokenize udf, a tfidf and a ML classifier. 
    The first element in the list will include a RandomForestClassifier, the second a GradientBoosingClassifier and the third a AdaBoostClassifier. 
    '''

    #Random Forest Classifier Pipeline
    pipe_rfc = Pipeline([
        ('vect', CountVectorizer(tokenizer = tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])

    #Gradient Boost Classifier Pipeline
    pipe_gbc = Pipeline([
        ('vect', CountVectorizer(tokenizer = tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(GradientBoostingClassifier()))
    ])

    #ADA Boost Classifier Pipeline 
    pipe_abc = Pipeline([
        ('vect', CountVectorizer(tokenizer = tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(AdaBoostClassifier()))
    ])

    pipelines = [pipe_rfc, pipe_gbc, pipe_abc]
    
    return pipelines

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [7]:
def train_pipeline(pipeline, X_train_array, y_train_array, n_jobs=-1):
    '''
    INPUT:
    pipeline - ML pipeline
    X_train_array - a np-array of training features
    y_train_array - a np-array of training responses
    n_jobs - number of workers to be used by the classifer (default = -1)
    
    OUTPUT:
    pipeline - a fit pipeline
    
    This function fits the input pipeline to the input X and y array. 
    '''

    #setting pipeline parameters
    if 'clf__estimator__n_jobs' in pipeline.get_params():
        pipeline.set_params(clf__estimator__n_jobs=n_jobs)
    
    #fitting pipeline
    pipeline.fit(X_train_array, y_train_array)
    
    return pipeline

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [8]:
def eval_pipeline(pipeline, X_test_array, y_test, column_names):
    '''
    INPUT:
    pipeline - ML pipeline
    X_test_array - a np-array of testing features
    y_train_array - a np-array of testing responses
    column_names - a list of column names of the features array/df
    
    OUTPUT:
    a classification report for each feature will be printed out. 
    '''
    
    #predicting y_test
    y_pred = pd.DataFrame(pipeline.predict(X_test_array), columns=y_train.columns.tolist())
        
    #classifier used:
    classifier_name = str(pipeline.get_params()['steps'][2][1]).replace('(','').replace(')','').replace('n_jobs','').split('=')[1]
    
    #printing classification report
    print('_________________________________ {} ___________________________________________________\n'.format(classifier_name))
    for i, col in enumerate(y_test):
        print('{}\n{}'.format(col, classification_report(y_test[col], y_pred[col], zero_division=0)))
        
    print('_______________________________________________________________________________________')
    print('Classification report in total')
    print(classification_report(y_test.melt().value, y_pred.melt().value, zero_division=0))
        
    print('____________________________________________________________________________________\n')

In [9]:
#instantiation of the pipelines
pipelines = init_pipelines()

#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)
#converting of train and test df into np-array
X_train_array, y_train_array, X_test_array = np.array(X_train), np.array(y_train), np.array(X_test)

#creation of a list of feature names
column_names = y_train.columns.tolist()

#fitting and evaluation of each pipeline in pipeline list
for pipe in pipelines:
    pipe = train_pipeline(pipe, X_train_array, y_train_array)
    eval_pipeline(pipe, X_test_array, y_test, column_names)

_________________________________ RandomForestClassifier ___________________________________________________

related
              precision    recall  f1-score   support

           0       0.70      0.42      0.52      1853
           1       0.84      0.95      0.89      6012

    accuracy                           0.82      7865
   macro avg       0.77      0.68      0.71      7865
weighted avg       0.81      0.82      0.80      7865

request
              precision    recall  f1-score   support

           0       0.91      0.98      0.94      6552
           1       0.84      0.49      0.62      1313

    accuracy                           0.90      7865
   macro avg       0.87      0.74      0.78      7865
weighted avg       0.90      0.90      0.89      7865

offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7829
           1       0.00      0.00      0.00        36

    accuracy                           1.00      7

### 6. Improve your model
Use grid search to find better parameters. 

In [10]:
#creation of parameter dictionaries to be searched by grid search

params_rfc = {'tfidf__use_idf': [True],
              'clf__estimator__n_estimators':[100,200],
              'clf__estimator__min_samples_split':[2,5],
              'clf__estimator__n_jobs':[-1]
}

params_gbc = {'tfidf__use_idf': [True,False],
              'clf__estimator__n_estimators': [100, 200],
              'clf__estimator__learning_rate': [.05, .1]
}

params_abc = {'tfidf__use_idf': [True,False],
              'clf__estimator__n_estimators': [200, 300],
              'clf__estimator__learning_rate': [.15, .2]
}

params_gs = [params_rfc, params_gbc, params_abc]

In [11]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)
#converting of train and test df into np-array
X_train_array, y_train_array, X_test_array = np.array(X_train), np.array(y_train), np.array(X_test)
    
#creation of list of feature names
column_names = y_train.columns.tolist()

#istantiation of pipelines
pipelines = init_pipelines()
cv_list = []
results = []

#performing of grid search for each pipeline in pipeline list. the results and best estimators will be appended to the according list. 
for i, pipe in enumerate(pipelines):
    print(pipe)
    cv = GridSearchCV(pipe, param_grid=params_gs[i], verbose=3, return_train_score=True)
    cv.fit(X_train_array, y_train_array)
    cv_list.append(cv)
    results.append(cv.cv_results_)
    eval_pipeline(cv.best_estimator_, X_test_array, y_test, column_names)

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x0000022A5EAC5E50>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100, clf__estimator__n_jobs=-1, tfidf__use_idf=True 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100, clf__estimator__n_jobs=-1, tfidf__use_idf=True, score=(train=0.995, test=0.272), total= 7.8min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100, clf__estimator__n_jobs=-1, tfidf__use_idf=True 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 10.2min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100, clf__estimator__n_jobs=-1, tfidf__use_idf=True, score=(train=0.994, test=0.268), total= 4.6min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100, clf__estimator__n_jobs=-1, tfidf__use_idf=True 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 16.8min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100, clf__estimator__n_jobs=-1, tfidf__use_idf=True, score=(train=0.994, test=0.257), total= 4.5min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100, clf__estimator__n_jobs=-1, tfidf__use_idf=True 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100, clf__estimator__n_jobs=-1, tfidf__use_idf=True, score=(train=0.994, test=0.265), total= 4.5min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100, clf__estimator__n_jobs=-1, tfidf__use_idf=True 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100, clf__estimator__n_jobs=-1, tfidf__use_idf=True, score=(train=0.994, test=0.267), total= 5.6min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=200, clf__estimator__n_jobs=-1, tfidf__use_idf=True 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=200, clf__estimator__n_jobs=-1, tfidf__u

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed: 191.2min finished


_________________________________ RandomForestClassifier ___________________________________________________

related
              precision    recall  f1-score   support

           0       0.70      0.42      0.53      1853
           1       0.84      0.95      0.89      6012

    accuracy                           0.82      7865
   macro avg       0.77      0.68      0.71      7865
weighted avg       0.81      0.82      0.81      7865

request
              precision    recall  f1-score   support

           0       0.91      0.98      0.94      6552
           1       0.83      0.50      0.62      1313

    accuracy                           0.90      7865
   macro avg       0.87      0.74      0.78      7865
weighted avg       0.89      0.90      0.89      7865

offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7829
           1       0.00      0.00      0.00        36

    accuracy                           1.00      7

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__learning_rate=0.05, clf__estimator__n_estimators=100, tfidf__use_idf=True, score=(train=0.267, test=0.231), total= 7.8min
[CV] clf__estimator__learning_rate=0.05, clf__estimator__n_estimators=100, tfidf__use_idf=True 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 10.2min remaining:    0.0s


[CV]  clf__estimator__learning_rate=0.05, clf__estimator__n_estimators=100, tfidf__use_idf=True, score=(train=0.273, test=0.223), total= 7.7min
[CV] clf__estimator__learning_rate=0.05, clf__estimator__n_estimators=100, tfidf__use_idf=True 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 20.2min remaining:    0.0s


[CV]  clf__estimator__learning_rate=0.05, clf__estimator__n_estimators=100, tfidf__use_idf=True, score=(train=0.273, test=0.232), total= 7.7min
[CV] clf__estimator__learning_rate=0.05, clf__estimator__n_estimators=100, tfidf__use_idf=True 
[CV]  clf__estimator__learning_rate=0.05, clf__estimator__n_estimators=100, tfidf__use_idf=True, score=(train=0.270, test=0.229), total= 7.8min
[CV] clf__estimator__learning_rate=0.05, clf__estimator__n_estimators=100, tfidf__use_idf=True 
[CV]  clf__estimator__learning_rate=0.05, clf__estimator__n_estimators=100, tfidf__use_idf=True, score=(train=0.264, test=0.245), total= 7.6min
[CV] clf__estimator__learning_rate=0.05, clf__estimator__n_estimators=100, tfidf__use_idf=False 
[CV]  clf__estimator__learning_rate=0.05, clf__estimator__n_estimators=100, tfidf__use_idf=False, score=(train=0.269, test=0.238), total= 7.5min
[CV] clf__estimator__learning_rate=0.05, clf__estimator__n_estimators=100, tfidf__use_idf=False 
[CV]  clf__estimator__learning_rate=0

[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed: 411.5min finished


_________________________________ GradientBoostingClassifiern_estimators ___________________________________________________

related
              precision    recall  f1-score   support

           0       0.73      0.24      0.36      1853
           1       0.81      0.97      0.88      6012

    accuracy                           0.80      7865
   macro avg       0.77      0.60      0.62      7865
weighted avg       0.79      0.80      0.76      7865

request
              precision    recall  f1-score   support

           0       0.91      0.98      0.94      6552
           1       0.83      0.50      0.62      1313

    accuracy                           0.90      7865
   macro avg       0.87      0.74      0.78      7865
weighted avg       0.89      0.90      0.89      7865

offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7829
           1       0.00      0.00      0.00        36

    accuracy                      

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__learning_rate=0.15, clf__estimator__n_estimators=200, tfidf__use_idf=True, score=(train=0.258, test=0.250), total= 6.5min
[CV] clf__estimator__learning_rate=0.15, clf__estimator__n_estimators=200, tfidf__use_idf=True 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  8.8min remaining:    0.0s


[CV]  clf__estimator__learning_rate=0.15, clf__estimator__n_estimators=200, tfidf__use_idf=True, score=(train=0.260, test=0.235), total= 5.5min
[CV] clf__estimator__learning_rate=0.15, clf__estimator__n_estimators=200, tfidf__use_idf=True 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 16.3min remaining:    0.0s


[CV]  clf__estimator__learning_rate=0.15, clf__estimator__n_estimators=200, tfidf__use_idf=True, score=(train=0.262, test=0.246), total= 5.4min
[CV] clf__estimator__learning_rate=0.15, clf__estimator__n_estimators=200, tfidf__use_idf=True 
[CV]  clf__estimator__learning_rate=0.15, clf__estimator__n_estimators=200, tfidf__use_idf=True, score=(train=0.265, test=0.249), total= 6.0min
[CV] clf__estimator__learning_rate=0.15, clf__estimator__n_estimators=200, tfidf__use_idf=True 
[CV]  clf__estimator__learning_rate=0.15, clf__estimator__n_estimators=200, tfidf__use_idf=True, score=(train=0.259, test=0.262), total= 6.5min
[CV] clf__estimator__learning_rate=0.15, clf__estimator__n_estimators=200, tfidf__use_idf=False 
[CV]  clf__estimator__learning_rate=0.15, clf__estimator__n_estimators=200, tfidf__use_idf=False, score=(train=0.259, test=0.251), total= 6.0min
[CV] clf__estimator__learning_rate=0.15, clf__estimator__n_estimators=200, tfidf__use_idf=False 
[CV]  clf__estimator__learning_rate=0

[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed: 423.2min finished


_________________________________ AdaBoostClassifierlearning_rate ___________________________________________________

related
              precision    recall  f1-score   support

           0       0.74      0.22      0.34      1853
           1       0.80      0.98      0.88      6012

    accuracy                           0.80      7865
   macro avg       0.77      0.60      0.61      7865
weighted avg       0.79      0.80      0.75      7865

request
              precision    recall  f1-score   support

           0       0.90      0.98      0.94      6552
           1       0.83      0.48      0.61      1313

    accuracy                           0.90      7865
   macro avg       0.87      0.73      0.77      7865
weighted avg       0.89      0.90      0.88      7865

offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7829
           1       0.00      0.00      0.00        36

    accuracy                           0.

In [12]:
#printout of each best estiamtor
for i in range(len(cv_list)):
    print(cv_list[i].best_estimator_)

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x0000022A5EAC5E50>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier(n_jobs=-1)))])
Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x0000022A5EAC5E50>)),
                ('tfidf', TfidfTransformer(use_idf=False)),
                ('clf',
                 MultiOutputClassifier(estimator=GradientBoostingClassifier(n_estimators=200)))])
Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x0000022A5EAC5E50>)),
                ('tfidf', TfidfTransformer(use_idf=False)),
                ('clf',
                 MultiOutputClassifier(estimator=AdaBoostClassifier(learning_rate=0.2,
                                                                    n_estimators=300)))])


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [13]:
#assignment of each best estimator
rf_clf = cv_list[0].best_estimator_
gb_clf = cv_list[1].best_estimator_
ab_clf = cv_list[2].best_estimator_

best_estimators = [rf_clf, gb_clf, ab_clf]

In [14]:
#classification report for each best estimator to determine the best best estimator
for best_estimator in best_estimators:
    print(best_estimator)
    y_pred = pd.DataFrame(best_estimator.predict(X_test_array), columns=y_train.columns.tolist())
    print()
    print(classification_report(y_test.melt().value, y_pred.melt().value, zero_division=0))
    print('_______________________________________________________________________________________')
    print()

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x0000022A5EAC5E50>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier(n_jobs=-1)))])

              precision    recall  f1-score   support

           0       0.95      0.99      0.97    250388
           1       0.82      0.53      0.64     24887

    accuracy                           0.95    275275
   macro avg       0.89      0.76      0.81    275275
weighted avg       0.94      0.95      0.94    275275

_______________________________________________________________________________________

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x0000022A5EAC5E50>)),
                ('tfidf', TfidfTransformer(use_idf=False)),
                ('clf',
                 MultiOutputClassifier(estimator=GradientBoostingClassifier(n_estimators=200)))])

              pr

## Model evaluation

The classification report shows a slight better scoring (especialls the precision, recall and f1-score) for value = 1 for the Gradient Boosting Classifier. 
This model will be chosen as final model to implement

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

In [15]:
#exporting of best best estimator (gb_classifier)
dump(gb_clf, r'.\gb_classifier.joblib') 

['.\\gb_classifier.joblib']

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.