<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Import-libraries-and-load-data-from-database." data-toc-modified-id="1.-Import-libraries-and-load-data-from-database.-1">1. Import libraries and load data from database.</a></span></li><li><span><a href="#2.-Write-a-tokenization-function-to-process-your-text-data" data-toc-modified-id="2.-Write-a-tokenization-function-to-process-your-text-data-2">2. Write a tokenization function to process your text data</a></span></li><li><span><a href="#3.-Build-a-machine-learning-pipeline" data-toc-modified-id="3.-Build-a-machine-learning-pipeline-3">3. Build a machine learning pipeline</a></span></li><li><span><a href="#4.-Train-pipeline" data-toc-modified-id="4.-Train-pipeline-4">4. Train pipeline</a></span></li><li><span><a href="#5.-Test-your-model" data-toc-modified-id="5.-Test-your-model-5">5. Test your model</a></span></li><li><span><a href="#Results-with-5%-sample-of-the-data" data-toc-modified-id="Results-with-5%-sample-of-the-data-6">Results with 5% sample of the data</a></span></li><li><span><a href="#Results-with-10%-sample-of-the-data:" data-toc-modified-id="Results-with-10%-sample-of-the-data:-7">Results with 10% sample of the data:</a></span></li><li><span><a href="#Conclusion-/-Choice-of-classifier" data-toc-modified-id="Conclusion-/-Choice-of-classifier-8">Conclusion / Choice of classifier</a></span></li><li><span><a href="#6.-Improve-your-model" data-toc-modified-id="6.-Improve-your-model-9">6. Improve your model</a></span></li><li><span><a href="#7.-Test-your-model" data-toc-modified-id="7.-Test-your-model-10">7. Test your model</a></span></li><li><span><a href="#Results-after-GridSearchCV:" data-toc-modified-id="Results-after-GridSearchCV:-11">Results after GridSearchCV:</a></span></li><li><span><a href="#8.-Try-improving-your-model-further.-Here-are-a-few-ideas:" data-toc-modified-id="8.-Try-improving-your-model-further.-Here-are-a-few-ideas:-12">8. Try improving your model further. Here are a few ideas:</a></span></li><li><span><a href="#9.-Export-your-model-as-a-pickle-file" data-toc-modified-id="9.-Export-your-model-as-a-pickle-file-13">9. Export your model as a pickle file</a></span></li><li><span><a href="#10.-Use-this-notebook-to-complete-train.py" data-toc-modified-id="10.-Use-this-notebook-to-complete-train.py-14">10. Use this notebook to complete <code>train.py</code></a></span></li></ul></div>

# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import time
from datetime import datetime
from IPython.core.display import display, HTML
from sqlalchemy import create_engine
import pickle
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split,  GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, f1_score, recall_score, precision_score

nltk.download(["wordnet", "punkt"])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\thoma\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\thoma\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
display(HTML("<style>.container { width:90% !important; }</style>"))

pd.set_option("display.float_format", lambda x: "%.2f" % x)
pd.set_option('display.max_colwidth', 500)
pd.set_option("display.max_rows", 1000)

In [10]:
# load data from database
engine = create_engine('sqlite:///disaster_response.db')
# only load a sample of the data for faster processing (in this example)
df = pd.read_sql_table("disaster_messages", 'sqlite:///disaster_response.db')  # .sample(frac=0.1, random_state=1)
X = df["message"].values
Y = df.drop(labels=["id", "message", "original", "genre"], axis=1).values
features = df.drop(labels=["id", "message", "original", "genre"], axis=1).columns

In [29]:
# visualizations for the web app
import plotly.graph_objects as go

df_temp = df.iloc[:,4:].sum(axis=0)

fig = go.Figure(data=[go.Bar(
    x=df_temp.index,
    y=df_temp.values,
)])
fig.update_layout(title_text='Category counts')

In [35]:
# correlation matrix
corr = df.iloc[:,4:].corr()

fig = go.Figure(data=[go.Heatmap(
                   z=corr.values,
                   x=corr.index,
                   y=corr.columns,
                   hoverongaps = False
)])
fig.show()

### 2. Write a tokenization function to process your text data

In [4]:
def tokenize(text):
    """
    function: 
        separates messages into root form of lower case word tokens without punctuation or stopwords
    args:
        text(str): message to be later classified
    returns:
        lemmed(list of str): list of root forms of lower case word tokens without punctuation or stopwords of the message
    """
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    words = word_tokenize(text)
    words = [w for w in words if w not in stopwords.words("english")]
    
    # lemmatization: Reduce words to their root form
    lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
    
    return lemmed

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

see 5. Test your model

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

see 5. Test your model

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [5]:
def show_results(Y_test, Y_pred, features):
    """
    function: prints out statistics of the results after fitting and predicting a model
    args:
        Y_test(numpy.ndarray): test data from train_test_split
        Y_pred(numpy.ndarray): predicted data
        features(list of str): list of column names of the features to be predicted
    return:
        df_res(DataFrame): DataFrame containing the classification report data for each feature
    """
    
    subset_accuracy = accuracy_score(Y_test, Y_pred) # subset_accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in Y_test.
    overall_accuracy = (Y_pred == Y_test).mean()
    print("subset_accuracy: {:.3f}\noverall_accuracy: {:.3f}".format(subset_accuracy, overall_accuracy))

    # create a results dataframe containing the classification reports for all columns (multiindexed)
    df_res = pd.DataFrame()
    i=0
    for y_test, y_pred in zip(Y_test.transpose(), Y_pred.transpose()):
        df_temp = pd.DataFrame.from_dict(classification_report(y_test, y_pred, output_dict=True, zero_division=1))
        df_temp = pd.concat([df_temp], axis=1, keys=[features[i]]) # add column name as additional level
        df_res = pd.concat([df_res, df_temp], axis=1)
        i+=1

    # overall mean classification report values:
    display(df_res.transpose().mean())
    # display(df_res.transpose())

In [None]:
list_classifier = [MLPClassifier(), KNeighborsClassifier(), GaussianNB(), RandomForestClassifier(), AdaBoostClassifier()]

print("Starting Script Execution at {}".format(datetime.now()))

for clf in list_classifier:
    start_time = time.time()
    print(str(clf))
    pipeline = Pipeline([
            ('tfidfvect', TfidfVectorizer(tokenizer = tokenize)),
            ('to_dense', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)), # so that GaussianNB works
            ('clf',  MultiOutputClassifier(clf))
        ])

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y) 

    # train pipeline/classifier
    pipeline.fit(X_train, Y_train)

    # predict on test data
    Y_pred = pipeline.predict(X_test)
    
    df_res = show_results(Y_test, Y_pred, features)
    
    print("--- {:.0f}s seconds to apply {} ---\n".format((time.time() - start_time), str(clf)))
    
print("End of ML Pipeline Execution at {}".format(datetime.now()))

### Results with 5% sample of the data
Starting Script Execution at 2021-09-26 19:12:45.116248

MLPClassifier()
subset_accuracy: 0.183
overall_accuracy: 0.932
precision     0.88
recall        0.72
f1-score      0.73
support     199.20
dtype: float64
--- 474s seconds to apply MLPClassifier() ---

KNeighborsClassifier()
subset_accuracy: 0.192
overall_accuracy: 0.934
precision     0.87
recall        0.71
f1-score      0.71
support     198.09
dtype: float64
--- 484s seconds to apply KNeighborsClassifier() ---

GaussianNB()
subset_accuracy: 0.091
overall_accuracy: 0.895
precision     0.68
recall        0.69
f1-score      0.68
support     198.08
dtype: float64
--- 495s seconds to apply GaussianNB() ---

RandomForestClassifier()
subset_accuracy: 0.146
overall_accuracy: 0.927
precision     0.92
recall        0.69
f1-score      0.69
support     198.09
dtype: float64
--- 551s seconds to apply RandomForestClassifier() ---

AdaBoostClassifier()
subset_accuracy: 0.146
overall_accuracy: 0.934
precision     0.79
recall        0.76
f1-score      0.76
support     199.20
dtype: float64
--- 691s seconds to apply AdaBoostClassifier() ---

### Results with 10% sample of the data:
Starting Script Execution at 2021-09-26 19:28:05.530925

MLPClassifier()
subset_accuracy: 0.213
overall_accuracy: 0.937
precision     0.81
recall        0.73
f1-score      0.74
support     398.21
dtype: float64
--- 1106s seconds to apply MLPClassifier() ---

KNeighborsClassifier()
subset_accuracy: 0.201
overall_accuracy: 0.934
precision     0.85
recall        0.72
f1-score      0.72
support     398.21
dtype: float64
--- 23s seconds to apply KNeighborsClassifier() ---

GaussianNB()
subset_accuracy: 0.119
overall_accuracy: 0.890
precision     0.69
recall        0.68
f1-score      0.68
support     395.98
dtype: float64
--- 26s seconds to apply GaussianNB() ---

RandomForestClassifier()
subset_accuracy: 0.232
overall_accuracy: 0.941
precision     0.92
recall        0.71
f1-score      0.71
support     395.99
dtype: float64
--- 257s seconds to apply RandomForestClassifier() ---

AdaBoostClassifier()
subset_accuracy: 0.180
overall_accuracy: 0.938
precision     0.80
recall        0.75
f1-score      0.76
support     395.99
dtype: float64
--- 493s seconds to apply AdaBoostClassifier() ---

### Conclusion / Choice of classifier
- MLPClassifier needs way to much time for results close to the others
- GaussianNB has the worst results
- RandomForestClassifier and AdaBoostClassifier have good results but take very long (subset accuracy became much better with a bigger sample-size
- KNeighborsClassifier is very fast and has as well very good results


- KNeighborsClassifier will be used for further tuning

### 6. Improve your model
Use grid search to find better parameters. 

In [6]:
clf = KNeighborsClassifier()

start_time = time.time()
print(str(clf))
pipeline = Pipeline([
        ('tfidfvect', TfidfVectorizer(tokenizer = tokenize)),
        ('clf',  MultiOutputClassifier(clf))
    ])

X_train, X_test, Y_train, Y_test = train_test_split(X, Y) 

# train pipeline/classifier
pipeline.fit(X_train, Y_train)

# predict on test data
Y_pred = pipeline.predict(X_test)

print("--- {:.0f}s seconds to apply {} ---\n".format((time.time() - start_time), str(clf)))

df_res = show_results(Y_test, Y_pred, features)

KNeighborsClassifier()
--- 18s seconds to apply KNeighborsClassifier() ---

subset_accuracy: 0.229
overall_accuracy: 0.934


precision     0.88
recall        0.71
f1-score      0.72
support     395.99
dtype: float64

In [17]:
display(pipeline.get_params())

params_pick = {
        #'tfidfvect__use_idf': (True, False),
        'clf__estimator__n_neighbors': [3, 5, 8]
}

cv = GridSearchCV(pipeline, param_grid = params_pick)

{'memory': None,
 'steps': [('tfidfvect',
   TfidfVectorizer(tokenizer=<function tokenize at 0x000001837BADA700>)),
  ('clf', MultiOutputClassifier(estimator=KNeighborsClassifier()))],
 'verbose': False,
 'tfidfvect': TfidfVectorizer(tokenizer=<function tokenize at 0x000001837BADA700>),
 'clf': MultiOutputClassifier(estimator=KNeighborsClassifier()),
 'tfidfvect__analyzer': 'word',
 'tfidfvect__binary': False,
 'tfidfvect__decode_error': 'strict',
 'tfidfvect__dtype': numpy.float64,
 'tfidfvect__encoding': 'utf-8',
 'tfidfvect__input': 'content',
 'tfidfvect__lowercase': True,
 'tfidfvect__max_df': 1.0,
 'tfidfvect__max_features': None,
 'tfidfvect__min_df': 1,
 'tfidfvect__ngram_range': (1, 1),
 'tfidfvect__norm': 'l2',
 'tfidfvect__preprocessor': None,
 'tfidfvect__smooth_idf': True,
 'tfidfvect__stop_words': None,
 'tfidfvect__strip_accents': None,
 'tfidfvect__sublinear_tf': False,
 'tfidfvect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tfidfvect__tokenizer': <function __main__.tokeniz

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [18]:
cv.fit(X_train, Y_train)
Y_pred = cv.predict(X_test)
df_res = show_results(Y_test, Y_pred, features)

subset_accuracy: 0.253
overall_accuracy: 0.936


precision     0.92
recall        0.71
f1-score      0.71
support     395.99
dtype: float64

In [16]:
print(cv.best_params_)

{'clf__estimator__n_neighbors': 8}


### Results after GridSearchCV:
subset_accuracy: 0.277
overall_accuracy: 0.943
precision     0.93
recall        0.72
f1-score      0.72
support     395.99

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

-> see tests above

### 9. Export your model as a pickle file

In [None]:
# save model as pickle file
with open ('classifier.pkl', 'wb') as f:
    pickle.dump(cv, f)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.