# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
import re
import pickle

import nltk
nltk.download(['punkt', 'wordnet', 'stopwords'])
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')

connection = engine.raw_connection()
table_name = str(engine.table_names()[0])

df = pd.read_sql("SELECT * FROM '{}'".format(table_name), con=connection)

cat = df.columns[4:]

X = df['message'].values
y = df[cat].values

In [3]:
df.iloc[:,4:].sum().sort_values(ascending=False).reset_index()

Unnamed: 0,index,0
0,related,20282
1,aid_related,10860
2,weather_related,7297
3,direct_report,5075
4,request,4474
5,other_aid,3446
6,food,2923
7,earthquake,2455
8,storm,2443
9,shelter,2314


In [4]:
genre_counts = df.groupby('genre').count()['message']
list(genre_counts.index)

['direct', 'news', 'social']

### 2. Write a tokenization function to process your text data

In [5]:
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

def tokenize(text, lemmatizer=WordNetLemmatizer()):
    # Detect and replace URLs
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, 'urlplaceholder')
    
    # Remove all non-alpha-numeric characters and tokenize text
    clean_tokens = nltk.word_tokenize(re.sub(r"[^a-zA-Z0-9]", " ", text.lower()))
    
    # Remove stopwords
    clean_tokens = [t for t in clean_tokens if t not in stopwords.words('english')]

    # Lemmatize tokens
    clean_tokens = [lemmatizer.lemmatize(t) for t in clean_tokens]
    
    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [6]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier(class_weight='balanced')))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [8]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None,
            verbose=0, warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [9]:
def report(y_true, y_pred):
    for i in range(0, len(cat)):
        print(cat[i])
        print("\tAccuracy: {:.4f}\t Precision: {:.4f}\t Recall: {:.4f}\t F1_score: {:.4f}".format(
            accuracy_score(y_true[:, i], y_pred[:, i]),
            precision_score(y_true[:, i], y_pred[:, i], average='weighted'),
            recall_score(y_true[:, i], y_pred[:, i], average='weighted'),
            f1_score(y_true[:, i], y_pred[:, i], average='weighted')
        ))

In [10]:
y_pred = pipeline.predict(X_test)

In [11]:
pipeline.get_params()

{'clf': MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
             criterion='gini', max_depth=None, max_features='auto',
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
             verbose=0, warm_start=False),
            n_jobs=1),
 'clf__estimator': RandomForestClassifier(bootstrap=True, class_weight='balanced',
             criterion='gini', max_depth=None, max_features='auto',
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
             verbose=0, warm_start=False),
 'clf__estimator__bootstrap': True,
 'clf__estimat

In [12]:
report(y_test, y_pred)

related
	Accuracy: 0.8026	 Precision: 0.7925	 Recall: 0.8026	 F1_score: 0.7952
request
	Accuracy: 0.8874	 Precision: 0.8788	 Recall: 0.8874	 F1_score: 0.8764
offer
	Accuracy: 0.9950	 Precision: 0.9900	 Recall: 0.9950	 F1_score: 0.9925
aid_related
	Accuracy: 0.7486	 Precision: 0.7465	 Recall: 0.7486	 F1_score: 0.7451
medical_help
	Accuracy: 0.9255	 Precision: 0.9041	 Recall: 0.9255	 F1_score: 0.9004
medical_products
	Accuracy: 0.9538	 Precision: 0.9465	 Recall: 0.9538	 F1_score: 0.9369
search_and_rescue
	Accuracy: 0.9753	 Precision: 0.9759	 Recall: 0.9753	 F1_score: 0.9632
security
	Accuracy: 0.9817	 Precision: 0.9640	 Recall: 0.9817	 F1_score: 0.9728
military
	Accuracy: 0.9663	 Precision: 0.9556	 Recall: 0.9663	 F1_score: 0.9549
child_alone
	Accuracy: 1.0000	 Precision: 1.0000	 Recall: 1.0000	 F1_score: 1.0000
water
	Accuracy: 0.9536	 Precision: 0.9476	 Recall: 0.9536	 F1_score: 0.9439
food
	Accuracy: 0.9284	 Precision: 0.9223	 Recall: 0.9284	 F1_score: 0.9178
shelter
	Accuracy: 0.9281

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


	Accuracy: 0.9852	 Precision: 0.9808	 Recall: 0.9852	 F1_score: 0.9791
money
	Accuracy: 0.9785	 Precision: 0.9789	 Recall: 0.9785	 F1_score: 0.9684
missing_people
	Accuracy: 0.9878	 Precision: 0.9760	 Recall: 0.9878	 F1_score: 0.9819
refugees
	Accuracy: 0.9644	 Precision: 0.9512	 Recall: 0.9644	 F1_score: 0.9490
death
	Accuracy: 0.9608	 Precision: 0.9578	 Recall: 0.9608	 F1_score: 0.9458
other_aid
	Accuracy: 0.8699	 Precision: 0.8385	 Recall: 0.8699	 F1_score: 0.8225
infrastructure_related
	Accuracy: 0.9326	 Precision: 0.8817	 Recall: 0.9326	 F1_score: 0.9022
transport
	Accuracy: 0.9545	 Precision: 0.9433	 Recall: 0.9545	 F1_score: 0.9355
buildings
	Accuracy: 0.9525	 Precision: 0.9411	 Recall: 0.9525	 F1_score: 0.9357
electricity
	Accuracy: 0.9814	 Precision: 0.9727	 Recall: 0.9814	 F1_score: 0.9728
tools
	Accuracy: 0.9939	 Precision: 0.9878	 Recall: 0.9939	 F1_score: 0.9909
hospitals
	Accuracy: 0.9893	 Precision: 0.9788	 Recall: 0.9893	 F1_score: 0.9840
shops
	Accuracy: 0.9965	 Precis

### 6. Improve your model
Use grid search to find better parameters. 

In [13]:
# Using all CPU cores, otherwise it takes forever...
parameters = {
    'vect__min_df': [1, 10],
    'vect__lowercase': [True, False],
    'tfidf__smooth_idf': [True, False],
    'clf__estimator__min_samples_split': [2, 5],
    'clf__estimator__n_estimators': [10, 20] 
}

cv = GridSearchCV(pipeline, param_grid = parameters, cv = 2, n_jobs = -1)

In [14]:
pipeline.get_params()

{'clf': MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
             criterion='gini', max_depth=None, max_features='auto',
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
             verbose=0, warm_start=False),
            n_jobs=1),
 'clf__estimator': RandomForestClassifier(bootstrap=True, class_weight='balanced',
             criterion='gini', max_depth=None, max_features='auto',
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
             verbose=0, warm_start=False),
 'clf__estimator__bootstrap': True,
 'clf__estimat

In [14]:
cv.fit(X_train, y_train)

GridSearchCV(cv=2, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None,
            verbose=0, warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'vect__min_df': [1, 10], 'vect__lowercase': [True, False], 'tfidf__smooth_idf': [True, False], 'clf__estimator__min_samples_split': [2, 5], 'clf__estimator__n_estimators': [10, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [15]:
y_pred = cv.predict(X_test)

In [16]:
report(y_test, y_pred)

related
	Accuracy: 0.8164	 Precision: 0.8106	 Recall: 0.8164	 F1_score: 0.8113
request
	Accuracy: 0.8897	 Precision: 0.8847	 Recall: 0.8897	 F1_score: 0.8865
offer
	Accuracy: 0.9945	 Precision: 0.9890	 Recall: 0.9945	 F1_score: 0.9918
aid_related
	Accuracy: 0.7653	 Precision: 0.7670	 Recall: 0.7653	 F1_score: 0.7659
medical_help
	Accuracy: 0.9278	 Precision: 0.9137	 Recall: 0.9278	 F1_score: 0.9162
medical_products
	Accuracy: 0.9556	 Precision: 0.9438	 Recall: 0.9556	 F1_score: 0.9438
search_and_rescue
	Accuracy: 0.9765	 Precision: 0.9702	 Recall: 0.9765	 F1_score: 0.9663
security
	Accuracy: 0.9831	 Precision: 0.9713	 Recall: 0.9831	 F1_score: 0.9753
military
	Accuracy: 0.9681	 Precision: 0.9602	 Recall: 0.9681	 F1_score: 0.9622
child_alone
	Accuracy: 1.0000	 Precision: 1.0000	 Recall: 1.0000	 F1_score: 1.0000
water
	Accuracy: 0.9596	 Precision: 0.9560	 Recall: 0.9596	 F1_score: 0.9567
food
	Accuracy: 0.9385	 Precision: 0.9347	 Recall: 0.9385	 F1_score: 0.9352
shelter
	Accuracy: 0.9365

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [23]:
ada_parameters = {
    'vect__min_df': [1, 20, 50, 100],
    'vect__lowercase': [True, False],
    'tfidf__smooth_idf': [True, False],
    'tfidf__sublinear_tf': [True, False],
    'clf__estimator__algorithm': ['SAMME', 'SAMME.R'],
    'clf__estimator__n_estimators': [1, 5, 10, 50] 
}

In [24]:
# Using all CPU cores, otherwise it takes forever...
ada_pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(AdaBoostClassifier(random_state=10)))
])

ada_cv = GridSearchCV(ada_pipeline, param_grid = ada_parameters, cv = 2, n_jobs = -1)

In [25]:
ada_pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7fe97f5aba60>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
             learning_rate=1.0, n_estimators=50, random_state=10),
              n_jobs=1))],
 'vect': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=

In [26]:
ada_cv.fit(X_train, y_train)

GridSearchCV(cv=2, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...timator=None,
          learning_rate=1.0, n_estimators=50, random_state=10),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'vect__min_df': [1, 20, 50, 100], 'vect__lowercase': [True, False], 'tfidf__smooth_idf': [True, False], 'tfidf__sublinear_tf': [True, False], 'clf__estimator__algorithm': ['SAMME', 'SAMME.R'], 'clf__estimator__n_estimators': [1, 5, 10, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [27]:
y_pred = ada_cv.predict(X_test)

In [28]:
report(y_test, y_pred)

related
	Accuracy: 0.7864	 Precision: 0.7569	 Recall: 0.7864	 F1_score: 0.7431
request
	Accuracy: 0.8834	 Precision: 0.8744	 Recall: 0.8834	 F1_score: 0.8746
offer
	Accuracy: 0.9940	 Precision: 0.9909	 Recall: 0.9940	 F1_score: 0.9920
aid_related
	Accuracy: 0.7611	 Precision: 0.7621	 Recall: 0.7611	 F1_score: 0.7557
medical_help
	Accuracy: 0.9318	 Precision: 0.9193	 Recall: 0.9318	 F1_score: 0.9190
medical_products
	Accuracy: 0.9603	 Precision: 0.9530	 Recall: 0.9603	 F1_score: 0.9537
search_and_rescue
	Accuracy: 0.9774	 Precision: 0.9710	 Recall: 0.9774	 F1_score: 0.9708
security
	Accuracy: 0.9823	 Precision: 0.9720	 Recall: 0.9823	 F1_score: 0.9757
military
	Accuracy: 0.9687	 Precision: 0.9618	 Recall: 0.9687	 F1_score: 0.9637
child_alone
	Accuracy: 1.0000	 Precision: 1.0000	 Recall: 1.0000	 F1_score: 1.0000
water
	Accuracy: 0.9587	 Precision: 0.9555	 Recall: 0.9587	 F1_score: 0.9565
food
	Accuracy: 0.9434	 Precision: 0.9411	 Recall: 0.9434	 F1_score: 0.9418
shelter
	Accuracy: 0.9442

### 9. Export your model as a pickle file

In [29]:
with open('classifier.pkl', 'wb') as pkl_file:
    pickle.dump(cv, pkl_file)
pkl_file.close()

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.