# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import sqlite3
import pandas as pd
from sqlalchemy import create_engine

In [2]:
# load data from database
def load_data():
    conn = create_engine('sqlite:///InsertDatabaseName.db')
    df = pd.read_sql_table('InsertTableName', conn)
    #conn = sqlite3.connect('InsertDatabaseName.db')
    #df = pd.read_sql("SELECT * FROM InsertTableName", conn)
    X = df.message.values
    labels = df.columns[4:]
    y = df[labels].values
    return X, y, labels

X, y, labels = load_data()

### 2. Write a tokenization function to process your text data

In [5]:
labels

Index(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')

In [None]:
# download necessary NLTK data
import nltk
nltk.download(['punkt', 'wordnet'])

import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [None]:
def tokenize(text):
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
    
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report
from sklearn.multioutput import MultiOutputClassifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
vect = CountVectorizer(tokenizer=tokenize)
tfidf = TfidfTransformer()
clf = RandomForestClassifier()

# train classifier
X_train_counts = vect.fit_transform(X_train)
X_train_tfidf = tfidf.fit_transform(X_train_counts)
clf.fit(X_train_tfidf, y_train)

# predict on test data
X_test_counts = vect.transform(X_test)
X_test_tfidf = tfidf.transform(X_test_counts)
y_pred = clf.predict(X_test_tfidf)

In [None]:
#pipeline = Pipeline([
#    ('vect', CountVectorizer(tokenizer=tokenize)),
#    ('tfidf', TfidfTransformer()),
#    ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))
#])

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier())
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)

### 5. Test your model
Report the accuracy, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [None]:
y_pred_train = pipeline.predict(X_train)
y_pred_test = pipeline.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
print(accuracy_score(y_train, y_pred_train))
print(accuracy_score(y_test, y_pred_test))

#print(f1_score(y_train, y_pred_train))
#print(f1_score(y_test, y_pred_test))

In [None]:
print("------------------- Scores for trainig set ------------------------")
print(classification_report(y_train, y_pred_train, target_names=labels))
print("------------------- Scores for testing set ------------------------")
print(classification_report(y_test, y_pred_test, target_names=labels))

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

parameters = {
        #'vect__ngram_range': ((1, 1), (1, 2)),
        #'vect__max_df': (0.5, 0.75, 1.0),
        #'vect__max_features': (None, 5000, 10000),
        #'tfidf__use_idf': (True, False),
        'clf__n_estimators': [100],
        #'clf__min_samples_split': [2, 3, 4]
    }

scorer = make_scorer(classification_report)

grid_obj = GridSearchCV(pipeline, param_grid=parameters)

In [None]:
grid_fit = grid_obj.fit(X_train, y_train)

In [None]:
#best_clf = grid_fit.best_estimator_
y_pred_train = grid_fit.predict(X_train)
y_pred_test = grid_fit.predict(X_test)

In [None]:
print(accuracy_score(y_train, y_pred_train))
print(accuracy_score(y_test, y_pred_test))

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

In [None]:
print("------------------- Scores for trainig set ------------------------")
print(classification_report(y_train, y_pred_train, target_names=labels))
print("------------------- Scores for testing set ------------------------")
print(classification_report(y_test, y_pred_test, target_names=labels))

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [8]:
engine = create_engine('sqlite:///../data/DisasterResponse.db')
df = pd.read_sql_table('disaster_messages_table', engine)

### 9. Export your model as a pickle file

In [7]:
df.shape

(26215, 40)

In [None]:
from sklearn.externals import joblib
joblib.dump(grid_obj, 'saved_model.pkl')

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.