# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import sqlite3
import pandas as pd
from sqlalchemy import create_engine

In [4]:

def load_data():
    conn = create_engine('sqlite:///../data/DisasterResponse.db')
    df = pd.read_sql_table('disaster_messages_table', conn)
    #conn = sqlite3.connect('InsertDatabaseName.db')
    #df = pd.read_sql("SELECT * FROM InsertTableName", conn)
    X = df.message.values
    labels = df.columns[4:]
    y = df[labels].values
    return X, y, labels

X, y, labels = load_data()

### 2. Write a tokenization function to process your text data

In [5]:
labels

Index(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')

In [6]:
df.groupby(['genre', 'related']).count()['message']['news'][1]

NameError: name 'df' is not defined

In [7]:
category_counts = list()
category_counts_news = list()
for label in labels:
    try:
        category_counts.append(df.groupby(label).count()['message'][1])
        category_counts_news.append(df.groupby(['genre', label]).count()['message']['news'][1])
    except:
        category_counts.append(0)
        category_counts_news.append(0)
print(category_counts)
print(category_counts_news)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [8]:
# download necessary NLTK data
import nltk
nltk.download(['punkt', 'wordnet'])

import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package punkt to /Users/bhalder/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/bhalder/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [9]:
def tokenize(text):
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
    
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report
from sklearn.multioutput import MultiOutputClassifier

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
vect = CountVectorizer(tokenizer=tokenize)
tfidf = TfidfTransformer()
clf = RandomForestClassifier()

# train classifier
X_train_counts = vect.fit_transform(X_train)
X_train_tfidf = tfidf.fit_transform(X_train_counts)
clf.fit(X_train_tfidf, y_train)

# predict on test data
X_test_counts = vect.transform(X_test)
X_test_tfidf = tfidf.transform(X_test_counts)
y_pred = clf.predict(X_test_tfidf)

In [12]:
#pipeline = Pipeline([
#    ('vect', CountVectorizer(tokenizer=tokenize)),
#    ('tfidf', TfidfTransformer()),
#    ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))
#])

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier())
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

### 5. Test your model
Report the accuracy, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [14]:
y_pred_train = pipeline.predict(X_train)
y_pred_test = pipeline.predict(X_test)

In [26]:
print("Accuracy Scores for all the categories are as follows - ")
for i, label in enumerate(labels):
    print("{}: {}".format(label, accuracy_score(y_test[:, i], y_pred_test[:, i])))

Accuracy Scores for all the categories are as follows - 
related: 0.7966127555691181
request: 0.878547451937748
offer: 0.9952700640830028
aid_related: 0.6981995727799817
medical_help: 0.9231003967043027
medical_products: 0.9533109551418981
search_and_rescue: 0.9734513274336283
security: 0.9798596277082697
military: 0.9670430271589868
child_alone: 1.0
water: 0.9412572474824534
food: 0.9070796460176991
shelter: 0.9189807750991761
clothing: 0.9884040280744584
money: 0.9769606347268843
missing_people: 0.9882514494964907
refugees: 0.9668904485810192
death: 0.9563625267012511
other_aid: 0.8643576441867562
infrastructure_related: 0.9346963686298444
transport: 0.9516325907842539
buildings: 0.9517851693622216
electricity: 0.9780286847726579
tools: 0.9938968568812938
hospitals: 0.9909978638999084
shops: 0.9957277998169057
aid_centers: 0.988556606652426
other_infrastructure: 0.9540738480317363
weather_related: 0.8051571559353067
floods: 0.9270674397314617
storm: 0.9102837961550199
fire: 0.9909978

In [15]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
print(accuracy_score(y_train, y_pred_train))
print(accuracy_score(y_test, y_pred_test))

#print(f1_score(y_train, y_pred_train))
#print(f1_score(y_test, y_pred_test))

0.885865418849499
0.259231003967043


In [16]:
print("------------------- Scores for trainig set ------------------------")
print(classification_report(y_train, y_pred_train, target_names=labels))
print("------------------- Scores for testing set ------------------------")
print(classification_report(y_test, y_pred_test, target_names=labels))

------------------- Scores for trainig set ------------------------
                        precision    recall  f1-score   support

               related       0.99      1.00      1.00     15061
               request       1.00      0.93      0.96      3360
                 offer       1.00      0.78      0.88        87
           aid_related       1.00      0.96      0.98      8192
          medical_help       1.00      0.83      0.91      1583
      medical_products       1.00      0.81      0.90      1001
     search_and_rescue       1.00      0.78      0.88       553
              security       1.00      0.73      0.84       341
              military       1.00      0.79      0.88       644
           child_alone       0.00      0.00      0.00         0
                 water       1.00      0.90      0.95      1225
                  food       1.00      0.92      0.96      2199
               shelter       1.00      0.88      0.94      1764
              clothing       1.00  

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

parameters = {
        #'vect__ngram_range': ((1, 1), (1, 2)),
        #'vect__max_df': (0.5, 0.75, 1.0),
        #'vect__max_features': (None, 5000, 10000),
        #'tfidf__use_idf': (True, False),
        'clf__n_estimators': [100],
        #'clf__min_samples_split': [2, 3, 4]
    }

scorer = make_scorer(classification_report)

grid_obj = GridSearchCV(pipeline, param_grid=parameters)

In [None]:
grid_fit = grid_obj.fit(X_train, y_train)

In [None]:
#best_clf = grid_fit.best_estimator_
y_pred_train = grid_fit.predict(X_train)
y_pred_test = grid_fit.predict(X_test)

In [None]:
print(accuracy_score(y_train, y_pred_train))
print(accuracy_score(y_test, y_pred_test))

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

In [None]:
print("------------------- Scores for trainig set ------------------------")
print(classification_report(y_train, y_pred_train, target_names=labels))
print("------------------- Scores for testing set ------------------------")
print(classification_report(y_test, y_pred_test, target_names=labels))

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [8]:
engine = create_engine('sqlite:///../data/DisasterResponse.db')
df = pd.read_sql_table('disaster_messages_table', engine)

### 9. Export your model as a pickle file

In [7]:
df.shape

(26215, 40)

In [None]:
from sklearn.externals import joblib
joblib.dump(grid_obj, 'saved_model.pkl')

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.