# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import re
import pickle
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

import warnings

warnings.simplefilter('ignore')

[nltk_data] Downloading package punkt to /Users/cheehli/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/cheehli/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/cheehli/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# load data from database
engine = create_engine('sqlite:///message.db')
df = pd.read_sql_table('message', engine)
X = df['message']
Y = df.iloc[:, 4:df.shape[1]]
Y = Y.fillna(0).astype(int)

### 2. Write a tokenization function to process your text data

In [3]:

def tokenize(text):
    """
    Description:
    Tokenize the input text
    
    Arguments:
    text: string. Message string
       
    Returns:
    stemmed: strings. Stemmed word tokens
    """
    # Remove symbols
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Tokenize message string
    tokens = word_tokenize(text)
    
    # Cleanup stop words and apply stem
    stemmer = PorterStemmer()
    stop_words = stopwords.words("english")
    
    stemmed = [stemmer.stem(word) for word in tokens if word not in stop_words]
    
    return stemmed

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [4]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X, Y)
X_train_counts.shape

(26153, 34894)

In [5]:
count_vect.vocabulary_.get(u'update')

32816

In [6]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(26153, 34894)

In [7]:
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, Y['related'])
clf

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [9]:
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle
import numpy as np
Y1 = np.vstack((np.vstack((Y.values)).T)).T
n_samples, n_features = X_train_tfidf.shape # 10,100
n_outputs = Y1.shape[1] # 3
n_classes = 3
forest = MultinomialNB()
multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)
preds = multi_target_forest.fit(X_train_tfidf, Y1).predict(X_train_tfidf)
preds

array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

In [10]:
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [12]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 42)

np.random.seed(42)
pipeline.fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                                                                        ccp_alpha=0.0,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                   

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [29]:
def evaluate_metrics(actual, predicted, col_names):
    """Evaluate metrics for ML model
    
    Args:
    actual: array. Actual labels.
    predicted: array. Predicted labels.
    col_names: strings. Category names.
       
    Returns:
    metrics_df: dataframe. Dataframe containing the accuracy, precision, recall and f1 score.
    """
    metrics = []
    
    # Calculate evaluation metrics for each set of labels
    for i in range(len(col_names)):
        accuracy = accuracy_score(actual[:, i], predicted[:, i])
        precision = precision_score(actual[:, i], predicted[:, i])
        recall = recall_score(actual[:, i], predicted[:, i])
        f1 = f1_score(actual[:, i], predicted[:, i])
        
        metrics.append([accuracy, precision, recall, f1])
    
    # Create dataframe containing metrics
    metrics = np.array(metrics)
    metrics_df = pd.DataFrame(data = metrics, index = col_names, columns = ['Accuracy', 'Precision', 'Recall', 'F1'])
      
    return metrics_df

In [14]:
# Calculate evaluation metrics for training set
Y_train_pred = pipeline.predict(X_train)
col_names = list(Y.columns.values)

print(evaluate_metrics(np.array(Y_train), Y_train_pred, col_names))

                        Accuracy  Precision    Recall        F1
related                 0.996074   0.996074  0.996074  0.996074
request                 0.996125   0.996125  0.996125  0.996125
offer                   1.000000   1.000000  1.000000  1.000000
aid_related             0.995208   0.995208  0.995208  0.995208
medical_help            0.998776   0.998776  0.998776  0.998776
medical_products        0.998725   0.998725  0.998725  0.998725
search_and_rescue       0.999847   0.999847  0.999847  0.999847
security                0.999796   0.999796  0.999796  0.999796
military                0.999541   0.999541  0.999541  0.999541
water                   0.998521   0.998521  0.998521  0.998521
food                    0.998063   0.998063  0.998063  0.998063
shelter                 0.997910   0.997910  0.997910  0.997910
clothing                0.999592   0.999592  0.999592  0.999592
money                   0.999490   0.999490  0.999490  0.999490
missing_people          0.999898   0.999

In [15]:
# Calculate evaluation metrics for test set
Y_test_pred = pipeline.predict(X_test)

eval_metrics0 = evaluate_metrics(np.array(Y_test), Y_test_pred, col_names)
print(eval_metrics0)

                        Accuracy  Precision    Recall        F1
related                 0.749044   0.749044  0.749044  0.749044
request                 0.823826   0.823826  0.823826  0.823826
offer                   0.996330   0.996330  0.996330  0.996330
aid_related             0.567365   0.567365  0.567365  0.567365
medical_help            0.919254   0.919254  0.919254  0.919254
medical_products        0.952286   0.952286  0.952286  0.952286
search_and_rescue       0.971555   0.971555  0.971555  0.971555
security                0.978437   0.978437  0.978437  0.978437
military                0.964979   0.964979  0.964979  0.964979
water                   0.936076   0.936076  0.936076  0.936076
food                    0.885762   0.885762  0.885762  0.885762
shelter                 0.910384   0.910384  0.910384  0.910384
clothing                0.986542   0.986542  0.986542  0.986542
money                   0.978131   0.978131  0.978131  0.978131
missing_people          0.988377   0.988

In [16]:
# Calculation the proportion of each column that have label == 1
Y.sum()/len(Y)

related                   0.761442
request                   0.171300
offer                     0.004550
aid_related               0.415937
medical_help              0.079800
medical_products          0.050243
search_and_rescue         0.027683
security                  0.018009
military                  0.032883
water                     0.064008
food                      0.112033
shelter                   0.088671
clothing                  0.015524
money                     0.023095
missing_people            0.011433
refugees                  0.033495
death                     0.045731
other_aid                 0.131840
infrastructure_related    0.065193
transport                 0.045999
buildings                 0.051046
electricity               0.020418
tools                     0.006080
hospitals                 0.010821
shops                     0.004588
aid_centers               0.011815
other_infrastructure      0.044010
weather_related           0.279203
floods              

### 6. Improve your model
Use grid search to find better parameters. 

In [17]:
# Define performance metric for use in grid search scoring object
def performance_metric(y_true, y_pred):
    """Calculate median F1 score for all of the output classifiers
    
    Args:
    y_true: array. Array containing actual labels.
    y_pred: array. Array containing predicted labels.
        
    Returns:
    score: float. Median F1 score for all of the output classifiers
    """
    f1_list = []
    for i in range(np.shape(y_pred)[1]):
        f1 = f1_score(np.array(y_true)[:, i], y_pred[:, i], average='micro')
        f1_list.append(f1)
        
    score = np.median(f1_list)
    return score

In [18]:
# Create grid search object

parameters = {'vect__min_df': [1, 5],
              'tfidf__use_idf':[True, False],
              'clf__estimator__n_estimators':[25], 
              'clf__estimator__min_samples_split':[2, 10]}

scorer = make_scorer(performance_metric)
cv = GridSearchCV(pipeline, param_grid = parameters, scoring = scorer, verbose = 10)

# Find best parameters
np.random.seed(81)
tuned_model = cv.fit(X_train, Y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=1 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=1, score=0.951, total= 1.5min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=1 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.5min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=1, score=0.950, total= 1.5min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=1 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  3.0min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=1, score=0.949, total= 1.5min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=1 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  4.5min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=1, score=0.951, total= 1.5min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=1 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  6.0min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=1, score=0.952, total= 1.5min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  7.5min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=5, score=0.950, total= 1.4min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  8.8min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=5, score=0.947, total= 1.3min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 10.2min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=5, score=0.950, total= 1.4min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 11.5min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=5, score=0.949, total= 1.4min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 12.9min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=5, score=0.952, total= 1.4min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__min_df=1 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__min_df=1, score=0.952, total= 1.7min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__min_df=1 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__min_df=1, score=0.949, total= 1.5min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__min_df=1 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__min_df=1, score=0.949, total= 1.6min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=False, v

[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed: 50.7min finished


In [19]:
# Get results of grid search
tuned_model.cv_results_

{'mean_fit_time': array([83.8225821 , 77.06772246, 91.80289469, 71.4910068 , 59.20762925,
        65.24876695, 57.68075151, 56.70537701]),
 'std_fit_time': array([0.47462673, 1.833335  , 3.49119524, 1.73886141, 1.19715861,
        0.78046821, 0.93173599, 1.99345152]),
 'mean_score_time': array([5.90592847, 5.44561496, 6.08065743, 5.20166078, 5.97253361,
        5.48790326, 5.95637584, 5.28131938]),
 'std_score_time': array([0.02123286, 0.27552219, 0.14589494, 0.13935694, 0.08913836,
        0.09659778, 0.0926959 , 0.24993658]),
 'param_clf__estimator__min_samples_split': masked_array(data=[2, 2, 2, 2, 10, 10, 10, 10],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_clf__estimator__n_estimators': masked_array(data=[25, 25, 25, 25, 25, 25, 25, 25],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_tfidf__use_idf': 

In [20]:
# Best mean test score
np.max(tuned_model.cv_results_['mean_test_score'])

0.9512594236606953

In [21]:
# Parameters for best mean test score
tuned_model.best_params_

{'clf__estimator__min_samples_split': 10,
 'clf__estimator__n_estimators': 25,
 'tfidf__use_idf': False,
 'vect__min_df': 1}

We got the best parameters. However the CV is intentionly make it less parameter because otherwise it takes very long for execution.

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [22]:
# Calculate evaluation metrics for test set
tuned_pred_test = tuned_model.predict(X_test)

eval_metrics1 = evaluate_metrics(np.array(Y_test), tuned_pred_test, col_names)

print(eval_metrics1)

                        Accuracy  Precision    Recall        F1
related                 0.751032   0.751032  0.751032  0.751032
request                 0.827191   0.827191  0.827191  0.827191
offer                   0.996330   0.996330  0.996330  0.996330
aid_related             0.569812   0.569812  0.569812  0.569812
medical_help            0.918489   0.918489  0.918489  0.918489
medical_products        0.950298   0.950298  0.950298  0.950298
search_and_rescue       0.971096   0.971096  0.971096  0.971096
security                0.978284   0.978284  0.978284  0.978284
military                0.965132   0.965132  0.965132  0.965132
water                   0.936076   0.936076  0.936076  0.936076
food                    0.885609   0.885609  0.885609  0.885609
shelter                 0.909772   0.909772  0.909772  0.909772
clothing                0.986542   0.986542  0.986542  0.986542
money                   0.977825   0.977825  0.977825  0.977825
missing_people          0.988377   0.988

In [23]:
# Get summary stats for first model
eval_metrics0.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,35.0,35.0,35.0,35.0
mean,0.922548,0.922548,0.922548,0.922548
std,0.09153,0.09153,0.09153,0.09153
min,0.567365,0.567365,0.567365,0.567365
25%,0.910154,0.910154,0.910154,0.910154
50%,0.952133,0.952133,0.952133,0.952133
75%,0.98096,0.98096,0.98096,0.98096
max,0.99633,0.99633,0.99633,0.99633


In [24]:
# Get summary stats for tuned model
eval_metrics1.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,35.0,35.0,35.0,35.0
mean,0.923029,0.923029,0.923029,0.923029
std,0.090407,0.090407,0.090407,0.090407
min,0.569812,0.569812,0.569812,0.569812
25%,0.91046,0.91046,0.91046,0.91046
50%,0.949839,0.949839,0.949839,0.949839
75%,0.98096,0.98096,0.98096,0.98096
max,0.99633,0.99633,0.99633,0.99633


The different between both F1 scores aren't significant. 

In this case we could try with larger range of parameters, but more times should be expected.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

To try to improve the model further, we will change the Random Forest Classifier in the pipeline to a polynomial SVM classifier. SVMs are often used for text categorization tasks due to their “ability to process many thousand different inputs. This opens the opportunity to use all words in a text directly as features” (Diederich, et al. (2003)). It is for this reason that this decision was made.

To keep the number of grid search cases to a minimum, we will keep the tuned parameter values for the CountVectorizer and TfidfTransformer found in the previous secion.

In [25]:
# Try using SVM instead of Random Forest Classifier
pipeline2 = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(SVC()))
])

parameters2 = {'vect__min_df': [5],
              'tfidf__use_idf':[True],
              'clf__estimator__kernel': ['poly'], 
              'clf__estimator__degree': [2],
              'clf__estimator__C':[1, 20]}

cv2 = GridSearchCV(pipeline2, param_grid = parameters2, scoring = scorer, verbose = 10)

# Find best parameters
np.random.seed(81)
tuned_model2 = cv2.fit(X_train, Y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.954, total=25.8min
[CV] clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 25.8min remaining:    0.0s


[CV]  clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.952, total=25.7min
[CV] clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 51.5min remaining:    0.0s


[CV]  clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.951, total=25.0min
[CV] clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 76.5min remaining:    0.0s


[CV]  clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.951, total=25.0min
[CV] clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 101.6min remaining:    0.0s


[CV]  clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.954, total=25.1min
[CV] clf__estimator__C=20, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 126.7min remaining:    0.0s


[CV]  clf__estimator__C=20, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.952, total=30.2min
[CV] clf__estimator__C=20, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 156.9min remaining:    0.0s


[CV]  clf__estimator__C=20, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.949, total=29.9min
[CV] clf__estimator__C=20, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 186.8min remaining:    0.0s


[CV]  clf__estimator__C=20, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.950, total=29.8min
[CV] clf__estimator__C=20, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 216.6min remaining:    0.0s


[CV]  clf__estimator__C=20, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.951, total=30.2min
[CV] clf__estimator__C=20, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 246.8min remaining:    0.0s


[CV]  clf__estimator__C=20, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.953, total=30.0min


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed: 276.7min finished


In [26]:
# Get results of grid search
tuned_model2.cv_results_

{'mean_fit_time': array([1389.646698  , 1679.39087877]),
 'std_fit_time': array([19.87938288,  8.31776847]),
 'mean_score_time': array([130.66194692, 121.26301761]),
 'std_score_time': array([1.7340107 , 0.69049438]),
 'param_clf__estimator__C': masked_array(data=[1, 20],
              mask=[False, False],
        fill_value='?',
             dtype=object),
 'param_clf__estimator__degree': masked_array(data=[2, 2],
              mask=[False, False],
        fill_value='?',
             dtype=object),
 'param_clf__estimator__kernel': masked_array(data=['poly', 'poly'],
              mask=[False, False],
        fill_value='?',
             dtype=object),
 'param_tfidf__use_idf': masked_array(data=[True, True],
              mask=[False, False],
        fill_value='?',
             dtype=object),
 'param_vect__min_df': masked_array(data=[5, 5],
              mask=[False, False],
        fill_value='?',
             dtype=object),
 'params': [{'clf__estimator__C': 1,
   'clf__estimator__d

In all cases, the median F1 score is 0. Therefore, we can't properly select between cases.

In [27]:
# Calculate evaluation metrics for test set
tuned_pred_test2 = tuned_model2.predict(X_test)

eval_metrics2 = evaluate_metrics(np.array(Y_test), tuned_pred_test2, col_names)

print(eval_metrics2)

                        Accuracy  Precision    Recall        F1
related                 0.761278   0.761278  0.761278  0.761278
request                 0.825203   0.825203  0.825203  0.825203
offer                   0.996330   0.996330  0.996330  0.996330
aid_related             0.575470   0.575470  0.575470  0.575470
medical_help            0.922159   0.922159  0.922159  0.922159
medical_products        0.953510   0.953510  0.953510  0.953510
search_and_rescue       0.971861   0.971861  0.971861  0.971861
security                0.978437   0.978437  0.978437  0.978437
military                0.965285   0.965285  0.965285  0.965285
water                   0.938981   0.938981  0.938981  0.938981
food                    0.889586   0.889586  0.889586  0.889586
shelter                 0.914054   0.914054  0.914054  0.914054
clothing                0.986848   0.986848  0.986848  0.986848
money                   0.978437   0.978437  0.978437  0.978437
missing_people          0.988377   0.988

This model is running extremely slow

The model performs well with regard to F1 score in one case ("related") but terribly in all other cases. We could try some more parameter values for the SVM in order to try to find a combination that will work, but instead, we shall just stick with the original tuned Random Forest Classifier based model.

### 9. Export your model as a pickle file

In [28]:
# Save to file
pickle.dump(tuned_model, open('disaster_model.sav', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.