# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline. As before, this is a template made by the Udacity team and it is very useful for getting familiar with the concepts we should apply in the project related to ML pipelines.

### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [60]:
# import libraries
import os
import pandas as pd
import numpy as np
import pickle 
import re
import sqlalchemy
import nltk
from nltk.corpus import stopwords
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.multioutput import MultiOutputClassifier
from sklearn.base import BaseEstimator, TransformerMixin

%matplotlib inline

In [18]:
nltk.download(['punkt', 'wordnet','stopwords', 'averaged_perceptron_tagger'])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ARULLOAO\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ARULLOAO\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ARULLOAO\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ARULLOAO\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
# load data from database
engine = create_engine('sqlite:///../data/disaster_process_data.db')
df = pd.read_sql('SELECT * FROM disaster_process_data', engine)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# Exploring data for X
df['message'][:10]

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
5               Information about the National Palace-
6                       Storm at sacred heart of jesus
7    Please, we need tents and water. We are in Sil...
8      I would like to receive the messages, thank you
9    I am in Croix-des-Bouquets. We have health iss...
Name: message, dtype: object

In [6]:
# Only considering the columns that were created from before (categories)
df.iloc[:, 4:]

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26210,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26211,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26212,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26213,1,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# Assigning data

X = df['message']
Y = df.iloc[:, 4:]

### 2. Write a tokenization function to process your text data

In [20]:
def tokenize(text):
    '''
    INPUT
    - raw text
    OUTPUT
    - clean text by using tokenization and lemmatization techniques 
    
    '''
    # Detecting urls and replacing them (Although it's unlikely to find urls in disaster messages, it's good criterion to clean it anyways) 
    url_format = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    urls_found = re.findall(url_format, text)
    
    for url in urls_found:
        text = text.replace(url, "urlplaceholder")
    
    # Spliting the sentences into tokens 
    tokens = word_tokenize(text)
    
    # Adding the lemmatizer functions too
    lemmatizer = WordNetLemmatizer()
    
    # Applying the lemmatization function 
    clean_text = []
    for token in tokens:
        lemma = lemmatizer.lemmatize(token).lower().strip()
        clean_text.append(lemma)

    return clean_text

In [21]:
df['message'][0]

'Weather update - a cold front from Cuba that could pass over Haiti'

In [23]:
# Testing the function from above
text = 'Weather update - a cold front from Cuba that could pass over Haiti https://en.unesco.org/about-us/introducing-unesco'
tokenize(text)

['weather',
 'update',
 '-',
 'cold',
 'front',
 'cuba',
 'could',
 'pas',
 'haiti',
 'urlplaceholder']

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [24]:
# Creating the Pipeline and applying the RandomForestClassifier() model

pipeline = Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer()),
            ('clf', MultiOutputClassifier(RandomForestClassifier())) # Method for Multi target classification
        ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [25]:
# Split data 
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

In [26]:
%%time
# Train pipeline
pipeline.fit(X_train, y_train)

Wall time: 5min 34s


Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x000001F8B1BBF3A0>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [27]:
%%time
y_pred = pipeline.predict(X_test)

Wall time: 1min 1s


In [28]:
y_pred

array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [29]:
y_test.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
7917,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20913,1,0,0,1,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
22523,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18442,1,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,0,0
1336,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
classification_report(y_test['related'], y_pred[:, 0])

'              precision    recall  f1-score   support\n\n           0       0.73      0.34      0.47      2034\n           1       0.83      0.96      0.89      6617\n\n    accuracy                           0.82      8651\n   macro avg       0.78      0.65      0.68      8651\nweighted avg       0.80      0.82      0.79      8651\n'

In [49]:
Y.columns

Index(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')

In [46]:
# Generating a function for knowing the performance of the model in each variable

def model_performance(y_test, y_pred):
    '''
    INPUT
    - y_test
    - y_pred
    
    OUTPUT
    - None (There is a print containing the precision | recall | f1-score | support measures)
    
    '''
    for i, col in enumerate(y_test):
        # Printing the results for each variable
        print('--------------------------------------------------------------')
        print(f'\033[1m{col}\033[0m')
        print(classification_report(y_test[col], y_pred[:,i]))
       

In [47]:
model_performance(y_test, y_pred)

--------------------------------------------------------------
[1mrelated[0m
              precision    recall  f1-score   support

           0       0.73      0.34      0.47      2034
           1       0.83      0.96      0.89      6617

    accuracy                           0.82      8651
   macro avg       0.78      0.65      0.68      8651
weighted avg       0.80      0.82      0.79      8651

--------------------------------------------------------------
[1mrequest[0m
              precision    recall  f1-score   support

           0       0.91      0.98      0.94      7201
           1       0.85      0.49      0.62      1450

    accuracy                           0.90      8651
   macro avg       0.88      0.74      0.78      8651
weighted avg       0.90      0.90      0.89      8651

--------------------------------------------------------------
[1moffer[0m
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      8610
  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.87      1.00      0.93      7491
           1       0.73      0.04      0.07      1160

    accuracy                           0.87      8651
   macro avg       0.80      0.52      0.50      8651
weighted avg       0.85      0.87      0.81      8651

--------------------------------------------------------------
[1minfrastructure_related[0m
              precision    recall  f1-score   support

           0       0.94      1.00      0.97      8100
           1       0.75      0.01      0.01       551

    accuracy                           0.94      8651
   macro avg       0.84      0.50      0.49      8651
weighted avg       0.92      0.94      0.91      8651

--------------------------------------------------------------
[1mtransport[0m
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      8266
           1       0.71      0.06      0.11       385

    acc

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [51]:
# General accuracy (mean)
accuracy = (y_pred == y_test).mean().mean()
accuracy

0.947443455477209

### 6. Improve your model
Use grid search to find better parameters. 

In [52]:
# Checking all the availables parameters
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x000001F8B1BBF3A0>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x000001F8B1BBF3A0>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text)>,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sublinear_tf': False,


In [53]:
parameters = {
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__max_df': (0.5, 1.0),
    'clf__estimator__n_estimators' : [10,60,100]
}

cv = GridSearchCV(pipeline, param_grid=parameters)

cv

GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x000001F8B1BBF3A0>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier()))]),
             param_grid={'clf__estimator__n_estimators': [10, 60, 100],
                         'vect__max_df': (0.5, 1.0),
                         'vect__ngram_range': ((1, 1), (1, 2))})

In [54]:
%%time
cv.fit(X_train, y_train)

Wall time: 4h 33min 48s


GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x000001F8B1BBF3A0>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier()))]),
             param_grid={'clf__estimator__n_estimators': [10, 60, 100],
                         'vect__max_df': (0.5, 1.0),
                         'vect__ngram_range': ((1, 1), (1, 2))})

In [55]:
%%time
cv.best_params_

Wall time: 0 ns


{'clf__estimator__n_estimators': 100,
 'vect__max_df': 0.5,
 'vect__ngram_range': (1, 2)}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [56]:
%%time
y_pred = cv.predict(X_test)

Wall time: 1min 26s


In [57]:
accuracy = (y_pred == y_test).mean()
accuracy

related                   0.819790
request                   0.898509
offer                     0.995261
aid_related               0.766385
medical_help              0.921743
medical_products          0.952953
search_and_rescue         0.973413
security                  0.980118
military                  0.969137
child_alone               1.000000
water                     0.958849
food                      0.942666
shelter                   0.933765
clothing                  0.986822
money                     0.978037
missing_people            0.988209
refugees                  0.964860
death                     0.958502
other_aid                 0.866836
infrastructure_related    0.936192
transport                 0.956768
buildings                 0.953185
electricity               0.979193
tools                     0.993527
hospitals                 0.990521
shops                     0.995839
aid_centers               0.987400
other_infrastructure      0.956884
weather_related     

In [58]:
accuracy = (y_pred == y_test).mean().mean()
accuracy

0.9465957692752281

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [74]:
# Class for checking if a sentences start with a verb 

class StartingVerbExtractor(BaseEstimator, TransformerMixin):

    def starting_verb(self, text):
        sentences = nltk.sent_tokenize(text)
        for sent in sentences:
            pos_tags = nltk.pos_tag(tokenize(sent))
            # If the list is empty (pos_tags = []) it returns automaticaly False
            if len(pos_tags):
                first_word, first_tag = pos_tags[0]
                if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                    return True
                return False
            else:
                return False

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

In [75]:
X,Y = load_data()

In [77]:
type(X)

pandas.core.series.Series

In [79]:
S = StartingVerbExtractor()

In [65]:
def load_data():
    engine = create_engine('sqlite:///../data/disaster_process_data.db')
    df = pd.read_sql('SELECT * FROM disaster_process_data', engine)
    X = df['message']
    Y = df.iloc[:, 4:]
    return X, Y

In [66]:
def tokenize(text):
    '''
    INPUT
    - raw text
    OUTPUT
    - clean text by using tokenization and lemmatization techniques 
    
    '''
    # Detecting urls and replacing them (Although it's unlikely to find urls in disaster messages, it's good criterion to clean it anyways) 
    url_format = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    urls_found = re.findall(url_format, text)
    
    for url in urls_found:
        text = text.replace(url, "urlplaceholder")
    
    # Spliting the sentences into tokens 
    tokens = word_tokenize(text)
    
    # Adding the lemmatizer functions too
    lemmatizer = WordNetLemmatizer()
    
    # Applying the lemmatization function 
    clean_text = []
    for token in tokens:
        lemma = lemmatizer.lemmatize(token).lower().strip()
        clean_text.append(lemma)
     
    # Removing stop words  (NEW FEATURE!!!)
    clean_text = [word for word in clean_text if word not in stopwords.words("english")]
    
    return clean_text


In [67]:
def build_model():

    pipeline = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ])),

            ('starting_verb', StartingVerbExtractor())
        ])),

        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])
    
    parameters = {
    'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2)),
    #'features__text_pipeline__vect__max_df': (0.5, 1.0),
    'clf__estimator__n_estimators' : [50,100]
    }
    
    cv = GridSearchCV(pipeline, param_grid=parameters)

    return cv

In [68]:
def model_performance(model, X_test, y_test):
    '''
    INPUT
    - y_test
    - y_pred
    
    OUTPUT
    - None (There is a print containing the precision | recall | f1-score | support measures)
    
    '''
    import warnings
    warnings.filterwarnings("ignore")
    
    y_pred = model.predict(X_test)
    
    for i, col in enumerate(y_test):
        # Printing the results for each variable
        print('--------------------------------------------------------------')
        print(f'\033[1m{col}\033[0m')
        print(classification_report(y_test[col], y_pred[:,i]))
    
    accuracy = (y_pred == y_test).mean().mean()
    print(f'General Accuracy: {accuracy}')  

In [71]:
def main():
    """ Builds the model, trains the model, evaluates the model, saves the model."""
    
    X, Y = load_data()
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

    print('Building model...')
    model = build_model()

    print('Training model...')
    model.fit(X_train, y_train)

    print('Evaluating model...')
    model_performance(model, X_test, y_test)
    
    print('Saving model...')
    filename = 'classifier.pkl'
    pickle.dump(model, open(filename, 'wb'))

In [72]:
%%time
main()

Building model...
Training model...


Traceback (most recent call last):
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\joblib\memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 988, in fit_transform
    return self._hstack(Xs)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", li

Traceback (most recent call last):
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\joblib\memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 988, in fit_transform
    return self._hstack(Xs)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", li

Traceback (most recent call last):
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\joblib\memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 988, in fit_transform
    return self._hstack(Xs)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", li

Traceback (most recent call last):
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\joblib\memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 988, in fit_transform
    return self._hstack(Xs)
  File "C:\Users\ARULLOAO\Anaconda3\lib\site-packages\sklearn\pipeline.py", li

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

### 9. Export your model as a pickle file

In [73]:
import pickle
pickle.dump(cv, open('classifier.pkl', 'wb'))

PicklingError: Can't pickle <function tokenize at 0x000001F8B1BBF3A0>: it's not the same object as __main__.tokenize

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.