# ML Pipeline Testing

---

**Run on terminal** requisites:

"If someone in the future comes with a revised or new dataset of messages, they should be able to easily create a new model just by running your code. These Python scripts should be able to run with additional arguments specifying the files used for the data and model."

`python process_data.py disaster_messages.csv disaster_categories.csv DisasterResponse.db`

`python train_classifier.py ../data/DisasterResponse.db classifier.pkl`

This is the test part of Machine Learning Pipeline

>- import libraries
>- read data from a SQLite table named `sqlite:///Messages.db`

In [1]:
#import libraries
from time import time
import math
import numpy as np
import pandas as pd
import pprint as pp
import pickle

import udacourse2 #my library for this project
import train_classifier as tr #my pipeline

data_file = 'sqlite:///Messages.db' #sys.argv[1] 
classifier = 'classifier.pkl' #sys.argv[2]

### 1. Test `load_data`

In [2]:
X, y = tr.load_data(data_file=data_file,
                    remove_cols=True,
                    verbose=True)

###load_data function started
existing tables in my SQLite database: ['Messages']
all labels are blank in 6304 rows
remaining rows: 19876
removal complete!
*tokenized column don´t exist, creating it
found 6 rows with no tokens
*after removal, found 0 rows with no tokens
*tokenized column dropped
now I have 19870 rows to train
###function group_check started
  - count for main class:aid_related, 10841 entries
  - for main, without any sub-categories,  3507 entries
  - for subcategories,  7360 entries
  - for lost parent sub-categories,  26 entries
    *correcting, new count: 0 entries
elapsed time: 0.0829s
###function group_check started
  - count for main class:weather_related, 7286 entries
  - for main, without any sub-categories,  1357 entries
  - for subcategories,  5929 entries
  - for lost parent sub-categories,  0 entries
    *correcting, new count: 0 entries
elapsed time: 0.0370s
###function group_check started
  - count for main class:infrastructure_related, 1705 entries
  - fo

In [3]:
X.head(4)

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
Name: message, dtype: object

In [4]:
y.head(4)

Unnamed: 0,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Test `build_model` function

In [5]:
model_pipeline = tr.build_model(verbose=True)

###build_model function started
Tree-type Classifier (Adaboost-default) pipeline is on the way
*note: parameter C, is NOT used in this family of Classifiers, so don´t call it!
creating convencional Adaboost pipeline
*Classifier pipeline was created
process time:0 seconds


In [6]:
model_pipeline

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function fn_tokenize_fast at 0x000002B9BAF713A0>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=AdaBoostClassifier(random_state=42)))])

### 3. Test `train` function

In [7]:
model = tr.train(X=X,
                 y=y,
                 model=model_pipeline,
                 verbose=True)

###train function started
data split into train and text seems OK
###function scores_report started
using top 10 labels
######################################################
*aid_related -> label iloc[2]
              precision    recall  f1-score   support

           0       0.67      0.73      0.70      2260
           1       0.76      0.70      0.73      2708

    accuracy                           0.71      4968
   macro avg       0.71      0.72      0.71      4968
weighted avg       0.72      0.71      0.72      4968

######################################################
*weather_related -> label iloc[26]
              precision    recall  f1-score   support

           0       0.85      0.93      0.89      3167
           1       0.86      0.71      0.78      1801

    accuracy                           0.85      4968
   macro avg       0.86      0.82      0.83      4968
weighted avg       0.85      0.85      0.85      4968

###################################################

### 4. Test `export_model` function

In [8]:
file_name = 'classifier.pkl'

tr.export_model(model=model,
                file_name=file_name,
                verbose=True)

###export_model function started
*trained Classifier was exported
process time:0 seconds


True

In [9]:
with open('classifier.pkl', 'rb') as pk_reader:
    model_unpk = pickle.load(pk_reader)
    
model_unpk

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function fn_tokenize_fast at 0x000002B9BAF713A0>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=AdaBoostClassifier(random_state=42)))])

### 5. Test `run_pipeline` function

In [10]:
data_file = 'sqlite:///Messages.db'
start = time()

tr.run_pipeline(data_file=data_file, verbose=False)

spent = time() - start
print('process time: {:.0f} seconds'.format(spent))

process time: 191 seconds


### 6.`Main` function calling 

In [11]:
print(eval(pp.pformat(tr.main.__doc__)))

This is the main Machine Learning Pipeline function. It calls the other 
    ones, in the correct order.
    Example: python train_classifier.py
    Basic parameters:
      - data_file - just indicate the complete path after the command 
        (default:'../data/DisasterResponse.db')
        Example: python train_classifier.py ../data/Database.db
      - classifier - you need to indicate both data_file and classifier
        (default:'classifier.pkl')
        Example: python train_classifier.py ../data/Database.db other.pkl
    Extra parameters:
      here you need to indicate both data_file and classifier, in order to use 
      them you can use only one, or more, in any order
      -v -> verbose - if you want some verbosity during the running
            (default=False)
      -r -> remove columns - if you want to remove (un)trainable columns from
            your y-labels dataset (default=False)
      -t -> test size for splitting your data (default=0.25)
      -s -> change Classifi

In [12]:
tr.run_pipeline(data_file=data_file,
                classifier=classifier,
                verbose=True)

###run_pipeline function started
###load_data function started
existing tables in my SQLite database: ['Messages']
all labels are blank in 6304 rows
remaining rows: 19876
removal complete!
*tokenized column don´t exist, creating it
found 6 rows with no tokens
*after removal, found 0 rows with no tokens
*tokenized column dropped
now I have 19870 rows to train
###function group_check started
  - count for main class:aid_related, 10841 entries
  - for main, without any sub-categories,  3507 entries
  - for subcategories,  7360 entries
  - for lost parent sub-categories,  26 entries
    *correcting, new count: 0 entries
elapsed time: 0.0989s
###function group_check started
  - count for main class:weather_related, 7286 entries
  - for main, without any sub-categories,  1357 entries
  - for subcategories,  5929 entries
  - for lost parent sub-categories,  0 entries
    *correcting, new count: 0 entries
elapsed time: 0.1099s
###function group_check started
  - count for main class:infrastruc

True

---

## Test Area

SQL correct string for URL [here](https://stackoverflow.com/questions/49776619/sqlalchemy-exc-argumenterror-could-not-parse-rfc1738-url-from-string)

In [13]:
tr.main(data_file=data_file,
        classifier=classifier,
        verbose=True)

###run_pipeline function started
###load_data function started
existing tables in my SQLite database: ['Messages']
all labels are blank in 6304 rows
remaining rows: 19876
removal complete!
*tokenized column don´t exist, creating it
found 6 rows with no tokens
*after removal, found 0 rows with no tokens
*tokenized column dropped
now I have 19870 rows to train
###function group_check started
  - count for main class:aid_related, 10841 entries
  - for main, without any sub-categories,  3507 entries
  - for subcategories,  7360 entries
  - for lost parent sub-categories,  26 entries
    *correcting, new count: 0 entries
elapsed time: 0.0790s
###function group_check started
  - count for main class:weather_related, 7286 entries
  - for main, without any sub-categories,  1357 entries
  - for subcategories,  5929 entries
  - for lost parent sub-categories,  0 entries
    *correcting, new count: 0 entries
elapsed time: 0.0370s
###function group_check started
  - count for main class:infrastruc

In [14]:
raise Exception('test area')

Exception: test area

In [None]:
#args = sys.argv

simul_args = ['xuru.db', 'boco.pkl', '-r', '-C=3.', '-t=.2', '-a', '-v']
#simul_args = ['xuru.db']
Esimul_args = ['xuru.db', 'boco.pkl', '-v', '-a', '-C=4.0', '-r', '-xu']
optionals = ['-r', '-C', '-t', '-a', '-v']
args = simul_args

#first, set default arguments
data_file = '../data/DisasterResponse.db'
classifier = 'classifier.pkl'
remove_cols = False
C = 2.0
test_size = 0.25
best_10 = True
verbose = False

#second, try to change the two main arguments
try:
    args[0]
except IndexError:
    pass
else:
    data_file = args[0]   
try:
    args[1]
except IndexError:
    pass
else:
    classifier = args[1]

remain_args = args[2:] #elliminate the two main arg    
if len(remain_args) > 0:
    for arg in remain_args:
        comm = arg[:2] #get the command part
        if comm == '-r':
            remove_cols = True
        elif comm == '-C':
            C = arg[3:]
        elif comm == '-t':
            test_size = arg[3:]
        elif comm == '-a':
            best_10=False
        elif comm == '-v':
            verbose=True
        else:
            raise Exception('invalid argument')

print('data_file={} classifier={} remove_cols={} C={} test_size={} best_10={} verbose={}'\
      .format(data_file, classifier, remove_cols, C, test_size, best_10, verbose))

In [None]:
arg='-s;'

arg[2:]