# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
#import libraries
#measuring time and making basic math
from time import time
import math
import numpy as np
import udacourse2 #my library for this project!

#my own ETL pipeline
#import process_data as pr

#dealing with datasets and showing content
import pandas as pd
#import pprint as pp

#SQLAlchemy toolkit
from sqlalchemy import create_engine
from sqlalchemy import pool
from sqlalchemy import inspect

#natural language toolkit
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer

#REGEX toolkit
import re

#Machine Learning toolkit
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

#Machine Learning Classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier #need MOClassifier!
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier

#pickling tool
import pickle

When trying to use NLTK, I took the following error:

- the point is - it´s not only about installing a library

- you need to install de supporting dictionnaries for doing the tasks

- this can be solved quite easilly (in hope that I will find a Portuguese-Brazil dictionnary when I will need to put it in practic in my work)

    LookupError: 
    **********************************************************************
      Resource stopwords not found.
      Please use the NLTK Downloader to obtain the resource:

      >>> import nltk
      >>> nltk.download('stopwords')
  
      For more information see: https://www.nltk.org/data.html

      Attempted to load corpora/stopwords`

In [2]:
#import nltk
#nltk.download('punkt')

    LookupError: 
    **********************************************************************
    Resource stopwords not found.
    Please use the NLTK Downloader to obtain the resource:

    >>> import nltk
    >>> nltk.download('stopwords')

In [3]:
#nltk.download('stopwords')

    LookupError: 
    **********************************************************************
    Resource wordnet not found.
    Please use the NLTK Downloader to obtain the resource:

    >>> import nltk
    >>> nltk.download('wordnet')

In [4]:
#nltk.download('wordnet')

In [5]:
#load data from database
#setting NullPool prevents a pool, so it is easy to close the database connection
#in our case, the DB is so simple, that it looks the best choice
#SLQAlchemy documentation
#https://docs.sqlalchemy.org/en/14/core/reflection.html
engine = create_engine('sqlite:///Messages.db', poolclass=pool.NullPool) #, echo=True)

#retrieving tables names from my DB
#https://stackoverflow.com/questions/6473925/sqlalchemy-getting-a-list-of-tables
inspector = inspect(engine)
print('existing tables in my SQLite database:', inspector.get_table_names())

existing tables in my SQLite database: ['Messages']


As my target is Messages table, so I reed this table as a Pandas dataset

In [6]:
#importing MySQL to Pandas
#https://stackoverflow.com/questions/37730243/importing-data-from-a-mysql-database-into-a-pandas-data-frame-including-column-n/37730334
#connection_str = 'mysql+pymysql://mysql_user:mysql_password@mysql_host/mysql_db'
#connection = create_engine(connection_str)

connection = engine.connect()
df = pd.read_sql('SELECT * FROM Messages', con=connection)
connection.close()

df.name = 'df'

df.head(1)

Unnamed: 0,message,original,genre,if_blank,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Splitting in X and Y datasets:

- X is the **Message** column

In [7]:
X = df['message']
X.head(1)

0    Weather update - a cold front from Cuba that c...
Name: message, dtype: object

- Y is the **Classification** labels

- I excluded all my columns that don´t make sense as labels to classify our message

In [8]:
Y = df[df.columns[4:]]
Y.head(1)

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [9]:
msg_text = X.iloc[0]
msg_text

'Weather update - a cold front from Cuba that could pass over Haiti'

In [10]:
#let´s insert some noise to see if it is filtering well
msg_text = "Weather update01 - a 00cold-front from Cuba's that could pass over Haiti' today"
low_text = msg_text.lower()

#I need to take only valid words
#a basic one (very common in Regex courses classes)
gex_text = re.sub(r'[^a-zA-Z]', ' ', low_text)

#other tryed sollutions from several sources
#re.sub(r'^\b[^a-zA-Z]\b', ' ', low_text)
#re.sub(r'^/[^a-zA-Z ]/g', ' ', low_text)
#re.sub(r'^/[^a-zA-Z0-9 ]/g', ' ', low_text)

gex_text

'weather update     a   cold front from cuba s that could pass over haiti  today'

Found this [here](https://stackoverflow.com/questions/1751301/regex-match-entire-words-only)

- '-' passed away, so it´s not so nice!

In [11]:
re.sub(r'^/\b($word)\b/i', ' ', low_text)

"weather update01 - a 00cold-front from cuba's that could pass over haiti' today"

In [12]:
re.sub(r'^\b[a-zA-Z]{3}\b', ' ', low_text)

"weather update01 - a 00cold-front from cuba's that could pass over haiti' today"

In [13]:
re.sub(r'^[a-zA-Z]{3}$', ' ', low_text)

"weather update01 - a 00cold-front from cuba's that could pass over haiti' today"

In [14]:
col_words = word_tokenize(gex_text)
col_words

['weather',
 'update',
 'a',
 'cold',
 'front',
 'from',
 'cuba',
 's',
 'that',
 'could',
 'pass',
 'over',
 'haiti',
 'today']

In [15]:
unnuseful = stopwords.words("english")
relevant_words = [word for word in col_words if word not in unnuseful]
relevant_words

['weather',
 'update',
 'cold',
 'front',
 'cuba',
 'could',
 'pass',
 'haiti',
 'today']

I noticed a lot of geographic references. I think they will not be so useful for us. Let´s try to remove them too...

References for City at NLKT [here](https://stackoverflow.com/questions/37025872/unable-to-import-city-database-dataset-from-nltk-data-in-anaconda-spyder-windows?rq=1)

In [16]:
import nltk.sem.chat80 as ct #.sql_demo()

LookupError: 
**********************************************************************
  Resource city_database not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('city_database')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/city_database/city.db

  Searched in:
    - 'C:\\Users\\epass/nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\share\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\epass\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************

In [17]:
#import nltk
#nltk.download('city_database')

In [18]:
countries = {
    country:city for city, country in ct.sql_query(
        "corpora/city_database/city.db",
        "SELECT City, Country FROM city_table"
    )
}

They look nice (and lower cased):
    
- observe possible errors with composite names, like united_states

In [19]:
for c in countries:
    print(c)

greece
thailand
spain
east_germany
united_kingdom
india
belgium
romania
hungary
argentina
egypt
china
venezuela
united_states
west_germany
hongkong
turkey
indonesia
south_africa
pakistan
soviet_union
japan
peru
philippines
australia
mexico
italy
canada
france
south_korea
brazil
vietnam
chile
singapore
iran
austria
poland


I couldn't find Haiti:

- countries list is not complete!

- it gaves `KeyError: 'haiti'`

In [20]:
#countries['haiti']

In [21]:
nogeo_words = [word for word in relevant_words if word not in countries]
nogeo_words

['weather',
 'update',
 'cold',
 'front',
 'cuba',
 'could',
 'pass',
 'haiti',
 'today']

Unfortatelly, it´s only a **demo**! We need something better for our project...

In [22]:
#df_cities = pd.read_csv('cities15000.txt', sep=';')
df_cities = pd.read_csv('cities15000.txt', sep='\t', header=None)
df_cities_15000 = df_cities[[1, 17]]
df_cities_15000.columns = ['City', 'Region']
df_cities_15000.head(5)

Unnamed: 0,City,Region
0,les Escaldes,Europe/Andorra
1,Andorra la Vella,Europe/Andorra
2,Umm Al Quwain City,Asia/Dubai
3,Ras Al Khaimah City,Asia/Dubai
4,Zayed City,Asia/Dubai


Tried this [here](https://data.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000%40public/information/?disjunctive.cou_name_en)

In [23]:
df_cities.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",42.50729,1.53414,P,PPLA,AD,,8,,,,15853,,1033,Europe/Andorra,2008-10-15
1,3041563,Andorra la Vella,Andorra la Vella,"ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ...",42.50779,1.52109,P,PPLC,AD,,7,,,,20430,,1037,Europe/Andorra,2020-03-03
2,290594,Umm Al Quwain City,Umm Al Quwain City,"Oumm al Qaiwain,Oumm al Qaïwaïn,Um al Kawain,U...",25.56473,55.55517,P,PPLA,AE,,7,,,,62747,,2,Asia/Dubai,2019-10-24
3,291074,Ras Al Khaimah City,Ras Al Khaimah City,"Julfa,Khaimah,RAK City,RKT,Ra's al Khaymah,Ra'...",25.78953,55.9432,P,PPLA,AE,,5,,,,351943,,2,Asia/Dubai,2019-09-09
4,291580,Zayed City,Zayed City,"Bid' Zayed,Bid’ Zayed,Madinat Za'id,Madinat Za...",23.65416,53.70522,P,PPL,AE,,1,103.0,,,63482,,124,Asia/Dubai,2019-10-24


found country names at Github [here](https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv)

- a small trick and we have our own coutries list!

In [24]:
df_countries = pd.read_csv('all.csv')
df_countries = df_countries['name'].apply(lambda x: x.lower())
countries = df_countries.tolist()
countries

['afghanistan',
 'åland islands',
 'albania',
 'algeria',
 'american samoa',
 'andorra',
 'angola',
 'anguilla',
 'antarctica',
 'antigua and barbuda',
 'argentina',
 'armenia',
 'aruba',
 'australia',
 'austria',
 'azerbaijan',
 'bahamas',
 'bahrain',
 'bangladesh',
 'barbados',
 'belarus',
 'belgium',
 'belize',
 'benin',
 'bermuda',
 'bhutan',
 'bolivia (plurinational state of)',
 'bonaire, sint eustatius and saba',
 'bosnia and herzegovina',
 'botswana',
 'bouvet island',
 'brazil',
 'british indian ocean territory',
 'brunei darussalam',
 'bulgaria',
 'burkina faso',
 'burundi',
 'cabo verde',
 'cambodia',
 'cameroon',
 'canada',
 'cayman islands',
 'central african republic',
 'chad',
 'chile',
 'china',
 'christmas island',
 'cocos (keeling) islands',
 'colombia',
 'comoros',
 'congo',
 'congo, democratic republic of the',
 'cook islands',
 'costa rica',
 "côte d'ivoire",
 'croatia',
 'cuba',
 'curaçao',
 'cyprus',
 'czechia',
 'denmark',
 'djibouti',
 'dominica',
 'dominican re

I can elliminate (perhaps not the whole) a lot of names of countries. In our case, the produce noise on our data.

In [25]:
nogeo_words = [word for word in relevant_words if word not in countries]
nogeo_words

['weather', 'update', 'cold', 'front', 'could', 'pass', 'today']

First test:
    
- over the first message only

In [26]:
message = 'Weather update - a cold front from Cuba that could pass over Haiti'
tokens = udacourse2.fn_tokenize_fast(msg_text, 
                                     verbose=True)

['update', 'today', 'pass', 'weather', 'cold', 'haiti', 'front', 'cuba']


In [27]:
message = 'Weather update - a cold front from Cuba that could pass over Haiti'
tokens = udacourse2.fn_tokenize(msg_text, 
                                lemmatize=True, 
                                rem_city=True, 
                                agg_words=True,
                                rem_noise=True,
                                elm_short=3,
                                verbose=True)

tokens

###Tokenizer function started
process time:0.0205 seconds
Tokens-start:79, token/stop:9, remove cities:7 &noise:7
 +lemmatizer:7
 +eliminate short:7


['could', 'update', 'today', 'pas', 'weather', 'cold', 'front']

It´s not so cool, some noise is still appearing in lemmatized words:
    
- an "l" was found, as in **French words**, like *l'orange*;

- my **City** filter needs a lot of improving, as it didn´t filter avenues and so many other **geographic** references;

- it passed a lot of unnuseful **two** or less letters words, as **u**, **st**;

- a lot of noisy words as **help**, **thanks**, **please** were found;

- there are several words **repetition** in some messages, like ['river', ... 'river', ...]

Basic test call

- only for the first 50 messages, verbose

In [28]:
b_start = time()

i = 0
for message in X:
    out = udacourse2.fn_tokenize_fast(message, 
                                      verbose=True)
    i += 1
    if i > 200: #it´s only for test, you can adjust it!
        break

b_spent = time() - b_start
print('process time:{:.0f} seconds'.format(b_spent))

['update', 'pass', 'weather', 'cold', 'haiti', 'front', 'cuba']
['hurricane']
['name', 'looking']
['desperately', 'destroyed', 'hospital', 'leogane', 'croix', 'needs', 'supplies', 'functioning', 'reports']
['today', 'tonight', 'side', 'haiti', 'rest', 'says']
['national', 'information']
['storm']
['silo', 'water', 'need', 'tents', 'please']
['messages', 'receive']
['issues', 'health', 'workers', 'croix', 'des', 'bouquets']
['water', 'starving', 'nothing', 'eat', 'thirsty']
['petionville', 'regarding', 'information', 'need']
['water', 'thomassin', 'desperately', 'pyron', 'need']
['delma', 'didine', 'need', 'together', 'food']
['order', 'participate', 'use', 'information']
['water', 'temporary', 'delmas', 'impasse', 'medications', 'clothes', 'charite', 'need', 'tents', 'comitee', 'shelter', 'dire', 'food', 'please']
['water', 'chretien', 'extension', 'impasse', 'dying', 'need', 'klecin', 'extended', 'hungry', 'sick', 'food', 'hunger']
['want', 'call']
['understand', 'use']
['earthquake']

['american', 'tomorrow', 'embassy', 'open']
['firm', 'south', 'architects', 'african', 'wondering', 'construction', 'stockholders', 'needs', 'civilians', 'moment', 'recruited']
['done', 'underneath', 'jacmel', 'rubbles', 'trapped', 'help', 'college', 'latriniti']
['enter', 'electricity', 'houses']
['kinds', 'vile', 'ona', 'route', 'bon', 'repos', 'need', 'help', 'really']
['muguet', 'please', 'petion']
['else', 'services', 'call', 'information', 'need', 'help', 'translators', 'distributors', 'food', 'beeing', 'write', 'please']
['political', 'hinche', 'true', 'held']
['french', 'creole', 'need', 'half', 'speak', 'field', 'english']
['financed', 'unrecognized', 'bid', 'initial', 'truncated', 'accelerated', 'launched', 'characterse', 'menfp', 'organized', 'formation']
['truncated', 'continues', 'education', 'regarding']
['desperately', 'help', 'needs', 'humiliated', 'bizoton']
['car', 'leave', 'find']
['water', 'without', 'delmas', 'something', 'live', 'finished', 'morning', 'food', 'ple

In [29]:
b_start = time()

i = 0
for message in X:
    print(message)
    out = udacourse2.fn_tokenize(message, 
                                 lemmatize=True, 
                                 rem_city=True, 
                                 agg_words=True,
                                 rem_noise=True,
                                 elm_short=3,
                                 great_noisy=True,
                                 verbose=True)
    print(out)
    print()
    i += 1
    if i > 20: #it´s only for test, you can adjust it!
        break

b_spent = time() - b_start
print('process time:{:.4f} seconds'.format(b_spent))

Weather update - a cold front from Cuba that could pass over Haiti
###Tokenizer function started
process time:0.0150 seconds
Tokens-start:66, token/stop:8, remove cities:6 &noise:6
 +lemmatizer:6
 +eliminate short:6
 +eliminate noisy from 300:6
['could', 'update', 'pas', 'weather', 'cold', 'front']

Is the Hurricane over or is it not over
###Tokenizer function started
process time:0.0135 seconds
Tokens-start:39, token/stop:1, remove cities:1 &noise:1
 +lemmatizer:1
 +eliminate short:1
 +eliminate noisy from 300:1
['hurricane']

Looking for someone but no name
###Tokenizer function started
process time:0.0120 seconds
Tokens-start:31, token/stop:3, remove cities:3 &noise:2
 +lemmatizer:2
 +eliminate short:2
 +eliminate noisy from 300:2
['name', 'looking']

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
###Tokenizer function started
process time:0.0135 seconds
Tokens-start:100, token/stop:11, remove cities:11 &noise:10
 +lemmatizer:10


Don´t try it! (complete tokenizer)

- it´s a slow test! (takes like 221 seconds to tokenize all the dataframe)

In [30]:
#b_start = time()

#X_tokens = X.apply(lambda x: udacourse2.fn_tokenize(x, 
#                                                    lemmatize=True, 
#                                                    rem_city=True, 
#                                                    agg_words=True,
#                                                    rem_noise=True,
#                                                    elm_short=3,
#                                                    great_noisy=True,
#                                                    verbose=False))

#b_spent = time() - b_start
#print('process time:{:.0f} seconds'.format(b_spent))

- it´s a bit faster test (it takes 46 seconds to run)

- the secret is that it loops only one time for row, as it condenses all the filters into one loop

In [31]:
b_start = time()

X_tokens = X.apply(lambda x: udacourse2.fn_tokenize_fast(x, 
                                                         verbose=False))

b_spent = time() - b_start
print('process time:{:.0f} seconds'.format(b_spent))

process time:55 seconds


Now I have a **series** with all my tokenized messages:

In [32]:
X_tokens.head(5)

0    [update, pass, weather, cold, haiti, front, cuba]
1                                          [hurricane]
2                                      [name, looking]
3    [desperately, destroyed, hospital, leogane, cr...
4            [today, tonight, side, haiti, rest, says]
Name: message, dtype: object

And I can filter it for rows that have an **empty list**:
    
- solution found [here](https://stackoverflow.com/questions/29100380/remove-empty-lists-in-pandas-series)

In [33]:
X_tokens[X_tokens.str.len() == 0]

2522     []
2678     []
4487     []
5347     []
5709     []
5737     []
6152     []
6153     []
6229     []
7190     []
7266     []
7559     []
7751     []
7807     []
8891     []
8901     []
9650     []
9863     []
12221    []
12225    []
12258    []
Name: message, dtype: object

In [34]:
ser2 = X_tokens[X_tokens.str.len() > 0]
ser2

0        [update, pass, weather, cold, haiti, front, cuba]
1                                              [hurricane]
2                                          [name, looking]
3        [desperately, destroyed, hospital, leogane, cr...
4                [today, tonight, side, haiti, rest, says]
                               ...                        
26240    [rice, locally, meals, fish, demonstrated, ene...
26241    [candidate, contract, month, starting, jakarta...
26242    [chokoria, rice, assessment, lentils, operatin...
26243    [women, elections, harcourt, port, conduct, co...
26244    [recognizing, radical, humanitarian, came, cri...
Name: message, Length: 26224, dtype: object

In [35]:
b_start = time()

dic_tokens = udacourse2.fn_subcount_lists(column=X_tokens, 
                                          verbose=False)

b_spent = time() - b_start
print('process time:{:.0f} seconds'.format(b_spent))

process time:0 seconds


Sorted dictionnary [here](https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value)

In [36]:
dic_tokens

d_tokens = dic_tokens['elements']
t_sorted = sorted(d_tokens.items(), key=lambda kv: kv[1], reverse=True)

if t_sorted:
    print('data processed')

data processed


Sorted list of tuples of most counted tokens:

- filtering the more counted 300 elements

In [37]:
t_sorted[:300]

[('food', 2528),
 ('water', 2413),
 ('help', 2344),
 ('need', 2003),
 ('please', 1901),
 ('earthquake', 1705),
 ('government', 930),
 ('haiti', 926),
 ('areas', 913),
 ('find', 873),
 ('information', 842),
 ('relief', 745),
 ('aid', 726),
 ('affected', 705),
 ('sandy', 702),
 ('health', 644),
 ('children', 591),
 ('work', 561),
 ('tents', 532),
 ('emergency', 531),
 ('give', 521),
 ('flood', 513),
 ('want', 510),
 ('supplies', 508),
 ('last', 500),
 ('international', 494),
 ('house', 488),
 ('power', 477),
 ('still', 471),
 ('rain', 467),
 ('million', 467),
 ('hit', 454),
 ('rains', 454),
 ('support', 444),
 ('santiago', 441),
 ('school', 432),
 ('disaster', 429),
 ('heavy', 429),
 ('assistance', 425),
 ('victims', 420),
 ('hurricane', 419),
 ('shelter', 419),
 ('family', 414),
 ('storm', 412),
 ('live', 412),
 ('destroyed', 410),
 ('floods', 396),
 ('families', 393),
 ('south', 389),
 ('medical', 384),
 ('houses', 383),
 ('national', 379),
 ('port', 377),
 ('united', 376),
 ('years', 

Modifying the **tokenize** function just to absorve less meaningful tokens to discard:
    
- **ver 1.2** created

In [38]:
great_noisy = ['people', 'help', 'need', 'said', 'country', 'government', 'one', 'year', 'good', 'day',
    'two', 'get', 'message', 'many', 'region', 'city', 'province', 'road', 'district', 'including', 'time',
    'new', 'still', 'due', 'local', 'part', 'problem', 'may', 'take', 'come', 'effort', 'note', 'around',
    'person', 'lot', 'already', 'situation', 'see', 'response', 'even', 'reported', 'caused', 'village', 'bit',
    'made', 'way', 'across', 'west', 'never', 'southern', 'january', 'least', 'zone', 'small', 'next', 'little',
    'four', 'must', 'non', 'used', 'five', 'wfp', 'however', 'com', 'set', 'every', 'think', 'item', 'yet', 
    'carrefour', 'asking', 'ask', 'site', 'line', 'put', 'unicef', 'got', 'east', 'june', 'got', 'ministry']

---

#### Older atempt to clear tokens

Tried to isolate some words that I think are noisy, for exclusion:
    
- general geographic references, as **area** and **village**;

- social communication words, as **thanks** and **please**;

- religious ways to talk, as **pray**

- unmeaningful words, as **thing** and **like**

- tokenize function **ver 1.1** created 

- visually filtered some words that I think don´t aggregate too much to the **Machine Learning**

- just think about - you prefer your **IA** trained for 'thanks' or for 'hurricane'?

- really I´m not 100% sure about these words, buy my **tokenize** function can enable and disable this list, and re-train the machine, and see if the performance increase or decrease

In [39]:
unhelpful_words = ['thank', 'thanks', 'god', 'fine', 'number', 'area', 'let', 'stop', 'know', 'going', 'thing',
    'would', 'hello', 'say', 'neither', 'right', 'asap', 'near', 'want', 'also', 'like', 'since', 'grace', 
    'congratulate', 'situated', 'tell', 'almost', 'hyme', 'sainte', 'croix', 'ville', 'street', 'valley', 'section',
    'carnaval', 'rap', 'cry', 'location', 'ples', 'bless', 'entire', 'specially',  'sorry', 'saint', 'village', 
    'located', 'palace', 'might', 'given']

Testing **elliminate duplicates**:

In [40]:
test = ['addon', 'place', 'addon']
test = list(set(test))
test

['addon', 'place']

Testing **elliminate short words**:

In [41]:
min = 3
list2 = []
test2 = ['addon', 'l', 'us', 'place']

for word in test2:
    if len(word) < min:
        print('elliminate:', word)
    else: 
        list2.append(word)
    
list2

elliminate: l
elliminate: us


['addon', 'place']

solution [here](https://stackoverflow.com/questions/3501382/checking-whether-a-variable-is-an-integer-or-not)

In [42]:
if isinstance(min, int):
    print('OK')

OK


Now I have two **Tokenizer** functions:

- `fn_tokenize` $\rightarrow$ it allows to test each individual methods, and contains all the methods described, but a bit slow, as it iterates all the words again for each method

- `fn_tokenize_fast` $\rightarrow$ it is a **boosted** version, with only one iteration, for running faster, but you cannot set each method individually for more accurate test

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.



---

### A small review over each item for our first machine learning pipelines

#### Feature Extraction

Feature Extraction from SKlearn documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

*"Convert a collection of text documents to a matrix of token counts"*

- we are looking for **tokens** that will be turned into **vectors** in a Machine Learning Model;

- they are represented as **scalars** in a **matrix**, that indicates the scale of each one of these tokens.

"This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix."

- normally matrix representations of the natural reallity are a bit **sparse**

- in this case, to save some memory, they indicate a use of a propper representation

"If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data."

- me already made it, drastically reducing the **variability** of terms

- it its represented by our **fn_tokenizer**

#### Preprocessing

TF-IDF from SKlearn documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)

- **tf** is about **term frequency** and;

- **idf** is about **inverse document frequency**.

*"Transform a count matrix to a normalized tf or tf-idf representation"*

- it means that it basically **normalizes** the count matrix

*Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.*

- it takes term-frequency and it **rescales** it by the gereral document-frequency

*The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.*

- the idea is to not weight too much a **noisy** and very frequent word

- we tried to "manually" elliminate some of the **noisy** words, but as the number of tokens is too high, it´s quite impossible to make a good job

#### Training a Machine Learning

As we have **labels**, a good strategy is to use **supervised learning**

- we could try to kind of make **clusters** of messages, using **unsupervised learning**, or try some strategy on **semi-supervised learning**, as we have some of the messages (40) that don´t have any classification;

- the most obvious way is to train a **Classifier**;

- as we have multiple labels, a **Multi Target Classifier** seems to be the better choice.

Multi target classification [here](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html)

*"This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification"*

- OK, we will be basically using **slices** of train for each feature, as we don´t have so much **Machines** that are natively supporting multi-target.

## I. Prepare the data

Make the lasts opperations for preparing the dataset for training on **Machine Learning**

For **training** data, it is a **data inconsistency** if you consider that all the labels are blank

- so we have 6,317 rows that we need to **remove** before **training**

In [43]:
print('all labels are blank in {} rows'.format(df[df['if_blank'] == 1].shape[0]))

all labels are blank in 6317 rows


In [44]:
df = df[df['if_blank'] == 0]
df.shape[0]

19928

Verifying if removal was complete

In [45]:
if df[df['if_blank'] == 1].shape[0] == 0:
    print('removal complete!')
else:
    raise Exception('something went wrong with rows removal before training')

removal complete!


What is this **crazy thing** over here? 

>- I created a **provisory** column, and **tokenizing** it
>- Why I need it for now? Just for removing rows that are **impossible to train**
>- After tokenization, if I get a **empty list**, I need to remove this row before training

In [46]:
start = time()

try:
    df = df.drop('tokenized', axis=1)
except KeyError:
    print('OK')

#inserting a provisory column
df.insert(1, 'tokenized', np.nan)

#tokenizing over the provisory
df['tokenized'] = df.apply(lambda x: udacourse2.fn_tokenize_fast(x['message']), axis=1)

#removing NaN over provisory (if istill exist)
df = df[df['tokenized'].notnull()]

spent = time() - start
print('process time:{:.0f} seconds'.format(spent))

df.head(1)

OK
process time:48 seconds


Unnamed: 0,message,tokenized,original,genre,if_blank,related,request,offer,aid_related,medical_help,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,Weather update - a cold front from Cuba that c...,"[update, pass, weather, cold, haiti, front, cuba]",Un front froid se retrouve sur Cuba ce matin. ...,direct,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Filtering empy lists on `provisory`, found [here](https://stackoverflow.com/questions/42964724/pandas-filter-out-column-values-containing-empty-list)

And another **crazy thing**, I regretted to remove `provisory` tokenized column:

>- why? Just because I already **trained** my **X** subdataset, o I will not need to do it later!
>- and if I make the thing **wizely**, I will accelerate the pipeline process, as I already made the hard job for the **CountVectorized**
>- it will also faccilitate to **train** diverse Classifiers, as I save a lot of individual processing, making it **early** in my process!

In [47]:
empty_tokens = df[df['tokenized'].apply(lambda x: len(x)) == 0].shape[0]
print('found {} rows with no tokens'.format(empty_tokens))

df = df[df['tokenized'].apply(lambda x: len(x)) > 0]
empty_tokens = df[df['tokenized'].apply(lambda x: len(x)) == 0].shape[0]
print('*after removal, found {} rows with no tokens'.format(empty_tokens))

#I will not drop it anymore!
#try:
#    df = df.drop('provisory', axis=1)
#except KeyError:
#    print('OK')

#Instead, I will drop 'message' column
try:
    df = df.drop('message', axis=1)
except KeyError:
    print('OK')

print('now I have {} rows to train'.format(df.shape[0]))
df.head(1)

found 6 rows with no tokens
*after removal, found 0 rows with no tokens
now I have 19922 rows to train


Unnamed: 0,tokenized,original,genre,if_blank,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,"[update, pass, weather, cold, haiti, front, cuba]",Un front froid se retrouve sur Cuba ce matin. ...,direct,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## II. Break the data

Break the dataset into the **training columns** and **labels** (if it have **multilabels**)

X is the **Training Text Column**:
    
- if I observe the potential training data really well, I could `genre` column as training data too!

- or I can use also `related`, `request`, `offer` columns for training `aid_related` data

*A discussion of how much these **Label** columns are **hierarchically defined** is made laterly in this notebook*

---

For this moment, I am using only `message` as training data

In [48]:
X = df['tokenized']
X.head(1)

0    [update, pass, weather, cold, haiti, front, cuba]
Name: tokenized, dtype: object

Y is constituted by the **Classification Labels**

*Updated 1 - removed `related` column from the Labels dataset. Why? Because when I go to statistics, it turns allways at `1`. So It is causing problems when training our Classifier*

>- was: `y = df[df.columns[4:]]`
>- now: `y = df[df.columns[5:]]`

*Updated 2 - removed columns that contains **only zeroes**. Why? Just because they are **impossible to train** on our Classifier!*

In [49]:
y = df[df.columns[5:]]

remove_lst = []

for column in y.columns:
    col = y[column]
    if (col == 0).all():
        print('*{} -> only zeroes training column!'.format(column))
        remove_lst.append(column)
    else:
        #print('*{} -> column OK'.format(column))
        pass
print(remove_lst)

y = y.drop(remove_lst, axis=1)

y.head(1)

*child_alone -> only zeroes training column!
['child_alone']


Unnamed: 0,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## III. Split the data 

Into **Train** and **Test** subdatasets

- let´s start it with **20%** of test data

- I am not using **random_state** settings (and why **42**? I personally think it is about a reference to the book **The Hitchhicker´s Guide to de Galaxy**, from Douglas Adams!

*Note update: now, I am using **random_state**, for ensuring the same results for each function call*

In [83]:
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

And it looks OK:

In [51]:
X_train.shape[0] + X_test.shape[0]

19922

## IV. Choose your first Classifier 

- and build a **Pipeline** for it

Each Pipeline is a Python Object that can be called for **methods**, as **fit()**

---

What **Classifier** to choose?

- **Towards Data Science** give us some tips [here](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a)

---

Start with a **Naïve Bayes** (NB)

`clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)`

In a Pipeline way:

>- i should use `CountVectorizer(tokenizer=udacourse2.fn_tokenize_fast)`, but I will **not**!
>- why? Just because I already proceeded with **tokenization** in a earlier step
>- so, how to overpass this hellish `tokenizer=...` parameter?
>- I found a clever solution [here](https://stackoverflow.com/questions/35867484/pass-tokens-to-countvectorizer)
>- so, I prepared a **dummy** function to overpass the tokenizer over **CountVertorizer**

First I tried to set Classifier as **MultinomialNB()**, and it crashes:

>- only **one** Label to be trained was expected, and there were 36 Labels!;
>- reading the documentation for SKlearn, it turned clear that it is necessary (if your Classifier algorithm was not originally built for **multicriteria**, to run it **n** times, one for each label
>- so it is necessary to include it our pipeline, using `MultiOutputClassifier()` transformer

*And... it looks pretty **fast** to train, not? What is the secret? We are **bypassing** the tokenizer and preprecessor, as we **already made** it at the dataset!*

*Another thing, we are not using the **whole** dataset... it´s just about a little **issue** we have, as there are a lot of **missing labels** at the dataset! And for me, it will **distort** our training! (lately I will compare the results with traning the **raw** dataset)*

**Naïve Bayes** is known as a very **fast** method:

>- but it is also known as being not so **accurate**
>- and it have so **few** parameters for a later refinement

In [82]:
start = time()

def dummy(doc):
    return doc

#Naïve Bayes classifier pipeline
pipeline_mbnb = Pipeline([('vect', CountVectorizer(tokenizer=dummy, preprocessor=dummy)),
                          ('tfidf', TfidfTransformer()),
                          ('clf',  MultiOutputClassifier(MultinomialNB()))])
                          #('clf', MultinomialNB())]) #<-my terrible mistake!
#remembering:
#CountVectorizer -> makes the count for tokenized vectors
#TfidTransformer -> makes the weight "normalization" for word occurences
#MultinomialNB -> is my Classifier

#fit text_clf (our first Classifier model)
fit_mbnb = pipeline_mbnb.fit(X_train, y_train)

spent = time() - start
print('NAÏVE BAYES - process time:{:.2f} seconds'.format(spent))

TypeError: __init__() got an unexpected keyword argument 'random_state'

If I want, I can see the parameters for my **Pipeline**, using this command

In [53]:
#pipeline_mbnb.get_params()

## V. Run metrics for it

Predicting using **Naïve Bayes** Classifier

And I took this **weird** Error Message:

"**UndefinedMetricWarning:**" 

>- "Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples"
>- "Use `zero_division` parameter to control this behavior"

And searching, I found this explanation [here](https://stackoverflow.com/questions/43162506/undefinedmetricwarning-f-score-is-ill-defined-and-being-set-to-0-0-in-labels-wi)

>- it is not an **weird error** at all. Some labels could´t be predicted when running the Classifier
>- so the report don´t know how to handle them

"What you can do, is decide that you are not interested in the scores of labels that were not predicted, and then explicitly specify the labels you are interested in (which are labels that were predicted at least once):"

`metrics.f1_score(y_test, y_pred, average='weighted', labels=np.unique(y_pred))`

#### Dealing with this issue

**First**, I altered my function `fn_plot_scores` for not allowing comparisons over an empty (**not trained**) column, as `y_pred`

And to check if all predicted values are **zeroes** [here](https://stackoverflow.com/questions/48570797/check-if-pandas-column-contains-all-zeros)

And I was using in my function a **general** calculus for Accuracy. The problem is: **zeroes** for **zeroes** result a **1** accuracy, distorting my actual Accuracy, for a better (**unreal**) higher value:

>- so, for general model Accuracy, I cannot use this `accuracy = (y_pred == y_test.values).mean()`
>- using instead `f1_score(y_test, y_pred, average='weighted', labels=np.unique(y_pred))`

In [54]:
y_pred = pipeline_mbnb.predict(X_test)
udacourse2.fn_scores_report(y_test, y_pred)

###function scores_report started
######################################################
*request -> label iloc[0]
              precision    recall  f1-score   support

           0       0.86      0.97      0.91      3088
           1       0.83      0.45      0.59       897

    accuracy                           0.86      3985
   macro avg       0.85      0.71      0.75      3985
weighted avg       0.85      0.86      0.84      3985

######################################################
*offer -> label iloc[1]
  - as y_pred has only zeroes, report is not valid
######################################################
*aid_related -> label iloc[2]
              precision    recall  f1-score   support

           0       0.78      0.48      0.60      1819
           1       0.67      0.89      0.76      2166

    accuracy                           0.70      3985
   macro avg       0.72      0.68      0.68      3985
weighted avg       0.72      0.70      0.69      3985

################

Model Accuracy is distorted by **false fitting** (zeroes over zeroes)

Manually, I could find the true meaning as **82.2%**

In [55]:
import statistics

real_f1 = [.78, .86, .83, .85, .80, .83, .81, .91, .86, .69, .83]
statistics.mean(real_f1)

0.8227272727272728

#### Critics about the performance of my Classifier

I know what you are thinking: "Uh, there is something **wrong** with the Accuracy of this guy"

So, as you can see: **92.3%** is too high for a **Naïve Bayes Classifier**!

There are some explanations here:

>- if you read it with care, you will find this **weird** label `related`. And it seems to **positivate** for every row on my dataset. So It distorts the average for a **higher** one
>- if you look at each **weighted avg**, you will find some clearly **bad** values, as **68%** for **aid_related** (if you start thinking about it, is something like in **2/3** of the cases the model guesses well for this label... so a really **bad** performance)

*Updated 1: when I removed `related` column, my **Model Accuracy** felt down to **56.1%**. Normally my Labels are holding something as **75-78%** f1-score. Now I think that these **untrainable columns** are making my average Accuracy to fall down!*

---

But there is another **critic** about this data.

I am **Engineer** by profession. And I work for almost **19** years in a **hidrology** datacenter for the Brazillian Government. So, in some cases, you see some data and start thinking: "this data is not what it seems".

And the main problem with this data is:

>- it is a **mistake** to think that all we need to do with it is to train a **Supervised Learning** machine!
>- if you look with care, this is not about **Supervised Learning**, it is an actual **Semi-Supervised Learning** problem. Why?
>- just consider that there were **zillions** of Tweeter messages about catastrophes all around the world. And then, when the message was not originally in English, they translated it. And then someone manually **labeled** each of these catastrophe reports. And a **lot** of them remained with **no classification**
>- it I just interpret it as a **Supervised Learning** challenge, I will feed my Classifier with a lot of **false negatives**. And my Machine Learning Model will learn how to **keep in blank** a lot of these messages, as it was trained by my **raw** data!

So in **preprocessing** step, I avoided **unlabelled data**, filtering and removing for training every row that not contains any label on it. They were clearly, **negleted** for labeling, when manually processed!




## VI. Try other Classifiers

- I will try some Classifiers based on a **hierarchical structure**:

>- why **hierarchical structure** for words? Just because I think we do it **naturally** in our brain
>- when science mimic nature I personally think that things goes in a better way. So, let´s try it!

First of them, **Random Forest** Classifier

>- as **RFC** is a **single-label** Classifier, we need to call it **n** times for each label to be classified
>- so, que need to call it indirectly, using **Multi-Output** Classifier tool
>- it took **693.73 seconds** (as 11 minutes and 35 seconds) to complete the tast (not so bad!)
>- I tried to configure a **GridSearch**, just to set the number of processors to `-1` (meaning, the **maximum** number)

Accuracy was near to **93%** before removing `related` label. Now it remains as **93.8%**. So, it don't matter!

In [81]:
start = time()

def dummy(doc):
    return doc

pipeline_rafo = Pipeline([('vect', CountVectorizer(tokenizer=dummy, preprocessor=dummy)),
                          ('tfidf', TfidfTransformer()),
                          ('clf', MultiOutputClassifier(RandomForestClassifier(random_state=42)))])
fit_rafo = pipeline_rafo.fit(X_train, y_train)

#an attempt to use multiple cores to process the task

#param_grid = 
#gs_clf = GridSearchCV(pipeline_rafo, parameters, n_jobs=-1)
#gs_clf = gs_clf.fit(X_train, y_train)

spent = time() - start
print('RANDOM FOREST - process time:{:.2f} seconds'.format(spent))

TypeError: __init__() got an unexpected keyword argument 'random_state'

In [57]:
y_pred = pipeline_rafo.predict(X_test)
udacourse2.fn_scores_report(y_test, y_pred)

###function scores_report started
######################################################
*request -> label iloc[0]
              precision    recall  f1-score   support

           0       0.90      0.95      0.92      3088
           1       0.78      0.64      0.70       897

    accuracy                           0.88      3985
   macro avg       0.84      0.79      0.81      3985
weighted avg       0.87      0.88      0.87      3985

######################################################
*offer -> label iloc[1]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      3956
           1       1.00      0.03      0.07        29

    accuracy                           0.99      3985
   macro avg       1.00      0.52      0.53      3985
weighted avg       0.99      0.99      0.99      3985

######################################################
*aid_related -> label iloc[2]
              precision    recall  f1-score   support

           0

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      3890
           1       0.80      0.13      0.22        95

    accuracy                           0.98      3985
   macro avg       0.89      0.56      0.60      3985
weighted avg       0.97      0.98      0.97      3985

######################################################
*other_weather -> label iloc[32]
              precision    recall  f1-score   support

           0       0.94      1.00      0.96      3716
           1       0.50      0.06      0.11       269

    accuracy                           0.93      3985
   macro avg       0.72      0.53      0.54      3985
weighted avg       0.91      0.93      0.91      3985

######################################################
*direct_report -> label iloc[33]
              precision    recall  f1-score   support

           0       0.84      0.93      0.88      2960
           1       0.71      0.49      0.58      1025

    ac

Another tree like Classifier is **Adaboost**:

>- they say Adaboost is specially good for **differenciate** positives and negatives
>- it took **106.16 seconds** (kind of 1 minute and 45 seconds) to complete the task... not so bad... (as AdaBoost don´t use **trees**, but **stumps** for doing its job)

Accuracy was near to **91%**. After removing `related` label, it raised to **93.9%**. As Adaboost is based on **stumps**, a bad label perhaps distorts the model.

In [78]:
start = time()

def dummy(doc):
    return doc

pipeline_adab = Pipeline([('vect', CountVectorizer(tokenizer=dummy, preprocessor=dummy)),
                         ('tfidf', TfidfTransformer()),
                         ('clf',  MultiOutputClassifier(AdaBoostClassifier(random_state=42)))])
fit_adab = pipeline_adab.fit(X_train, y_train)

spent = time() - start
print('ADABOOST - process time:{:.2f} seconds'.format(spent))

KeyboardInterrupt: 

In [59]:
y_pred = pipeline_adab.predict(X_test)
udacourse2.fn_scores_report(y_test, y_pred)

###function scores_report started
######################################################
*request -> label iloc[0]
              precision    recall  f1-score   support

           0       0.88      0.96      0.92      3088
           1       0.78      0.55      0.65       897

    accuracy                           0.86      3985
   macro avg       0.83      0.75      0.78      3985
weighted avg       0.86      0.86      0.86      3985

######################################################
*offer -> label iloc[1]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      3956
           1       0.00      0.00      0.00        29

    accuracy                           0.99      3985
   macro avg       0.50      0.50      0.50      3985
weighted avg       0.99      0.99      0.99      3985

######################################################
*aid_related -> label iloc[2]
              precision    recall  f1-score   support

           0

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      3963
           1       0.10      0.05      0.06        22

    accuracy                           0.99      3985
   macro avg       0.55      0.52      0.53      3985
weighted avg       0.99      0.99      0.99      3985

######################################################
*hospitals -> label iloc[22]
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      3935
           1       0.21      0.08      0.12        50

    accuracy                           0.98      3985
   macro avg       0.60      0.54      0.55      3985
weighted avg       0.98      0.98      0.98      3985

######################################################
*shops -> label iloc[23]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      3954
           1       0.25      0.03      0.06        31

    accuracy      

---

#### Falling in a trap when choosing another Classifier

Then I tried a **Stochastic Gradient Descent** (SGD) [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)

_"Linear classifiers (SVM, logistic regression, etc.) with SGD training"_

It can works with a **Support Vector Machine** (SVM), that is a fancy way of defining a good frontier


`clf = SGDClassifier()` with some parameters
  
>- `learning_rate='optimal'`$\rightarrow$ **decreasing strength schedule** used for updating the gradient of the loss at each sample
>- `loss='hinge'` $\rightarrow$ **Linear SVM** for the fitting model (works with data represented as dense or sparse arrays for features)
>- `penalty=[‘l2’, ‘l1’, ‘elasticnet’]` $\rightarrow$ **regularizer**  shrinks model parameters towards the zero vector using an **Elastic Net** (l2) or 
>- `alpha=[1e-5, 1e-4, 1e-3]` $\rightarrow$ stopping criteria, the higher the value, the **stronger** the regularization (also used to compute the **Learning Rate**, when set to learning_rate is set to ‘optimal’
>- `n_iter=[1, 5, 10]` $\rightarrow$ number of passes over the **Epochs** (Training Data). It only impacts the behavior in the **fit method**, and not the partial_fit method
>- `random_state=42` $\rightarrow$ if you want to replicate exactly the same output each time you retrain your machine
  
*Observe that this is a kind of a lecture over the text at SkLearn website for this Classifier*

---

And **SGDC** didn´t work! It gave me a **ValueError: y should be a 1d array, got an array instead**. So, something went wrong:

Searching for the cause of the problem, I found this explanation [here](https://stackoverflow.com/questions/20335853/scikit-multilabel-classification-valueerror-bad-input-shape)

"No, SGDClassifier does not do **multilabel classification** (what I need!) -- it does **multiclass classification**, which is a different problem, although both are solved using a one-vs-all problem reduction*

*Then, neither **SGD** nor OneVsRestClassifier.fit will accept a **sparse matrix** (is what I have!) for y* 

*- SGD wants an **array of labels** (is what I have!), as you've already found out*

*- OneVsRestClassifier wants, for multilabel purposes, a list of lists of labels*

*Observe that this is a kind of a lecture over the explanatory text that I got at SKLearn website for SGDC for Multilabel*

---

There is a good explanation about **Multiclass** and **Multilabel** Classifiers [here](https://scikit-learn.org/stable/modules/multiclass.html)

Don´t try to run this code:

In [60]:
#start = time()

#def dummy(doc):
#    return doc

#random_state=42 #<-just to remember!
#pipeline_sgrd = Pipeline([('vect', CountVectorizer(tokenizer=dummy, preprocessor=dummy)),
#                          ('tfidf', TfidfTransformer()),
#                          ('clf', SGDClassifier(loss='hinge', 
#                                                penalty='l2',
#                                                alpha=1e-3))]) 
#fit_sgrd = pipeline_sgrd.fit(X_train, y_train)

#spent = time() - start
#print('STOCHASTIC GRADIENT DESCENT - process time:{:.2f} seconds'.format(spent))

Let's try **K-Neighbors Classifier**:

- average Accuracy was **91.8%**... not so bad!

In [76]:
start = time()

def dummy(doc):
    return doc

#random_state=42 #<-just to remember!
pipeline_knbr = Pipeline([('vect', CountVectorizer(tokenizer=dummy, preprocessor=dummy)),
                          ('tfidf', TfidfTransformer()),
                          ('clf', MultiOutputClassifier(KNeighborsClassifier(n_neighbors=3)))])
fit_knbr = pipeline_knbr.fit(X_train, y_train)

spent = time() - start
print('K NEIGHBORS CLASSIFIER - process time:{:.2f} seconds'.format(spent))

K NEIGHBORS CLASSIFIER - process time:0.63 seconds


In [77]:
y_pred = pipeline_knbr.predict(X_test)
udacourse2.fn_scores_report(y_test, y_pred)

KeyboardInterrupt: 

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier

Linear Supor Vector Machuine, fed by TfidVectorizer:
    
- Average Accuracy of **93.8%**

In [73]:
start = time()

def dummy(doc):
    return doc

feats = TfidfVectorizer(analyzer='word', 
                        tokenizer=dummy, 
                        preprocessor=dummy,
                        token_pattern=None,
                        ngram_range=(1, 3))

#feats = feats.fit_transform(X_train)

pipeline_lnsv = Pipeline([
    ('vect', feats),
    ('clf', OneVsRestClassifier(LinearSVC(C=2., 
                                          random_state=42)))
])

fit_lnsv = pipeline_lnsv.fit(X_train, y_train)

spent = time() - start
print('LINEAR SUPPORT VECTOR MACHINE - process time:{:.2f} seconds'.format(spent))

LINEAR SUPPORT VECTOR MACHINE - process time:22.85 seconds


*NotFittedError: Vocabulary not fitted or provided*
[here](https://stackoverflow.com/questions/60472925/python-scikit-svm-vocabulary-not-fitted-or-provided)

In [74]:
y_pred = pipeline_lnsv.predict(X_test)
udacourse2.fn_scores_report(y_test, y_pred)

###function scores_report started
######################################################
*request -> label iloc[0]
              precision    recall  f1-score   support

           0       0.92      0.91      0.91      3088
           1       0.69      0.73      0.71       897

    accuracy                           0.87      3985
   macro avg       0.81      0.82      0.81      3985
weighted avg       0.87      0.87      0.87      3985

######################################################
*offer -> label iloc[1]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      3956
           1       1.00      0.03      0.07        29

    accuracy                           0.99      3985
   macro avg       1.00      0.52      0.53      3985
weighted avg       0.99      0.99      0.99      3985

######################################################
*aid_related -> label iloc[2]
              precision    recall  f1-score   support

           0

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      3877
           1       0.62      0.51      0.56       108

    accuracy                           0.98      3985
   macro avg       0.81      0.75      0.78      3985
weighted avg       0.98      0.98      0.98      3985

######################################################
*tools -> label iloc[21]
  - as y_pred has only zeroes, report is not valid
######################################################
*hospitals -> label iloc[22]
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      3935
           1       0.33      0.04      0.07        50

    accuracy                           0.99      3985
   macro avg       0.66      0.52      0.53      3985
weighted avg       0.98      0.99      0.98      3985

######################################################
*shops -> label iloc[23]
  - as y_pred has only zeroes, report is not valid


## VIII. Make a Fine Tunning effort over Classifiers

If the Classifier that you choose have parameters that can be worked on, you can try to improve its performance, changing these parameters with some criteria.

**Grid Search**

`parameters = {'vect__ngram_range': [(1, 1), (1, 2)],`
              `'tfidf__use_idf': (True, False),`
              `'clf__alpha': (1e-2, 1e-3)}`

- use **multiple cores** to process the task

`gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)`

`gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)`

-see the **mean score** of the parameters

`gs_clf.best_score_`

`gs_clf.best_params_`

In [None]:
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3)}

#use multiple cores to process the task
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

Score of the parameters

In [None]:
classification_report...

In [None]:
raise Exception('under development')

In [None]:
#mean score of the parameters
gs_clf.best_score_
gs_clf.best_params_

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [None]:
def plot_scores(y_test, y_pred):
    #Testing the model
    # Printing the classification report for each label
    i = 0
    for col in y_test:
        print('Feature {}: {}'.format(i+1, col))
        print(classification_report(y_test[col], y_pred[:, i]))
        i = i + 1
    accuracy = (y_pred == y_test.values).mean()
    print('The model accuracy is {:.3f}'.format(accuracy))

In [None]:
# Prediction: the Random Forest Classifier  
y_pred = pipeline_rfc.predict(X_test)
plot_scores(y_test, y_pred)

In [None]:
# Prediction: the Naive Bayes classifier 
y_pred = pipeline_nbc.predict(X_test)
plot_scores(y_test, y_pred)

[here](https://stackoverflow.com/questions/31744519/load-pickled-classifier-data-vocabulary-not-fitted-error)

[here](https://stackoverflow.com/questions/32674380/countvectorizer-vocabulary-wasnt-fitted)

In [None]:
# Prediction: the Adaboost Classifier 
y_pred = pipeline_ada.predict(X_test)
plot_scores(y_test, y_pred)

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
#parameters = 

#cv = 

# Show parameters for the pipline
pipeline_rfc.get_params()

In [None]:
# Show parameters for the pipline
pipeline_ada.get_params()

In [None]:
# Using grid search
# Create Grid search parameters for Random Forest Classifier   
parameters_rfc = {
        'tfidf__use_idf': (True, False),
        'clf__estimator__n_estimators': [10, 20]
}

cv_rfc = GridSearchCV(pipeline_rfc, param_grid = parameters_rfc)
cv_rfc

In [None]:
# Using grid search
# Create Grid search parameters
parameters_ada = {
        'tfidf__use_idf': (True, False),
        'clf__estimator__n_estimators': [50, 60, 70]
}

cv_ada = GridSearchCV(pipeline_ada, param_grid = parameters_ada)
cv_ada

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

In [None]:
# Create a pickle file for the model

#https://www.codegrepper.com/code-examples/python/save+and+load+python+pickle+stackoverflow
#with open('filename.pkl', 'wb') as handle:
#    pickle.dump(a, handle, protocol=pickle.HIGHEST_PROTOCOL)
#with open('filename.pkl', 'rb') as handle:
#    b = pickle.load(handle)

#Picke documentation
#https://docs.python.org/3/library/pickle.html#module-pickle
file_name = 'classifier.pkl'
with open (file_name, 'wb') as handle: #wb for write+binary
    pickle.dump(cv_ada, handle)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.

In [None]:
# import packages
import sys

def load_data(data_file):
    # read in file
    # clean data
    # load to database
    # define features and label arrays
    return X, y

def build_model():
    # text processing and model pipeline
    # define parameters for GridSearchCV
    # create gridsearch object and return as final model pipeline
    return model_pipeline

def train(X, y, model):
    # train test split
    # fit model
    # output model test results
    return model

def export_model(model):
    # Export model as a pickle file

def run_pipeline(data_file):
    X, y = load_data(data_file)  # run ETL pipeline
    model = build_model()  # build model pipeline
    model = train(X, y, model)  # train model pipeline
    export_model(model)  # save model

if __name__ == '__main__':
    data_file = sys.argv[1]  # get filename of dataset
    run_pipeline(data_file)  # run data pipeline

In [None]:
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import PorterStemmer
from nltk import word_tokenize
import string


# Function to pass the list to the Tf-idf vectorizer
def returnPhrase(inputList):
    return inputList


# Pre-processing the sentence which we input to predict the emotion
def transformSentence(sentence):
    s = []
    sentence = sentence.replace('\n', '')
    sentTokenized = word_tokenize(sentence)
    s.append(sentTokenized)
    sWithoutPunct = []
    punctList = list(string.punctuation)
    curSentList = s[0]
    newSentList = []
    for word in curSentList:
        if word.lower() not in punctList:
            newSentList.append(word.lower())
    sWithoutPunct.append(newSentList)
    mystemmer = PorterStemmer()
    tokenziedStemmed = []
    for i in range(0, len(sWithoutPunct)):
        curList = sWithoutPunct[i]
        newList = []
        for word in curList:
            newList.append(mystemmer.stem(word))
        tokenziedStemmed.append(newList)
    return tokenziedStemmed


# Extracting the features for SVM
myVectorizer = TfidfVectorizer(analyzer='word', tokenizer=returnPhrase, preprocessor=returnPhrase,
                               token_pattern=None,
                               ngram_range=(1, 3))

# The SVM Model
curC = 2  # cost factor in SVM
SVMClassifier = svm.LinearSVC(C=curC)

filename = 'finalized_model.sav'
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

# Input sentence
with open('trial_truth_001.txt', 'r') as file:
    sent = file.read().replace('\n', '')

transformedTest = transformSentence(sent)

X_test = myVectorizer.fit_transform(transformedTest).toarray()
Prediction = loaded_model.predict(X_test)

# Printing the predicted emotion
print(Prediction)