# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
#import libraries
#measuring time and making basic math
from time import time
import math
import numpy as np
import udacourse2 #my library for this project!
import statistics

#my own ETL pipeline
#import process_data as pr

#dealing with datasets and showing content
import pandas as pd
#import pprint as pp

#SQLAlchemy toolkit
from sqlalchemy import create_engine
from sqlalchemy import pool
from sqlalchemy import inspect

#natural language toolkit
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer

#REGEX toolkit
import re

#Machine Learning preparing/preprocessing toolkits
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

#Machine Learning Feature Extraction tools
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

#Machine Learning Classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier #need MOClassifier!
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier

#Machine Learning Classifiers extra tools
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline

#Machine Learning Metrics
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

#pickling tool
import pickle

When trying to use NLTK, I took the following error:

- the point is - it´s not only about installing a library

- you need to install de supporting dictionnaries for doing the tasks

- this can be solved quite easilly (in hope that I will find a Portuguese-Brazil dictionnary when I will need to put it in practic in my work)

    LookupError: 
    **********************************************************************
      Resource stopwords not found.
      Please use the NLTK Downloader to obtain the resource:

      >>> import nltk
      >>> nltk.download('stopwords')
  
      For more information see: https://www.nltk.org/data.html

      Attempted to load corpora/stopwords`

In [2]:
#import nltk
#nltk.download('punkt')

    LookupError: 
    **********************************************************************
    Resource stopwords not found.
    Please use the NLTK Downloader to obtain the resource:

    >>> import nltk
    >>> nltk.download('stopwords')

In [3]:
#nltk.download('stopwords')

    LookupError: 
    **********************************************************************
    Resource wordnet not found.
    Please use the NLTK Downloader to obtain the resource:

    >>> import nltk
    >>> nltk.download('wordnet')

In [4]:
#nltk.download('wordnet')

In [5]:
#load data from database
#setting NullPool prevents a pool, so it is easy to close the database connection
#in our case, the DB is so simple, that it looks the best choice
#SLQAlchemy documentation
#https://docs.sqlalchemy.org/en/14/core/reflection.html
engine = create_engine('sqlite:///Messages.db', poolclass=pool.NullPool) #, echo=True)

#retrieving tables names from my DB
#https://stackoverflow.com/questions/6473925/sqlalchemy-getting-a-list-of-tables
inspector = inspect(engine)
print('existing tables in my SQLite database:', inspector.get_table_names())

existing tables in my SQLite database: ['Messages']


As my target is Messages table, so I reed this table as a Pandas dataset

In [6]:
#importing MySQL to Pandas
#https://stackoverflow.com/questions/37730243/importing-data-from-a-mysql-database-into-a-pandas-data-frame-including-column-n/37730334
#connection_str = 'mysql+pymysql://mysql_user:mysql_password@mysql_host/mysql_db'
#connection = create_engine(connection_str)

connection = engine.connect()
df = pd.read_sql('SELECT * FROM Messages', con=connection)
connection.close()

df.name = 'df'

df.head(1)

Unnamed: 0,message,original,genre,if_blank,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Splitting in X and Y datasets:

- X is the **Message** column

In [7]:
X = df['message']
X.head(1)

0    Weather update - a cold front from Cuba that c...
Name: message, dtype: object

- Y is the **Classification** labels

- I excluded all my columns that don´t make sense as labels to classify our message

In [8]:
Y = df[df.columns[4:]]
Y.head(1)

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [9]:
msg_text = X.iloc[0]
msg_text

'Weather update - a cold front from Cuba that could pass over Haiti'

In [10]:
#let´s insert some noise to see if it is filtering well
msg_text = "Weather update01 - a 00cold-front from Cuba's that could pass over Haiti' today"
low_text = msg_text.lower()

#I need to take only valid words
#a basic one (very common in Regex courses classes)
gex_text = re.sub(r'[^a-zA-Z]', ' ', low_text)

#other tryed sollutions from several sources
#re.sub(r'^\b[^a-zA-Z]\b', ' ', low_text)
#re.sub(r'^/[^a-zA-Z ]/g', ' ', low_text)
#re.sub(r'^/[^a-zA-Z0-9 ]/g', ' ', low_text)

gex_text

'weather update     a   cold front from cuba s that could pass over haiti  today'

Found this [here](https://stackoverflow.com/questions/1751301/regex-match-entire-words-only)

- '-' passed away, so it´s not so nice!

In [11]:
re.sub(r'^/\b($word)\b/i', ' ', low_text)

"weather update01 - a 00cold-front from cuba's that could pass over haiti' today"

In [12]:
re.sub(r'^\b[a-zA-Z]{3}\b', ' ', low_text)

"weather update01 - a 00cold-front from cuba's that could pass over haiti' today"

In [13]:
re.sub(r'^[a-zA-Z]{3}$', ' ', low_text)

"weather update01 - a 00cold-front from cuba's that could pass over haiti' today"

In [14]:
col_words = word_tokenize(gex_text)
col_words

['weather',
 'update',
 'a',
 'cold',
 'front',
 'from',
 'cuba',
 's',
 'that',
 'could',
 'pass',
 'over',
 'haiti',
 'today']

In [15]:
unnuseful = stopwords.words("english")
relevant_words = [word for word in col_words if word not in unnuseful]
relevant_words

['weather',
 'update',
 'cold',
 'front',
 'cuba',
 'could',
 'pass',
 'haiti',
 'today']

I noticed a lot of geographic references. I think they will not be so useful for us. Let´s try to remove them too...

References for City at NLKT [here](https://stackoverflow.com/questions/37025872/unable-to-import-city-database-dataset-from-nltk-data-in-anaconda-spyder-windows?rq=1)

In [16]:
import nltk.sem.chat80 as ct #.sql_demo()

LookupError: 
**********************************************************************
  Resource city_database not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('city_database')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/city_database/city.db

  Searched in:
    - 'C:\\Users\\epass/nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\share\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\epass\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************

In [17]:
#import nltk
#nltk.download('city_database')

In [18]:
countries = {
    country:city for city, country in ct.sql_query(
        "corpora/city_database/city.db",
        "SELECT City, Country FROM city_table"
    )
}

They look nice (and lower cased):
    
- observe possible errors with composite names, like united_states

In [19]:
for c in countries:
    print(c)

greece
thailand
spain
east_germany
united_kingdom
india
belgium
romania
hungary
argentina
egypt
china
venezuela
united_states
west_germany
hongkong
turkey
indonesia
south_africa
pakistan
soviet_union
japan
peru
philippines
australia
mexico
italy
canada
france
south_korea
brazil
vietnam
chile
singapore
iran
austria
poland


I couldn't find Haiti:

- countries list is not complete!

- it gaves `KeyError: 'haiti'`

In [20]:
#countries['haiti']

In [21]:
nogeo_words = [word for word in relevant_words if word not in countries]
nogeo_words

['weather',
 'update',
 'cold',
 'front',
 'cuba',
 'could',
 'pass',
 'haiti',
 'today']

Unfortatelly, it´s only a **demo**! We need something better for our project...

In [22]:
#df_cities = pd.read_csv('cities15000.txt', sep=';')
df_cities = pd.read_csv('cities15000.txt', sep='\t', header=None)
df_cities_15000 = df_cities[[1, 17]]
df_cities_15000.columns = ['City', 'Region']
df_cities_15000.head(5)

Unnamed: 0,City,Region
0,les Escaldes,Europe/Andorra
1,Andorra la Vella,Europe/Andorra
2,Umm Al Quwain City,Asia/Dubai
3,Ras Al Khaimah City,Asia/Dubai
4,Zayed City,Asia/Dubai


Tried this [here](https://data.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000%40public/information/?disjunctive.cou_name_en)

In [23]:
df_cities.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",42.50729,1.53414,P,PPLA,AD,,8,,,,15853,,1033,Europe/Andorra,2008-10-15
1,3041563,Andorra la Vella,Andorra la Vella,"ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ...",42.50779,1.52109,P,PPLC,AD,,7,,,,20430,,1037,Europe/Andorra,2020-03-03
2,290594,Umm Al Quwain City,Umm Al Quwain City,"Oumm al Qaiwain,Oumm al Qaïwaïn,Um al Kawain,U...",25.56473,55.55517,P,PPLA,AE,,7,,,,62747,,2,Asia/Dubai,2019-10-24
3,291074,Ras Al Khaimah City,Ras Al Khaimah City,"Julfa,Khaimah,RAK City,RKT,Ra's al Khaymah,Ra'...",25.78953,55.9432,P,PPLA,AE,,5,,,,351943,,2,Asia/Dubai,2019-09-09
4,291580,Zayed City,Zayed City,"Bid' Zayed,Bid’ Zayed,Madinat Za'id,Madinat Za...",23.65416,53.70522,P,PPL,AE,,1,103.0,,,63482,,124,Asia/Dubai,2019-10-24


found country names at Github [here](https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv)

- a small trick and we have our own coutries list!

In [24]:
df_countries = pd.read_csv('all.csv')
df_countries = df_countries['name'].apply(lambda x: x.lower())
countries = df_countries.tolist()
countries

['afghanistan',
 'åland islands',
 'albania',
 'algeria',
 'american samoa',
 'andorra',
 'angola',
 'anguilla',
 'antarctica',
 'antigua and barbuda',
 'argentina',
 'armenia',
 'aruba',
 'australia',
 'austria',
 'azerbaijan',
 'bahamas',
 'bahrain',
 'bangladesh',
 'barbados',
 'belarus',
 'belgium',
 'belize',
 'benin',
 'bermuda',
 'bhutan',
 'bolivia (plurinational state of)',
 'bonaire, sint eustatius and saba',
 'bosnia and herzegovina',
 'botswana',
 'bouvet island',
 'brazil',
 'british indian ocean territory',
 'brunei darussalam',
 'bulgaria',
 'burkina faso',
 'burundi',
 'cabo verde',
 'cambodia',
 'cameroon',
 'canada',
 'cayman islands',
 'central african republic',
 'chad',
 'chile',
 'china',
 'christmas island',
 'cocos (keeling) islands',
 'colombia',
 'comoros',
 'congo',
 'congo, democratic republic of the',
 'cook islands',
 'costa rica',
 "côte d'ivoire",
 'croatia',
 'cuba',
 'curaçao',
 'cyprus',
 'czechia',
 'denmark',
 'djibouti',
 'dominica',
 'dominican re

I can elliminate (perhaps not the whole) a lot of names of countries. In our case, the produce noise on our data.

In [25]:
nogeo_words = [word for word in relevant_words if word not in countries]
nogeo_words

['weather', 'update', 'cold', 'front', 'could', 'pass', 'today']

First test:
    
- over the first message only

In [26]:
message = 'Weather update - a cold front from Cuba that could pass over Haiti'
tokens = udacourse2.fn_tokenize_fast(msg_text, 
                                     verbose=True)

['weather', 'update', 'cold', 'front', 'cuba', 'pass', 'haiti', 'today']


In [27]:
message = 'Weather update - a cold front from Cuba that could pass over Haiti'
tokens = udacourse2.fn_tokenize(msg_text, 
                                lemmatize=True, 
                                rem_city=True, 
                                agg_words=True,
                                rem_noise=True,
                                elm_short=3,
                                verbose=True)

tokens

Tokens-start:79, token/stop:9, remove cities:7 &noise:7
 +lemmatizer:7
 +eliminate short:7


['update', 'front', 'cold', 'weather', 'today', 'pas', 'could']

It´s not so cool, some noise is still appearing in lemmatized words:
    
- an "l" was found, as in **French words**, like *l'orange*;

- my **City** filter needs a lot of improving, as it didn´t filter avenues and so many other **geographic** references;

- it passed a lot of unnuseful **two** or less letters words, as **u**, **st**;

- a lot of noisy words as **help**, **thanks**, **please** were found;

- there are several words **repetition** in some messages, like ['river', ... 'river', ...]

Basic test call

- only for the first 50 messages, verbose

In [28]:
b_start = time()

i = 0
for message in X:
    out = udacourse2.fn_tokenize_fast(message, 
                                      verbose=True)
    i += 1
    if i > 200: #it´s only for test, you can adjust it!
        break

b_spent = time() - b_start
print('process time:{:.0f} seconds'.format(b_spent))

['weather', 'update', 'cold', 'front', 'cuba', 'pass', 'haiti']
['hurricane']
['looking', 'name']
['reports', 'leogane', 'destroyed', 'hospital', 'croix', 'functioning', 'needs', 'supplies', 'desperately']
['says', 'side', 'haiti', 'rest', 'today', 'tonight']
['information', 'national']
['storm']
['please', 'need', 'tents', 'water', 'silo']
['receive', 'messages']
['croix', 'des', 'bouquets', 'health', 'issues', 'workers', 'croix', 'des', 'bouquets']
['nothing', 'eat', 'water', 'starving', 'thirsty']
['petionville', 'need', 'information', 'regarding']
['thomassin', 'pyron', 'water', 'desperately', 'need', 'water']
['together', 'need', 'food', 'delma', 'didine']
['information', 'order', 'participate', 'use']
['comitee', 'delmas', 'impasse', 'charite', 'temporary', 'shelter', 'dire', 'need', 'water', 'food', 'medications', 'tents', 'clothes', 'please']
['need', 'food', 'water', 'klecin', 'dying', 'hunger', 'impasse', 'chretien', 'klecin', 'extended', 'extension', 'hungry', 'sick']
['call

['criminals', 'jacmel', 'asking', 'police']
['dying', 'hunger', 'petion']
['exactly', 'akay', 'name', 'toman', 'miles', 'kabar', 'name', 'kafou', 'lachomy']
['american', 'embassy', 'open', 'tomorrow']
['construction', 'firm', 'south', 'african', 'recruited', 'civilians', 'architects', 'stockholders', 'wondering', 'needs', 'moment']
['done', 'jacmel', 'trapped', 'underneath', 'rubbles', 'college', 'latriniti', 'help']
['enter', 'houses', 'electricity']
['need', 'kinds', 'help', 'bon', 'repos', 'route', 'ona', 'vile', 'really', 'need', 'help']
['muguet', 'please', 'petion']
['please', 'call', 'write', 'need', 'information', 'need', 'help', 'beeing', 'translators', 'food', 'distributors', 'services', 'else']
['true', 'political', 'held', 'hinche']
['field', 'need', 'speak', 'creole', 'french', 'half', 'english']
['unrecognized', 'characterse', 'initial', 'formation', 'accelerated', 'organized', 'menfp', 'financed', 'bid', 'launched', 'truncated']
['continues', 'truncated', 'regarding', 'e

Another Call:

In [29]:
b_start = time()

i = 0
for message in X:
    print(message)
    out = udacourse2.fn_tokenize(message, 
                                 lemmatize=True, 
                                 rem_city=True, 
                                 agg_words=True,
                                 rem_noise=True,
                                 elm_short=3,
                                 great_noisy=True,
                                 verbose=True)
    print(out)
    print()
    i += 1
    if i > 20: #it´s only for test, you can adjust it!
        break

b_spent = time() - b_start
print('process time:{:.4f} seconds'.format(b_spent))

Weather update - a cold front from Cuba that could pass over Haiti
Tokens-start:66, token/stop:8, remove cities:6 &noise:6
 +lemmatizer:6
 +eliminate short:6
 +eliminate noisy from 300:6
['update', 'front', 'cold', 'weather', 'pas', 'could']

Is the Hurricane over or is it not over
Tokens-start:39, token/stop:1, remove cities:1 &noise:1
 +lemmatizer:1
 +eliminate short:1
 +eliminate noisy from 300:1
['hurricane']

Looking for someone but no name
Tokens-start:31, token/stop:3, remove cities:3 &noise:2
 +lemmatizer:2
 +eliminate short:2
 +eliminate noisy from 300:2
['looking', 'name']

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
Tokens-start:100, token/stop:11, remove cities:11 &noise:10
 +lemmatizer:10
 +eliminate short:8
 +eliminate noisy from 300:6
['destroyed', 'leogane', 'functioning', 'desperately', 'hospital', 'supply']

says: west side of Haiti, rest of the country today and tonight
Tokens-start:63, token/stop:8, remove cit

Don´t try it! (complete tokenizer)

- it´s a slow test! (takes like 221 seconds to tokenize all the dataframe)

In [30]:
#b_start = time()

#X_tokens = X.apply(lambda x: udacourse2.fn_tokenize(x, 
#                                                    lemmatize=True, 
#                                                    rem_city=True, 
#                                                    agg_words=True,
#                                                    rem_noise=True,
#                                                    elm_short=3,
#                                                    great_noisy=True,
#                                                    verbose=False))

#b_spent = time() - b_start
#print('process time:{:.0f} seconds'.format(b_spent))

- it´s a bit faster test (it takes 46 seconds to run)

- the secret is that it loops only one time for row, as it condenses all the filters into one loop

In [31]:
b_start = time()

X_tokens = X.apply(lambda x: udacourse2.fn_tokenize_fast(x, 
                                                         verbose=False))

b_spent = time() - b_start
print('process time:{:.0f} seconds'.format(b_spent))

process time:42 seconds


Now I have a **series** with all my tokenized messages:

In [32]:
X_tokens.head(5)

0    [weather, update, cold, front, cuba, pass, haiti]
1                                          [hurricane]
2                                      [looking, name]
3    [reports, leogane, destroyed, hospital, croix,...
4            [says, side, haiti, rest, today, tonight]
Name: message, dtype: object

And I can filter it for rows that have an **empty list**:
    
- solution found [here](https://stackoverflow.com/questions/29100380/remove-empty-lists-in-pandas-series)

In [33]:
X_tokens[X_tokens.str.len() == 0]

2522     []
2678     []
4487     []
5347     []
5709     []
5737     []
6152     []
6153     []
6229     []
7190     []
7266     []
7559     []
7751     []
7807     []
8891     []
8901     []
9650     []
9863     []
12221    []
12225    []
12258    []
Name: message, dtype: object

In [34]:
ser2 = X_tokens[X_tokens.str.len() > 0]
ser2

0        [weather, update, cold, front, cuba, pass, haiti]
1                                              [hurricane]
2                                          [looking, name]
3        [reports, leogane, destroyed, hospital, croix,...
4                [says, side, haiti, rest, today, tonight]
                               ...                        
26240    [training, demonstrated, enhance, micronutrien...
26241    [suitable, candidate, selected, ocha, jakarta,...
26242    [proshika, operating, cox, bazar, municipality...
26243    [women, protesting, conduct, elections, tearga...
26244    [radical, shift, thinking, came, result, meeti...
Name: message, Length: 26224, dtype: object

In [35]:
b_start = time()

dic_tokens = udacourse2.fn_subcount_lists(column=X_tokens, 
                                          verbose=False)

b_spent = time() - b_start
print('process time:{:.0f} seconds'.format(b_spent))

process time:0 seconds


Sorted dictionnary [here](https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value)

In [36]:
dic_tokens

d_tokens = dic_tokens['elements']
t_sorted = sorted(d_tokens.items(), key=lambda kv: kv[1], reverse=True)

if t_sorted:
    print('data processed')

data processed


Sorted list of tuples of most counted tokens:

- filtering the more counted 300 elements

In [37]:
t_sorted[:300]

[('water', 2930),
 ('food', 2799),
 ('help', 2623),
 ('need', 2162),
 ('please', 2051),
 ('earthquake', 1802),
 ('haiti', 1043),
 ('government', 993),
 ('areas', 979),
 ('sandy', 927),
 ('find', 917),
 ('information', 856),
 ('relief', 812),
 ('aid', 778),
 ('affected', 758),
 ('health', 753),
 ('children', 662),
 ('work', 593),
 ('million', 583),
 ('emergency', 580),
 ('flood', 568),
 ('supplies', 550),
 ('tents', 550),
 ('want', 549),
 ('give', 541),
 ('international', 535),
 ('power', 525),
 ('house', 522),
 ('last', 520),
 ('rain', 507),
 ('rains', 506),
 ('still', 485),
 ('hurricane', 484),
 ('hit', 479),
 ('heavy', 479),
 ('disaster', 472),
 ('school', 469),
 ('support', 468),
 ('storm', 457),
 ('santiago', 450),
 ('assistance', 446),
 ('medical', 441),
 ('shelter', 436),
 ('floods', 428),
 ('victims', 427),
 ('families', 427),
 ('destroyed', 425),
 ('south', 425),
 ('family', 424),
 ('live', 423),
 ('national', 412),
 ('north', 410),
 ('houses', 409),
 ('port', 400),
 ('united',

Modifying the **tokenize** function just to absorve less meaningful tokens to discard:
    
- **ver 1.2** update: tokenizer function created!

In [38]:
great_noisy = ['people', 'help', 'need', 'said', 'country', 'government', 'one', 'year', 'good', 'day',
    'two', 'get', 'message', 'many', 'region', 'city', 'province', 'road', 'district', 'including', 'time',
    'new', 'still', 'due', 'local', 'part', 'problem', 'may', 'take', 'come', 'effort', 'note', 'around',
    'person', 'lot', 'already', 'situation', 'see', 'response', 'even', 'reported', 'caused', 'village', 'bit',
    'made', 'way', 'across', 'west', 'never', 'southern', 'january', 'least', 'zone', 'small', 'next', 'little',
    'four', 'must', 'non', 'used', 'five', 'wfp', 'however', 'com', 'set', 'every', 'think', 'item', 'yet', 
    'carrefour', 'asking', 'ask', 'site', 'line', 'put', 'unicef', 'got', 'east', 'june', 'got', 'ministry']

---

#### Older atempt to clear tokens

Tried to isolate some words that I think are noisy, for exclusion:
    
- general geographic references, as **area** and **village**;

- social communication words, as **thanks** and **please**;

- religious ways to talk, as **pray**

- unmeaningful words, as **thing** and **like**

- visually filtered some words that I think don´t aggregate too much to the **Machine Learning**

- just think about - you prefer your **IA** trained for 'thanks' or for 'hurricane'?

- really I´m not 100% sure about these words, buy my **tokenize** function can enable and disable this list, and re-train the machine, and see if the performance increase or decrease

In [39]:
unhelpful_words = ['thank', 'thanks', 'god', 'fine', 'number', 'area', 'let', 'stop', 'know', 'going', 'thing',
    'would', 'hello', 'say', 'neither', 'right', 'asap', 'near', 'want', 'also', 'like', 'since', 'grace', 
    'congratulate', 'situated', 'tell', 'almost', 'hyme', 'sainte', 'croix', 'ville', 'street', 'valley', 'section',
    'carnaval', 'rap', 'cry', 'location', 'ples', 'bless', 'entire', 'specially',  'sorry', 'saint', 'village', 
    'located', 'palace', 'might', 'given']

Testing **elliminate duplicates**:

In [40]:
test = ['addon', 'place', 'addon']
test = list(set(test))
test

['addon', 'place']

Testing **elliminate short words**:

In [41]:
min = 3
list2 = []
test2 = ['addon', 'l', 'us', 'place']

for word in test2:
    if len(word) < min:
        print('elliminate:', word)
    else: 
        list2.append(word)
    
list2

elliminate: l
elliminate: us


['addon', 'place']

solution [here](https://stackoverflow.com/questions/3501382/checking-whether-a-variable-is-an-integer-or-not)

In [42]:
if isinstance(min, int):
    print('OK')

OK


Now I have two **Tokenizer** functions:

- `fn_tokenize` $\rightarrow$ it allows to test each individual methods, and contains all the methods described, but a bit slow, as it iterates all the words again for each method

- `fn_tokenize_fast` $\rightarrow$ it is a **boosted** version, with only one iteration, for running faster, but you cannot set each method individually for more accurate test

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.



---

### A small review over each item for our first machine learning pipelines

#### Feature Extraction

Feature Extraction from SKlearn documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

"Convert a collection of text documents to a matrix of token counts"

- we are looking for **tokens** that will be turned into **vectors** in a Machine Learning Model;

- they are represented as **scalars** in a **matrix**, that indicates the scale of each one of these tokens.

"This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix."

- normally matrix representations of the natural reallity are a bit **sparse**

- in this case, to save some memory, they indicate a use of a propper representation

"If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data."

- me already made it, drastically reducing the **variability** of terms

- it its represented by our **fn_tokenizer**

#### Preprocessing

TF-IDF from SKlearn documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)

- **tf** is about **term frequency** and;

- **idf** is about **inverse document frequency**.

"Transform a count matrix to a normalized tf or tf-idf representation"

- it means that it basically **normalizes** the count matrix

*Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.*

- it takes term-frequency and it **rescales** it by the gereral document-frequency

*The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.*

- the idea is to not weight too much a **noisy** and very frequent word

- we tried to "manually" elliminate some of the **noisy** words, but as the number of tokens is too high, it´s quite impossible to make a good job

#### Training a Machine Learning

As we have **labels**, a good strategy is to use **supervised learning**

- we could try to kind of make **clusters** of messages, using **unsupervised learning**, or try some strategy on **semi-supervised learning**, as we have some of the messages (40) that don´t have any classification;

- the most obvious way is to train a **Classifier**;

- as we have multiple labels, a **Multi Target Classifier** seems to be the better choice.

Multi target classification [here](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html)

"This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification"

- OK, we will be basically using **slices** of train for each feature, as we don´t have so much **Machines** that are natively supporting multi-target.

## I. Prepare the data

Make the lasts opperations for preparing the dataset for training on **Machine Learning**

For **training** data, it is a **data inconsistency** if you consider that all the labels are blank

- so we have 6,317 rows that we need to **remove** before **training**

In [43]:
print('all labels are blank in {} rows'.format(df[df['if_blank'] == 1].shape[0]))

all labels are blank in 6317 rows


In [44]:
df = df[df['if_blank'] == 0]
df.shape[0]

19928

Verifying if removal was complete

In [45]:
if df[df['if_blank'] == 1].shape[0] == 0:
    print('removal complete!')
else:
    raise Exception('something went wrong with rows removal before training')

removal complete!


**Version 1.3** update: **pre-tokenizer** (a premature tokenization strategy) created, for removing **untrainable rows**

What is this **crazy thing** over here? 

>- I created a **provisory** column, and **tokenizing** it
>- Why I need it for now? Just for removing rows that are **impossible to train**
>- After tokenization, if I get a **empty list**, I need to remove this row before training

In [46]:
start = time()

try:
    df = df.drop('tokenized', axis=1)
except KeyError:
    print('OK')

#inserting a provisory column
df.insert(1, 'tokenized', np.nan)

#tokenizing over the provisory
df['tokenized'] = df.apply(lambda x: udacourse2.fn_tokenize_fast(x['message']), axis=1)

#removing NaN over provisory (if istill exist)
df = df[df['tokenized'].notnull()]

spent = time() - start
print('process time:{:.0f} seconds'.format(spent))

df.head(1)

OK
process time:34 seconds


Unnamed: 0,message,tokenized,original,genre,if_blank,related,request,offer,aid_related,medical_help,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,Weather update - a cold front from Cuba that c...,"[weather, update, cold, front, cuba, pass, haiti]",Un front froid se retrouve sur Cuba ce matin. ...,direct,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Filtering empy lists on `provisory`, found [here](https://stackoverflow.com/questions/42964724/pandas-filter-out-column-values-containing-empty-list)

**Version 1.4** update: could absorb **pre-tokenized** column as a input for **Machine Learning Classifier**, saving time!

And another **crazy thing**, I regret about removing `provisory` tokenized column:

>- why? Just because I already **trained** my **X** subdataset, o I will not need to do it later!
>- and if I make the thing **wizely**, I will accelerate the pipeline process, as I already made the hard job for the **CountVectorized**
>- it will also faccilitate to **train** diverse Classifiers, as I save a lot of individual processing, making it **early** in my process!

In [47]:
empty_tokens = df[df['tokenized'].apply(lambda x: len(x)) == 0].shape[0]
print('found {} rows with no tokens'.format(empty_tokens))

df = df[df['tokenized'].apply(lambda x: len(x)) > 0]
empty_tokens = df[df['tokenized'].apply(lambda x: len(x)) == 0].shape[0]
print('*after removal, found {} rows with no tokens'.format(empty_tokens))

#I will not drop it anymore!
#try:
#    df = df.drop('provisory', axis=1)
#except KeyError:
#    print('OK')

#Instead, I will drop 'message' column
try:
    df = df.drop('message', axis=1)
except KeyError:
    print('OK')

print('now I have {} rows to train'.format(df.shape[0]))
df.head(1)

found 6 rows with no tokens
*after removal, found 0 rows with no tokens
now I have 19922 rows to train


Unnamed: 0,tokenized,original,genre,if_blank,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,"[weather, update, cold, front, cuba, pass, haiti]",Un front froid se retrouve sur Cuba ce matin. ...,direct,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---

#### Database data inconsistency fix

**Version 1.5** update - added **hierarchical structure** on labels, for checking and correcting unfilled classes that already have at least one subclass alredy filled

A **more advanced** issue about these data

A more detailed explanation, you can found at the file `ETL Pipeline Preparatione.ipynb`

The fact is: 

>- these labels are not **chaotic** as we initially think they are
>- looking with care, we can see a very clear **hierarchic structure** on them
>- it they are really hierarchized, so, we can verify them for **data inconsistencies**, using **database fundamentals**

---

#### Another viewpoint about these labels

If we look at them more carefully, we can find a curious pattern on them

These labels looks as they have a kind of hierarchy behind their shape, as:

First **hierarchical** class: 

>- **related**
>- **request**
>- **offer**
>- **direct_report**

And then, **related** seems to have a **Second** hierarchical class

Features for considering a training a classifier on **two layes**, or to **group** them all in main groups, as they are clearly **collinear**:

>- **aid_related** $\rightarrow$ groups aid calling (new things to add/ to do **after** the disaster)
>>- **food**
>>- **shelter**
>>- **water**
>>- **death**
>>- **refugees**
>>- **money**
>>- **security**
>>- **military**
>>- **clothing**
>>- **tools**
>>- **missing_people**
>>- **child_alone**
>>- **search_and_rescue**
>>- **medical_help**
>>- **medical_products**
>>- **aid_centers**
>>- **other_aid**
>- **weather_related** $\rightarrow$ groups what was the main **cause** of the disaster
>>- **earthquake**
>>- **storm**
>>- **floods**
>>- **fire**
>>- **cold**
>>- **other_weather**
>- **infrastructure_related** $\rightarrow$ groups **heavy infra** that was probably dammaged during the disaster
>>- **buildings**
>>- **transport**
>>- **hospitals**
>>- **electricity**
>>- **shops**
>>- **other_infrastructure**

Applying a correction for **database data consistency**:

>- using the function that I already created (see: `ETL Pipeline Preparatione.ipynb`)
>- the idea is when at least some element of a **subcategory** is filled for one **category**, it is expected that the **category** was filled too
>- this is valido for the main category **related** too!

*This is only one more **advanced step** for **data preparation**, as it involves only a mechanic and automatized correction*

In [48]:
#correction for aid_related
df = udacourse2.fn_group_check(dataset=df,
                               subset='aid',
                               correct=True, 
                               shrink=False, 
                               shorten=False, 
                               verbose=True)
#correction for weather_related
df = udacourse2.fn_group_check(dataset=df,
                               subset='wtr',
                               correct=True, 
                               shrink=False, 
                               shorten=False, 
                               verbose=True)
#correction for infrastrucutre_related
df = udacourse2.fn_group_check(dataset=df,
                               subset='ifr',
                               correct=True, 
                               shrink=False, 
                               shorten=False, 
                               verbose=True)
#correction for related(considering that the earlier were already corrected)
df = udacourse2.fn_group_check(dataset=df,
                               subset='main',
                               correct=True, 
                               shrink=False, 
                               shorten=False, 
                               verbose=True)
print(df.shape)
df.head(1)

###function group_check started
  - count for main class:aid_related, 10877 entries
  - for main, without any sub-categories,  3515 entries
  - for subcategories,  7388 entries
  - for lost parent sub-categories,  26 entries
    *correcting, new count: 0 entries
elapsed time: 0.1971s
###function group_check started
  - count for main class:weather_related, 7304 entries
  - for main, without any sub-categories,  1359 entries
  - for subcategories,  5945 entries
  - for lost parent sub-categories,  0 entries
    *correcting, new count: 0 entries
elapsed time: 0.0794s
###function group_check started
  - count for main class:infrastructure_related, 1705 entries
  - for main, without any sub-categories,  679 entries
  - for subcategories,  2926 entries
  - for lost parent sub-categories,  1900 entries
    *correcting, new count: 0 entries
elapsed time: 0.0848s
###function group_check started
  - count for main class:related, 19922 entries
  - for main, without any sub-categories,  9436 entr

Unnamed: 0,tokenized,original,genre,if_blank,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,"[weather, update, cold, front, cuba, pass, haiti]",Un front froid se retrouve sur Cuba ce matin. ...,direct,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## II. Break the data

Break the dataset into the **training columns** and **labels** (if it have **multilabels**)

X is the **Training Text Column**:
    
- if I observe the potential training data really well, I could `genre` column as training data too!

- or I can use also `related`, `request`, `offer` columns for training `aid_related` data

*A discussion of how much these **Label** columns are **hierarchically defined** is made laterly in this notebook*

---

For this moment, I am using only `message` as training data

In [49]:
X = df['tokenized']
X.head(1)

0    [weather, update, cold, front, cuba, pass, haiti]
Name: tokenized, dtype: object

Y is constituted by the **Classification Labels**

**Version 1.6** update: removed `related` column from the Labels dataset. Why? Because when I go to statistics after training the **Machine Learning Classifier**, it turns allways at `1`. So, sometimes this coluimn (like in Adaboost) is causing problems when training our Classifier, and adding nothing to the model

>- was: `y = df[df.columns[4:]]`
>- now: `y = df[df.columns[5:]]`

**Version 1.7** update: removed columns that contains **only zeroes**. Why? Just because they are **impossible to train** on our Classifier!, so they add nothing to the model

In [50]:
y = df[df.columns[5:]]

remove_lst = []

for column in y.columns:
    col = y[column]
    if (col == 0).all():
        print('*{} -> only zeroes training column!'.format(column))
        remove_lst.append(column)
    else:
        #print('*{} -> column OK'.format(column))
        pass
print(remove_lst)

y = y.drop(remove_lst, axis=1)

y.head(1)

*child_alone -> only zeroes training column!
['child_alone']


Unnamed: 0,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## III. Split the data 

Into **Train** and **Test** subdatasets

>- let´s start it with **20%** of test data
>- I am not using **random_state** settings (and why **42**? I personally think it is about a reference to the book **The Hitchhicker´s Guide to de Galaxy**, from Douglas Adams!

**Version 1.8** update: now I am using **random_state** parameter, so I can compare exactly the same thing, when using randomized processes, for ensuring the same results for each function call

---

**Future** possible updates:

>- I can test/train using other parameters for test_size, like **0.25** and see if it interfers so much
>- I can try to do **bootstrap** and see if I can plot a good **normalization** curve for it!

**NEW Future** possible update:

>- I could use **Cross Validation** in order to use all my data for training!
>- **Warning** there are some papers saying that to take care about using **Cross Validation** on Model Training. The reason is, it may let **data leakage** from your **train** to your **test** dataset, masking the real power of your model!
>- so I need to **study more** about that before trying to implement it in Python
>- the discussion about "data leakage" when using cross validation strategies when **fitting** data is [here](https://stackoverflow.com/questions/56129726/fitting-model-when-using-cross-validation-to-evaluate-performance)

In [51]:
#Split makes randomization, so random_state parameter was set
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.25, 
                                                    random_state=42)

And it looks OK:

In [52]:
X_train.shape[0] + X_test.shape[0]

19922

## IV. Choose your first Classifier 

- and build a **Pipeline** for it

Each Pipeline is a Python Object that can be called for **methods**, as **fit()**

---

What **Classifier** to choose?

- **Towards Data Science** give us some tips [here](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a)

---

Start with a **Naïve Bayes** (NB)

`clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)`

In a Pipeline way (pipeline documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html):

>- I had some issues with `CountVectorizer`, but could clear it using Stack Overflow [here](https://stackoverflow.com/questions/32674380/countvectorizer-vocabulary-wasnt-fitted)
>- should I use `CountVectorizer(tokenizer=udacourse2.fn_tokenize_fast)`?... but I will **not**!
>- why? Just because I already proceeded with **tokenization** in a earlier step
>- so, how to overpass this hellish `tokenizer=...` parameter?
>- I found a clever solution [here](https://stackoverflow.com/questions/35867484/pass-tokens-to-countvectorizer)
>- so, I prepared a **dummy** function to overpass the tokenizer over **CountVertorizer**

First I tried to set Classifier as **MultinomialNB()**, and it crashes:

>- only **one** Label to be trained was expected, and there were 36 Labels!;
>- reading the documentation for SKlearn, it turned clear that it is necessary (if your Classifier algorithm was not originally built for **multicriteria**, to run it **n** times, one for each label
>- so it is necessary to include it our pipeline, using `MultiOutputClassifier()` transformer

*And... it looks pretty **fast** to train, not? What is the secret? We are **bypassing** the tokenizer and preprecessor, as we **already made** it at the dataset!*

*Another thing, we are not using the **whole** dataset... it´s just about a little **issue** we have, as there are a lot of **missing labels** at the dataset! And for me, it will **distort** our training! (lately I will compare the results with traning the **raw** dataset)*

**Naïve Bayes** is known as a very **fast** method:

>- but it is also known as being not so **accurate**
>- and it have so **few** parameters for a later refinement

I could reach Model Accuracy of **92.2**, after **.58** seconds for fitting the Classifier 

In [53]:
start = time()

def dummy(doc):
    return doc

#Naïve Bayes classifier pipeline - no randomization involved
pipeline_mbnb = Pipeline([('vect', CountVectorizer(tokenizer=dummy, preprocessor=dummy)),
                          ('tfidf', TfidfTransformer()),
                          ('clf',  MultiOutputClassifier(MultinomialNB()))])
                          #('clf', MultinomialNB())]) #<-my terrible mistake!
#remembering:
#CountVectorizer -> makes the count for tokenized vectors
#TfidTransformer -> makes the weight "normalization" for word occurences
#MultinomialNB -> is my Classifier

#fit text_clf (our first Classifier model)
pipeline_mbnb.fit(X_train, y_train)

spent = time() - start
print('NAÏVE BAYES - process time: {:.2f} seconds'.format(spent))

NAÏVE BAYES - process time: 1.10 seconds


If I want, I can see the parameters for my **Pipeline**, using this command

In [54]:
#pipeline_mbnb.get_params()

## V. Run metrics for it

Predicting using **Naïve Bayes** Classifier

And I took this **weird** Error Message:

"**UndefinedMetricWarning:**" 

>- "Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples"
>- "Use `zero_division` parameter to control this behavior"

And searching, I found this explanation [here](https://stackoverflow.com/questions/43162506/undefinedmetricwarning-f-score-is-ill-defined-and-being-set-to-0-0-in-labels-wi)

>- it is not an **weird error** at all. Some labels could´t be predicted when running the Classifier
>- so the report don´t know how to handle them

"What you can do, is decide that you are not interested in the scores of labels that were not predicted, and then explicitly specify the labels you are interested in (which are labels that were predicted at least once):"

`metrics.f1_score(y_test, y_pred, average='weighted', labels=np.unique(y_pred))`

#### Dealing with this issue

**First**, I altered my function `fn_plot_scores` for not allowing comparisons over an empty (**not trained**) column, as `y_pred`

And to check if all predicted values are **zeroes** [here](https://stackoverflow.com/questions/48570797/check-if-pandas-column-contains-all-zeros)

And I was using in my function a **general** calculus for Accuracy. The problem is: **zeroes** for **zeroes** result a **1** accuracy, distorting my actual Accuracy, for a better (**unreal**) higher value:

>- so, for general model Accuracy, I cannot use this `accuracy = (y_pred == y_test.values).mean()`
>- using instead `f1_score(y_test, y_pred, average='weighted', labels=np.unique(y_pred))`

**Version 1.9** updated: created my own customized function for showing metrics

**Version 1.15** updated: improved my customized function for other metrics

>- I was using the mean F1 Score as "Model Precision" and that seems a bit **silly**, as there were other metrics
>- I could find a better material At SkLearn documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html)
>- for example, as we are using binary labels, and the most important one is the "1", label, we can set is in the parameters as `average='binary'` and `pos_label=1`
>- another thing, **Precision** and **Reacall** are more **effective** for Machine Learning than **F1**
>- about ill-defined parameters, I found some documentation at [Udacity](https://knowledge.udacity.com/questions/314220)

**Future improvement**

>- there are better metrics for **multilabel classificication** [here](https://medium.com/analytics-vidhya/metrics-for-multi-label-classification-49cc5aeba1c3#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjgxOWQxZTYxNDI5ZGQzZDNjYWVmMTI5YzBhYzJiYWU4YzZkNDZmYmMiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJuYmYiOjE2MzAyNzYxNDYsImF1ZCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsInN1YiI6IjEwNTAzNjUxNTUwMDU1MTQ1OTkzNSIsImVtYWlsIjoiZXBhc3NldG9AZ21haWwuY29tIiwiZW1haWxfdmVyaWZpZWQiOnRydWUsImF6cCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsIm5hbWUiOiJFZHVhcmRvIFBhc3NldG8iLCJwaWN0dXJlIjoiaHR0cHM6Ly9saDMuZ29vZ2xldXNlcmNvbnRlbnQuY29tL2EtL0FPaDE0R2pJNmh5V3FSTGNfdHZCYlg4OWxFTEphZ3diMFBYeXJNOGN1YXBLR1E9czk2LWMiLCJnaXZlbl9uYW1lIjoiRWR1YXJkbyIsImZhbWlseV9uYW1lIjoiUGFzc2V0byIsImlhdCI6MTYzMDI3NjQ0NiwiZXhwIjoxNjMwMjgwMDQ2LCJqdGkiOiIzYzYyZThiZDhkYWU4YjU4NWJlZDI4ZGFhYjE5ZDkwY2MyOTFmNjhlIn0.kwd1YjjoxP-RUFHA86RftkGHMMwic3edRM31Yz8sJL9dg0jzPwS2c9peJ9kDuIQK5x8PWvZxhnl-wI32M_D_FvWv5UXad1cYnkuEGnxeo94LPCUam-aOnUvDDpefUEOv8Oe2751C0VH1MrlDiOQxyGcYBIjnr2NtdaN8Y8pm-ZLonqw3zpZO-2Wlkhnrb12ruZmpWD2CbqZCHpNwmYq0bQqCrNp_dCZ9mBjc5xrYN2G8Us7ESZcCnqLLjk_cb6UVV81LFjKkrjGifBsOac-ANoc7TBJQnFW41FISORWL8j84mW7jl8UgEmxrgc8kaFtHm6oC5ptc9YLRBDq1Q93ZBQ)
>- we could use **Precision at k** `P@k`, **Avg precision at k** `AP@k`, **Mean avg precision at k** `MAP@k` and **Sampled F1 Score** `F1 Samples`

---

**Version 1.17** for **Naïve Bayes** updated new, **more realistic** metrics based on **10 top** labels:

>- Model Accuracy now is **31.2**%
>- Precision now is **85.9**%
>- Recall now is **26.4**%

---

**Version 1.18** for **Naïve Bayes** letting the tokenizer take the same word more than once:

>- Model Accuracy now is **31.5**%
>- Precision now is **86.3**%
>- Recall now is **26.6**%

In [55]:
y_pred = pipeline_mbnb.predict(X_test)
udacourse2.fn_scores_report2(y_test, 
                             y_pred,
                             best_10=True)
#udacourse2.fn_scores_report(y_test, y_pred)

###function scores_report started
######################################################
*aid_related -> label iloc[2]
              precision    recall  f1-score   support

           0       0.77      0.44      0.56      2313
           1       0.65      0.89      0.75      2668

    accuracy                           0.68      4981
   macro avg       0.71      0.66      0.65      4981
weighted avg       0.70      0.68      0.66      4981

######################################################
*weather_related -> label iloc[26]
              precision    recall  f1-score   support

           0       0.79      0.93      0.85      3109
           1       0.83      0.59      0.69      1872

    accuracy                           0.80      4981
   macro avg       0.81      0.76      0.77      4981
weighted avg       0.81      0.80      0.79      4981

######################################################
*direct_report -> label iloc[33]
              precision    recall  f1-score   sup

(0.31483529958459766, 0.8628820325241943, 0.26578107990419364)

Model Accuracy is distorted by **false fitting** (zeroes over zeroes)

Manually, I could find the true meaning as near to **82%**

In [56]:
real_f1 = [.78, .86, .83, .85, .80, .83, .81, .91, .86, .69, .83]

corr_precision = statistics.mean(real_f1)
print('F1 corrected Model Accuracy: {:.2f} ({:.0f}%)'.format(corr_precision, corr_precision*100.))

F1 corrected Model Accuracy: 0.82 (82%)


#### Critics about the performance of my Classifier

I know what you are thinking: "Uh, there is something **wrong** with the Accuracy of this guy"

So, as you can see: **92.2%** is too high for a **Naïve Bayes Classifier**!

There are some explanations here:

>- if you read it with care, you will find this **weird** label `related`. And it seems to **positivate** for every row on my dataset. So It distorts the average for a **higher** one
>- if you look at each **weighted avg**, you will find some clearly **bad** values, as **68%** for **aid_related** (if you start thinking about it, is something like in **2/3** of the cases the model guesses well for this label... so a really **bad** performance)

*Updated 1: when I removed `related` column, my **Model Accuracy** felt down to **56.1%**. Normally my Labels are holding something as **75-78%** f1-score. Now I think that these **untrainable columns** are making my average Accuracy to fall down!*

---

But there is another **critic** about this data.

I am **Engineer** by profession. And I work for almost **19** years in a **hidrology** datacenter for the Brazillian Government. So, in some cases, you see some data and start thinking: "this data is not what it seems".

And the main problem with this data is:

>- it is a **mistake** to think that all we need to do with it is to train a **Supervised Learning** machine!
>- if you look with care, this is not about **Supervised Learning**, it is an actual **Semi-Supervised Learning** problem. Why?
>- just consider that there were **zillions** of Tweeter messages about catastrophes all around the world. And then, when the message was not originally in English, they translated it. And then someone manually **labeled** each of these catastrophe reports. And a **lot** of them remained with **no classification**
>- it I just interpret it as a **Supervised Learning** challenge, I will feed my Classifier with a lot of **false negatives**. And my Machine Learning Model will learn how to **keep in blank** a lot of these messages, as it was trained by my **raw** data!

So in **preprocessing** step, I avoided **unlabelled data**, filtering and removing for training every row that not contains any label on it. They were clearly, **negleted** for labeling, when manually processed!




## VI. Try other Classifiers

- I will try some Classifiers based on a **hierarchical structure**:

>- why **hierarchical structure** for words? Just because I think we do it **naturally** in our brain
>- when science mimic nature I personally think that things goes in a better way. So, let´s try it!

First of them, **Random Forest** Classifier

>- as **RFC** is a **single-label** Classifier, we need to call it **n** times for each label to be classified
>- so, que need to call it indirectly, using **Multi-Output** Classifier tool
>- it took **693.73 seconds** (as 11 minutes and 35 seconds) to complete the tast (not so bad!)
>- I tried to configure a **GridSearch**, just to set the number of processors to `-1` (meaning, the **maximum** number)

Accuracy was near to **93%** before removing `related` label. Now it remains as **93.8%**. So, it don't matter!

**Version 1.10** update: prepared other Machine Learning Classifiers for training the data

---

**Version 1.17** for **Random Forest** updated new, **more realistic** metrics based on **10 top** labels:

>- Model Accuracy now is **66.5**%
>- Precision now is **69.8**%
>- Recall now is **70.1**%

---

**Version 1.18** for **Random Forest** letting the tokenizer take the same word more than once:

>- Model Accuracy now is **66.4**%
>- Precision now is **79.8**%
>- Recall now is **59.7**%

In [57]:
start = time()

def dummy(doc):
    return doc

#Random Forest makes randomization, so random_state parameter was set
pipeline_rafo = Pipeline([('vect', CountVectorizer(tokenizer=dummy, preprocessor=dummy)),
                          ('tfidf', TfidfTransformer()),
                          ('clf', MultiOutputClassifier(RandomForestClassifier(random_state=42)))])

pipeline_rafo.fit(X_train, y_train)

#an attempt to use multiple cores to process the task

#param_grid = 
#gs_clf = GridSearchCV(pipeline_rafo, parameters, n_jobs=-1)
#gs_clf = gs_clf.fit(X_train, y_train)

spent = time() - start
s_min = spent // 60
print('RANDOM FOREST - process time: {:.0f} minutes, {:.2f} seconds ({:.2f}s)'\
      .format(s_min, spent-(s_min*60), spent))

RANDOM FOREST - process time: 13 minutes, 21.23 seconds (801.23s)


In [58]:
y_pred = pipeline_rafo.predict(X_test)
udacourse2.fn_scores_report2(y_test, 
                             y_pred,
                             best_10=True)
#udacourse2.fn_scores_report(y_test, y_pred)

###function scores_report started
######################################################
*aid_related -> label iloc[2]
              precision    recall  f1-score   support

           0       0.74      0.67      0.70      2313
           1       0.74      0.79      0.76      2668

    accuracy                           0.74      4981
   macro avg       0.74      0.73      0.73      4981
weighted avg       0.74      0.74      0.74      4981

######################################################
*weather_related -> label iloc[26]
              precision    recall  f1-score   support

           0       0.87      0.93      0.89      3109
           1       0.86      0.76      0.81      1872

    accuracy                           0.86      4981
   macro avg       0.86      0.84      0.85      4981
weighted avg       0.86      0.86      0.86      4981

######################################################
*direct_report -> label iloc[33]
              precision    recall  f1-score   sup

(0.6637689138512191, 0.7976322127539948, 0.5969071962832412)

Another tree like Classifier is **Adaboost**:

>- they say Adaboost is specially good for **differenciate** positives and negatives
>- it took **106.16 seconds** (kind of **1** minute and **45** seconds) to complete the task... not so bad... (as AdaBoost don´t use **trees**, but **stumps** for doing its job)

Accuracy was near to **91%**. After removing `related` label:

>- it raised to **93.6%**. As Adaboost is based on **stumps**, a bad label perhaps distorts the model
>- training time lowered to **71,57** seconds, so kind of a time reduction about 30%

*Adaboost seems to be really **fast**, when compared to Random Forest. And without loosing too much in terms of Model Accuracy...*

---

**Version 1.17** for **Adaboost** updated new, **more realistic** metrics based on **10 top** labels:

>- Model Accuracy now is **66.3**%
>- Precision now is **77.7**%
>- Recall now is **58.7**%

---

**Version 1.18** for **Adaboost** letting the tokenizer take the same word more than once:

>- Model Accuracy now is **65.4**%
>- Precision now is **77.3**%
>- Recall now is **57.8**%

In [59]:
start = time()

def dummy(doc):
    return doc

#Adaboost makes randomization, so random_state parameter was set
pipeline_adab = Pipeline([('vect', CountVectorizer(tokenizer=dummy, preprocessor=dummy)),
                          ('tfidf', TfidfTransformer()),
                          ('clf',  MultiOutputClassifier(AdaBoostClassifier(random_state=42)))])

pipeline_adab.fit(X_train, y_train)

spent = time() - start
print('ADABOOST - process time: {:.2f} seconds'.format(spent))

ADABOOST - process time: 142.60 seconds


In [60]:
y_pred = pipeline_adab.predict(X_test)
udacourse2.fn_scores_report2(y_test, 
                             y_pred,
                             best_10=True)
#udacourse2.fn_scores_report(y_test, y_pred)

###function scores_report started
######################################################
*aid_related -> label iloc[2]
              precision    recall  f1-score   support

           0       0.69      0.70      0.69      2313
           1       0.73      0.73      0.73      2668

    accuracy                           0.71      4981
   macro avg       0.71      0.71      0.71      4981
weighted avg       0.71      0.71      0.71      4981

######################################################
*weather_related -> label iloc[26]
              precision    recall  f1-score   support

           0       0.84      0.94      0.89      3109
           1       0.88      0.70      0.78      1872

    accuracy                           0.85      4981
   macro avg       0.86      0.82      0.83      4981
weighted avg       0.85      0.85      0.85      4981

######################################################
*direct_report -> label iloc[33]
              precision    recall  f1-score   sup

(0.654363384791236, 0.7728101688203811, 0.5784827533968593)

---

#### Falling in a trap when choosing another Classifier

Then I tried a **Stochastic Gradient Descent** (SGD) [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)

_"Linear classifiers (SVM, logistic regression, etc.) with SGD training"_

It can works with a **Support Vector Machine** (SVM), that is a fancy way of defining a good frontier


`clf = SGDClassifier()` with some parameters
  
>- `learning_rate='optimal'`$\rightarrow$ **decreasing strength schedule** used for updating the gradient of the loss at each sample
>- `loss='hinge'` $\rightarrow$ **Linear SVM** for the fitting model (works with data represented as dense or sparse arrays for features)
>- `penalty=[‘l2’, ‘l1’, ‘elasticnet’]` $\rightarrow$ **regularizer**  shrinks model parameters towards the zero vector using an **Elastic Net** (l2) or 
>- `alpha=[1e-5, 1e-4, 1e-3]` $\rightarrow$ stopping criteria, the higher the value, the **stronger** the regularization (also used to compute the **Learning Rate**, when set to learning_rate is set to ‘optimal’
>- `n_iter=[1, 5, 10]` $\rightarrow$ number of passes over the **Epochs** (Training Data). It only impacts the behavior in the **fit method**, and not the partial_fit method
>- `random_state=42` $\rightarrow$ if you want to replicate exactly the same output each time you retrain your machine
  
*Observe that this is a kind of a lecture over the text at SkLearn website for this Classifier*

---

And **SGDC** didn´t work! It gave me a **ValueError: y should be a 1d array, got an array instead**. So, something went wrong:

Searching for the cause of the problem, I found this explanation [here](https://stackoverflow.com/questions/20335853/scikit-multilabel-classification-valueerror-bad-input-shape)

*"No, SGDClassifier does not do **multilabel classification** (what I need!) -- it does **multiclass classification**, which is a different problem, although both are solved using a one-vs-all problem reduction"*

*(we use Multiclass Classification when the possible classifications are **mutually exclusive**. For example, I have a picture with a kind of fruit, and it could be classified as a **banana**, or a **pear**, or even an **apple**. Clearly that is not our case!)*

*Then, neither **SGD** nor OneVsRestClassifier.fit will accept a **sparse matrix** (is what I have!) for y* 

*- SGD wants an **array of labels** (is what I have!), as you've already found out*

*- OneVsRestClassifier wants, for multilabel purposes, a list of lists of labels*

*Observe that this is a kind of a lecture over the explanatory text that I got at SKLearn website for SGDC for Multilabel*

---

There is a good explanation about **Multiclass** and **Multilabel** Classifiers [here](https://scikit-learn.org/stable/modules/multiclass.html)

Don´t try to run this code:

In [61]:
#start = time()

#def dummy(doc):
#    return doc

#random_state=42 #<-just to remember!
#pipeline_sgrd = Pipeline([('vect', CountVectorizer(tokenizer=dummy, preprocessor=dummy)),
#                          ('tfidf', TfidfTransformer()),
#                          ('clf', SGDClassifier(loss='hinge', 
#                                                penalty='l2',
#                                                alpha=1e-3))]) 
#fit_sgrd = pipeline_sgrd.fit(X_train, y_train)

#spent = time() - start
#print('STOCHASTIC GRADIENT DESCENT - process time:{:.2f} seconds'.format(spent))

Let's try **K-Neighbors Classifier**

**First** try, `n_neighbors=3`:

>- model Accuracy was **91.8%**... not so bad!
>- and... why only **3** neighbors? You see this parameter is quite **arbitrary** in our case... it could be 2 or 5... as we have so much (or so few neighbors that we can rely on, this can **tune better** our classifier)... and why not try it, using **GridSearch**?

**Second** try, `n_neighbors=7` and `p=1` (using **GridSearch**, explanation below to tune it for a better result):

>- it took **.74** seconds to **fit** the Classifier
>- the slowest part was to **predict**, as **5** minutes and **27** seconds!
>- it gave us **92.0%** of model Accuracy... and a lot of **non-fitting** labels!
>- so, it was not a good idea to use the new parameters, the **original ones** are better!

Some reflexions about models, **GridSearch** and best parameters:

>- sometimes a **slight** difference don´t worth the computational price
>- another thing to reflect about: why I started with only **3** neighbors? Just because Tweeter messages are quite **short**. When tokenized, the number of **tokens** normally don´t exceed **7**!
>- so, giving a brutal **resolution** to poor data, normally is not a good idea

**Third** try, `n_neighbors=3` and `p=1`

>- I achieved **91.3** accuracy, don´t using so much computational power!
>- only tunning a bit the **power** parameter provided me with a silghtly **better** result
>- training time is **0.79** seconds and predict is **5** minutes and **27** seconds

**Version 1.11** update: preparation of k-Neighbors Classifier for training

*k-Neighbors seems to not fit so well for this kind of problems!*

---

**Version 1.17** for **k-Nearest** updated new, **more realistic** metrics based on **10 top** labels:

>- Model Accuracy now is **39.1**%
>- Precision now is **60.1**%
>- Recall now is **32.6**%

---

**Version 1.18** for **k-Nearest** letting the tokenizer take the same word more than once:

>- Model Accuracy now is **38.8**%
>- Precision now is **60.5**%
>- Recall now is **32.2**%

In [62]:
start = time()

def dummy(doc):
    return doc

#k-Neighbors don´t use randomization
pipeline_knbr = Pipeline([('vect', CountVectorizer(tokenizer=dummy, preprocessor=dummy)),
                          ('tfidf', TfidfTransformer()),
                          ('clf', MultiOutputClassifier(KNeighborsClassifier(n_neighbors=3, p=1)))])

pipeline_knbr.fit(X_train, y_train)

spent = time() - start
print('K NEIGHBORS CLASSIFIER - process time: {:.2f} seconds'.format(spent))

K NEIGHBORS CLASSIFIER - process time: 0.56 seconds


In [63]:
start = time()

y_pred = pipeline_knbr.predict(X_test)
udacourse2.fn_scores_report2(y_test, 
                             y_pred,
                             best_10=True)
#udacourse2.fn_scores_report(y_test, y_pred)

spent = time() - start
print('process time: {:.2f} seconds'.format(spent))

###function scores_report started
######################################################
*aid_related -> label iloc[2]
              precision    recall  f1-score   support

           0       0.53      0.78      0.63      2313
           1       0.67      0.39      0.49      2668

    accuracy                           0.57      4981
   macro avg       0.60      0.59      0.56      4981
weighted avg       0.60      0.57      0.56      4981

######################################################
*weather_related -> label iloc[26]
              precision    recall  f1-score   support

           0       0.72      0.89      0.80      3109
           1       0.69      0.43      0.53      1872

    accuracy                           0.72      4981
   macro avg       0.71      0.66      0.66      4981
weighted avg       0.71      0.72      0.70      4981

######################################################
*direct_report -> label iloc[33]
              precision    recall  f1-score   sup

Linear Suport Vector Machine, fed by TfidVectorizer:
    
>- now, the idea is to train another type of machine, a **Support Vector Machine** (SVM)
>- SVM uses another philosophy, as you create a coordinate space for **vectors**
>- the space coordinate system can be a **cartesian planes**, or **polar combinations**
>- the idea is to sepparate data using vectors as **sepparation elements**
>- in this case, whe use only **linear** elements to make de sepparation

Why **Linear**?

>- the **computational cost** for linear entities on **discrete** computers is really low (if we were using **valved** computers, we could start exploring **non-linear** models with better profit)
>- now we ned **fit** and **transform** opperations on our vectors provider
>- it is a **fast** machine (**18.84**seconds), with the amazing Model Accuracy of a bit less than **93%** (one of the features could not be trained!)
>- when corrected **labels consistencies**, based on our **hierarchical structure**, Model Accuracy raised a bit, reaching **93.6**!

**Version 1.12** update: preparation of a completely different kind of **Machine Learning Classifier**

---

**Version 1.17** for **Linear Support Vector** updated new, **more realistic** metrics based on **10 top** labels:

>- Model Accuracy now is **70.6**%
>- Precision now is **70.8**%
>- Recall now is **71.1**%

In [64]:
start = time()

def dummy(doc):
    return doc

feats = TfidfVectorizer(analyzer='word', 
                        tokenizer=dummy, 
                        preprocessor=dummy,
                        token_pattern=None,
                        ngram_range=(1, 3))

classif = OneVsRestClassifier(LinearSVC(C=2., 
                                        random_state=42))

#don´t use this line, I thought it was necessary to to te sepparation!
#feats = feats.fit_transform(X_train)

pipeline_lnsv = Pipeline([('vect', feats),
                          ('clf', classif)])

pipeline_lnsv.fit(X_train, y_train)

spent = time() - start
print('LINEAR SUPPORT VECTOR MACHINE - process time:{:.2f} seconds'.format(spent))

LINEAR SUPPORT VECTOR MACHINE - process time:27.10 seconds


If you experience:

*NotFittedError: Vocabulary not fitted or provided*
[here](https://stackoverflow.com/questions/60472925/python-scikit-svm-vocabulary-not-fitted-or-provided)

---

#### Test Area (for Version 1.16 improvement)

I am trying to create new **fancy** metrics for scoring my Classifiers

>- I was taking only the **General Average F1 Score** as metrics, and it seems so pooly detailed


I have for most classified labels, according to my `fn_labels_report` function:

1. related:19928 (75.9%)
2. aid_related:10903 (41.5%)
3. weather_related:7304 (27.8%)
4. direct_report:5080 (19.4%)
5. request:4480 (17.1%)
6. other_aid:3448 (13.1%)
7. food:2930 (11.2%)
8. earthquake:2455 (9.4%)
9. storm:2448 (9.3%)
10. shelter:2319 (8.8%)
11. floods:2158 (8.2%)

When I remove **related** (as it will only classify as **"1"** for **All** my dataset, when I remove rows that have **no** classification at all - so, I cannot **train** on them), I will get these new columns as:

1. aid_related
2. weather_related
3. direct_report
4. request
5. other_aid
6. food
7. earthquake
8. storm
9. shelter
10. floods

Turning them into a list:

`top_labels = ['aid_related', 'weather_related', 'direct_report', 'request', 'other_aid', 'food', 'earthquake', 'storm', 'shelter', 'floods']`

Retrieve their position by name [here](https://stackoverflow.com/questions/13021654/get-column-index-from-column-name-in-python-pandas):

`y_test.columns.get_loc("offer")`

**Version 1.16** update: new `fn_scores_report2` function created

---

**Version 1.17** for **k-Nearest** updated new, **more realistic** metrics based on **10 top** labels:

>- Model Accuracy now is **69.9**%
>- Precision now is **70.8**%
>- Recall now is **71.1**%

**Version 1.18** for **Linear Support Vector Machine** letting the tokenizer take the same word more than once:

>- Model Accuracy now is **70.5**%
>- Precision now is **71.9**%
>- Recall now is **69.7**%

In [65]:
y_pred = pipeline_lnsv.predict(X_test)
udacourse2.fn_scores_report2(y_test, 
                             y_pred,
                             best_10=True)

###function scores_report started
######################################################
*aid_related -> label iloc[2]
              precision    recall  f1-score   support

           0       0.75      0.56      0.64      2313
           1       0.69      0.84      0.76      2668

    accuracy                           0.71      4981
   macro avg       0.72      0.70      0.70      4981
weighted avg       0.72      0.71      0.70      4981

######################################################
*weather_related -> label iloc[26]
              precision    recall  f1-score   support

           0       0.88      0.88      0.88      3109
           1       0.80      0.79      0.80      1872

    accuracy                           0.85      4981
   macro avg       0.84      0.84      0.84      4981
weighted avg       0.85      0.85      0.85      4981

######################################################
*direct_report -> label iloc[33]
              precision    recall  f1-score   sup

(0.7051297978208334, 0.7187426419336524, 0.6965549902942181)

Don´t use this function! (deprecated)

In [66]:
#y_pred = pipeline_lnsv.predict(X_test)
#udacourse2.fn_scores_report(y_test, y_pred)

## VIII. Make a Fine Tunning effort over Classifiers

#### First attempt: Stochastic Gradient Descent

**Grid Search**

`parameters = {'vect__ngram_range': [(1, 1), (1, 2)],`
              `'tfidf__use_idf': (True, False),`
              `'clf__alpha': (1e-2, 1e-3)}`

- use **multiple cores** to process the task

`gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)`

`gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)`

-see the **mean score** of the parameters

`gs_clf.best_score_`

`gs_clf.best_params_`

*Not implemented, by the reason that our SGD effort was abandonned. Only some  sketches from my studies for GridSearch on SGD remain here! (source, SKlearn parameters + documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)*

#### Second attempt: k-Neighbors

>- we can see tunable parameters using the command `Class_k.get_params()`
>- I tried to tune up for `n_neighbors` and for `p`
>- it took **74** minutes and **15** seconds to run (so, don´t try it!)
>- best estimator was **n_neighbors=7** and **p=1** $\rightarrow$ "Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1)" (from SkLearn documentation)

**Version 1.13** update: implemented **Grid Search** for some sellected Classifiers

**Future implementation**: test other parameters for a better fine-tunning (I don't made an **exaustive fine-tunning**!)

Don´t use this code, it takes too much time to process!

In [67]:
#start = time()

#def dummy(doc):
#    return doc


#k-Neighbors don´t use randomization
#Vect_k = CountVectorizer(tokenizer=dummy, preprocessor=dummy)
#Transf_k = TfidfTransformer()
#Class_k = MultiOutputClassifier(KNeighborsClassifier())

#pipeline_knbr = Pipeline([('vect', Vect_k),
#                          ('tfidf', Transf_k),
#                          ('clf', Class_k)])

#param_dict = {'clf__estimator__n_neighbors': [3,5,7],
#              'clf__estimator__p': [1,2]}

#estimator = GridSearchCV(estimator=pipeline_knbr, 
#                         param_grid=param_dict,
#                         n_jobs=-1) #, scoring='roc_auc')

#estimator.fit(X_train, y_train)

#spent = time() - start
#s_min = spent // 60
#print('K NEIGHBORS CLASSIFIER - process time: {:.0f} minutes, {:.2f} seconds ({:.2f}s)'\
#      .format(s_min, spent-(s_min*60), spent))

In [68]:
#fit_knbr.best_estimator_ 

**Linear SVC**: new parameter found by using **Grid Search**

- `C=0.5`

- run time for training the Classifier is **4**min **26**sec

In [69]:
start = time()

def dummy(doc):
    return doc

feats = TfidfVectorizer(analyzer='word', 
                        tokenizer=dummy, 
                        preprocessor=dummy,
                        token_pattern=None,
                        ngram_range=(1, 3))
classif = OneVsRestClassifier(LinearSVC())

pipeline_lnsv = Pipeline([('vect', feats),
                          ('clf', classif)])


param_dict = {'clf__estimator__C': [0.1,0.5,1.0,2.0,5.0]}

estimator = GridSearchCV(estimator=pipeline_lnsv, 
                         param_grid=param_dict,
                         n_jobs=-1) #, scoring='roc_auc')

estimator.fit(X_train, y_train)

spent = time() - start
s_min = spent // 60
print('LINEAR SUPPORT VECTOR MACHINE - process time: {:.0f} minutes, {:.2f} seconds ({:.2f}s)'\
      .format(s_min, spent-(s_min*60), spent))

LINEAR SUPPORT VECTOR MACHINE - process time: 4 minutes, 30.83 seconds (270.83s)


In [70]:
#classif.get_params()

In [71]:
estimator.best_estimator_ 

Pipeline(steps=[('vect',
                 TfidfVectorizer(ngram_range=(1, 3),
                                 preprocessor=<function dummy at 0x0000022B95B5AC18>,
                                 token_pattern=None,
                                 tokenizer=<function dummy at 0x0000022B95B5AC18>)),
                ('clf', OneVsRestClassifier(estimator=LinearSVC(C=0.5)))])

*NotFittedError: Vocabulary not fitted or provided*
[here](https://stackoverflow.com/questions/60472925/python-scikit-svm-vocabulary-not-fitted-or-provided)

## VIII. Choosing my Classifier

### Classifiers Training & Tunning Summary


|      Classifier      | Model Accuracy | Time to Train |           Observation          |
|:--------------------:|:--------------:|:-------------:|:------------------------------:|
| Binomial Naive Bayes | less than 82%  | 0.68s         | 22 labels couldn't be trained! |
| Random Forest        | less than 90%  | 11m 44s       | 3 labels couldn't be trained!  |
| Adaboost             | 93.6%          | 100.5s        |                                |
| k-Neighbors          | less than 90%  | 0.58s         | 3 labels couldn't be trained!  |
| Linear SVM           | less than 93%  | 26.81s        | 2 labels couldn't be trained!  |

*thanks for the service Tables Generator [here](https://www.tablesgenerator.com/markdown_tables)

#### In my concept, the rank is

**First** place, Adaboost. It seemed **reliable** and **fast** for this particular task, and it is a neat machine, really easy to understand

**Second** place, Linear SVM. Some of these labels are really **hard** to train and it was really **fast**

**Third** place, k-Neighbors. It is **fast** and seeme so realiable as **Random Forest**, that is **too hard** to train

---

And I will take... **Linear SVM**!

*Why? just because I cannot **really** believe that some of these labels can be trained!*

The "bad guy" were `tools`, `shops`, `aid centers` and the **real** problem involved is:

>- `shops` $\rightarrow$ 120
>- `tools` $\rightarrow$ 159
>- `aid_centers` $\rightarrow$ 309

>- there are so **few** labelled rows for these 3 guys that I really cannot believe that any Machine Learning Classifier can really **train** for them!
>- and what about **Adaboost**? Well, Adaboost is based on **stumps** algorithm. And by processing the data, it cannot really reach a true **zero**, as the stumps inside them do not allow this kind of thing. So, instead of a **1**, it will give you a **0.999%**, that worth nothing for practical uses
>- lately I can run more **GridSearch** and over **Linear SVM**. Adaboost don't have so much options for future improvement

So, I will use in my model **Linear SVM**

---

**Version 1.14** update: filtering for **valid** ones over critical labels

Choosen model changed to **Adaboost**. Why?

>- conting for **valid** labels showed that these labels are in fact **trainable**, but that is not so easy to do it
>- probably they are pressed to **zero**, as there are much more **false negatives** under these labels

**Future version** - as my labels columns are clearly **hierarchical**:

>- I could break my original dataset into 3 **more specific** datasets, as `infrastructure_related`, `aid_related` and `weather_related`, and include in each one the rows that are **relevant**
>- In this case, the noise caused by **false negatives** will decrease, turning easier for each training achieve a better score

---

**Version 1.17** updated: metrics **changed**, so my choice may change too!

New table for **Classifier evaluation** (10 greatest labels):

|      Classifier      | Precision | Recall | Worst Metrics |
|:--------------------:|:---------:|:------:|:-------------:|
| Binomial Naïve Bayes | 85.9%     | 26.4%  | 65.6% & 0.1%  |
| Random Forest        | 79.8%     | 60.1%  | 62.2% & 8.4%  |
| Adaboost             | 77.7%     | 58.7%  | 48.4% & 20.4% |
| k-Neighbors          | 60.1%     | 32.6%  | 28.6% & 1.2%  |
| Linear SVM           | 70.8%     | 71.1%  | 43.0% & 32.5% |

*Random Forest is very **slow** to fit!*
*k-Neighbors is really **slow** to predict!*

So, now I can see a lot of advantage for choosing **Linear SVM**:

>- it is not **slow** for fit/train
>- I can later explorer other better parameters using **GridSearch**
>- It **don´t decay** so fast, for labels without so much rows for train

My second choice is **Adaboost**

*If things don´t go pretty well, I have a fancy alternative!*

**Version 1.18**: letting the tokenizer take the same word more than once:

|      Classifier      | Precision | Recall | Worst Metrics | Observations                  |
|:--------------------:|:---------:|:------:|:-------------:|:-----------------------------:|
| Binomial Naïve Bayes | 86.3%     | 26.6%  | 64.5% & 0.1%  | Imperceptible changes         |
| Random Forest        | 79.8%     | 59.7%  | 61.8% & 9.3%  | Recall lowered a bit          |
| Adaboost             | 77.3%     | 55.8%  | 46.1% & 15.9% | Recall lowered a bit          |
| k-Neighbors          | 60.5%     | 32.2%  | 29.5% & 1.9%  | Parameters slightly increased |
| Linear SVM           | 70.5%     | 71.9%  | 44.7% & 35.8% | Parameters slightly increased |

*Fo, I will **keep** my tokenizer letting repeated tokens for each message, as I choose to use **Linear SVM**. If in future, training will turn so slow (as I get more and more messages at my dataset for training), I can go back to the earlier setting (only unique tokens per message)*

Verifying the amount of **positive** data for **few** data on the labels:

- observe that `child_alone` was previously removed from our training dataset

In [72]:
df2 = df[df.columns[5:]]
a = df2.apply(pd.Series.value_counts).loc[1]
a[a < 400]

offer             118.0
missing_people    299.0
tools             159.0
hospitals         283.0
shops             120.0
aid_centers       309.0
fire              282.0
Name: 1, dtype: float64

In [None]:
#mean score of the parameters
#gs_clf.best_score_
#gs_clf.best_params_

## IX. Export your model as a pickle file

1. Choose your model, with the fine tunning already done (this can be changed later!)

How to deal with picke [here](https://www.codegrepper.com/code-examples/python/save+and+load+python+pickle+stackoverflow)

Pickle documentation [here](https://docs.python.org/3/library/pickle.html#module-pickle)

2. Final considerations about this model:

>- I choosed **Adaboost** as our Classifier
>- The explanation for my choice is at the item **above**

---

**Version 1.18** update: now my Classifier was changed to **Linear SVC**. The explanations for my choice rests **above**

Trying the **Demo** code, that I found at **Codegreeper.com**

In [None]:
import pickle

dic = {'hello': 'world'}

with open('filename.pkl', 'wb') as pk_writer: #wb is for write+binary
    pickle.dump(dic, 
                pk_writer, 
                protocol=pickle.HIGHEST_PROTOCOL)
    
with open('filename.pkl', 'rb') as pk_reader: #rb is for read+binary
    dic_unpk = pickle.load(pk_reader)

print (dic == dic_unpk)

In [None]:
file_name = 'classifier.pkl'

with open (file_name, 'wb') as pk_writer: 
    pickle.dump(pipeline_lnsv, pk_writer)
    
with open('classifier.pkl', 'rb') as pk_reader: #rb is for read+binary
    pipeline_lnsv = pickle.load(pk_reader)

In [None]:
pipeline_lnsv.predict(X_test)

## X. Use the notebook to complete `train.py`

Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.

In [73]:
raise Exception('under development')

Exception: under development

In [None]:
#import packages
import sys
import math
import numpy as np
import udacourse2 #my library for this project!
import pandas as pd
from time import time

#SQLAlchemy toolkit
from sqlalchemy import create_engine
from sqlalchemy import pool
from sqlalchemy import inspect

#Machine Learning preparing/preprocessing toolkits
from sklearn.model_selection import train_test_split

#Machine Learning Feature Extraction tools
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

#Machine Learning Classifiers
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier

#Machine Learning Classifiers extra tools
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline

#pickling tool
import pickle

#only a dummy function, as I pre-tokenize my data
def dummy(doc):
    return doc

#########1#########2#########3#########4#########5#########6#########7#########8
def load_data(data_file, 
              verbose=False):
    '''This function takes a path for a MySQL table and returns processed data
    for training a Machine Learning Classifier
    Inputs:
      - data_file (mandatory) - full path for SQLite table - text string
      - verbose (optional) - if you want some verbosity during the running 
        (default=False)
    Outputs:
      - X - tokenized text X-training - Pandas Series
      - y - y-multilabels 0|1 - Pandas Dataframe'''
    if verbose:
        print('###load_data function started')
    start = time()

    #1.read in file
    #importing MySQL to Pandas - load data from database
    engine = create_engine(data_file, poolclass=pool.NullPool) #, echo=True)
    #retrieving tables names from my DB
    inspector = inspect(engine)
    if verbose:
        print('existing tables in my SQLite database:', inspector.get_table_names())
    connection = engine.connect()
    df = pd.read_sql('SELECT * FROM Messages', con=connection)
    connection.close()
    df.name = 'df'
    
    #2.clean data
    #2.1.Elliminate rows with all-blank labels
    if verbose:
        print('all labels are blank in {} rows'.format(df[df['if_blank'] == 1].shape[0]))
    df = df[df['if_blank'] == 0]
    if verbose:
        print('remaining rows:', df.shape[0])
    #Verifying if removal was complete
    if df[df['if_blank'] == 1].shape[0] == 0:
        if verbose:
            print('removal complete!')
        else:
            raise Exception('something went wrong with rows removal before training')
            
    #2.2.Premature Tokenization Strategy (pre-tokenizer)
    #Pre-Tokenizer + not removing provisory tokenized column
    #inserting a tokenized column
    try:
        df = df.drop('tokenized', axis=1)
    except KeyError:
        print('OK')
    df.insert(1, 'tokenized', np.nan)
    #tokenizing over the provisory
    df['tokenized'] = df.apply(lambda x: udacourse2.fn_tokenize_fast(x['message']), axis=1)
    #removing NaN over provisory (if istill exist)
    df = df[df['tokenized'].notnull()]
    empty_tokens = df[df['tokenized'].apply(lambda x: len(x)) == 0].shape[0]
    if verbose:
        print('found {} rows with no tokens'.format(empty_tokens))
    df = df[df['tokenized'].apply(lambda x: len(x)) > 0]
    empty_tokens = df[df['tokenized'].apply(lambda x: len(x)) == 0].shape[0]
    if verbose:
        print('*after removal, found {} rows with no tokens'.format(empty_tokens))
    #I will drop the original 'message' column
    try:
        df = df.drop('message', axis=1)
    except KeyError:
        if verbose:
            print('OK')
    if verbose:
        print('now I have {} rows to train'.format(df.shape[0]))

    #2.3.Database Data Consistency Check/Fix
    #correction for aid_related
    df = udacourse2.fn_group_check(dataset=df,
                                   subset='aid',
                                   correct=True, 
                                   shrink=False, 
                                   shorten=False, 
                                   verbose=True)
    #correction for weather_related
    df = udacourse2.fn_group_check(dataset=df,
                                   subset='wtr',
                                   correct=True, 
                                   shrink=False, 
                                   shorten=False, 
                                   verbose=True)
    #correction for infrastrucutre_related
    df = udacourse2.fn_group_check(dataset=df,
                                   subset='ifr',
                                   correct=True, 
                                   shrink=False, 
                                   shorten=False, 
                                   verbose=True)
    #correction for related(considering that the earlier were already corrected)
    df = udacourse2.fn_group_check(dataset=df,
                                   subset='main',
                                   correct=True, 
                                   shrink=False, 
                                   shorten=False, 
                                   verbose=True)
    
    #load to database <-I don't know for what it is
    
    #3.Define features and label arrays (break the data)
    #3.1.X is the Training Text Column
    X = df['tokenized']
    #3.2.y is the Classification labels
    #I REMOVED "related" column from my labels, as it is impossible to train it!
    y = df[df.columns[5:]]
    remove_lst = []

    for column in y.columns:
        col = y[column]
        if (col == 0).all():
            if verbose:
                print('*{} -> only zeroes training column!'.format(column))
            remove_lst.append(column)
        else:
            #print('*{} -> column OK'.format(column))
            pass
        
    if verbose:
        print(remove_lst)
    y = y.drop(remove_lst, axis=1)
    
    spent = time() - start
    if verbose:
        print('*dataset breaked into X-Training Text Column and Y-Multilabels')    
        print('process time:{:.0f} seconds'.format(spent))
    return X, y

#########1#########2#########3#########4#########5#########6#########7#########8
def build_model(verbose=False):
    '''This function builds the Classifier Pipeline, for future fitting
    Inputs:
      - verbose (optional) - if you want some verbosity during the running 
        (default=False)
    Output:
      - model_pipeline for your Classifiear (untrained)
    '''
    if verbose:
        print('###build_model function started')
    start = time()
    
    #1.text processing and model pipeline
    #(text processing was made at a earlier step, at Load Data function)
    feats = TfidfVectorizer(analyzer='word', 
                            tokenizer=dummy, 
                            preprocessor=dummy,
                            token_pattern=None,
                            ngram_range=(1, 3))
    
    classif = OneVsRestClassifier(LinearSVC(C=2., 
                                            random_state=42))
    
    model_pipeline = Pipeline([('vect', feats),
                               ('clf', classif)])
    
    #define parameters for GridSearchCV (parameters already defined)
    #create gridsearch object and return as final model pipeline (made at pipeline preparation)
    #obs: for better performance, I pre-tokenized my data. And GridSearch was runned on Jupyter,
    #     and the best parameters where adjusted, just to save processing time during code execution.
    spent = time() - start
    if verbose:
        print('*Linear Support Vector Machine pipeline was created')
        print('process time:{:.0f} seconds'.format(spent))
    return model_pipeline

#########1#########2#########3#########4#########5#########6#########7#########8
def train(X, 
          y, 
          model, 
          verbose=False):
    '''This function trains your already created Classifier Pipeline
    Inputs:
      - X (mandatory) - tokenized data for training - Pandas Series
      - y (mandatory) - Multilabels 0|1 - Pandas Dataset
      - verbose (optional) - if you want some verbosity during the running 
        (default=False)
    Output:
      - trained model'''
    if verbose:
        print('###train function started')
    start = time()

    #1.Train test split
    #Split makes randomization, so random_state parameter was set
    X_train, X_test, y_train, y_test = train_test_split(X, 
                                                        y, 
                                                        test_size=0.25, 
                                                        random_state=42)
    if (X_train.shape[0] + X_test.shape[0]) == X.shape[0]:
        if verbose:
            print('data split into train and text seems OK')
    else:
        raise Exception('something went wrong when splitting the data')
        
    #2.fit the model
    model.fit(X_train, y_train)
    
    # output model test results
    y_pred = model.predict(X_test)
    if verbose:
        metrics = udacourse2.fn_scores_report2(y_test, 
                                               y_pred,
                                               best_10=True,
                                               verbose=True)
    else:
        metrics = udacourse2.fn_scores_report2(y_test, 
                                               y_pred,
                                               best_10=True,
                                               verbose=False)

    for metric in metrics:
        if metric < 0.6:
            raise Exception('something is wrong, model is predicting poorly')

    spent = time() - start
    if verbose:
        print('*classifier was trained!')
        print('process time:{:.0f} seconds'.format(spent))
    return model

#########1#########2#########3#########4#########5#########6#########7#########8
def export_model(model,
                 file_name='classifier.pkl',
                 verbose=False):
    '''This function writes your already trained Classifiear as a Picke Binary
    file.
    Inputs:
      - model (mandatory) - your already trained Classifiear - Python Object
      - file_name (optional) - the name of the file to be created (default:
         'classifier.pkl')
      - verbose (optional) - if you want some verbosity during the running 
        (default=False)
       Output: return True if everything runs OK
      ''' 
    if verbose:
        print('###export_model function started')
    start = time()

    #1.Export model as a pickle file
    file_name = file_name

    #writing the file
    with open (file_name, 'wb') as pk_writer: 
        pickle.dump(model, pk_writer)

    #reading the file
    #with open('classifier.pkl', 'rb') as pk_reader:
    #    model = pickle.load(pk_reader)
    
    spent = time() - start
    if verbose:
        print('*trained Classifier was exported')
        print('process time:{:.0f} seconds'.format(spent))
        
    return True

#########1#########2#########3#########4#########5#########6#########7#########8
def run_pipeline(data_file='sqlite:///Messages.db', 
                 verbose=False):
    '''This function is a caller: it calls load, build, train and save modules
    Inputs:
      - data_file (optional) - complete path to the SQLite datafile to be 
        processed - (default='sqlite:///Messages.db')
      - verbose (optional) - if you want some verbosity during the running 
        (default=False)
    Output: return True if everything runs OK
    '''
    if verbose:
        print('###run_pipeline function started')
    start = time()

    #1.Run ETL pipeline
    X, y = load_data(data_file, 
                     verbose=verbose)
    #2.Build model pipeline
    model = build_model(verbose=verbose)
    #3.Train model pipeline
    model = train(X, 
                  y, 
                  model, 
                  verbose=verbose)
    # save the model
    export_model(model,
                 verbose=verbose)
    
    spent = time() - start
    if verbose:
        print('process time:{:.0f} seconds'.format(spent))
    return True

#########1#########2#########3#########4#########5#########6#########7#########8
def main():
    '''This is the main Machine Learning Pipeline function. It calls the other 
    ones, in the correctorder.
    '''
    data_file = sys.argv[1]  # get filename of dataset
    run_pipeline(data_file='sqlite:///Messages.db',
                 verbose=True)

#########1#########2#########3#########4#########5#########6#########7#########8
if __name__ == '__main__':
    main()

`P@k` implementation [here](https://medium.com/analytics-vidhya/metrics-for-multi-label-classification-49cc5aeba1c3#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjgxOWQxZTYxNDI5ZGQzZDNjYWVmMTI5YzBhYzJiYWU4YzZkNDZmYmMiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJuYmYiOjE2MzAyNzYxNDYsImF1ZCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsInN1YiI6IjEwNTAzNjUxNTUwMDU1MTQ1OTkzNSIsImVtYWlsIjoiZXBhc3NldG9AZ21haWwuY29tIiwiZW1haWxfdmVyaWZpZWQiOnRydWUsImF6cCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsIm5hbWUiOiJFZHVhcmRvIFBhc3NldG8iLCJwaWN0dXJlIjoiaHR0cHM6Ly9saDMuZ29vZ2xldXNlcmNvbnRlbnQuY29tL2EtL0FPaDE0R2pJNmh5V3FSTGNfdHZCYlg4OWxFTEphZ3diMFBYeXJNOGN1YXBLR1E9czk2LWMiLCJnaXZlbl9uYW1lIjoiRWR1YXJkbyIsImZhbWlseV9uYW1lIjoiUGFzc2V0byIsImlhdCI6MTYzMDI3NjQ0NiwiZXhwIjoxNjMwMjgwMDQ2LCJqdGkiOiIzYzYyZThiZDhkYWU4YjU4NWJlZDI4ZGFhYjE5ZDkwY2MyOTFmNjhlIn0.kwd1YjjoxP-RUFHA86RftkGHMMwic3edRM31Yz8sJL9dg0jzPwS2c9peJ9kDuIQK5x8PWvZxhnl-wI32M_D_FvWv5UXad1cYnkuEGnxeo94LPCUam-aOnUvDDpefUEOv8Oe2751C0VH1MrlDiOQxyGcYBIjnr2NtdaN8Y8pm-ZLonqw3zpZO-2Wlkhnrb12ruZmpWD2CbqZCHpNwmYq0bQqCrNp_dCZ9mBjc5xrYN2G8Us7ESZcCnqLLjk_cb6UVV81LFjKkrjGifBsOac-ANoc7TBJQnFW41FISORWL8j84mW7jl8UgEmxrgc8kaFtHm6oC5ptc9YLRBDq1Q93ZBQ)

"Given a list of actual classes and predicted classes, precision at k would be defined as the number of correct predictions considering only the top k elements of each class divided by k"

In [None]:
def patk(actual, pred, k):
	#we return 0 if k is 0 because 
	#   we can't divide the no of common values by 0 
	if k == 0:
		return 0

	#taking only the top k predictions in a class 
	k_pred = pred[:k]

	#taking the set of the actual values 
	actual_set = set(actual)

	#taking the set of the predicted values 
	pred_set = set(k_pred)

	#taking the intersection of the actual set and the pred set
		# to find the common values
	common_values = actual_set.intersection(pred_set)

	return len(common_values)/len(pred[:k])

In [None]:
#defining the values of the actual and the predicted class
y_true = [1 ,2, 0]
y_pred = [1, 1, 0]

In [None]:
if __name__ == "__main__":
    print(patk(y_true, y_pred,3))

`AP@k` implementation [here](https://medium.com/analytics-vidhya/metrics-for-multi-label-classification-49cc5aeba1c3#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjgxOWQxZTYxNDI5ZGQzZDNjYWVmMTI5YzBhYzJiYWU4YzZkNDZmYmMiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJuYmYiOjE2MzAyNzYxNDYsImF1ZCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsInN1YiI6IjEwNTAzNjUxNTUwMDU1MTQ1OTkzNSIsImVtYWlsIjoiZXBhc3NldG9AZ21haWwuY29tIiwiZW1haWxfdmVyaWZpZWQiOnRydWUsImF6cCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsIm5hbWUiOiJFZHVhcmRvIFBhc3NldG8iLCJwaWN0dXJlIjoiaHR0cHM6Ly9saDMuZ29vZ2xldXNlcmNvbnRlbnQuY29tL2EtL0FPaDE0R2pJNmh5V3FSTGNfdHZCYlg4OWxFTEphZ3diMFBYeXJNOGN1YXBLR1E9czk2LWMiLCJnaXZlbl9uYW1lIjoiRWR1YXJkbyIsImZhbWlseV9uYW1lIjoiUGFzc2V0byIsImlhdCI6MTYzMDI3NjQ0NiwiZXhwIjoxNjMwMjgwMDQ2LCJqdGkiOiIzYzYyZThiZDhkYWU4YjU4NWJlZDI4ZGFhYjE5ZDkwY2MyOTFmNjhlIn0.kwd1YjjoxP-RUFHA86RftkGHMMwic3edRM31Yz8sJL9dg0jzPwS2c9peJ9kDuIQK5x8PWvZxhnl-wI32M_D_FvWv5UXad1cYnkuEGnxeo94LPCUam-aOnUvDDpefUEOv8Oe2751C0VH1MrlDiOQxyGcYBIjnr2NtdaN8Y8pm-ZLonqw3zpZO-2Wlkhnrb12ruZmpWD2CbqZCHpNwmYq0bQqCrNp_dCZ9mBjc5xrYN2G8Us7ESZcCnqLLjk_cb6UVV81LFjKkrjGifBsOac-ANoc7TBJQnFW41FISORWL8j84mW7jl8UgEmxrgc8kaFtHm6oC5ptc9YLRBDq1Q93ZBQ)

"It is defined as the average of all the precision at k for k =1 to k"

In [None]:
import numpy as np
import pk

In [None]:
def apatk(acutal, pred, k):
	#creating a list for storing the values of precision for each k 
	precision_ = []
	for i in range(1, k+1):
		#calculating the precision at different values of k 
		#      and appending them to the list 
		precision_.append(pk.patk(acutal, pred, i))

	#return 0 if there are no values in the list
	if len(precision_) == 0:
		return 0 

	#returning the average of all the precision values
	return np.mean(precision_)

In [None]:
#defining the values of the actual and the predicted class
y_true = [[1,2,0,1], [0,4], [3], [1,2]]
y_pred = [[1,1,0,1], [1,4], [2], [1,3]]

In [None]:
if __name__ == "__main__":
	for i in range(len(y_true)):
		for j in range(1, 4):
			print(
				f"""
				y_true = {y_true[i]}
				y_pred = {y_pred[i]}
				AP@{j} = {apatk(y_true[i], y_pred[i], k=j)}
				"""
			)

`MAP@k` implementation [here](https://medium.com/analytics-vidhya/metrics-for-multi-label-classification-49cc5aeba1c3#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjgxOWQxZTYxNDI5ZGQzZDNjYWVmMTI5YzBhYzJiYWU4YzZkNDZmYmMiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJuYmYiOjE2MzAyNzYxNDYsImF1ZCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsInN1YiI6IjEwNTAzNjUxNTUwMDU1MTQ1OTkzNSIsImVtYWlsIjoiZXBhc3NldG9AZ21haWwuY29tIiwiZW1haWxfdmVyaWZpZWQiOnRydWUsImF6cCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsIm5hbWUiOiJFZHVhcmRvIFBhc3NldG8iLCJwaWN0dXJlIjoiaHR0cHM6Ly9saDMuZ29vZ2xldXNlcmNvbnRlbnQuY29tL2EtL0FPaDE0R2pJNmh5V3FSTGNfdHZCYlg4OWxFTEphZ3diMFBYeXJNOGN1YXBLR1E9czk2LWMiLCJnaXZlbl9uYW1lIjoiRWR1YXJkbyIsImZhbWlseV9uYW1lIjoiUGFzc2V0byIsImlhdCI6MTYzMDI3NjQ0NiwiZXhwIjoxNjMwMjgwMDQ2LCJqdGkiOiIzYzYyZThiZDhkYWU4YjU4NWJlZDI4ZGFhYjE5ZDkwY2MyOTFmNjhlIn0.kwd1YjjoxP-RUFHA86RftkGHMMwic3edRM31Yz8sJL9dg0jzPwS2c9peJ9kDuIQK5x8PWvZxhnl-wI32M_D_FvWv5UXad1cYnkuEGnxeo94LPCUam-aOnUvDDpefUEOv8Oe2751C0VH1MrlDiOQxyGcYBIjnr2NtdaN8Y8pm-ZLonqw3zpZO-2Wlkhnrb12ruZmpWD2CbqZCHpNwmYq0bQqCrNp_dCZ9mBjc5xrYN2G8Us7ESZcCnqLLjk_cb6UVV81LFjKkrjGifBsOac-ANoc7TBJQnFW41FISORWL8j84mW7jl8UgEmxrgc8kaFtHm6oC5ptc9YLRBDq1Q93ZBQ)

"The average of all the values of `AP@k` over the whole training data is known as `MAP@k`. This helps us give an accurate representation of the accuracy of whole prediction data"

In [None]:
import numpy as np
import apk

In [None]:
def mapk(acutal, pred, k):

	#creating a list for storing the Average Precision Values
	average_precision = []
	#interating through the whole data and calculating the apk for each 
	for i in range(len(acutal)):
		average_precision.append(apk.apatk(acutal[i], pred[i], k))

	#returning the mean of all the data
	return np.mean(average_precision)

In [None]:
#defining the values of the actual and the predicted class
y_true = [[1,2,0,1], [0,4], [3], [1,2]]
y_pred = [[1,1,0,1], [1,4], [2], [1,3]]

In [None]:
if __name__ == "__main__":
    print(mapk(y_true, y_pred,3))

`F1 Samples` implementation [here](https://medium.com/analytics-vidhya/metrics-for-multi-label-classification-49cc5aeba1c3#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjgxOWQxZTYxNDI5ZGQzZDNjYWVmMTI5YzBhYzJiYWU4YzZkNDZmYmMiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJuYmYiOjE2MzAyNzYxNDYsImF1ZCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsInN1YiI6IjEwNTAzNjUxNTUwMDU1MTQ1OTkzNSIsImVtYWlsIjoiZXBhc3NldG9AZ21haWwuY29tIiwiZW1haWxfdmVyaWZpZWQiOnRydWUsImF6cCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsIm5hbWUiOiJFZHVhcmRvIFBhc3NldG8iLCJwaWN0dXJlIjoiaHR0cHM6Ly9saDMuZ29vZ2xldXNlcmNvbnRlbnQuY29tL2EtL0FPaDE0R2pJNmh5V3FSTGNfdHZCYlg4OWxFTEphZ3diMFBYeXJNOGN1YXBLR1E9czk2LWMiLCJnaXZlbl9uYW1lIjoiRWR1YXJkbyIsImZhbWlseV9uYW1lIjoiUGFzc2V0byIsImlhdCI6MTYzMDI3NjQ0NiwiZXhwIjoxNjMwMjgwMDQ2LCJqdGkiOiIzYzYyZThiZDhkYWU4YjU4NWJlZDI4ZGFhYjE5ZDkwY2MyOTFmNjhlIn0.kwd1YjjoxP-RUFHA86RftkGHMMwic3edRM31Yz8sJL9dg0jzPwS2c9peJ9kDuIQK5x8PWvZxhnl-wI32M_D_FvWv5UXad1cYnkuEGnxeo94LPCUam-aOnUvDDpefUEOv8Oe2751C0VH1MrlDiOQxyGcYBIjnr2NtdaN8Y8pm-ZLonqw3zpZO-2Wlkhnrb12ruZmpWD2CbqZCHpNwmYq0bQqCrNp_dCZ9mBjc5xrYN2G8Us7ESZcCnqLLjk_cb6UVV81LFjKkrjGifBsOac-ANoc7TBJQnFW41FISORWL8j84mW7jl8UgEmxrgc8kaFtHm6oC5ptc9YLRBDq1Q93ZBQ)

"This metric calculates the F1 score for each instance in the data and then calculates the average of the F1 scores"

In [None]:
from sklearn.metrics import f1_score
from sklearn.preprocessing import MultiLabelBinarizer

In [None]:
def f1_sampled(actual, pred):
    #converting the multi-label classification to a binary output
    mlb = MultiLabelBinarizer()
    actual = mlb.fit_transform(actual)
    pred = mlb.fit_transform(pred)

    #fitting the data for calculating the f1 score 
    f1 = f1_score(actual, pred, average = "samples")
    return f1

In [None]:
#defining the values of the actual and the predicted class
y_true = [[1,2,0,1], [0,4], [3], [1,2]]
y_pred = [[1,1,0,1], [1,4], [2], [1,3]]

In [None]:
if __name__ == "__main__":
    print(f1_sampled(y_true, y_pred))