# Text Classification : MultiClass

## Background
This notebook tutorial is derived from [a classification example](https://github.com/asyml/forte/tree/master/examples/classification).
Given a table-like csv file with data at some columns are input text and data at one column is label, we set up a text classification pipeline below. This example is also a good example of wrapping external library classes/methods into `PipelineComponent`.

## Inference Workflow

### Pipeline
* [Pipeline setup](https://github.com/asyml/forte/blob/master/examples/classification/bank_customer_intent.py#L123)

* The pipeline has one reader `ClassificationDatasetReader` and two processor
`NLTKSentenceSegmenter` and `ZeroShotClassifier`. 


### Reader
* [ClassificationDatasetReader](https://github.com/asyml/forte/blob/7dc6e6c7d62d9a4126bdfc5ca02d15be3ffd61ca/forte/data/readers/classification_reader.py#L26)
    * `set_up()`: It checks whether the configuration is correct. For example, `skip_k_starting_lines` should be larger than 0 otherwise it doesn't make sense. It also converts different table data at the label column to a digit.
    * `_collect()`: read rows from csv file and returns iterator that yields line id and line data.
    * `_cache_key_function()`: use the line id as the cache key. 
    * `_parse_pack()`: parse data from iterator returned by `_collect` and load it in the datapack



### Processor
In this example, we want to classify data sentence by sentence so we wrapped `nltk.PunktSentenceTokenizer` in [NLTKSentenceSegmenter](https://github.com/asyml/forte-wrappers/blob/80cfe19926c0596edd13985581e8ca01a7be86ad/src/nltk/fortex/nltk/nltk_processors.py#L247) to segment sentences. 

* `_process()`: split data pack text into sentence spans.



Then need a model to do classification. We wrap `transformers.pipeline` in 
[Huggingface ZeroShotClassifier](https://github.com/asyml/forte-wrappers/blob/main/src/huggingface/fortex/huggingface/zero_shot_classifier.py).

* `_process()`: running classifier over data pack data and write the prediction results back to data pack.

`ZeroShotClassifier` and `NLTKSentenceSegmenter` both inherit from `PackProcessor` as it processes one `DataPack` at a time. Suppose if we processes one `MultiPack` at a time, we need to inherit `MultiPackProcessor` instead. 

## Imports

In [2]:
import sys
from importlib_metadata import csv
from termcolor import colored

from forte import Pipeline
from forte.data.readers import ClassificationDatasetReader
from fortex.nltk import NLTKSentenceSegmenter
from fortex.huggingface import ZeroShotClassifier
from ft.onto.base_ontology import Sentence
import pandas as pd

## Dataset

Banking77 is a multi-class datasets. It has 77 classes which are fine-grained intents in a banking domain.
The train data can be downloaded from [link](https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/train.csv) and test data can be downloaded from [link](https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/test.csv).

## Visualize data

In [3]:
csv_path = "../../data_samples/banking77/sample.csv"
df=pd.read_csv(csv_path)
df.sample(5)

Unnamed: 0,text,category
12,I still don't have my card after 2 weeks. Wha...,card_arrival
11,How long does a card delivery take?,card_arrival
15,I have been waiting longer than expected for m...,card_arrival
1,"I still have not received my new card, I order...",card_arrival
13,still waiting on my new card,card_arrival


## Reader Configuration
`ClassificationDatasetReader` is designed to read table-like classification datasets and currently it only support `csv` file which is a common file format. To use the reader correctly, User needs to check the dataset and configure the reader correspondingly. To better explain this, we will use banking77 dataset as an example throughout the explanation.
* User needs to check column names of the dataset. In the example dataset, we have column names [label, title, content]. First, we need know the first column is about data labels. Second, we know the second and third column can be input text. Therefore, we can set `forte_data_fields` to be `['label', 'ft.onto.base_ontology.Title', 'ft.onto.base_ontology.Body']` that each element matches column names from dataset. `label` is just a keyword that reader needs to identify the label. `'ft.onto.base_ontology.Title'` and `'ft.onto.base_ontology.Body'` are two forte data entries that stores input text in proper wrappers. In some cases that dataset might contain unnecessary columns that User doesn't want to use at all, User can set corresponding list elements in `forte_data_fields` to `None` so that the reader can skip processing them. 
* User also needs to check if how many classes in the dataset to configure `index2class` which is a dictionary mapping from zero-based indices to class names. For dataset with many classes such as banking77, User can initialize `class_names` to store a list of class names and then set 
    `index2class` to `dict(enumerate(class_names))`.
* User needs to check the first line of dataset if they are column names which are not input data. If it's the case, User needs to set `skip_k_starting_lines` to `1` to skip the first line. Otherwise, `skip_k_starting_lines` defaults to `0` which means not skipping the first line. In special cases when User wants to skip multiple lines, User can just set `skip_k_starting_lines` to the number of lines they want to skip.
* In some cases, dataset labels are digits rather than text. User needs to set `digit_label` to `True`. Then User needs to check if the dataset label starting with `1`, if so, User needs to set `one_based_index_label` to True.


## List of class names

In [4]:
class_names = [
    "activate_my_card",
    "age_limit",
    "apple_pay_or_google_pay",
    "atm_support",
    "automatic_top_up",
    "balance_not_updated_after_bank_transfer",
    "balance_not_updated_after_cheque_or_cash_deposit",
    "beneficiary_not_allowed",
    "cancel_transfer",
    "card_about_to_expire",
    "card_acceptance",
    "card_arrival",
    "card_delivery_estimate",
    "card_linking",
    "card_not_working",
    "card_payment_fee_charged",
    "card_payment_not_recognised",
    "card_payment_wrong_exchange_rate",
    "card_swallowed",
    "cash_withdrawal_charge",
    "cash_withdrawal_not_recognised",
    "change_pin",
    "compromised_card",
    "contactless_not_working",
    "country_support",
    "declined_card_payment",
    "declined_cash_withdrawal",
    "declined_transfer",
    "direct_debit_payment_not_recognised",
    "disposable_card_limits",
    "edit_personal_details",
    "exchange_charge",
    "exchange_rate",
    "exchange_via_app",
    "extra_charge_on_statement",
    "failed_transfer",
    "fiat_currency_support",
    "get_disposable_virtual_card",
    "get_physical_card",
    "getting_spare_card",
    "getting_virtual_card",
    "lost_or_stolen_card",
    "lost_or_stolen_phone",
    "order_physical_card",
    "passcode_forgotten",
    "pending_card_payment",
    "pending_cash_withdrawal",
    "pending_top_up",
    "pending_transfer",
    "pin_blocked",
    "receiving_money",
    "Refund_not_showing_up",
    "request_refund",
    "reverted_card_payment?",
    "supported_cards_and_currencies",
    "terminate_account",
    "top_up_by_bank_transfer_charge",
    "top_up_by_card_charge",
    "top_up_by_cash_or_cheque",
    "top_up_failed",
    "top_up_limits",
    "top_up_reverted",
    "topping_up_by_card",
    "transaction_charged_twice",
    "transfer_fee_charged",
    "transfer_into_account",
    "transfer_not_received_by_recipient",
    "transfer_timing",
    "unable_to_verify_identity",
    "verify_my_identity",
    "verify_source_of_funds",
    "verify_top_up",
    "virtual_card_not_working",
    "visa_or_mastercard",
    "why_verify_identity",
    "wrong_amount_of_cash_received",
    "wrong_exchange_rate_for_cash_withdrawal",
]

## Converting class names into numerical values

In [6]:
index2class = dict(enumerate(class_names))

##  Initialize reader config

In [7]:
this_reader_config = {
    "forte_data_fields": [
        "ft.onto.base_ontology.Body",
        "label",
    ],
    "index2class": index2class,
    "text_fields": [
        "ft.onto.base_ontology.Body"
    ],
    "digit_label": False,
    "one_based_index_label": False,
}

## Initialize the pipeline 

In [8]:
pl = Pipeline()
pl.set_reader(ClassificationDatasetReader(), config=this_reader_config)
pl.add(NLTKSentenceSegmenter())
pl.add(ZeroShotClassifier(), config={"candidate_labels": class_names})
pl.initialize()

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/bhaskarrao/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<forte.pipeline.Pipeline at 0x7f7b9a1afbb0>

## The below code will predict the classification result of each sentence in the csv file

In [9]:
for pack in pl.process_dataset(csv_path):
    for sent in pack.get(Sentence):
        sent_text = sent.text
        print(colored("Sentence:", "red"), sent_text, "\n")
        print(colored("Prediction:", "blue"), sent.classification)

[31mSentence:[0m How do I locate my card? 

[34mPrediction:[0m {'lost_or_stolen_card': 0.4788, 'compromised_card': 0.3407, 'get_physical_card': 0.3214, 'card_linking': 0.2053, 'card_acceptance': 0.1595, 'passcode_forgotten': 0.1424, 'getting_spare_card': 0.1394, 'order_physical_card': 0.1389, 'reverted_card_payment?': 0.1371, 'getting_virtual_card': 0.1054, 'virtual_card_not_working': 0.066, 'why_verify_identity': 0.0623, 'card_not_working': 0.0616, 'get_disposable_virtual_card': 0.061, 'card_swallowed': 0.0603, 'supported_cards_and_currencies': 0.06, 'contactless_not_working': 0.0593, 'activate_my_card': 0.0576, 'verify_my_identity': 0.056, 'card_payment_not_recognised': 0.055, 'card_about_to_expire': 0.0426, 'visa_or_mastercard': 0.0393, 'pending_card_payment': 0.0384, 'card_arrival': 0.0375, 'cash_withdrawal_not_recognised': 0.0372, 'unable_to_verify_identity': 0.0321, 'declined_card_payment': 0.0294, 'card_delivery_estimate': 0.0261, 'declined_cash_withdrawal': 0.025, 'direct_d

[31mSentence:[0m Is there a way to know when my card will arrive? 

[34mPrediction:[0m {'card_delivery_estimate': 0.2742, 'card_not_working': 0.1729, 'order_physical_card': 0.0868, 'card_acceptance': 0.0861, 'getting_spare_card': 0.0844, 'compromised_card': 0.0785, 'reverted_card_payment?': 0.0733, 'pending_card_payment': 0.0688, 'card_arrival': 0.0688, 'card_linking': 0.0637, 'virtual_card_not_working': 0.0618, 'getting_virtual_card': 0.0604, 'get_physical_card': 0.0568, 'pending_transfer': 0.0538, 'declined_transfer': 0.0476, 'transfer_timing': 0.0437, 'card_swallowed': 0.0372, 'pending_cash_withdrawal': 0.0326, 'unable_to_verify_identity': 0.0311, 'contactless_not_working': 0.0304, 'pin_blocked': 0.028, 'fiat_currency_support': 0.0229, 'supported_cards_and_currencies': 0.0216, 'activate_my_card': 0.021, 'failed_transfer': 0.0179, 'receiving_money': 0.016, 'get_disposable_virtual_card': 0.0147, 'passcode_forgotten': 0.0145, 'lost_or_stolen_card': 0.0143, 'visa_or_mastercard': 0.0

[31mSentence:[0m i have not received my card 

[34mPrediction:[0m {'pending_card_payment': 0.9204, 'pending_transfer': 0.8726, 'failed_transfer': 0.8579, 'card_payment_not_recognised': 0.8114, 'card_not_working': 0.7187, 'card_delivery_estimate': 0.6956, 'declined_transfer': 0.627, 'pending_cash_withdrawal': 0.5498, 'pin_blocked': 0.394, 'transfer_not_received_by_recipient': 0.3481, 'contactless_not_working': 0.3406, 'declined_card_payment': 0.3279, 'lost_or_stolen_card': 0.3207, 'pending_top_up': 0.3149, 'passcode_forgotten': 0.3072, 'virtual_card_not_working': 0.3023, 'direct_debit_payment_not_recognised': 0.2902, 'card_swallowed': 0.2836, 'reverted_card_payment?': 0.1706, 'cash_withdrawal_not_recognised': 0.1639, 'unable_to_verify_identity': 0.1463, 'get_physical_card': 0.1376, 'Refund_not_showing_up': 0.1242, 'getting_spare_card': 0.1046, 'card_acceptance': 0.1002, 'declined_cash_withdrawal': 0.0727, 'order_physical_card': 0.0627, 'balance_not_updated_after_bank_transfer': 0.05

[31mSentence:[0m How long does a card delivery take? 

[34mPrediction:[0m {'card_delivery_estimate': 0.3711, 'card_acceptance': 0.0781, 'get_physical_card': 0.0391, 'card_arrival': 0.0335, 'getting_spare_card': 0.0311, 'order_physical_card': 0.0302, 'card_swallowed': 0.0287, 'reverted_card_payment?': 0.0234, 'card_linking': 0.0211, 'get_disposable_virtual_card': 0.0192, 'getting_virtual_card': 0.0183, 'pending_card_payment': 0.0179, 'supported_cards_and_currencies': 0.0165, 'compromised_card': 0.0165, 'passcode_forgotten': 0.0134, 'topping_up_by_card': 0.0125, 'transfer_timing': 0.0124, 'pin_blocked': 0.0114, 'virtual_card_not_working': 0.0104, 'disposable_card_limits': 0.01, 'card_about_to_expire': 0.0094, 'activate_my_card': 0.0093, 'card_payment_fee_charged': 0.0086, 'edit_personal_details': 0.0078, 'card_not_working': 0.0077, 'declined_card_payment': 0.0076, 'verify_my_identity': 0.007, 'visa_or_mastercard': 0.0069, 'unable_to_verify_identity': 0.006, 'declined_transfer': 0.005

[31mSentence:[0m I am still waiting for my card after 1 week. 

[34mPrediction:[0m {'failed_transfer': 0.8337, 'pending_card_payment': 0.7525, 'card_not_working': 0.6236, 'pending_transfer': 0.6071, 'pin_blocked': 0.5322, 'card_delivery_estimate': 0.4697, 'card_acceptance': 0.3969, 'declined_transfer': 0.3573, 'pending_cash_withdrawal': 0.3484, 'transfer_not_received_by_recipient': 0.3238, 'reverted_card_payment?': 0.3207, 'top_up_failed': 0.2695, 'unable_to_verify_identity': 0.227, 'get_physical_card': 0.2269, 'virtual_card_not_working': 0.2238, 'order_physical_card': 0.2205, 'getting_spare_card': 0.2148, 'card_payment_not_recognised': 0.2092, 'card_swallowed': 0.1938, 'activate_my_card': 0.1805, 'transfer_timing': 0.1632, 'pending_top_up': 0.163, 'getting_virtual_card': 0.1603, 'exchange_charge': 0.1424, 'Refund_not_showing_up': 0.1388, 'card_linking': 0.1335, 'lost_or_stolen_card': 0.1296, 'balance_not_updated_after_bank_transfer': 0.1247, 'declined_card_payment': 0.1206, 'conta

[31mSentence:[0m Why hasn't my card been delivered? 

[34mPrediction:[0m {'order_physical_card': 0.3274, 'card_not_working': 0.3091, 'reverted_card_payment?': 0.2668, 'card_delivery_estimate': 0.2109, 'get_physical_card': 0.2102, 'getting_spare_card': 0.1794, 'pending_card_payment': 0.1527, 'card_acceptance': 0.1496, 'card_swallowed': 0.1089, 'failed_transfer': 0.096, 'compromised_card': 0.0908, 'request_refund': 0.0841, 'card_payment_not_recognised': 0.0724, 'virtual_card_not_working': 0.0644, 'card_linking': 0.0592, 'Refund_not_showing_up': 0.0575, 'getting_virtual_card': 0.0557, 'transfer_not_received_by_recipient': 0.0532, 'unable_to_verify_identity': 0.0456, 'activate_my_card': 0.0453, 'direct_debit_payment_not_recognised': 0.0389, 'pending_transfer': 0.0387, 'supported_cards_and_currencies': 0.0383, 'contactless_not_working': 0.0378, 'lost_or_stolen_card': 0.0374, 'card_arrival': 0.0362, 'get_disposable_virtual_card': 0.0356, 'declined_transfer': 0.0329, 'pin_blocked': 0.0324