<img src="https://github.com/fluidml/fluidml/blob/main/logo/fluid_ml_logo.png" width=60 height=60 />

# **Tutorial: Text Classification using FluidML and Sklearn**
In this notebook, we'll go over some basics of FluidML and implement a complete ML pipeline that performs text classification.
Like any other ML pipeline, this usually consists of several steps:
- Dataset collection
- Dataset pre-processing
- Featurization
- Training a classifier
- Evaluation of classifier

With FluidML, all of these steps can naturally be implemented as individual tasks and then later put together as a task graph/flow.

## **Setup**

**Note**: Due to the limitation of multiprocessing and jupyter, we have to import task definitions from a separate script.



In [18]:
from fluidml.common import Task, Resource
from fluidml.swarm import Swarm
from fluidml.flow import Flow, GridTaskSpec, TaskSpec
from with_gs import DatasetFetchTask, PreProcessTask, TFIDFFeaturizeTask, GloveFeaturizeTask, TrainTask, EvaluateTask
from rich import print

## **1. Dataset Collection**

Let's use HuggingFace's [datasets](https://huggingface.co/datasets) repository to quick get access to a text classification dataset. Specifically, we will use [TREC](https://huggingface.co/datasets/trec) which is a question classification dataset containing ~ 5k labeled questions in training set and ~500 questions in test set. The dataset contains two types of labels: fine and coarse. For simplicity, let's go ahead with coarse, with unique 6 labels.

We can implement this dataset collection as a separate task on its own by inheriting from FluidML's Task class. It just has to implement a run() method and returns task results in a dictionary. 
For simplicity, we can implement this task to return a nested dictionary. On the outer level, we have different splits and in each split, we will have list of sentences and labels.

Here is the complete implementation of DatasetFetchTask:


```python
from datasets import load_dataset


class DatasetFetchTask(Task):
    def __init__(self, name: str, id_: int):
        super().__init__(name, id_)

    def _get_split(self, dataset, split):
        if split == "test":
            return dataset[split]
        elif split in ["train", "val"]:
            splitted = list(dataset["train"])
            split_index = int(0.7 * len(splitted))
            return splitted[:split_index] if split == "train" else splitted[split_index:]

    def _get_sentences_and_labels(self, dataset) -> Tuple[List[str], List[str]]:
        sentences = []
        labels = []
        for item in dataset:
            sentences.append(item["text"])
            labels.append(item["label-coarse"])
        return sentences, labels

    def run(self, results: Dict[str, Any], resource: Resource):
        dataset = load_dataset("trec")
        splits = ["train", "val", "test"]
        task_results = {}
        for split in splits:
            dataset_split = self._get_split(dataset, split)
            sentences, labels = self._get_sentences_and_labels(dataset_split)
            split_results = {
                "sentences": sentences,
                "labels": labels
            }
            task_results[split] = split_results
        return task_results

```

## **2. Dataset Pre-processing:**
Now that we have our raw datasets prepared, next, we can apply some pre-processing to clean them up a bit, such as *removing punctuations*, *removing digits* and *making lower case* etc. 
We will implement this logic into a PreProcessTask. Additionally, this task takes a list of pre-processing steps as parameters.

**Note**, this task assumes that the raw sentences are available to it via results dictionary as arguments of run() method. This is automatically ensured by FluidML itself. 
Therefore, PreProcessTask just has to implement its own logic and return pre-processed sentences as results (**separation of concerns**)

Here is the complete implementation of PreProcessTask:

```python
import string
import re


class PreProcessTask(Task):
    def __init__(self, name: str, id_: int, pre_processing_steps: List[str]):
        super().__init__(name, id_)
        self._pre_processing_steps = pre_processing_steps

    def _pre_process(self, text: str) -> str:
        pre_processed_text = text
        for step in self._pre_processing_steps:
            if step == "lower_case":
                pre_processed_text = pre_processed_text.lower()
            if step == "remove_punct":
                pre_processed_text = pre_processed_text.translate(
                    str.maketrans('', '', string.punctuation))
            if step == "remove_digits":
                pre_processed_text = re.sub(
                    r"\d+", "<num>", pre_processed_text)
        return pre_processed_text

    def run(self, results: Dict[str, Any], resource: Resource):
        task_results = {}
        for split in ["train", "val", "test"]:
            pre_processed_sentences = [
                self._pre_process(sentence) for sentence in results["dataset"]["result"][split]["sentences"]]
            task_results[split] = {"sentences": pre_processed_sentences}
        return task_results
```

## **3. Featurization:**

Now that we have our datasets prepared and sentences pre-processed, we can now convert these to numerical vectors which can be then fed to classifiers. To this end, we would like to implement two featurizers namely a TFIDF featurizer and a glove featurizer. They can be implemented as two independent tasks which takes preprocessed sentences and returns vectorized sentences.

You may have guessed the pattern already, this task takes takes result from PreProcess task. 
Note, to fetch results from PreProcessTask, we can simply access it using `results["pre_process"]` where *pre_process* is the task name given to the instance of PreProcessTask. 
More on this later, when we model the flow/pipeline.

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from flair.data import Sentence
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings

class TFIDFFeaturizeTask(Task):
    def __init__(self, name: str, id_: int, min_df: int, max_features: int):
        super().__init__(name, id_)
        self._min_df = min_df
        self._max_features = max_features

    def run(self, results: Dict[str, Any], resource: Resource):
        tfidf_model = TfidfVectorizer(
            min_df=self._min_df, max_features=self._max_features)
        tfidf_model.fit(results["pre_process"]["result"]["train"]["sentences"])
        task_results = {}
        for split in ["train", "val", "test"]:
            tfidf_vectors = tfidf_model.transform(
                results["pre_process"]["result"][split]["sentences"]).toarray()
            task_results[split] = {"vectors": tfidf_vectors}
        return task_results
```

```python
class GloveFeaturizeTask(Task):
    def __init__(self, name: str, id_: int):
        super().__init__(name, id_)

    def run(self, results: Dict[str, Any], resource: Resource):
        task_results = {}
        for split in ["train", "val", "test"]:
            sentences = [Sentence(sent)
                         for sent in results["pre_process"]["result"][split]["sentences"]]
            embedder = DocumentPoolEmbeddings([WordEmbeddings("glove")])
            embedder.embed(sentences)
            glove_vectors = [sent.embedding.cpu().numpy()
                             for sent in sentences]
            glove_vectors = np.array(glove_vectors).reshape(
                len(glove_vectors), -1)
            task_results[split] = {"vectors": glove_vectors}
        return task_results
```

## **4. Training a classifier**

We are all set to train a simple classifier. For this tutorial, let's stick with a simple logistic regression model from Sklearn.
For the inputs, we can stack the glove and tfidf vectors (obtained from featurization task results) and for the targets, we can just use the labels obtained from DatasetFetch task. At the end, this task returns a trained SKlearn classifier.

**Note:** You are not limited just to Sklearn. You can train any kind of model using your favorite library (PyTorch, TensorFlow, Keras, PyTorch Lightning, etc) 

```python

from sklearn.linear_model import LogisticRegression

class TrainTask(Task):
    def __init__(self, name: str, id_: int, max_iter: int, balanced: str):
        super().__init__(name, id_)
        self._max_iter = max_iter
        self._class_weight = "balanced" if balanced else None

    def run(self, results: Dict[str, Any], resource: Resource):
        model = LogisticRegression(
            max_iter=self._max_iter, class_weight=self._class_weight)
        stacked_vectors = np.hstack((results["tfidf_featurize"]["result"]["train"]["vectors"],
                                     results["glove_featurize"]["result"]["train"]["vectors"]))
        model.fit(stacked_vectors,
                  results["dataset"]["result"]["train"]["labels"])
        task_results = {
            "model": model
        }
        return task_results
```

## **5. Evaluation of classifier**

Now, that we have trained a classifier, it is time to evaluate this classifier on all of the dataset splits. This task is straightforward, we will get the splits from results of DatasetFetch Task and the model from TrainTask. Finally, EvaluateTask returns a nested dictionary containing classification reports for each of train, val and test splits.



```python
from sklearn.metrics import classification_report

class EvaluateTask(Task):
    def __init__(self, name: str, id_: int):
        super().__init__(name, id_)

    def run(self, results: Dict[str, Any], resource: Resource):
        task_results = {}
        for split in ["train", "val", "test"]:
            stacked_vectors = np.hstack((results["tfidf_featurize"]["result"][split]["vectors"],
                                         results["glove_featurize"]["result"][split]["vectors"]))
            predictions = results["train"]["result"]["model"].predict(
                stacked_vectors)
            report = classification_report(
                results["dataset"]["result"][split]["labels"], predictions, output_dict=True)
            task_results[split] = {"classification_report": report}
        return task_results
```

## **Creating Flow/Pipeline**

So far, we have looked into implementing our individual pipeline steps using FluidML's Task class and it was very straightforward.
You might be wondering, how to put these tasks together and make them work together as a single pipeline?

Thanks to FluidML's TaskSpec API, you can connect these tasks like Lego blocks :)

### **Task Specifications**
TaskSpec is a simple wrapper that allows to specify task details such as task name and task arguments which will be used during instantiation of the task.

In [21]:
# create all task specs
dataset_fetch_task = TaskSpec(task=DatasetFetchTask, name="dataset")
pre_process_task = TaskSpec(task=PreProcessTask, name="pre_process", task_kwargs={
                            "pre_processing_steps": ["lower_case", "remove_punct"]})
featurize_task_1 = TaskSpec(
    task=GloveFeaturizeTask, name="glove_featurize")
featurize_task_2 = TaskSpec(
    task=TFIDFFeaturizeTask, name="tfidf_featurize", task_kwargs={"min_df": 5, "max_features": 1000})
train_task = TaskSpec(task=TrainTask, name="train",
                        task_kwargs={"max_iter": 50, "balanced": True})
evaluate_task = TaskSpec(task=EvaluateTask, name="evaluate")

### **Task Dependencies**
More importantly, TaskSpec also provides `requires()` method to specify predecessor tasks which need to be executed before that particular task.

For instance, in our example, we would need DatasetFetchTask to be finished before we start to run PreProcessTask. Similarly,
PreProcessTask is required for both FeaturizeTask. We can specify these dependencies on TaskSpec using `requires()`.


In [23]:
# dependencies between tasks
pre_process_task.requires([dataset_fetch_task])
featurize_task_1.requires([pre_process_task])
featurize_task_2.requires([pre_process_task])
train_task.requires(
    [dataset_fetch_task, featurize_task_1, featurize_task_2])
evaluate_task.requires(
    [dataset_fetch_task, featurize_task_1, featurize_task_2, train_task])

### **Final list of tasks**
We can just hold all these tasks in a list which we will pass it to FluidML.

In [20]:
# all tasks
tasks = [dataset_fetch_task,
         pre_process_task,
         featurize_task_1, featurize_task_2,
         train_task,
         evaluate_task]

### **Swarm & Flow**
Now that we have a final list of tasks which are ready to be run, we just have to create

- **Swarm:** which contains several workers which helps to run these tasks parallely. In our example, featurize_task_1 and featurize_task_2 are independent and can be exectued concurrently.
- **Flow:** which builds tasks from provided task specifications and creates a task graph which is processed by Swarm


In [28]:
with Swarm(n_dolphins=2,
            refresh_every=10,
            return_results=True) as swarm:
    flow = Flow(swarm=swarm)
    results = flow.run(tasks)
print(results["evaluate"]["result"]["test"])

# Grid Search (TBD)

# AutoML (TBD)

<img src="https://github.com/fluidml/fluidml/blob/main/logo/fluid_ml_logo.png" width=60 height=60 />