<img src="https://raw.githubusercontent.com/fluidml/fluidml/main/logo/fluid_ml_logo.png" width=500 height=500 />

# **Text Classification using FluidML and Sklearn**
In this notebook, we'll go over some basics of FluidML and implement a complete ML pipeline that performs text classification.
Like any other ML pipeline, this usually consists of several tasks:
- **Dataset fetching** - Downloads and parses the dataset for text classification
- **Dataset pre-processing** - Pre-processes the raw dataset
- **Featurization** - Converts the raw sentences to numerical vectors
- **Training a classifier** - Trains a logistic regression model
- **Evaluation of classifier** - Evaluates the trained model on train/test splits

With FluidML, all of these steps are naturally implemented as individual tasks which register their dependencies and are chained together to a task graph. This graph is then executed in parallel by FluidML and all results are returned at the end.

## **Setup**

To run this example, make sure install FluidML with the additional example requirements. You can do this by `cd` ing into the cloned fluidml directory and run:
```bash 
pip install .[examples]
```


**Note**: Due to the limitation of multiprocessing and jupyter, we have to import our defined tasks and some helper classes from a separate script. Hence, our task definitions are located in `sklearn_text_classification.py`, which not only implements the tasks but also the entire functionality of this example. So the interested reader can also go ahead and execute the just mentioned script. In order to still make this notebook self-explanatory, we provide Markdown code snippets of the individual task implementations at the place where we would have defined the task.


In [10]:
from fluidml.common import Task, Resource
from fluidml.swarm import Swarm
from fluidml.flow import Flow, GridTaskSpec, TaskSpec
from sklearn_text_classification import DatasetFetchTask, PreProcessTask, TFIDFFeaturizeTask, GloveFeaturizeTask, TrainTask, EvaluateTask, ModelSelectionTask
from rich import print

## **Task Definitions**

### **1. Dataset Fetching**

Let's use HuggingFace's [datasets](https://huggingface.co/datasets) repository to quick get access to a text classification dataset. Specifically, we will use [TREC](https://huggingface.co/datasets/trec) which is a question classification dataset containing ~ 5k labeled questions in training set and ~500 questions in test set. The dataset contains two types of labels: fine and coarse. For simplicity, let's go ahead with coarse, with unique 6 labels.

We can implement this dataset collection as a separate task on its own by inheriting from FluidML's Task class. It just has to implement a run() method and publish its results. 
For simplicity, we can implement this task to publish a nested dictionary. On the outer level, we have different splits and in each split, we will have list of sentences and labels.

**Note:** This task publishes a 'raw dataset' as specified in `self.save(dataset_splits, "raw_dataset")` and `self.publishes = ["raw_dataset"]`. Only the items specified in self.publishes are passed to the downstream tasks.

Here is the complete implementation of DatasetFetchTask:


```python
from datasets import load_dataset


class DatasetFetchTask(Task):
    def __init__(self):
        super().__init__()
        self.publishes = ["raw_dataset"]

    def _get_split(self, dataset, split):
        if split == "test":
            return dataset[split]
        elif split in ["train", "val"]:
            splitted = list(dataset["train"])
            split_index = int(0.7 * len(splitted))
            return splitted[:split_index] if split == "train" else splitted[split_index:]

    def _get_sentences_and_labels(self, dataset) -> Tuple[List[str], List[str]]:
        sentences = []
        labels = []
        for item in dataset:
            sentences.append(item["text"])
            labels.append(item["label-coarse"])
        return sentences, labels

    def run(self):
        dataset = load_dataset("trec")
        splits = ["train", "val", "test"]
        dataset_splits = {}
        for split in splits:
            dataset_split = self._get_split(dataset, split)
            sentences, labels = self._get_sentences_and_labels(dataset_split)
            split_results = {
                "sentences": sentences,
                "labels": labels
            }
            dataset_splits[split] = split_results
        self.save(dataset_splits, "raw_dataset")

```

### **2. Dataset Pre-processing:**
Now that we have our raw datasets prepared, next, we can apply some pre-processing to clean them up a bit, such as *removing punctuations*, *removing digits* and *making lower case* etc. 
We will implement this logic into a PreProcessTask. Additionally, this task takes a list of pre-processing steps as parameters.

**Note**, this task assumes that the raw dataset is available to it as arguments of run() method. This will be automatically passed by FluidML. 
Later, when we create the PreProcess Task, we just have to connect the DatasetFetchTask as a predecessor. More on this later, when we model the flow/pipeline.

Therefore, PreProcessTask just has to implement its own logic and publish pre-processed sentences as results. In this way, we can implement the tasks individually (**separation of concerns**)

Here is the complete implementation of PreProcessTask:

```python
import string
import re


class PreProcessTask(Task):
    def __init__(self, pre_processing_steps: List[str]):
        super().__init__()
        self._pre_processing_steps = pre_processing_steps
        self.publishes = ["pre_processed_dataset"]

    def _pre_process(self, text: Dict) -> str:
        pre_processed_text = text
        for step in self._pre_processing_steps:
            if step == "lower_case":
                pre_processed_text = pre_processed_text.lower()
            if step == "remove_punct":
                pre_processed_text = pre_processed_text.translate(
                    str.maketrans('', '', string.punctuation))
            if step == "remove_digits":
                pre_processed_text = re.sub(
                    r"\d+", "<num>", pre_processed_text)
        return pre_processed_text

    def run(self, raw_dataset: Dict):
        pre_processed_splits = {}
        for split in ["train", "val", "test"]:
            pre_processed_sentences = [
                self._pre_process(sentence) for sentence in raw_dataset[split]["sentences"]]
            pre_processed_splits[split] = {
                "sentences": pre_processed_sentences}
        self.save(pre_processed_splits, "pre_processed_dataset")
```

### **3. Featurization:**

Now that we have our datasets prepared and sentences pre-processed, we can now convert these to numerical vectors which can be then fed to classifiers. To this end, we would like to implement two featurizers namely a TFIDF featurizer and a glove featurizer. They can be implemented as two independent tasks which takes preprocessed sentences and publishes vectorized sentences.

You may have guessed the pattern already, this task expects pre_processed dataset, which was produced by PreProcessTask. 

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from flair.data import Sentence
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings

class GloveFeaturizeTask(Task):
    def __init__(self):
        super().__init__()
        self.publishes = ["glove_featurized_dataset"]

    def run(self, pre_processed_dataset: Dict):
        featurized_splits = {}
        for split in ["train", "val", "test"]:
            sentences = [Sentence(sent)
                         for sent in pre_processed_dataset[split]["sentences"]]
            embedder = DocumentPoolEmbeddings([WordEmbeddings("glove")])
            embedder.embed(sentences)
            glove_vectors = [sent.embedding.cpu().numpy()
                             for sent in sentences]
            glove_vectors = np.array(glove_vectors).reshape(
                len(glove_vectors), -1)
            featurized_splits[split] = {"vectors": glove_vectors}
        self.save(featurized_splits, "glove_featurized_dataset")


class TFIDFFeaturizeTask(Task):
    def __init__(self, min_df: int, max_features: int):
        super().__init__()
        self._min_df = min_df
        self._max_features = max_features
        self.publishes = ["tfidf_featurized_dataset"]

    def run(self, pre_processed_dataset: Dict):
        tfidf_model = TfidfVectorizer(
            min_df=self._min_df, max_features=self._max_features)
        tfidf_model.fit(pre_processed_dataset["train"]["sentences"])
        featurized_splits = {}
        for split in ["train", "val", "test"]:
            tfidf_vectors = tfidf_model.transform(
                pre_processed_dataset[split]["sentences"]).toarray()
            featurized_splits[split] = {"vectors": tfidf_vectors}
        self.save(featurized_splits, "tfidf_featurized_dataset")
```

### **4. Training a classifier**

We are all set to train a simple classifier. For this tutorial, let's stick with a simple logistic regression model from Sklearn.
For the inputs, we can stack the glove and tfidf vectors (obtained from featurization task results) and for the targets, we can just use the labels from the raw dataset. In the end, this task returns a trained SKlearn classifier.

**Note:** You are not limited just to Sklearn. You can train any kind of model using your favorite library (PyTorch, TensorFlow, Keras, PyTorch Lightning, etc) 

```python

from sklearn.linear_model import LogisticRegression

class TrainTask(Task):
    def __init__(self, max_iter: int, balanced: str):
        super().__init__()
        self._max_iter = max_iter
        self._class_weight = "balanced" if balanced else None
        self.publishes = ["trained_model"]

    def run(self, raw_dataset: Dict, tfidf_featurized_dataset: Dict, glove_featurized_dataset: Dict):
        model = LogisticRegression(
            max_iter=self._max_iter, class_weight=self._class_weight)
        stacked_vectors = np.hstack((tfidf_featurized_dataset["train"]["vectors"],
                                     glove_featurized_dataset["train"]["vectors"]))
        model.fit(stacked_vectors, raw_dataset["train"]["labels"])
        self.save(model, "trained_model")
```

### **5. Evaluation of classifier**

Now, that we have trained a classifier, it is time to evaluate this classifier on all of the dataset splits. This task is straightforward, we will get the featurized dataset splits and the trained model from the results. 
At the end, EvaluateTask publishes a nested dictionary containing classification reports for each of train, val and test splits.


```python
from sklearn.metrics import classification_report

class EvaluateTask(Task):
    def __init__(self):
        super().__init__()
        self.publishes = ["evaluation_results"]

    def run(self, raw_dataset: Dict, tfidf_featurized_dataset: Dict, glove_featurized_dataset: Dict,
            trained_model: LogisticRegression):
        evaluation_results = {}
        for split in ["train", "val", "test"]:
            stacked_vectors = np.hstack((tfidf_featurized_dataset[split]["vectors"],
                                         glove_featurized_dataset[split]["vectors"]))
            predictions = trained_model.predict(
                stacked_vectors)
            report = classification_report(
                raw_dataset[split]["labels"], predictions, output_dict=True)
            evaluation_results[split] = {"classification_report": report}
        self.save(evaluation_results, "evaluation_results")
```

## **Create and Run the Pipeline/Task-Graph via FluidML**

So far, we have looked into implementing our individual pipeline steps using FluidML's Task class and it was very straightforward.
You might be wondering, how to put these tasks together and make them work together as a single pipeline?

Thanks to FluidML's TaskSpec API, you can connect these tasks like Lego blocks :)

### **1. Instantiate Task Specifications**
TaskSpec is a simple wrapper that allows to specify task details task arguments which will be used during instantiation of the task.
Let's go ahead and create specs for all our tasks.

In [11]:
# create all task specs
dataset_fetch_task = TaskSpec(task=DatasetFetchTask)
pre_process_task = TaskSpec(task=PreProcessTask, task_kwargs={
                            "pre_processing_steps": ["lower_case", "remove_punct"]})
featurize_task_1 = TaskSpec(
    task=GloveFeaturizeTask)
featurize_task_2 = TaskSpec(
    task=TFIDFFeaturizeTask, task_kwargs={"min_df": 5, "max_features": 1000})
train_task = TaskSpec(task=TrainTask, task_kwargs={"max_iter": 50, "balanced": True})
evaluate_task = TaskSpec(task=EvaluateTask)

### **2. Registering Task Dependencies**
More importantly, TaskSpec also provides `requires()` method to specify predecessor tasks which need to be executed before that particular task.

For instance, in our example, we would need DatasetFetchTask to be finished before we start to run PreProcessTask. Similarly,
PreProcessTask is required for both FeaturizeTask. 

Using these task dependencies, FluidML creates a task graph and schedules the tasks considering the dependencies.
Not just that, it automatically collects the results from predecessor tasks and makes it available to `run()` method.


In [12]:
# dependencies between tasks
pre_process_task.requires([dataset_fetch_task])
featurize_task_1.requires([pre_process_task])
featurize_task_2.requires([pre_process_task])
train_task.requires(
    [dataset_fetch_task, featurize_task_1, featurize_task_2])
evaluate_task.requires(
    [dataset_fetch_task, featurize_task_1, featurize_task_2, train_task])

### **3. Creating a Final list of Task Specs**
We can just hold all these tasks in a list which we will pass it to FluidML.

In [13]:
# all task specs
tasks = [dataset_fetch_task,
         pre_process_task,
         featurize_task_1, featurize_task_2,
         train_task,
         evaluate_task]

### **4. Run the Pipeline/Task-Graph using Swarm & Flow**
Now that we have a final list of task specs which are ready to be run, we just have to create

- **Swarm:** which contains several workers which helps to run these tasks parallely. In our example, featurize_task_1 and featurize_task_2 are independent and can be exectued concurrently.
- **Flow:** which builds tasks from provided task specifications and creates a task graph which is then processed by Swarm


Using swarm and flow, we can just run our tasks using `flow.run(tasks)`.




In [14]:
with Swarm(n_dolphins=2,
           return_results=True,
           verbose=True) as swarm:
    flow = Flow(swarm=swarm)
    results = flow.run(tasks)

### **5. Results**
We can now go over the results and fetch a task's result using its name, which would give task results and task_config (up until that task in the graph)

In [15]:
print(results["EvaluateTask"]["config"])

In [16]:
print(results["EvaluateTask"]["result"]["evaluation_results"]["test"])

## **Grid Search:**

We can extend this pipeline and include grid search to find hyper-parameter tuning on the whole pipeline.
To enable grid search on a particular task, we just have to wrap it with `GridTaskSpec` instead of `TaskSpec`.

For example, for the training task, we can wrap it with `train_task = GridTaskSpec(task=TrainTask, gs_config={"max_iter": [50, 100], "balanced": [True, False]})`.

Internally, Flow would expand this task into 4 tasks with provided combinations of `max_iter` and `balanced`. 
Not only that, any successor tasks (for instance, evaluate task) in the task graph will also be automatically extended. In our example, we would have four evaluate tasks.

### **Model Selection Task:**

Ok that's nice. We have now several evaluation tasks and once we run this task graph through Flow, we will have several evaluation tasks and their results.
We can implement a model selection task, which consolidates these results and fetches the best config for the pipeline.

```python
class ModelSelectionTask(Task):
    def __init__(self):
        super().__init__()
        self.publishes = ["best_config", "best_performance"]

    def run(self, reduced_results: List[Dict]):
        sorted_results = sorted(reduced_results, key=lambda model_result: model_result["result"]
                                ["evaluation_results"]["val"]["classification_report"]["macro avg"]["f1-score"],
                                reverse=True)
        self.save(sorted_results[0]["config"], "best_config")
        self.save(sorted_results[1]["result"], "best_performance")
```


**Note:** Now, we can attach the EvaluationTask as a predecessor to ModelSelectionTask. However, if we do it naively, we would end up with 4 model selection tasks since we have 4 evaluation tasks.
So, we just have to specify `reduce=True` so only one instance of model selection task is created and all of the evaluation tasks are attached as parents to it.

```python
model_selection_task = TaskSpec(task=ModelSelectionTask, reduce=True)
model_selection_task.requires([evaluate_task])
```
Finally, putting all of these together:

In [17]:
# create all task specs
dataset_fetch_task = TaskSpec(task=DatasetFetchTask)
pre_process_task = TaskSpec(task=PreProcessTask, task_kwargs={
                            "pre_processing_steps": ["lower_case", "remove_punct"]})
featurize_task_1 = TaskSpec(
    task=GloveFeaturizeTask)
featurize_task_2 = GridTaskSpec(
    task=TFIDFFeaturizeTask, gs_config={"min_df": 5, "max_features": [1000, 2000]})
train_task = GridTaskSpec(task=TrainTask, gs_config={
                          "max_iter": [50, 100], "balanced": [True, False]})
evaluate_task = TaskSpec(task=EvaluateTask)
model_selection_task = TaskSpec(
    task=ModelSelectionTask, reduce=True)

In [18]:
# dependencies between tasks
pre_process_task.requires([dataset_fetch_task])
featurize_task_1.requires([pre_process_task])
featurize_task_2.requires([pre_process_task])
train_task.requires(
    [dataset_fetch_task, featurize_task_1, featurize_task_2])
evaluate_task.requires(
    [dataset_fetch_task, featurize_task_1, featurize_task_2, train_task])
model_selection_task.requires([evaluate_task])

In [21]:
# all tasks
tasks = [dataset_fetch_task,
         pre_process_task,
         featurize_task_1, featurize_task_2,
         train_task,
         evaluate_task,
         model_selection_task]

with Swarm(n_dolphins=2,
           return_results=True,
           verbose=True) as swarm:
    flow = Flow(swarm=swarm)
    results = flow.run(tasks)

### **Choosing the best Config and best performance**

In [22]:
print(results["ModelSelectionTask"]["result"]["best_config"])
print(results["ModelSelectionTask"]["result"]["best_performance"]["evaluation_results"]["test"])

Using this best config, one can get the corresponding model from TrainTask results.

In [23]:
print(results["TrainTask"])

<img src="https://raw.githubusercontent.com/fluidml/fluidml/main/logo/fluid_ml_logo.png" width=500 height=500 />