# Training ML models for Supervised NLP Tasks
## IMD1107 - Natural Language Processing
### [Dr. Elias Jacob de Menezes Neto](https://docente.ufrn.br/elias.jacob)

## Keypoints
- Common Natural Language Processing (NLP) pipelines consists of three main steps: text processing (normalization, tokenization, numericalization), feature extraction, and model training.

- Hyperparameter optimization techniques include Grid Search, Random Search, and Bayesian Optimization, each with its own advantages and use cases.

- The "60 iterations rule" in Random Search states that 60 iterations can find the best 5% set of parameters 95% of the time, regardless of the grid size.

- Bayesian Optimization uses a surrogate model, objective function, and acquisition function to efficiently navigate the hyperparameter space.

- Ensemble methods like StackingClassifier can improve model performance by combining multiple base classifiers with a meta-classifier.

- Chronological data splitting is crucial for legal ruling analysis to prevent temporal leakage and maintain the integrity of legal precedents.


## Learning Goals

By the end of this class, you will be able to:

1) Describe the steps involved in a typical Natural Language Processing (NLP) pipeline, including data collection, text cleaning, preprocessing, feature extraction, modeling, and evaluation.

2) Compare and contrast different hyperparameter optimization techniques such as Grid Search, Random Search, and Bayesian Optimization, explaining the "60 iterations rule" in the context of Random Search.

3) Apply chronological data splitting for legal ruling datasets to prevent temporal leakage and justify its importance for maintaining the integrity of legal precedent analysis.

4) Implement text processing and feature extraction techniques, specifically using TF-IDF vectorization, to transform raw legal text data into a numerical format suitable for machine learning models.

5) Train and evaluate various machine learning classifiers for NLP tasks, interpret key evaluation metrics like F1-score, balanced accuracy, accuracy, and Matthews Correlation Coefficient (MCC), and explain how ensemble methods like StackingClassifier can improve model performance.


# Understanding The Natural Language Processing (NLP) Pipeline

The NLP pipeline is a series of structured steps that help transform raw text into a format that machines can understand and use to make decisions or predictions. Here we explore each step for a complete understanding.

## 1. Data Collection

Data collection is the first and crucial step in the pipeline, where we gather raw text data from various sources. The quality and quantity of this data can significantly impact the effectiveness of your NLP model.

## 2. Text Cleaning

In this stage, we clean the collected data by removing noise such as HTML tags, emojis, punctuation marks, etc., which do not contribute to understanding the actual content. This cleaned-up data will improve the model's performance and save computational resources.

## 3. Preprocessing

Preprocessing involves transformation to ready the data for feature extraction, including tasks like tokenization (splitting text into words or phrases), stemming/lemmatization (reducing words to their base/root form), and removing stop words (common words like 'is', 'an', 'the' that don't carry much meaning).

## 4. Feature Extraction

This stage involves converting preprocessed data into a format that can be understood by machine learning algorithms. Techniques like Bag-of-Words or TF-IDF (Term Frequency-Inverse Document Frequency) are employed here to create numerical representations of the text.

`The points above were already addressed on Notebook 4.`


`During this class, we'll cover:`

## 5. Modeling

Once the features are extracted and in proper format, we use them to build and train our NLP model. Depending on the end-goal, different models can be used, like Naive Bayes for classification, or LSTM (Long Short Term Memory) for sequence prediction.

## 6. Evaluation

After the model has been trained, it must be evaluated to ascertain its performance. Metrics like precision, recall, accuracy, and F1-score are typically considered. Also, the model might be tested with new data to validate its performance.




`What we won't cover in this class (I'll leave that to your MLOps professor):`

## 7. Deployment

Once satisfied with the model's performance, the next step is to deploy it for practical use. This can range from integrating within an existing system or application, to deploying on a server for production use.

## 8. Maintenance and Monitoring

Post-deployment, continuous monitoring is essential to ensure the model's performance doesn't degrade over time, due to changes in data patterns. Periodic retraining and tuning may be necessary to keep the model up-to-date.

# Understanding the BrCAD-5 Dataset

## Dataset Overview

The dataset we're working with is a sample from the [BrCAD-5](https://www.kaggle.com/datasets/eliasjacob/brcad5), a thorough collection of legal rulings from Brazilian Federal Small Claims Courts (FSCC). This dataset was specifically curated for academic purposes, with the primary goal of developing AI models capable of predicting appeal outcomes within the jurisdiction of the 5th Regional Federal Court (TRF5).

## Key Features of the Dataset

- **Sample Size**: Our sample contains over 40,000 legal rulings, providing a reliable foundation for analysis.
- **Jurisdiction**: All cases are from the 5th Regional Federal Court (TRF5) jurisdiction in Brazil.
- **Case Types**: The dataset includes rulings from both federal courts and appellate panels.

## Dataset Structure

The dataset is composed of three main columns:

1. **`case_number`**: 
   - A unique identifier for each legal ruling
   - Ensures each case can be distinctly referenced and tracked

2. **`ruling_type`**: 
   - Indicates the type of legal ruling
   - Two categories:
     - ACÓRDÃO (Judgment): Typically refers to decisions made by appellate panels
     - SENTENÇA (Sentence): Usually refers to decisions made by a single judge in a lower court

3. **`outcome`**: 
   - Represents the result of the legal ruling
   - Multiple possible outcomes:
     - PROVIMENTO: Appeal granted
     - PROVIMENTO PARCIAL: Appeal partially granted
     - NÃO PROVIMENTO: Appeal denied
     - IMPROCEDENTE: Claim dismissed
     - PROCEDENTE: Claim upheld
     - PARCIALMENTE PROCEDENTE: Claim partially upheld
     - EXTINTO SEM MÉRITO: Case dismissed without judgment on merits
     - HOMOLOGADA TRANSAÇÃO: Settlement agreement approved

## Relevance of the Dataset

1. **AI in Legal Prediction**: This dataset serves as a valuable resource for developing machine learning models that can predict legal outcomes, potentially revolutionizing legal research and case preparation.

2. **Understanding Brazilian Legal System**: It offers insights into the decision-making patterns within Brazilian Federal Small Claims Courts, which could be useful for comparative legal studies.

3. **Natural Language Processing (NLP) Applications**: While not explicitly mentioned, such datasets often include text data that can be used for NLP tasks like legal document classification or summarization.

## Considerations for Analysis

- **Balanced Representation**: It's important to check if all outcome categories are adequately represented in the sample to ensure unbiased analysis.
- **Temporal Aspects**: Consider whether the rulings span a specific time period, as legal trends may change over time.
- **Contextual Factors**: While not provided in this dataset, factors like the judge's identity, case subject matter, or regional variations could be influential in outcomes.


In [None]:
# Load our data.
import pandas as pd

df_texts = pd.read_parquet("data/brcad5/texts_sample.parquet.gz")
df_meta = pd.read_parquet("data/brcad5/metadata_sample.parquet.gz")

In [2]:
df_texts.head()

Unnamed: 0,case_number,text,outcome,ruling_type
2,0515165-56.2018.4.05.8202,SENTENÇA I - RELATÓRIO Dispensada a feitura do...,IMPROCEDENTE,SENTENÇA
11,0506287-42.2018.4.05.8300,"SENTENÇA I - RELATÓRIO Dispensado, nos termos ...",PARCIALMENTE PROCEDENTE,SENTENÇA
15,0513000-08.2019.4.05.8103,RECURSO INOMINADO CONTRA SENTENÇA DE EXTINÇÃO ...,NÃO PROVIMENTO,ACÓRDÃO
18,0503661-32.2018.4.05.8015,PROCESSO No 0503661-32.2018.4.05.8015 RECORREN...,NÃO PROVIMENTO,ACÓRDÃO
19,0516813-86.2018.4.05.8100,AMPARO SOCIAL (LOAS). REQUISITOS NÃO PREENCHID...,NÃO PROVIMENTO,ACÓRDÃO


In [None]:
df_texts.groupby("ruling_type").outcome.value_counts()

ruling_type  outcome                
ACÓRDÃO      NÃO PROVIMENTO             15581
             PROVIMENTO                  3004
             PROVIMENTO PARCIAL          1415
SENTENÇA     IMPROCEDENTE               11550
             PROCEDENTE                  5253
             PARCIALMENTE PROCEDENTE     2403
             EXTINTO SEM MÉRITO           791
             HOMOLOGADA TRANSAÇÃO           3
Name: count, dtype: int64

In [4]:
df_meta.head()

Unnamed: 0,case_number,filing_date,defendant_normalized,date_first_instance_ruling,date_appeal_panel_ruling,case_topic_code,case_topic_1st_level,case_topic_2nd_level,case_topic_3rd_level,court_id
11,0510491-75.2017.4.05.8103,2017-10-11 00:00:00,INSS,2018-01-16 11:16:19,2018-03-26 17:07:16,6095,Direito Previdenciário,Benefícios em Espécie,Aposentadoria por Invalidez,19-CE
16,0509917-52.2017.4.05.8103,2017-09-26 00:00:00,INSS,2018-01-15 18:07:55,2018-03-26 17:07:16,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,31-CE
19,0501051-55.2017.4.05.8103,2017-02-03 00:00:00,INSS,2017-11-21 11:44:40,2018-03-26 17:07:16,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),31-CE
23,0511681-76.2017.4.05.8102,2017-09-27 00:00:00,INSS,2017-12-29 02:18:24,2018-03-26 17:07:16,6104,Direito Previdenciário,Benefícios em Espécie,Pensão por Morte (Art. 74/9),30-CE
33,0500777-79.2017.4.05.8107,2017-02-23 00:00:00,INSS,2017-11-27 17:58:46,2018-03-26 17:07:16,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),25-CE


In [5]:
df_meta.columns

Index(['case_number', 'filing_date', 'defendant_normalized',
       'date_first_instance_ruling', 'date_appeal_panel_ruling',
       'case_topic_code', 'case_topic_1st_level', 'case_topic_2nd_level',
       'case_topic_3rd_level', 'court_id'],
      dtype='object')

In [6]:
df_texts.columns

Index(['case_number', 'text', 'outcome', 'ruling_type'], dtype='object')

In [7]:
df_texts.case_number.nunique()

20000


### Next Steps

As we proceed with our analysis, we'll need to:
1. Load the data into our working environment
2. Perform initial exploratory data analysis to understand the distribution of ruling types and outcomes
3. Consider any necessary data preprocessing steps, such as handling missing values or encoding categorical variables


## Problem Statement 1 - Classifying "Acórdãos"

When a legal case is adjudicated by a judge, the involved parties have the option to appeal the decision to a higher court. This appellate court then reviews the case and issues a document called an **"Acórdão"**. The "Acórdão" details the court's decision and the reasoning behind it, playing a crucial role in the judicial process. 

### Purpose and Importance of "Acórdãos"

The "Acórdão" can either:
- **Uphold** the original decision made by the lower court.
- **Overturn** the decision, leading to a different outcome.

These documents are invaluable for understanding the application of laws in various contexts. They offer insights into:
- **Judicial reasoning**: How judges interpret and apply the law.
- **Legal precedents**: Past decisions that influence future cases.

### Challenges in Analyzing "Acórdãos"

Despite their importance, "Acórdãos" are written in natural language, which poses challenges for systematic analysis:
1. **Complexity of Legal Language**: Legal terminology and reasoning can be sophisticated and difficult to parse for those without legal training.
2. **Volume of Data**: The sheer number of "Acórdãos" can be overwhelming, making manual analysis impractical.

To address these challenges, automated classification techniques can be employed to categorize and extract meaningful insights from these documents. By training machine learning models on labeled "Acórdãos," we can develop systems that classify the outcome of an appeal based on the content of the document.



## Problem Statement 2 - Predicting the Time to Issue an "Acórdão"

Once a case has been tried by a judge, the appellate court issues the "Acórdão" after a certain period. The duration between the initial trial and the issuance of the "Acórdão" can vary significantly, ranging from a few days to several months.

### Repercussions of the Time Frame

The time it takes to issue an "Acórdão" has significant repercussions for the parties involved:
- **Closure**: If the "Acórdão" supports the original decision, parties can find closure and move forward.
- **Uncertainty**: If the "Acórdão" overrules the initial decision, the parties may face prolonged uncertainty, potentially lasting months or years.

### Calculating the Time Frame

In the dataset, the time it takes to issue the "Acórdão" can be derived by calculating the difference between two columns:
- `date_first_instance_ruling`: The date when the initial trial decision was made.
- `date_appeal_panel_ruling`: The date when the appellate court issued the "Acórdão".

This calculation will provide an accurate representation of the time duration between the trial and the issuance of the "Acórdão".

> **Note**: When developing predictive models, it is essential to use only the data available at the time of prediction. Therefore, the `date_appeal_panel_ruling` column cannot be used as an input for the model since it is only available after the trial is completed. This prevents data leakage and ensures the model's practical utility.

### Potential Predictive Features

To predict the duration between the trial and the issuance of the "Acórdão," consider using features available at the time of the initial trial, such as:
- **Case characteristics**: Type of case, complexity, involved parties.
- **Court attributes**: Location, workload, historical averages.
- **Judge information**: Identity, workload, historical performance.


# Problem 1 - Classifying "Acórdãos"

In [None]:
# Filter the dataframe to include only rows where the ruling type is "ACÓRDÃO"
# Select only the columns 'case_number', 'text', and 'outcome'
df = df_texts.query('ruling_type == "ACÓRDÃO"')[
    ["case_number", "text", "outcome"]
].copy()

# Merge the filtered dataframe with another dataframe containing metadata
# Specifically, we are adding the 'date_appeal_panel_ruling' column based on 'case_number'
df = df.merge(
    df_meta[["case_number", "date_appeal_panel_ruling"]], on="case_number", how="left"
)

# Display the summary information of the resulting dataframe
# This includes the number of entries, column names, non-null counts, and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 4 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   case_number               20000 non-null  object        
 1   text                      20000 non-null  object        
 2   outcome                   20000 non-null  object        
 3   date_appeal_panel_ruling  20000 non-null  datetime64[us]
dtypes: datetime64[us](1), object(3)
memory usage: 625.1+ KB


In [9]:
df.outcome.value_counts(normalize=True)

outcome
NÃO PROVIMENTO        0.77905
PROVIMENTO            0.15020
PROVIMENTO PARCIAL    0.07075
Name: proportion, dtype: float64

In [None]:
# Convert the 'outcome' column to a categorical type and then to integer codes
# This is useful for machine learning models that require numerical input
df["outcome_int"] = df.outcome.astype("category").cat.codes

# Display the count of each unique integer code in the 'outcome_int' column
# This helps to understand the distribution of different outcomes in the dataset
df.outcome_int.value_counts()

outcome_int
0    15581
1     3004
2     1415
Name: count, dtype: int64

## Step 1 - Split the data into training, validation and test sets

In [None]:
from sklearn.model_selection import train_test_split

# Split the dataframe into training and testing sets
# Use stratified sampling based on the 'outcome' column to ensure balanced classes
# Set aside 20% of the data for testing
df_train, df_test = train_test_split(
    df, test_size=0.2, random_state=271828, stratify=df.outcome
)

# Further split the testing set into validation and testing sets
# Use stratified sampling based on the 'outcome' column to ensure balanced classes
# Set aside 50% of the testing set for validation, resulting in 10% of the original data
df_test, df_val = train_test_split(
    df_test, test_size=0.5, random_state=271828, stratify=df_test.outcome
)

# Display the shapes of the resulting datasets
# This helps to verify the sizes of the training, testing, and validation sets
df_train.shape, df_test.shape, df_val.shape

((16000, 5), (2000, 5), (2000, 5))

## Proper Data Splitting in Legal Ruling Analysis

Legal data, by its nature, is organized chronologically. When preparing legal ruling data for analysis or model training, it is essential to split the data in a manner that respects its time order. This ensures that the learned patterns and evaluations are valid and reflective of the evolving legal context. Let's explore the importance of chronological data splitting and best practices for handling legal data. For more details, please refer to [this very interesting paper](https://aclanthology.org/2023.nllp-1.9/)

### Issues with Random Data Splitting

Random data splitting may seem convenient but introduces major challenges when dealing with time-ordered legal data:

1. **Temporal Leakage**  
   When future data accidentally influences training, models may appear to perform much better than they actually would on new cases. Here, the error estimated on randomly split data is misleadingly low compared to the error on truly unseen future data.

2. **Disruption of Legal Precedent**  
   Legal rulings often build on earlier decisions. Random splitting can break the chronological sequence, meaning that the influence of earlier cases on later ones is lost. This distortion can lead to erroneous models that do not reflect the evolution of legal arguments.

3. **Misrepresentation of Evolving Legal Trends**  
   Laws are subject to change, and legal interpretations grow over time. Random splitting may combine older and newer data in a way that masks significant trends and shifts in legal reasoning, leading to an inaccurate assessment of current and future legal environments.


### Best Practices for Splitting Legal Ruling Data

To overcome these pitfalls, consider the following guidelines:

1. **Chronological Splitting**  
   Divide the dataset based on the date of the rulings:
   - **Training Set**: Use earlier rulings.
   - **Validation Set**: Use rulings from an intermediate time period.
   - **Test Set**: Use the most recent rulings.
   
   This approach preserves the natural sequence of legal decisions.

2. **Maintaining Temporal Order**  
   By ensuring that training always involves earlier data and testing uses later data, the evaluation simulates a real-world scenario where the model predicts future cases based solely on past knowledge.

3. **Realistic Model Evaluation**  
   Testing the model on the most recent rulings helps provide a realistic measure of its performance on truly unseen future cases. This method avoids overly optimistic results and better reflects the challenges encountered in practice.


### Benefits of Proper Data Splitting

- **Improved Generalization**  
  Models trained on historical data and evaluated on recent data are better prepared for real-world applications. They learn from past patterns and are challenged by new, unseen trends.

- **Accurate Performance Metrics**  
  By testing on data that follows chronologically, performance assessments become closer to what the model will get in production. This ensures that the reported metrics are reliable and indicative of real-world performance.

- **Alignment with Legal Trends**  
  Chronologically splitting data preserves the evolution of legal standards. This allows models to capture shifts in legal interpretation and adapt appropriately.

### Additional Considerations

- **Time Window Selection**  
  Choose the time spans for training, validation, and testing carefully to ensure each segment contains enough data while still reflecting the chronological sequence.

- **Handling Landmark Cases**  
  Be mindful of significant legal events or landmark rulings. Such events might represent shifts that an effective model should capture. Consider segmenting the data around these events if necessary.

- **Regular Model Updates**  
  In areas where legal standards change quickly, updating the model with the most recent data can be crucial. Regular retraining ensures that the model remains relevant and performs well on the most current information.

> **Note**: Respecting the innate temporal order in legal data is not merely a technical requirement; it is crucial for developing models that offer reliable and ethically sound judgments in real-world applications.

In [None]:
# Calculate the number of rows for the training set (80% of the total dataframe length)
train_length = int(0.8 * len(df))

# Calculate the number of rows for the validation set (10% of the total dataframe length)
validation_length = int(0.1 * len(df))

# Sort the dataframe by the 'date_appeal_panel_ruling' column in ascending order
# This ensures that the data is split chronologically, which can be important for time series data
df = df.sort_values(by="date_appeal_panel_ruling", ascending=True)

# Split the dataframe into training, validation, and testing sets based on the calculated lengths
# The training set includes the first 80% of the rows
df_train = df.iloc[:train_length].copy()

# The validation set includes the next 10% of the rows
df_val = df.iloc[train_length : train_length + validation_length].copy()

# The testing set includes the remaining 10% of the rows
df_test = df.iloc[train_length + validation_length :].copy()

# Display the shapes of the resulting datasets
# This helps to verify the sizes of the training, validation, and testing sets
df_train.shape, df_val.shape, df_test.shape

((16000, 5), (2000, 5), (2000, 5))

In [13]:
df_train.date_appeal_panel_ruling.min(), df_train.date_appeal_panel_ruling.max()

(Timestamp('2018-03-26 17:02:45'), Timestamp('2019-09-30 17:20:24'))

In [14]:
df_val.date_appeal_panel_ruling.min(), df_val.date_appeal_panel_ruling.max()

(Timestamp('2019-09-30 17:20:24'), Timestamp('2019-12-11 14:57:02'))

In [15]:
df_test.date_appeal_panel_ruling.min(), df_test.date_appeal_panel_ruling.max()

(Timestamp('2019-12-11 14:57:02'), Timestamp('2020-04-01 11:24:06'))

## Step 2 - Text Processing and vectorization

In [None]:
# Import necessary functions from the 'helpers.text' module
from helpers.text import (
    remove_accented_characters,
    remove_numbers_punctuation_from_text,
    remove_excessive_spaces,
    remove_short_words,
)

# Apply the cleaning functions to the 'text' column of each dataframe
for dataframe in [df_train, df_val, df_test]:
    dataframe["clean_text"] = dataframe.text.apply(remove_accented_characters)
    dataframe["clean_text"] = dataframe.clean_text.apply(
        remove_numbers_punctuation_from_text
    )
    dataframe["clean_text"] = dataframe.clean_text.apply(remove_excessive_spaces)
    dataframe["clean_text"] = dataframe.clean_text.apply(remove_short_words)

In [None]:
df_train.iloc[0]["clean_text"]

'RELATORIO Trata recurso interposto pela parte autora face sentenca que julgou improcedente pedido reajuste beneficio com base Indice Precos Consumidor Terceira Idade IPC Instituto Brasileiro Economia Fundacao Getulio Vargas FGV IBRE substituicao Indice Nacional Precos Consumidor INPC Instituto Brasileiro Geografia Estatistica IBGE alegando que este indice aplicado pelo INSS nao preservaria carater permanente valor real dos beneficios previdenciarios que afrontaria comando art VOTO Conforme bem fundamenta sentenca recorrida preservacao valor real dos beneficios assegurada pela aplicacao dos indices estabelecidos pela propria legislacao previdenciaria nao cabendo poder judiciario substituir indice eleito pelo legislador fato forma indices reajustes que devem ser aplicados aos beneficios previdenciarios concedidos apos sao aqueles estabelecidos pela Lei uma vez que Carta Magna remeteu legislador ordinario definicao dos indices serem aplicados aos reajustes dos beneficios para preservacao

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

# Load Portuguese stopwords from NLTK
stopwords_nltk = stopwords.words("portuguese")

# Create a TF-IDF vectorizer with specific parameters:
# - stop_words: remove common Portuguese stopwords
# - max_features: limit the number of features to 1000
# - ngram_range: consider unigrams and bigrams
# - min_df: ignore terms that appear in fewer than 5 documents
# - max_df: ignore terms that appear in more than 80% of documents
# - lowercase: convert all text to lowercase
vectorizer = TfidfVectorizer(
    stop_words=stopwords_nltk,
    max_features=1000,
    ngram_range=(1, 2),
    min_df=5,
    max_df=0.8,
    lowercase=True,
)

# Fit the vectorizer using only the training data
# This ensures that the model does not have access to the validation or test data during training
vectorizer.fit(df_train.clean_text)

# Transform the text data into TF-IDF vectors
# This converts the text data into numerical form suitable for machine learning models
X_train = vectorizer.transform(df_train.clean_text)
X_val = vectorizer.transform(df_val.clean_text)
X_test = vectorizer.transform(df_test.clean_text)

# Extract the target labels for training, validation, and testing sets
# These labels will be used to train and evaluate the machine learning model
y_train = df_train.outcome_int
y_val = df_val.outcome_int
y_test = df_test.outcome_int

In [None]:
df_train[["clean_text", "outcome_int"]].sample(10, random_state=271828)

Unnamed: 0,clean_text,outcome_int
13171,EMENTA PREVIDENCIARIO AUXILIO DOENCA LAUDO DES...,1
3344,EMENTA PREVIDENCIARIO BENEFICIO ASSISTENCIAL L...,0
2508,VOTO Dispensado relatorio nos termos art Lei a...,0
17639,PROCESSO EMENTA PROCESSO CIVIL ASSISTENCIA SOC...,0
17063,PROCESSO EMENTA ADMINISTRATIVO SERVIDOR PUBLIC...,0
15959,EMENTA PREVIDENCIARIO AUXILIO DOENCA APOSENTAD...,0
668,RECURSO INOMINADO DIREITO PREVIDENCIARIO PEDID...,0
4637,RECURSO INOMINADO DIREITO ADMINISTRATIVO SEGUR...,1
19977,VOTO EMENTA PREVIDENCIARIO BENEFICIO PREVIDENC...,0
1950,PROCESSO RECORRENTE MARIA LOURDES CONCEICAO RE...,0


In [20]:
X_train.shape, X_val.shape, X_test.shape

((16000, 1000), (2000, 1000), (2000, 1000))

In [21]:
y_train.shape, y_val.shape, y_test.shape

((16000,), (2000,), (2000,))

In [22]:
X_train[0]

<1x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 244 stored elements in Compressed Sparse Row format>

In [23]:
X_train = X_train.toarray()
X_val = X_val.toarray()
X_test = X_test.toarray()

In [24]:
X_train[0]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.05059559, 0.        , 0.        ,
       0.        , 0.        , 0.0465296 , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.0376958 , 0.        ,
       0.02821556, 0.03924224, 0.04661888, 0.04740649, 0.        ,
       0.        , 0.        , 0.        , 0.04219101, 0.04322569,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.05352028, 0.        ,
       0.03443451, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.03712143, 0.04413272, 0.        , 0.        ,
       0.        , 0.03417699, 0.        , 0.12682283, 0.05464034,
       0.        , 0.04136232, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

In [None]:
import numpy as np


def get_top_ngrams(
    X_train: np.ndarray, vectorizer: TfidfVectorizer, top_n: int = 30
) -> np.ndarray:
    """
    Get the top n most frequent n-grams from the vectorized text data.

    Args:
        X_train (np.ndarray): The vectorized text data.
        vectorizer (TfidfVectorizer): The vectorizer used to transform the text data.
        top_n (int, optional): The number of top n-grams to return. Defaults to 30.

    Returns:
        np.ndarray: An array of the top n most frequent n-grams.
    """
    # Sum all the columns to get the total frequency of each n-gram
    total_ngram_frequencies = np.sum(X_train, axis=0)

    # Sort the n-grams by their total frequency
    sorted_ngrams_indices = np.argsort(total_ngram_frequencies)[::-1]

    # Get the indices of the top n most frequent n-grams
    top_ngrams_indices = sorted_ngrams_indices[:top_n]

    # Get the names of the n-grams corresponding to the top n indices
    ngram_names = np.array(vectorizer.get_feature_names_out())

    return ngram_names[top_ngrams_indices]


# Use the function to get the top 30 n-grams from the training data
top_ngrams = get_top_ngrams(X_train, vectorizer, top_n=30)
top_ngrams

array(['incapacidade', 'beneficio', 'doenca', 'auxilio', 'atividade',
       'auxilio doenca', 'aposentadoria', 'segurado', 'laudo',
       'concessao', 'prova', 'especial', 'rural', 'fgts', 'anexo',
       'autor', 'monetaria', 'inss', 'periodo', 'pericial', 'invalidez',
       'correcao', 'trabalho', 'anos', 'tempo', 'aposentadoria invalidez',
       'social', 'prazo', 'carencia', 'direito'], dtype=object)

## Step 3 - Model training

In [None]:
import time

# Import various classifiers and utilities from scikit-learn and other libraries

# LightGBM classifier, a gradient boosting framework that uses tree-based learning algorithms
from lightgbm import LGBMClassifier

# CalibratedClassifierCV for probability calibration of classifiers
from sklearn.calibration import CalibratedClassifierCV

# Ensemble classifiers from scikit-learn
# ExtraTreesClassifier and RandomForestClassifier are ensemble methods that use multiple decision trees
# StackingClassifier allows combining multiple classifiers to improve performance
from sklearn.ensemble import (
    ExtraTreesClassifier,
    RandomForestClassifier,
    StackingClassifier,
)

# Linear models from scikit-learn
# LogisticRegression is a linear model for binary classification
# SGDClassifier is a linear classifier using stochastic gradient descent
from sklearn.linear_model import LogisticRegression, SGDClassifier

# Metrics for evaluating classification performance
# accuracy_score, balanced_accuracy_score, classification_report, confusion_matrix, f1_score, matthews_corrcoef
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    matthews_corrcoef,
)

# Naive Bayes classifier for multinomially distributed data
from sklearn.naive_bayes import MultinomialNB

# K-Nearest Neighbors classifier
from sklearn.neighbors import KNeighborsClassifier

# Neural network-based classifier
from sklearn.neural_network import MLPClassifier

# Support Vector Machine classifiers
# SVC is a support vector classifier with a non-linear kernel
# LinearSVC is a support vector classifier with a linear kernel
from sklearn.svm import SVC, LinearSVC

# Decision tree classifier
from sklearn.tree import DecisionTreeClassifier

# XGBoost classifier, an optimized distributed gradient boosting library
from xgboost import XGBClassifier

## Classifiers Overview

This section provides a systematic explanation of several classifiers used in machine learning. The discussion covers how each classifier works, when it is best applied, and highlights important technical details. Mathematical formulas are provided where useful.


### 1. Calibrated-LSVC: CalibratedClassifierCV with LinearSVC

- **LinearSVC:**  
  - A linear support vector classifier that identifies the hyperplane separating data into classes.
  - The goal is to maximize the margin between classes.
  - Mathematically, given a set of data points $(\mathbf{x}_i, y_i)$ with $y_i \in \{-1, 1\}$, the classifier seeks a hyperplane described by $\mathbf{w}^T \mathbf{x} + b = 0$ such that the margin is maximized.

- **CalibratedClassifierCV:**  
  - This method adjusts the outputs of classifiers that do not naturally produce probability estimates.
  - It takes the scores from LinearSVC and maps them to reliable probability values.
  - The combination helps in situations where the probability estimate is important for applications such as ranking or threshold setting.

<p align="center">
  <img src="images/linear_svm.png" alt="" style="width: 40%; height: 40%"/>
</p>

> For a very nice video on model calibration and why it is important, check out [this video](https://www.youtube.com/watch?v=oOZr4kRJgFE&list=WL).

---

### 2. LR: Logistic Regression

- **Logistic Function (Sigmoid):**  
  - Logistic Regression uses the sigmoid function to convert linear combinations of features into a probability between 0 and 1.  
    $$
    \sigma(z) = \frac{1}{1 + e^{-z}}
    $$
    where $ z = \mathbf{w}^T \mathbf{x} + b $.

- **Decision Boundary:**  
  - A standard decision boundary is set at a probability of 0.5. Instances with a probability above 0.5 are classified into one class, while those below are assigned to the other.

- **Applications:**  
  - It is particularly useful for binary classification tasks.
  - With methods such as one-vs-rest or softmax, it can be extended to multi-class classification.

<p align="center">
  <img src="images/logistic_regression.png" alt="" style="width: 40%; height: 40%"/>
</p>

---

### 3. RF: Random Forest Classifier

- **Ensemble Method:**  
  - Random Forest constructs multiple decision trees and aggregates their predictions.
  - Each tree is built on a bootstrap sample of the data, and subsets of features are considered when splitting nodes.

- **Voting Mechanism:**  
  - In classifying new instances, each tree in the forest votes on the class outcome, and the final classification is determined by the majority vote.

- **Advantages:**  
  - The ensemble strategy reduces overfitting and improves generalization compared to a single decision tree.

<p align="center">
  <img src="images/random_forest.png" alt="" style="width: 40%; height: 40%"/>
</p>

---

### 4. LGBM
### 5. XGB
### 6. CatBoost: Gradient Boosting Machines

These classifiers are based on the gradient boosting principle, where weak learners are sequentially combined to create a strong learner.

- **General Principles:**
  - **Boosting Framework:**  
    Each new model is trained to correct the errors made by the combination of previous models. If the current model prediction is denoted as $\hat{y}_i$ for instance $i$, and the loss function is $L(y_i, \hat{y}_i)$, then subsequent models address the residuals.
  
  - **Tree-Based Learning:**  
    Decision trees serve as the weak learners, handling both numerical and categorical data.

- **Differences Among the Methods:**

  - **LGBM (LightGBM):**  
    - Optimizes speed and memory use.
    - Implements techniques such as Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).

  - **XGBoost:**  
    - Emphasizes performance and includes regularization methods to mitigate overfitting.
    - Utilizes parallel processing for improved training times.

  - **CatBoost:**  
    - Specializes in handling categorical features by managing them directly without extensive preprocessing.
    - Uses ordered boosting and symmetric trees to ensure efficient predictions.

<p align="center">
  <img src="images/xgboost_lgbm.png" alt="" style="width: 40%; height: 40%"/>
</p>

---

### 7. MLP: Multi-Layer Perceptron Classifier

- **Neural Network Structure:**  
  - Consists of an input layer, one or more hidden layers, and an output layer.
  - Each layer is formed by units (neurons) that process inputs and pass them to the next layer.

- **Activation Functions:**  
  - Uses non-linear activation functions like ReLU or sigmoid, which allow the network to model complex relationships.
  - The activation can be expressed as:  
    $$
    a = f(\mathbf{w}^T \mathbf{x} + b)
    $$
    where $f$ is a non-linear function such as $\sigma(z)$.

- **Learning via Backpropagation:**  
  - The network adjusts weights based on the difference between predicted and true outcomes, often using gradients computed by backpropagation.

<p align="center">
  <img src="images/mlp.png" alt="" style="width: 40%; height: 40%"/>
</p>

---

### 8. SGD: Stochastic Gradient Descent Classifier

- **Linear Classification:**  
  - Similar to logistic regression or linear SVMs, but it employs a unique optimization method.

- **Stochastic Gradient Descent (SGD):**  
  - Updates weights using only one or a few training instances at a time. This makes the algorithm suitable for large-scale problems.
  - The weight update rule is given by:  
    $$
    \mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla L(\mathbf{w}_t; \mathbf{x}_i, y_i)
    $$
    where $\eta$ is the learning rate.

- **Flexibility:**  
  - Allows the choice of different loss functions such as hinge loss (similar to SVM) or logistic loss.
  - Incorporates regularization (e.g., L1 or L2) to address overfitting.

<p align="center">
  <img src="images/sgd.png" alt="" style="width: 40%; height: 40%"/>
</p>

---

### 9. NB: Multinomial Naive Bayes

- **Probabilistic Framework:**  
  - Applies Bayes' theorem under the assumption that features are conditionally independent given the class (this is why it's called "naive").
  - The probability of a class $C$ given features $\mathbf{x} = (x_1, x_2, \dots, x_n)$ is computed as:  
    $$
    P(C \mid \mathbf{x}) = \frac{P(C)\prod_{i=1}^{n}P(x_i \mid C)}{P(\mathbf{x})}
    $$

- **Multinomial Variant:**  
  - Designed for discrete features such as word counts in text classification.
  - Works well when features represent the frequency of occurrence and can also work with fractional counts (for instance, tf-idf scores).

> **Note:** The independence assumption is a simplification that may not capture all interactions among features; however, the classifier performs effectively in many practical scenarios.

---

### 10. LinearSVC: Linear Support Vector Classification

- **Non-Probability Estimation Version:**  
  - Similar to CalibratedClassifierCV(LinearSVC), LSVC seeks the optimal hyperplane but does not directly yield probability scores.
  
- **Comparison with Calibration:**  
  - When used in combination with calibration methods, the same model can produce reliable probability estimates, making it more versatile in applications that require probabilistic outputs.

---

### 11. KNN: K-Nearest Neighbors Classifier

- **Instance-Based Learning:**  
  - The classifier does not build an explicit model; rather, it stores the training instances.
  - Classification is performed by comparing a new instance with the stored instances.

- **Distance Metric:**  
  - It typically uses the Euclidean distance to measure the closeness between instances:  
    $$
    d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}
    $$

- **"K" Parameter and Voting:**  
  - The value of $K$ determines how many neighbors are taken into account.
  - The new instance is assigned the class that is most common among its $K$ nearest neighbors.

<p align="center">
  <img src="images/knn.png" alt="" style="width: 40%; height: 40%"/>
</p>

---

### 12. DT: Decision Tree Classifier  
### 13. ET: Extra Trees Classifier

- **Decision Trees:**
  - Construct a tree-like model where each internal node represents a test on a feature.
  - The tree is built using a greedy approach, choosing the best feature and split point based on metrics such as information gain or Gini impurity.
  - While instinctive, decision trees can easily become complex and overfit the training data if not pruned.

- **Extra Trees:**
  - Similar in structure to decision trees but introduces additional randomness:
    - A random subset of features is chosen at each split.
    - Random thresholds are used for each selected feature.
  - This extra randomness often leads to simpler trees that generalize better and train quickly.

> **Important:** Both DT and ET can be effective as base models. Their performance can be enhanced significantly when used in ensemble methods.

### Metrics Explanation

Let's explore the key metrics used for evaluating classification models, including their mathematical formulas and Python code fragments. Understanding these metrics and their calculations will help you interpret your model's performance more effectively.

#### 1. F1 Score
```python
f1 = f1_score(y, pred, average='micro')
```
The **F1 Score** is the harmonic mean of precision and recall. It provides a single metric that balances the trade-off between precision and recall, making it particularly useful when you need to consider both false positives and false negatives.

- **Precision**: The ratio of correctly predicted positive observations to the total predicted positives.
- **Recall**: The ratio of correctly predicted positive observations to all observations in the actual class.

**Formula**:
$$F1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$

Where:
- $\text{precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$
- $\text{recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$

TP = True Positives, FP = False Positives, FN = False Negatives

**Range**: 0 to 1, where 1 is the best possible score.

When using `average='micro'`, the F1 score is calculated globally by counting the total true positives, false negatives, and false positives across all classes.

> **Note**: The F1 Score is especially useful in scenarios with imbalanced classes, where one class is significantly more frequent than others. However, if true negatives are essential, consider using the Matthews Correlation Coefficient (MCC) instead.

---

#### 2. Balanced Accuracy
```python
bacc = balanced_accuracy_score(y, pred)
```
**Balanced Accuracy** is the average of recall obtained on each class. This metric is particularly effective for imbalanced datasets, where some classes are underrepresented.

- **Recall (Sensitivity)**: The ability of the classifier to find all the positive samples.

**Formula**:
$$\text{Balanced Accuracy} = \frac{1}{n} \sum_{i=1}^{n} \frac{\text{TP}_i}{\text{TP}_i + \text{FN}_i}$$

Where $n$ is the number of classes, and $\text{TP}_i$ and $\text{FN}_i$ are the true positives and false negatives for class $i$, respectively.

**Range**: 0 to 1, where 1 indicates perfect accuracy on all classes.

> **Example**: In a medical diagnosis scenario, balanced accuracy helps ensure that the model performs well across both common and rare conditions.

---

#### 3. Accuracy
```python
acc = accuracy_score(y, pred)
```
**Accuracy** measures the proportion of correct predictions out of the total number of samples. It is a straightforward metric but can be misleading in the presence of class imbalance.

**Formula**:
$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

Where TN = True Negatives

**Range**: 0 to 1, where 1 means all predictions are correct.

> **Important**: In datasets with imbalanced classes, accuracy might give an inflated sense of performance if the model is biased towards the majority class.

---

#### 4. Classification Report
```python
cr = classification_report(y, pred)
```
The **Classification Report** provides a complete summary of various classification metrics for each class, including:

- **Precision**: How many selected items are relevant.
- **Recall**: How many relevant items are selected.
- **F1-Score**: The harmonic mean of precision and recall.
- **Support**: The number of actual occurrences of the class in the dataset.

The formulas for these metrics are:

- **Precision**: $$\frac{\text{TP}}{\text{TP} + \text{FP}}$$
- **Recall**: $$\frac{\text{TP}}{\text{TP} + \text{FN}}$$
- **F1-Score**: $$2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$

> **Usage**: This report shows you detailed insights into the performance of your model on a per-class basis, helping identify which classes are well-predicted and which are not.

---

#### 5. Matthews Correlation Coefficient (MCC)
```python
mcc = matthews_corrcoef(y, pred)
```
The **Matthews Correlation Coefficient (MCC)** is a measure of the quality of binary classifications. It takes into account true and false positives and negatives and is regarded as a balanced measure even with imbalanced classes.

**Formula**:
$$\text{MCC} = \frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}$$

**Range**: -1 to 1, where 1 indicates perfect prediction, 0 no better than random prediction, and -1 total disagreement.

> **Analogy**: Think of MCC as a correlation coefficient between the observed and predicted binary classifications. It provides a more informative and truthful score in scenarios where class imbalance is a concern.

---

#### 6. Confusion Matrix
```python
cm = confusion_matrix(y, pred)
```
A **Confusion Matrix** is a table used to describe the performance of a classification model. It provides a breakdown of correct and incorrect predictions by each class.

For a binary classification problem, it takes the form:

|               | Predicted Positive | Predicted Negative |
|---------------|--------------------|--------------------|
| Actual Positive | TP                 | FN                 |
| Actual Negative | FP                 | TN                 |

From this matrix, various metrics can be derived:

- **Sensitivity (Recall)**: $$\frac{\text{TP}}{\text{TP} + \text{FN}}$$
- **Specificity**: $$\frac{\text{TN}}{\text{TN} + \text{FP}}$$
- **Precision**: $$\frac{\text{TP}}{\text{TP} + \text{FP}}$$

> **Visualization**: The confusion matrix helps visualize the performance of your model, making it easier to identify which classes are being misclassified.


### Goodhart's Law and the Misuse of Classification Metrics

> "When a measure becomes a target, it ceases to be a good measure." - Goodhart's Law

Goodhart's Law serves as a crucial reminder in machine learning: blindly improving for a single metric can lead to unintended and undesirable outcomes. This is especially relevant in classification tasks where we utilize metrics to gauge model performance.  The law highlights that once a metric is used as the primary goal for optimization, its correlation with the actual objective may degrade or vanish entirely.

#### Real-World Ramifications Beyond Just Numbers

While classification metrics offer quantitative feedback on model behavior, their interpretation necessitates a broader understanding of the problem's real-world context.  Overemphasis on metrics alone can obscure critical practical and ethical considerations:

1.  **Context-Specific Metric Selection**: The relevance of a classification metric is intrinsically linked to the specific problem we aim to solve and its associated costs. For example, in medical diagnosis, missing a disease (false negative) has far graver consequences than incorrectly indicating a disease when absent (false positive), necessitating a metric focus beyond just overall accuracy. Consider the costs associated with each type of error. Let $C_{FN}$ be the cost of a false negative and $C_{FP}$ be the cost of a false positive. The total cost can be represented as:

    $TotalCost = C_{FN} \times FN + C_{FP} \times FP$

    where $FN$ is the number of false negatives and $FP$ is the number of false positives.  In scenarios like disease detection, minimizing $TotalCost$ may prioritize reducing false negatives, even at the expense of increased false positives, rather than simply maximizing accuracy.

2.  **Multi-faceted Evaluation is Key**:  A singular metric rarely paints a complete picture of model efficacy. A more strong evaluation involves examining a range of metrics to gain a complete understanding. For example, a model might exhibit high accuracy but perform poorly in terms of precision or recall in specific classes, revealing class imbalance issues or other limitations not apparent from accuracy alone.

3.  **Ethical and Societal Dimensions**: Metrics are inherently numerical and may fail to capture the broader ethical and societal consequences of classification systems. For instance, a highly accurate predictive policing model might still perpetuate or even magnify biases present in historical crime data, leading to unfair or discriminatory outcomes for certain demographic groups.  The fairness and transparency of the model are aspects that go beyond standard classification metrics.

4.  **Interpretability vs. Optimization**:  While refining for metrics is important, some highly optimized models, particularly complex ones, can become "black boxes". This lack of interpretability can be problematic in domains requiring transparency and understanding of the decision process, such as healthcare or finance. In these fields, understanding *why* a model makes a certain prediction is as important as the prediction's accuracy.

#### Strategies for Responsible Metric Application

1.  **Domain Expert Collaboration**: Engaging with subject matter experts is crucial for aligning metric selection with real-world objectives. Experts can provide insights into the practical repercussions of different types of errors and help determine which metrics genuinely reflect success in the application domain.

2.  **Employing Multiple Metrics**: To obtain a complete evaluation, utilize a range of metrics. For classification tasks, this might include accuracy, precision, recall, F1-score, AUC-ROC, and others. Examining these metrics in conjunction provides a more nuanced view of model strengths and weaknesses.

3.  **Developing Custom Metrics**: In some cases, standard metrics might not adequately capture the project's specific aims.  Creating custom metrics tailored to the problem can provide a more direct measure of success.  This requires a deep understanding of the problem domain and what constitutes a meaningful measure of performance in that context.

4.  **Evaluation Across Diverse Datasets**:  To ensure model robustness and generalizability, evaluate performance on multiple datasets, reflecting varied real-world conditions. This helps to identify if the model's performance is consistent across different scenarios or if it is overfitting to a specific dataset.

5.  **Real-World Performance Monitoring**:  Continuously monitor model performance after deployment in real-world settings.  Tracking performance in live operation, not just on static test sets, is vital to identify performance drift, data shifts, and ensure the model continues to meet its intended objectives over time.

> Recognizing Goodhart's Law with respect to classification metrics compels us to prioritize solving actual problems over merely refining numbers. Metrics are essential tools for model development, but they should serve as guides in our decision-making, not become the ultimate targets themselves.  By adopting a broader perspective that encompasses the real-world impact and context of our models, we can strive to build more effective and ethically responsible machine learning solutions.

In [None]:
from typing import List, Tuple


def calculate_evaluation_metrics(
    y_true: pd.Series, y_pred: pd.Series
) -> Tuple[float, float, float, str, float, np.ndarray]:
    """
    Calculate evaluation metrics for model predictions.

    Args:
        y_true (pd.Series): The true labels.
        y_pred (pd.Series): The predicted labels.

    Returns:
        Tuple[float, float, float, str, float, np.ndarray]: The calculated metrics including F1 score, balanced accuracy, accuracy, classification report, Matthews correlation coefficient, and confusion matrix.
    """
    # Calculate F1 score
    f1 = f1_score(y_true, y_pred, average="micro")
    # Calculate balanced accuracy
    balanced_accuracy = balanced_accuracy_score(y_true, y_pred)
    # Calculate accuracy
    accuracy = accuracy_score(y_true, y_pred)
    # Generate classification report
    classification_report_str = classification_report(y_true, y_pred)
    # Calculate Matthews correlation coefficient
    matthews_corr_coeff = matthews_corrcoef(y_true, y_pred)
    # Generate confusion matrix
    confusion_matrix_arr = confusion_matrix(y_true, y_pred)

    return (
        f1,
        balanced_accuracy,
        accuracy,
        classification_report_str,
        matthews_corr_coeff,
        confusion_matrix_arr,
    )


def train_and_evaluate_models(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    X_valid: pd.DataFrame,
    y_valid: pd.Series,
    n_jobs: int = -1,
) -> Tuple[pd.DataFrame, List[List]]:
    """
    Train multiple models and evaluate their performance.

    Args:
        X_train (pd.DataFrame): The training data.
        y_train (pd.Series): The training labels.
        X_valid (pd.DataFrame): The validation data.
        y_valid (pd.Series): The validation labels.
        n_jobs (int, optional): The number of jobs to run in parallel. Defaults to -1.

    Returns:
        Tuple[pd.DataFrame, List[List]]: A dataframe of the evaluation results and a list of classification reports.
    """
    # Define the models to be trained
    models = [
        (
            "Calibrated-LSVC",
            CalibratedClassifierCV(
                LinearSVC(random_state=271828, class_weight="balanced", dual="auto")
            ),
        ),
        (
            "LR",
            LogisticRegression(
                random_state=271828, n_jobs=n_jobs, class_weight="balanced"
            ),
        ),
        (
            "RF",
            RandomForestClassifier(
                random_state=271828, n_jobs=n_jobs, class_weight="balanced"
            ),
        ),
        (
            "LGBM",
            LGBMClassifier(
                random_state=271828, n_jobs=n_jobs, class_weight="balanced", verbose=-1
            ),
        ),
        (
            "XGB",
            XGBClassifier(
                random_state=271828, n_jobs=n_jobs, class_weight="balanced", verbosity=0
            ),
        ),
        ("MLP", MLPClassifier(random_state=271828)),
        (
            "SGD",
            SGDClassifier(random_state=271828, n_jobs=n_jobs, class_weight="balanced"),
        ),
        ("NB", MultinomialNB()),
        ("LSVC", LinearSVC(random_state=271828, class_weight="balanced", dual="auto")),
        ("KNN", KNeighborsClassifier(n_jobs=n_jobs)),
        ("DT", DecisionTreeClassifier(random_state=271828, class_weight="balanced")),
        (
            "ExtraTrees",
            ExtraTreesClassifier(
                random_state=271828, n_jobs=n_jobs, class_weight="balanced"
            ),
        ),
    ]

    evaluation_results = []
    classification_reports = []

    # Train each model and evaluate its performance
    for model_name, model in models:
        start_time = time.time()  # Record the start time

        try:
            # Train the model
            model.fit(X_train, y_train)
            # Make predictions on the validation set
            predictions = model.predict(X_valid)
        except Exception as e:
            # Handle any exceptions that occur during training or prediction
            print(f"Error {model_name} - {e}")
            continue

        # Calculate evaluation metrics
        (
            f1,
            balanced_accuracy,
            accuracy,
            classification_report_str,
            matthews_corr_coeff,
            confusion_matrix_arr,
        ) = calculate_evaluation_metrics(y_valid, predictions)
        # Store the classification report and confusion matrix
        classification_reports.append(
            [model_name, classification_report_str, confusion_matrix_arr]
        )

        elapsed_time = time.time() - start_time  # Calculate the elapsed time
        # Append the evaluation results
        evaluation_results.append(
            [
                model_name,
                f1,
                balanced_accuracy,
                accuracy,
                matthews_corr_coeff,
                elapsed_time,
                confusion_matrix_arr,
                classification_report_str,
            ]
        )

        # Print the evaluation results
        print(
            f"Name: {model_name} - F1: {f1:.4f} - BACC: {balanced_accuracy:.4f} - ACC: {accuracy:.4f} - MCC: {matthews_corr_coeff:.4f} - Elapsed: {elapsed_time:.2f}s"
        )
        print(classification_report_str)
        print(confusion_matrix_arr)
        print("*" * 20, "\n")

    # Create a DataFrame to store the evaluation results
    results_df = pd.DataFrame(
        evaluation_results,
        columns=[
            "Model",
            "F1",
            "BACC",
            "ACC",
            "MCC",
            "Total Time",
            "Confusion Matrix",
            "Classification Report",
        ],
    )
    # Convert the confusion matrix to a string for better readability in the DataFrame
    results_df["Confusion Matrix"] = results_df["Confusion Matrix"].apply(
        lambda x: str(x)
    )

    return results_df, classification_reports

In [None]:
df_results, creports = train_and_evaluate_models(
    X_train, y_train, X_val, y_val, n_jobs=-1
)

Name: Calibrated-LSVC - F1: 0.9380 - BACC: 0.8338 - ACC: 0.9380 - MCC: 0.8456 - Elapsed: 8.46s
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1491
           1       0.84      0.87      0.86       328
           2       0.84      0.64      0.73       181

    accuracy                           0.94      2000
   macro avg       0.88      0.83      0.85      2000
weighted avg       0.94      0.94      0.94      2000

[[1474   14    3]
 [  23  286   19]
 [  24   41  116]]
******************** 

Name: LR - F1: 0.9255 - BACC: 0.8922 - ACC: 0.9255 - MCC: 0.8309 - Elapsed: 6.81s
              precision    recall  f1-score   support

           0       1.00      0.95      0.97      1491
           1       0.82      0.85      0.84       328
           2       0.66      0.88      0.75       181

    accuracy                           0.93      2000
   macro avg       0.83      0.89      0.85      2000
weighted avg       0.94      0.93     



Name: LGBM - F1: 0.9565 - BACC: 0.9092 - ACC: 0.9565 - MCC: 0.8942 - Elapsed: 961.59s
              precision    recall  f1-score   support

           0       0.99      0.98      0.99      1491
           1       0.89      0.89      0.89       328
           2       0.83      0.86      0.84       181

    accuracy                           0.96      2000
   macro avg       0.90      0.91      0.91      2000
weighted avg       0.96      0.96      0.96      2000

[[1467   15    9]
 [  14  291   23]
 [   4   22  155]]
******************** 

Name: XGB - F1: 0.9555 - BACC: 0.8957 - ACC: 0.9555 - MCC: 0.8910 - Elapsed: 115.46s
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1491
           1       0.87      0.89      0.88       328
           2       0.87      0.81      0.84       181

    accuracy                           0.96      2000
   macro avg       0.91      0.90      0.90      2000
weighted avg       0.96      0.96      0.96 

In [None]:
df_results.sort_values(by="MCC", ascending=False)

Unnamed: 0,Model,F1,BACC,ACC,MCC,Total Time,Confusion Matrix,Classification Report
3,LGBM,0.9565,0.909151,0.9565,0.894218,961.593016,[[1467 15 9]\n [ 14 291 23]\n [ 4 ...,precision recall f1-score ...
4,XGB,0.9555,0.895726,0.9555,0.890957,115.455211,[[1472 14 5]\n [ 18 293 17]\n [ 4 ...,precision recall f1-score ...
8,LSVC,0.948,0.894947,0.948,0.874801,2.258129,[[1460 19 12]\n [ 12 284 32]\n [ 2 ...,precision recall f1-score ...
6,SGD,0.946,0.89742,0.946,0.870491,0.64903,[[1458 15 18]\n [ 14 276 38]\n [ 2 ...,precision recall f1-score ...
5,MLP,0.9405,0.868181,0.9405,0.853977,47.054498,[[1461 22 8]\n [ 22 281 25]\n [ 13 ...,precision recall f1-score ...
0,Calibrated-LSVC,0.938,0.833811,0.938,0.845626,8.464702,[[1474 14 3]\n [ 23 286 19]\n [ 24 ...,precision recall f1-score ...
1,LR,0.9255,0.89225,0.9255,0.830926,6.807154,[[1413 39 39]\n [ 5 279 44]\n [ 1 ...,precision recall f1-score ...
2,RF,0.9315,0.8033,0.9315,0.830485,0.849617,[[1475 13 3]\n [ 24 292 12]\n [ 19 ...,precision recall f1-score ...
10,DT,0.9105,0.815001,0.9105,0.779225,17.517048,[[1447 32 12]\n [ 49 239 40]\n [ 10 ...,precision recall f1-score ...
11,ExtraTrees,0.906,0.719249,0.906,0.76533,0.539276,[[1474 14 3]\n [ 34 282 12]\n [ 30 ...,precision recall f1-score ...


### Ensemble Classifier: Stacking for Enhanced Performance

Ensemble learning in machine learning combines multiple models to achieve better performance than any single model alone. This approach is especially beneficial when dealing with complex classification tasks or when individual models have limitations. Key advantages include:

- **Improved Accuracy:** Combining models can capture different aspects of the data.
- **Enhanced Robustness:** The ensemble method reduces the risk of poor performance from any individual classifier.
- **Handling Complex Patterns:** Multiple models may learn different relationships in the data, improving overall prediction capability.

### The Stacking Classifier

The `StackingClassifier` is an ensemble method that integrates several models in two stages:

#### Two Levels of Stacking

1. **Base Classifiers:**
   - Multiple models are first trained on the original dataset.
   - They can be of different types (e.g., linear and non-linear) to bring varied perspectives to the problem.

2. **Meta-Classifier:**
   - This higher-level model is trained on the predictions provided by the base classifiers.
   - Rather than simply averaging or voting on the outputs, the meta-classifier learns the best way to combine these predictions for the final decision.

Mathematically, this can be expressed as:

$$
\hat{y} = f\Big( g_1(x),\; g_2(x),\; \dots,\; g_m(x) \Big)
$$

where:
- $ g_1(x), g_2(x), \dots, g_m(x) $ are the predictions (or probabilities) of the base classifiers.
- $ f $ represents the meta-classifier that outputs the final prediction $ \hat{y} $.

#### Workflow of Stacking

1. **Training the Base Classifiers:**
   - Each base model is trained using the full set of original features.
   
2. **Generating Meta-Features:**
   - Base classifiers predict outcomes on a separate hold-out set or via cross-validation. These predictions form a new set of features, often referred to as meta-features.
   
3. **Training the Meta-Classifier:**
   - The meta-classifier is trained on the meta-features to learn how to best combine the base models' predictions.
   
4. **Predicting New Data:**
   - For a new instance, base classifiers generate predictions that are passed to the meta-classifier, which then produces the final output.

### Choosing the Right Classifiers for Stacking

Selecting diverse classifiers is key to effective stacking. Consider the following:

- **Algorithm Diversity:**  
  Use classifiers that rely on different fundamental assumptions. For example, combine linear models (such as Logistic Regression) with non-linear ones (such as Decision Trees).

- **Strengths and Weaknesses:**  
  Choose models with complementary capabilities. One classifier might handle missing data well, while another might excel at capturing complex interactions among features.

- **Probability Predictions:**  
  Ensure that classifiers used in this ensemble (especially the meta-classifier) support or provide probability estimates. This is important for accurately combining the model outputs.


### Practical Considerations

When applying stacking, it is important to balance performance with potential challenges:

- **Computational Cost:**  
  Training multiple base models and a meta-model can be resource-intensive, particularly with large datasets or numerous classifiers.

- **Overfitting Risks:**  
  The method requires careful use of cross-validation or hold-out datasets to ensure that the meta-classifier does not overfit the data used to train the base classifiers.

- **Interpretability:**  
  With increased complexity from multiple models, understanding the decision process becomes more difficult. This trade-off is common when aiming for improved performance.

##### Implementation Note

For classifiers lacking `predict_proba()` (e.g., LinearSVC), the `CalibratedClassifierCV` wrapper can be used:

```python
from sklearn.calibration import CalibratedClassifierCV
calibrated_svc = CalibratedClassifierCV(LinearSVC())
```

This calibration step adds probability estimation capabilities to otherwise non-probabilistic classifiers.


In [None]:
# Initialize individual models with specific parameters
random_forest_model = RandomForestClassifier(
    random_state=271828, n_jobs=-1, class_weight="balanced"
)
lgbm_model = LGBMClassifier(
    random_state=271828, n_jobs=-1, class_weight="balanced", verbose=-1
)
lsvc_model = LinearSVC(random_state=271828, class_weight="balanced", dual="auto")

# List of base estimators for stacking
base_estimators = [
    ("random_forest", random_forest_model),
    ("calibrated_lsvc", lsvc_model),
    ("lgbm", lgbm_model),
]

# Initialize the stacking classifier with base estimators and a final estimator
stacking_model = StackingClassifier(
    estimators=base_estimators,
    final_estimator=LogisticRegression(
        random_state=271828, n_jobs=-1, class_weight="balanced"
    ),
    n_jobs=2,
    cv=5,
)

# Fit the stacking model on the training data
stacking_model.fit(X_train, y_train)

# Predict on the validation data
predictions = stacking_model.predict(X_val)

# Calculate evaluation metrics for the predictions
(
    f1_score_val,
    balanced_accuracy_val,
    accuracy_val,
    classification_report_val,
    matthews_corr_coeff_val,
    confusion_matrix_val,
) = calculate_evaluation_metrics(y_val, predictions)

# Print the evaluation metrics
print(
    f"F1: {f1_score_val:.4f} - BACC: {balanced_accuracy_val:.4f} - ACC: {accuracy_val:.4f} - MCC: {matthews_corr_coeff_val:.4f}"
)
print(classification_report_val)
print(confusion_matrix_val)

# Took 14 minutes to run in a 48 core CPU
# Best MCC was 0.894218
# Best MCC is now 0.8947



F1: 0.9560 - BACC: 0.9227 - ACC: 0.9560 - MCC: 0.8947
              precision    recall  f1-score   support

           0       0.99      0.98      0.99      1491
           1       0.89      0.88      0.88       328
           2       0.80      0.91      0.85       181

    accuracy                           0.96      2000
   macro avg       0.89      0.92      0.91      2000
weighted avg       0.96      0.96      0.96      2000

[[1459   20   12]
 [  10  288   30]
 [   1   15  165]]




`As we've seen above, we can improve our results at the cost of increased training time by combining multiple classifiers into a single model. `



## Hyperparameter Optimization

<p align="center">
  <img src="images/hyperparameter_optimization.png" alt="" style="width: 80%; height: 80%"/>
</p>

Hyperparameter optimization is the process of selecting the best set of hyperparameters for a machine learning model. Hyperparameters are configuration settings that govern the model's operation and influence its performance. Choosing appropriate hyperparameters can improve evaluation metrics such as accuracy and loss.

The optimization problem is often expressed mathematically as:

$$
\mathcal{x}^* = \arg \min_{\mathcal{x}} f(\mathcal{x})
$$

where:

- **$\mathcal{x}^*$** is the optimal set of hyperparameters.
- **$\mathcal{x}$** represents the hyperparameter space.
- **$f(\mathcal{x})$** is the objective function that quantifies model performance (for example, validation loss or error) and is to be minimized.

> **Note:** The choice of $f(\mathcal{x})$ reflects what is crucial to the task at hand, such as minimizing loss, maximizing accuracy, or fine-tuning another performance metric.

There are several common methods used for hyperparameter optimization. Each method has trade-offs between thoroughness, computation time, and simplicity.


### Methods of Hyperparameter Optimization

<p align="center">
  <img src="images/hyperparameter_optimization2.jpg" alt="" style="width: 80%; height: 80%"/>
</p>

#### 1. Grid Search

Grid Search performs an exhaustive evaluation by checking every combination within a pre-defined hyperparameter grid. This method is straightforward but may be extremely time-consuming when the hyperparameter space is large.

- **Exhaustive Search:** Every possible combination within the specified ranges is evaluated.
- **Computational Cost:** Requires extensive computation, particularly when many hyperparameters or large datasets are involved.
- **Appropriate Use:** Best applied when the number of hyperparameters is limited or when sufficient computational resources are available.

#### 2. Random Search

Random Search selects hyperparameter combinations at random from the defined space. This approach searches a diverse set of configurations and can often yield good results faster than Grid Search.

- **Random Sampling:** Instead of checking all combinations, a predetermined number of randomly chosen configurations are evaluated.
- **Efficiency:** Can discover high-performing settings without the heavy computational burden of an exhaustive search.
- **Intended Use:** Useful when dealing with a large hyperparameter space and when computational resources are more constrained.

#### 3. Bayesian Optimization

Bayesian Optimization is a model-based technique that sequentially selects hyperparameter configurations based on prior evaluations.

- **Probabilistic Modeling:** Uses models (such as Gaussian Processes) to estimate the performance $f(\mathcal{x})$ across the hyperparameter space.
- **Sequential Selection:** Chooses the next set of hyperparameters to evaluate based on an acquisition function that balances the exploration of uncertain regions with the exploitation of known good areas.
- **Efficiency:** Usually requires fewer evaluations compared to grid or random search, especially when individual evaluations are expensive.
- **Ideal Scenarios:** Best used when the search space is complex and model evaluations are computationally costly.


To save time, we'll perform each kind of optimization on the hyperparameters of our `SGDClassifier` (our fastest top classifiers). You may want to select the best classifier and perform the optimization on it, but this is a good starting point.

#### Performing grid search

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, matthews_corrcoef
from sklearn.linear_model import SGDClassifier
from typing import Dict, Any
import pandas as pd
import warnings
from sklearn.exceptions import ConvergenceWarning


def perform_grid_search(
    model: Any,
    param_grid: Dict[str, Any],
    X_train: pd.DataFrame,
    y_train: pd.Series,
    X_val: pd.DataFrame,
    y_val: pd.Series,
) -> None:
    """
    Perform grid search to find the best parameters for the model and evaluate it on the validation set.

    Args:
        model (Any): The model to be trained.
        param_grid (Dict[str, Any]): The parameter grid for the search.
        X_train (pd.DataFrame): The training data.
        y_train (pd.Series): The training labels.
        X_val (pd.DataFrame): The validation data.
        y_val (pd.Series): The validation labels.
    """
    # Define the scorer for the grid search using Matthews correlation coefficient
    scorer_mcc = make_scorer(matthews_corrcoef)

    # Create the GridSearchCV object with the model and parameter grid
    grid_search = GridSearchCV(model, param_grid, cv=3, scoring=scorer_mcc, n_jobs=-1)

    # Suppress ConvergenceWarning during grid search
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=ConvergenceWarning)

        # Perform the grid search by fitting the training data
        grid_search.fit(X_train, y_train)

    # Print the best parameters and corresponding MCC score found by the grid search
    print("Best parameters: ", grid_search.best_params_)
    print("Best score: ", grid_search.best_score_)

    # Evaluate the model with chosen parameters on the validation set
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=ConvergenceWarning)
        best_model = grid_search.best_estimator_
        validation_score = best_model.score(X_val, y_val)
    print("Validation score: ", validation_score)


# Define the SGDClassifier model with specific parameters
sgd_model = SGDClassifier(random_state=271828, n_jobs=-1, class_weight="balanced")

# Define parameter grid for the search
search_parameters = {
    "loss": [
        "hinge",
        "log_loss",
        "squared_hinge",
        "modified_huber",
    ],  # Different loss functions
    "alpha": [1e-4, 1e-3, 1e-2, 1e-1, 1e0],  # Regularization strength
    "max_iter": [1000, 2000, 3000, 4000, 5000],  # Number of iterations
    "penalty": ["l2", "l1", "elasticnet"],  # Regularization types
}

# Perform grid search
perform_grid_search(sgd_model, search_parameters, X_train, y_train, X_val, y_val)

# Took 17 minutes to run in a 48 core CPU
# It performs 4 * 5 * 5 * 3 combinations of parameters, which is 300 combinations in total.
# Since it uses Cross Validation with 3 folds, it will train 900 models in total!!!!!

# Best parameters:  {'alpha': 0.0001, 'loss': 'hinge', 'max_iter': 1000, 'penalty': 'l1'}
# Validation MCC:  0.947
# The best valid MCC for the SGDClassifier was 0.870491



Best parameters:  {'alpha': 0.0001, 'loss': 'hinge', 'max_iter': 1000, 'penalty': 'l1'}
Best score:  0.8545240719127332
Validation score:  0.947


#### Performing Random Search and  Understanding the 60 Iterations Rule

In hyperparameter optimization, a common rule of thumb is that **60 iterations** of random search will, with 95% confidence, discover at least one hyperparameter set within the top 5% of performance. This conclusion is reached through a simple probability analysis and remains valid regardless of the total number of possible hyperparameter combinations.

The essence of this rule can be captured by considering each iteration of random search as an independent trial with a chance $ p $ of success, where "success" means finding a hyperparameter set in the top 5%. Here:
   - $ p = 0.05 $ (5% chance)
   - $ n $ is the number of independent trials (iterations)

We wish to find the minimum $ n $ such that the probability of having **at least one success** is at least 95%. Instead of computing the success probability directly, we first calculate the probability of **no success** in $ n $ trials and require it to be less than or equal to 5% (or 0.05).

1. **Probability of no success in $ n $ trials:**
   $ \text{Probability} = (1 - p)^n = (0.95)^n $

2. **Setting up the condition for 95% confidence:** $ (0.95)^n \le 0.05 $

3. **Solving for $ n $ using logarithms:** $ n \cdot \ln(0.95) \le \ln(0.05)$

   - Since $ \ln(0.95) $ is negative, dividing by it reverses the inequality: $n \ge \frac{\ln(0.05)}{\ln(0.95)}$

   -  Evaluating the logarithms yields: $n \approx 59$

   - This value is typically rounded up to **60 iterations** to ensure the 95% confidence level.

#### Interpretation and Consequences

- **Independence of Hyperparameter Space Size:**  
  The required number of iterations (approximately 60) does not depend on the overall size of the hyperparameter space. It only depends on the probability per iteration and the desired confidence level.

- **Application of the Geometric Distribution:**  
  The logic parallels that of the geometric distribution, where we generally ask, "How many trials are needed for the first success?" In this context, however, the focus is on achieving a 95% probability of at least one success after a certain number of trials.

- **Confidence Guarantee:**  
  With 60 random search iterations, one can be 95% confident of having found a hyperparameter configuration that falls in the top 5% of the space, assuming the success probability per trial is indeed 0.05.

#### Practical Considerations

- **Efficiency Over Grid Search:**  
  Random search can be more efficient, especially in high-dimensional spaces where many parameters have little effect on performance. It avoids the exhaustive nature of grid search, which may evaluate many non-promising combinations.

- **Sampling the Full Space:**  
  Since random search samples from the entire hyperparameter space, it helps in cases where it is not clear which hyperparameters are most influential.

- **Scalability and Adjustability:**  
  The necessary number of iterations scales logarithmically with the desired confidence and inversely with the target percentile. For instance, to be 99% confident of finding a configuration in the top 1%, the number of iterations required is approximately:
  $$
  n \ge \frac{\ln(0.01)}{\ln(0.99)} \approx 459
  $$

- **Lack of Informed Search:**  
  While random search is stable, it does not use information from previous trials. Advanced methods like Bayesian optimization adjust future trials based on past outcomes, potentially leading to a more efficient overall search.

> **Key Note:** The probability model assumes independent trials and constant probability of success in each iteration. If these assumptions are not met (for example, in the presence of dependencies between trials), the rule of thumb may not hold.

In [None]:
from sklearn.model_selection import RandomizedSearchCV


def perform_random_search(
    model: SGDClassifier,
    param_dist: Dict[str, Any],
    X_train: pd.DataFrame,
    y_train: pd.Series,
    X_val: pd.DataFrame,
    y_val: pd.Series,
) -> None:
    """
    Perform random search to find the best parameters for the model and evaluate it on the validation set.

    Args:
        model (SGDClassifier): The model to be trained.
        param_dist (Dict[str, Any]): The parameter distribution for the search.
        X_train (pd.DataFrame): The training data.
        y_train (pd.Series): The training labels.
        X_val (pd.DataFrame): The validation data.
        y_val (pd.Series): The validation labels.
    """
    # Define the scorer for the random search
    mcc_scorer = make_scorer(matthews_corrcoef)

    # Create the RandomizedSearchCV object with the SGDClassifier and parameter distribution
    random_search = RandomizedSearchCV(
        model,
        param_dist,
        cv=3,
        scoring=mcc_scorer,
        n_jobs=-1,
        n_iter=60,
        random_state=271828,
    )

    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=ConvergenceWarning)
        # Perform the random search by fitting training data
        random_search.fit(X_train, y_train)

    # Print the best parameters and corresponding MCC score found by the random search
    print("Best parameters: ", random_search.best_params_)
    print("Best score: ", random_search.best_score_)

    # Evaluate the model with chosen parameters on the validation set
    best_model = random_search.best_estimator_
    validation_mcc = best_model.score(X_val, y_val)
    print("Validation MCC: ", validation_mcc)


# Define the SGDClassifier model
sgd_model = SGDClassifier(random_state=271828, n_jobs=-1, class_weight="balanced")

# Define parameter distribution for the search
search_parameters = {
    "loss": ["hinge", "log_loss", "squared_hinge", "modified_huber"],
    "alpha": [1e-4, 1e-3, 1e-2, 1e-1, 1e0],
    "max_iter": [1000, 2000, 3000, 4000, 5000],
    "penalty": ["l2", "l1", "elasticnet"],
}

# Perform random search
perform_random_search(sgd_model, search_parameters, X_train, y_train, X_val, y_val)

# This will typically run faster than GridSearchCV due to the reduced number of parameter combinations.
# Time to run: 3.5 minutes in a 48 core CPU

# Best parameters:  {'penalty': 'l1', 'max_iter': 3000, 'loss': 'hinge', 'alpha': 0.0001}
# Validation MCC:  0.947
# The best valid MCC for the SGDClassifier was 0.870491



Best parameters:  {'penalty': 'l1', 'max_iter': 3000, 'loss': 'hinge', 'alpha': 0.0001}
Best score:  0.8545240719127332
Validation MCC:  0.947


#### Bayesian Optimization for Hyperparameter Tuning

Bayesian optimization is a sophisticated approach to hyperparameter tuning, particularly valuable when dealing with complex models and computationally expensive evaluations. This method intelligently navigates the hyperparameter space by balancing exploration of unknown regions with exploitation of promising areas.

The fundamental principle of Bayesian optimization lies in its ability to learn from previous evaluations and make informed decisions about which hyperparameters to try next. This approach significantly reduces the number of evaluations needed to find optimal or near-optimal hyperparameters.

> **Key Insight**: Bayesian model-based methods can discover superior hyperparameters more efficiently by reasoning about the most promising configurations based on past trials.

##### Visual Understanding

To grasp the essence of Bayesian optimization, consider the following visual representations:

<p align="center">
  <img src="images/bayesian1.webp" alt="" style="width: 40%; height: 40%"/>
</p>

This image illustrates the initial state of the surrogate model (black line with gray uncertainty) after only two evaluations. At this stage, the surrogate model poorly approximates the true objective function (red line).

<p align="center">
  <img src="images/bayesian2.webp" alt="" style="width: 40%; height: 40%"/>
</p>

After eight evaluations, the surrogate model closely matches the true function. This improved approximation allows the algorithm to select hyperparameters that are likely to yield excellent results on the actual evaluation function.

For a nice conceptual understanding of Bayesian optimization, [check this link](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f)

##### The Bayesian Approach

Bayesian methods mirror human learning processes:
1. Form an initial view (prior)
2. Update the model based on new experiences (posterior)

concerning hyperparameter optimization, this framework is applied to discover optimal model settings iteratively.

##### Key Components of Bayesian Optimization

1. **Objective Function**
   - Evaluates model performance for a given set of hyperparameters
   - Typically uses metrics like accuracy, loss, or cross-validation scores
   - Serves as the ground truth for comparing different configurations

2. **Surrogate Model**
   - Probabilistic approximation of the objective function
   - Often uses Gaussian Processes (GPs) for their ability to provide both predictions and uncertainty estimates
   - Predicts performance of untested hyperparameter sets based on observed results
   - Significantly reduces computational costs by minimizing actual evaluations of the (costly) objective function

3. **Acquisition Function**
   - Guides the selection of the next hyperparameter set to evaluate
   - Balances exploration (investigating new areas) and exploitation (focusing on known good regions)
   - Common types include:
     - Expected Improvement (EI)
     - Probability of Improvement (PI)
     - Upper Confidence Bound (UCB)

##### The Optimization Process

1. **Initialization**: Randomly select and evaluate a few initial hyperparameter sets.

2. **Surrogate Model Creation**: Fit a Gaussian Process to the initial data points.

3. **Acquisition Function Calculation**: Compute the acquisition function values across the hyperparameter space.

4. **Next Point Selection**: Choose the hyperparameter set with the highest acquisition function value.

5. **Evaluation**: Assess the chosen hyperparameters using the objective function.

6. **Model Update**: Refit the Gaussian Process with the new data point.

7. **Iteration**: Repeat steps 3-6 until a stopping criterion is met (e.g., performance threshold or iteration limit).

##### Advantages and Considerations

- **Efficiency**: Particularly beneficial for expensive-to-evaluate models, minimizing the number of evaluations needed.
- **Adaptability**: Learns from previous evaluations to make informed decisions about future trials.
- **Uncertainty Handling**: Incorporates uncertainty in its decision-making process, leading to a more stable exploration of the hyperparameter space.

> **Note**: While highly effective, Bayesian optimization does not guarantee finding the global optimum, especially in complex or noisy objective functions.

##### Illustrative Analogies

- **Treasure Map Analogy**: Think of the surrogate model as a map of an island. Initially, the map is rough with only a few landmarks (evaluations). As you explore more and add details (new evaluations), the map becomes a better guide for where to search next.

- **Drilling for Oil Analogy**: The acquisition function plays a role similar to deciding where to drill for oil. It weighs the promise of drilling deeper in a known rich area (exploitation) versus trying a new location (exploration) where the potential reward is uncertain but high.

In [None]:
from skopt import BayesSearchCV


def perform_bayesian_optimization(
    model: SGDClassifier,
    param_space: Dict[str, Any],
    X_train: pd.DataFrame,
    y_train: pd.Series,
    X_val: pd.DataFrame,
    y_val: pd.Series,
) -> None:
    """
    Perform Bayesian optimization to find the best parameters for the model and evaluate it on the validation set.

    Args:
        model (SGDClassifier): The model to be trained.
        param_space (Dict[str, Any]): The parameter space for the search.
        X_train (pd.DataFrame): The training data.
        y_train (pd.Series): The training labels.
        X_val (pd.DataFrame): The validation data.
        y_val (pd.Series): The validation labels.
    """
    # Define the scorer for the Bayesian optimization
    mcc_scorer = make_scorer(matthews_corrcoef)

    # Create the BayesSearchCV object with the SGDClassifier and parameter space
    bayesian_search = BayesSearchCV(
        model,
        param_space,
        cv=3,
        scoring=mcc_scorer,
        n_jobs=-1,
        n_iter=30,
        random_state=271828,
        n_points=10,
    )

    # Perform the Bayesian optimization by fitting training data
    bayesian_search.fit(X_train, y_train)

    # Print the best parameters and corresponding MCC score found by the Bayesian search
    print("Best parameters: ", bayesian_search.best_params_)
    print("Best score: ", bayesian_search.best_score_)

    # Evaluate the model with chosen parameters on the validation set
    best_model = bayesian_search.best_estimator_
    validation_mcc = best_model.score(X_val, y_val)
    print("Validation MCC: ", validation_mcc)


# Define the SGDClassifier model
sgd_model = SGDClassifier(random_state=271828, n_jobs=-1, class_weight="balanced")

# Define parameter space for the search
search_parameters = {
    "loss": ["hinge", "log_loss", "squared_hinge", "modified_huber"],
    "alpha": [1e-4, 1e-3, 1e-2, 1e-1, 1e0],
    "max_iter": [1000, 2000, 3000, 4000, 5000],
    "penalty": ["l2", "l1", "elasticnet"],
}

# Perform Bayesian optimization
perform_bayesian_optimization(
    sgd_model, search_parameters, X_train, y_train, X_val, y_val
)

# This cell took 1 minute do run
# Best parameters:  OrderedDict({'alpha': 0.0001, 'loss': 'hinge', 'max_iter': 5000, 'penalty': 'l1'})
# Validation MCC:  0.947
# The best valid MCC for the SGDClassifier was 0.870491



Best parameters:  OrderedDict({'alpha': 0.0001, 'loss': 'hinge', 'max_iter': 5000, 'penalty': 'l1'})
Best score:  0.8545240719127332
Validation MCC:  0.947


# Problem 2 - Predicting the Time it Takes to Issue an "Acórdão"
Your turn to build a model! In this problem, you'll use the data from the previous problem to predict the time it takes for an appeal to be judged. This is a regression problem, and you'll use the same features as in the previous problem.

In [None]:
# Load our data.
import pandas as pd

df_texts = pd.read_parquet("data/brcad5/texts_sample.parquet.gz")
df_meta = pd.read_parquet("data/brcad5/metadata_sample.parquet.gz")

In [35]:
df_meta

Unnamed: 0,case_number,filing_date,defendant_normalized,date_first_instance_ruling,date_appeal_panel_ruling,case_topic_code,case_topic_1st_level,case_topic_2nd_level,case_topic_3rd_level,court_id
11,0510491-75.2017.4.05.8103,2017-10-11 00:00:00,INSS,2018-01-16 11:16:19,2018-03-26 17:07:16,6095,Direito Previdenciário,Benefícios em Espécie,Aposentadoria por Invalidez,19-CE
16,0509917-52.2017.4.05.8103,2017-09-26 00:00:00,INSS,2018-01-15 18:07:55,2018-03-26 17:07:16,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,31-CE
19,0501051-55.2017.4.05.8103,2017-02-03 00:00:00,INSS,2017-11-21 11:44:40,2018-03-26 17:07:16,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),31-CE
23,0511681-76.2017.4.05.8102,2017-09-27 00:00:00,INSS,2017-12-29 02:18:24,2018-03-26 17:07:16,6104,Direito Previdenciário,Benefícios em Espécie,Pensão por Morte (Art. 74/9),30-CE
33,0500777-79.2017.4.05.8107,2017-02-23 00:00:00,INSS,2017-11-27 17:58:46,2018-03-26 17:07:16,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),25-CE
...,...,...,...,...,...,...,...,...,...,...
152600,0511177-90.2019.4.05.8202,2019-11-08 00:00:00,INSS,2019-12-16 13:09:08,2020-03-31 11:47:59,6176,Direito Previdenciário,Pedidos Genéricos Relativos aos Benefícios em ...,Parcelas de benefício não pagas,15-PB
152605,0508370-06.2019.4.05.8200,2019-06-13 00:00:00,INSS,2019-10-03 18:01:23,2020-03-31 15:22:15,6096,Direito Previdenciário,Benefícios em Espécie,Aposentadoria por Idade (Art. 48/51),7-PB
152610,0508717-73.2018.4.05.8200,2018-06-20 00:00:00,INSS,2019-08-29 16:53:58,2020-03-31 15:22:15,6114,Direito Previdenciário,Benefícios em Espécie,"Benefício Assistencial (Art. 203,V CF/88)",13-PB
152628,0507511-81.2019.4.05.8202,2019-08-22 00:00:00,INSS,2019-11-05 17:10:31,2020-04-01 11:24:06,6176,Direito Previdenciário,Pedidos Genéricos Relativos aos Benefícios em ...,Parcelas de benefício não pagas,15-PB


In [36]:
df_texts.info()

<class 'pandas.core.frame.DataFrame'>
Index: 40000 entries, 2 to 305281
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   case_number  40000 non-null  object
 1   text         40000 non-null  object
 2   outcome      40000 non-null  object
 3   ruling_type  40000 non-null  object
dtypes: object(4)
memory usage: 1.5+ MB


In [37]:
df

Unnamed: 0,case_number,text,outcome,date_appeal_panel_ruling,outcome_int
5318,0529055-14.2017.4.05.8100,RELATÓRIO Trata-se de recurso interposto pela ...,NÃO PROVIMENTO,2018-03-26 17:02:45,0
19114,0504805-96.2017.4.05.8105,DIREITO PREVIDENCIÁRIO. AUXÍLIO-DOENÇA. LAUDO ...,NÃO PROVIMENTO,2018-03-26 17:07:16,0
13460,0511681-76.2017.4.05.8102,RECURSO INOMINADO. DIREITO PREVIDENCIÁRIO. PEN...,PROVIMENTO,2018-03-26 17:07:16,1
2359,0502738-58.2017.4.05.8106,ADMINISTRATIVO. GRATIFICAÇÃO DE DESEMPENHO. GD...,NÃO PROVIMENTO,2018-03-26 17:07:16,0
1706,0509527-91.2017.4.05.8100,EMENTA: APLICAÇÃO DO INPC À CORREÇÃO MONETÁRIA...,PROVIMENTO,2018-03-26 17:07:16,1
...,...,...,...,...,...
2880,0500987-74.2019.4.05.8200,VOTO-EMENTA ADMINISTRATIVO. CONVERSÃO DE LICEN...,PROVIMENTO,2020-03-31 15:22:15,1
12876,0508370-06.2019.4.05.8200,VOTO – EMENTA PREVIDENCIÁRIO. APOSENTADORIA PO...,NÃO PROVIMENTO,2020-03-31 15:22:15,0
183,0507364-55.2019.4.05.8202,VOTO - EMENTA PREVIDENCIÁRIO. EXTENSÃO DO PERÍ...,PROVIMENTO PARCIAL,2020-04-01 11:24:06,2
19799,0508286-96.2019.4.05.8202,VOTO - EMENTA PREVIDENCIÁRIO. EXTENSÃO DO PERÍ...,PROVIMENTO PARCIAL,2020-04-01 11:24:06,2


In [None]:
df_meta.date_first_instance_ruling = pd.to_datetime(
    df_meta.date_first_instance_ruling, yearfirst=True
)
df_meta.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20000 entries, 11 to 152635
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   case_number                 20000 non-null  object        
 1   filing_date                 20000 non-null  object        
 2   defendant_normalized        20000 non-null  object        
 3   date_first_instance_ruling  20000 non-null  datetime64[ns]
 4   date_appeal_panel_ruling    20000 non-null  datetime64[us]
 5   case_topic_code             20000 non-null  int64         
 6   case_topic_1st_level        20000 non-null  object        
 7   case_topic_2nd_level        20000 non-null  object        
 8   case_topic_3rd_level        19993 non-null  object        
 9   court_id                    20000 non-null  object        
dtypes: datetime64[ns](1), datetime64[us](1), int64(1), object(7)
memory usage: 1.7+ MB


In [None]:
df_meta["time_to_trial_appeal"] = (
    df_meta["date_appeal_panel_ruling"] - df_meta["date_first_instance_ruling"]
)
df_meta["time_to_trial_appeal"] = df_meta["time_to_trial_appeal"].dt.days
df_meta["time_to_trial_appeal"].describe()

count    20000.000000
mean       153.045100
std        267.485493
min          8.000000
25%         55.000000
50%         80.000000
75%        127.000000
max       3599.000000
Name: time_to_trial_appeal, dtype: float64

In [None]:
df = df_texts.query('ruling_type == "SENTENÇA"').merge(
    df_meta, on="case_number", how="left"
)
df.drop_duplicates(subset="case_number", inplace=True)
df

Unnamed: 0,case_number,text,outcome,ruling_type,filing_date,defendant_normalized,date_first_instance_ruling,date_appeal_panel_ruling,case_topic_code,case_topic_1st_level,case_topic_2nd_level,case_topic_3rd_level,court_id,time_to_trial_appeal
0,0515165-56.2018.4.05.8202,SENTENÇA I - RELATÓRIO Dispensada a feitura do...,IMPROCEDENTE,SENTENÇA,2018-10-19 00:00:00,INSS,2019-04-25 15:55:52,2019-08-20 10:11:35,6114,Direito Previdenciário,Benefícios em Espécie,"Benefício Assistencial (Art. 203,V CF/88)",15-PB,116
1,0506287-42.2018.4.05.8300,"SENTENÇA I - RELATÓRIO Dispensado, nos termos ...",PARCIALMENTE PROCEDENTE,SENTENÇA,2018-04-20 00:00:00,INSS,2019-05-02 23:48:08,2019-07-03 18:08:09,6099,Direito Previdenciário,Benefícios em Espécie,Aposentadoria por Tempo de Serviço (Art. 52/4),14-PE,61
2,0505610-87.2019.4.05.8102,SENTENÇA I – RELATÓRIO Por força do disposto n...,PROCEDENTE,SENTENÇA,2019-05-06 00:00:00,INSS,2019-07-28 09:57:06,2019-09-26 14:16:00,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,30-CE,60
3,0501720-83.2018.4.05.8100,"SENTENÇA Dispensado o relatório, nos termos do...",IMPROCEDENTE,SENTENÇA,2018-01-24 00:00:00,INSS,2018-08-27 12:39:45,2018-10-31 15:36:41,6114,Direito Previdenciário,Benefícios em Espécie,"Benefício Assistencial (Art. 203,V CF/88)",13-CE,65
4,0501533-63.2018.4.05.8104,"TERMO DE AUDIÊNCIA Aos 18 de setembro de 2018,...",PROCEDENTE,SENTENÇA,2018-05-02 00:00:00,INSS,2018-09-18 14:33:01,2018-11-14 13:50:04,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),22-CE,56
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,0500473-94.2019.4.05.8306,SENTENÇA (Tipo C – Sem Resolução do Mérito) I ...,EXTINTO SEM MÉRITO,SENTENÇA,2019-03-14 00:00:00,INSS,2019-04-15 09:12:31,2019-07-22 15:38:11,6114,Direito Previdenciário,Benefícios em Espécie,"Benefício Assistencial (Art. 203,V CF/88)",25-PE,98
19996,0505814-16.2014.4.05.8100,SENTENÇA Por força do disposto no art. 38 da L...,IMPROCEDENTE,SENTENÇA,2014-03-21 00:00:00,CEF,2014-03-28 10:32:04,2018-07-06 11:33:24,7691,Direito Civil,Obrigações,Inadimplemento,14-CE,1561
19997,0500153-84.2018.4.05.8304,SENTENÇA I. Relatório Ante o disposto no art. ...,IMPROCEDENTE,SENTENÇA,2018-01-24 00:00:00,INSS,2018-03-23 08:35:39,2018-05-23 14:42:26,6177,Direito Previdenciário,Pedidos Genéricos Relativos aos Benefícios em ...,Concessão,20-PE,61
19998,0511210-80.2019.4.05.8202,SENTENÇA I – RELATÓRIO Trata-se de ação de ind...,IMPROCEDENTE,SENTENÇA,2019-11-08 00:00:00,INSS,2019-12-16 13:09:08,2020-03-16 14:46:57,6176,Direito Previdenciário,Pedidos Genéricos Relativos aos Benefícios em ...,Parcelas de benefício não pagas,15-PB,91


In [None]:
df = df.sort_values(by="date_appeal_panel_ruling", ascending=True).reset_index(
    drop=True
)
df

Unnamed: 0,case_number,text,outcome,ruling_type,filing_date,defendant_normalized,date_first_instance_ruling,date_appeal_panel_ruling,case_topic_code,case_topic_1st_level,case_topic_2nd_level,case_topic_3rd_level,court_id,time_to_trial_appeal
0,0529055-14.2017.4.05.8100,SENTENÇA Dispensado o relatório (art. 1o da Le...,IMPROCEDENTE,SENTENÇA,2017-12-22 00:00:00,INSS,2018-01-30 12:09:11,2018-03-26 17:02:45,6138,Direito Previdenciário,"RMI - Renda Mensal Inicial, Reajustes e Revisõ...",Reajustes e Revisões Específicos,14-CE,55
1,0505725-70.2017.4.05.8105,JUSTIÇA FEDERAL SEÇÃO JUDICIÁRIA DO ESTADO DO ...,IMPROCEDENTE,SENTENÇA,2017-11-21 00:00:00,INSS,2018-02-06 15:50:11,2018-03-26 17:07:16,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,23-CE,48
2,0501051-55.2017.4.05.8103,SENTENÇA I. RELATÓRIO Trata-se de demanda prev...,IMPROCEDENTE,SENTENÇA,2017-02-03 00:00:00,INSS,2017-11-21 11:44:40,2018-03-26 17:07:16,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),31-CE,125
3,0512841-39.2017.4.05.8102,PODER JUDICIÁRIO JUSTIÇA FEDERAL DE PRIMEIRA I...,PROCEDENTE,SENTENÇA,2017-10-25 00:00:00,INSS,2017-11-09 10:58:36,2018-03-26 17:07:16,10288,Direito Administrativo e outras matérias do Di...,Servidor Público Civil,Sistema Remuneratório e Benefícios,17-CE,137
4,0509917-52.2017.4.05.8103,SENTENÇA I. RELATÓRIO Trata-sede ação de rito ...,IMPROCEDENTE,SENTENÇA,2017-09-26 00:00:00,INSS,2018-01-15 18:07:55,2018-03-26 17:07:16,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,31-CE,69
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,0508370-06.2019.4.05.8200,SENTENÇA – TIPO A Vistos etc. Trata-se de ação...,PROCEDENTE,SENTENÇA,2019-06-13 00:00:00,INSS,2019-10-03 18:01:23,2020-03-31 15:22:15,6096,Direito Previdenciário,Benefícios em Espécie,Aposentadoria por Idade (Art. 48/51),7-PB,179
19996,0508717-73.2018.4.05.8200,SENTENÇA Dispensado o relatório nos termos do ...,IMPROCEDENTE,SENTENÇA,2018-06-20 00:00:00,INSS,2019-08-29 16:53:58,2020-03-31 15:22:15,6114,Direito Previdenciário,Benefícios em Espécie,"Benefício Assistencial (Art. 203,V CF/88)",13-PB,214
19997,0507511-81.2019.4.05.8202,SENTENÇA I – RELATÓRIO Trata-se de ação de ind...,IMPROCEDENTE,SENTENÇA,2019-08-22 00:00:00,INSS,2019-11-05 17:10:31,2020-04-01 11:24:06,6176,Direito Previdenciário,Pedidos Genéricos Relativos aos Benefícios em ...,Parcelas de benefício não pagas,15-PB,147
19998,0507364-55.2019.4.05.8202,SENTENÇA I – RELATÓRIO Trata-se de ação de ind...,IMPROCEDENTE,SENTENÇA,2019-08-21 00:00:00,INSS,2019-10-30 16:18:03,2020-04-01 11:24:06,6176,Direito Previdenciário,Pedidos Genéricos Relativos aos Benefícios em ...,Parcelas de benefício não pagas,15-PB,153


In [None]:
# Determine the length of the training set (80% of the total data)
train_len = int(0.8 * len(df))

# Determine the length of the validation set (10% of the total data)
val_len = int(0.1 * len(df))

# Create the training set by selecting the first 'train_len' rows from the dataframe
df_train = df.iloc[:train_len].copy()

# Create the validation set by selecting the next 'val_len' rows after the training set
df_val = df.iloc[train_len : train_len + val_len].copy()

# Create the test set by selecting the remaining rows after the training and validation sets
df_test = df.iloc[train_len + val_len :].copy()

# Print the shapes of the training, validation, and test sets to verify the splits
df_train.shape, df_val.shape, df_test.shape

((16000, 14), (2000, 14), (2000, 14))

In [None]:
for dataframe in [df_train, df_val, df_test]:
    dataframe["clean_text"] = dataframe.text.apply(remove_accented_characters)
    dataframe["clean_text"] = dataframe.clean_text.apply(
        remove_numbers_punctuation_from_text
    )
    dataframe["clean_text"] = dataframe.clean_text.apply(remove_excessive_spaces)
    dataframe["clean_text"] = dataframe.clean_text.apply(remove_short_words)

In [44]:
df_train

Unnamed: 0,case_number,text,outcome,ruling_type,filing_date,defendant_normalized,date_first_instance_ruling,date_appeal_panel_ruling,case_topic_code,case_topic_1st_level,case_topic_2nd_level,case_topic_3rd_level,court_id,time_to_trial_appeal,clean_text
0,0529055-14.2017.4.05.8100,SENTENÇA Dispensado o relatório (art. 1o da Le...,IMPROCEDENTE,SENTENÇA,2017-12-22 00:00:00,INSS,2018-01-30 12:09:11,2018-03-26 17:02:45,6138,Direito Previdenciário,"RMI - Renda Mensal Inicial, Reajustes e Revisõ...",Reajustes e Revisões Específicos,14-CE,55,SENTENCA Dispensado relatorio art Lei art Lei ...
1,0505725-70.2017.4.05.8105,JUSTIÇA FEDERAL SEÇÃO JUDICIÁRIA DO ESTADO DO ...,IMPROCEDENTE,SENTENÇA,2017-11-21 00:00:00,INSS,2018-02-06 15:50:11,2018-03-26 17:07:16,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,23-CE,48,JUSTICA FEDERAL SECAO JUDICIARIA ESTADO CEARA ...
2,0501051-55.2017.4.05.8103,SENTENÇA I. RELATÓRIO Trata-se de demanda prev...,IMPROCEDENTE,SENTENÇA,2017-02-03 00:00:00,INSS,2017-11-21 11:44:40,2018-03-26 17:07:16,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),31-CE,125,SENTENCA RELATORIO Trata demanda previdenciari...
3,0512841-39.2017.4.05.8102,PODER JUDICIÁRIO JUSTIÇA FEDERAL DE PRIMEIRA I...,PROCEDENTE,SENTENÇA,2017-10-25 00:00:00,INSS,2017-11-09 10:58:36,2018-03-26 17:07:16,10288,Direito Administrativo e outras matérias do Di...,Servidor Público Civil,Sistema Remuneratório e Benefícios,17-CE,137,PODER JUDICIARIO JUSTICA FEDERAL PRIMEIRA INST...
4,0509917-52.2017.4.05.8103,SENTENÇA I. RELATÓRIO Trata-sede ação de rito ...,IMPROCEDENTE,SENTENÇA,2017-09-26 00:00:00,INSS,2018-01-15 18:07:55,2018-03-26 17:07:16,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,31-CE,69,SENTENCA RELATORIO Trata sede acao rito especi...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15995,0516038-07.2019.4.05.8013,Processo no 05160380720194058013T Autor: CARLS...,PROCEDENTE,SENTENÇA,2019-05-23 00:00:00,IFES,2019-07-29 11:47:02,2019-09-30 17:20:24,10220,Direito Administrativo e outras matérias do Di...,Servidor Público Civil,Regime Estatutário,14-AL,63,Processo Autor CARLSON LAMENHA APOLINARIO Reu ...
15996,0500940-73.2019.4.05.8015,SENTENÇA Trata-se de ação proposta em face da ...,IMPROCEDENTE,SENTENÇA,2019-01-29 00:00:00,CEF,2019-06-14 11:24:25,2019-09-30 17:20:24,10433,Direito Civil,Responsabilidade Civil,Indenização por Dano Moral,10-AL,108,SENTENCA Trata acao proposta face Caixa Econom...
15997,0509291-41.2019.4.05.8013,SENTENÇA Trata-se de ação proposta contra o IN...,PROCEDENTE,SENTENÇA,2019-03-29 00:00:00,INSS,2019-06-13 12:08:18,2019-09-30 17:20:24,6118,Direito Previdenciário,Benefícios em Espécie,Aposentadoria por Tempo de Contribuição (Art. ...,9-AL,109,SENTENCA Trata acao proposta contra INSS que p...
15998,0519807-23.2019.4.05.8013,"SENTENÇA Trata-se de ação de rito sumaríssimo,...",PROCEDENTE,SENTENÇA,2019-06-25 00:00:00,INSS,2019-07-16 11:13:16,2019-09-30 17:20:24,6100,Direito Previdenciário,Benefícios em Espécie,Aposentadoria Especial (Art. 57/8),9-AL,76,SENTENCA Trata acao rito sumarissimo com pedid...


In [None]:
X_train = df_train.clean_text
X_val = df_val.clean_text
X_test = df_test.clean_text

y_train = df_train.time_to_trial_appeal
y_val = df_val.time_to_trial_appeal
y_test = df_test.time_to_trial_appeal

From now on, it's up to you.

## Takeaways
- Proper data preprocessing and feature extraction are essential for effective NLP tasks, especially in specialized domains like legal text analysis.

- Choosing the right hyperparameter optimization technique depends on the problem complexity, computational resources, and desired balance between exploration and exploitation.

- Ensemble methods can significantly enhance model performance by employing the strengths of multiple classifiers, but may increase computational cost and reduce interpretability.

- When working with time-sensitive data like legal rulings, it's crucial to consider the temporal nature of the data in both preprocessing and model evaluation stages.

- Evaluation metrics should be chosen carefully based on the specific problem and domain; in this case, Matthews Correlation Coefficient (MCC) was emphasized for its effectiveness in imbalanced classification tasks.

- Balancing model performance with interpretability and computational efficiency is a key consideration in practical machine learning applications, especially in sensitive domains like legal prediction.


# Questions

1. What are the three key steps in text processing mentioned in the NLP pipeline?

2. What is the "60 iterations rule" with respect to random search for hyperparameter optimization?

3. What are the three main components of Bayesian Optimization for hyperparameter tuning?

4. How does the notebook split the data for training, validation and testing?

5. What is the purpose of the StackingClassifier regarding this notebook?

6. What evaluation metrics are used to assess the performance of the classifiers?


`Answers are commented inside this cell.`

<!-- 1. The three key steps in text processing mentioned in the NLP pipeline are: Normalization (standardizing text), Tokenization (breaking text into smaller units), and Numericalization (converting tokens to numerical representations).

2. The "60 iterations rule" states that 60 iterations of random search can find the best 5% set of parameters 95% of the time, regardless of the grid size. This rule provides an efficient approach to hyperparameter optimization.

3. The three main components of Bayesian Optimization for hyperparameter tuning are: Objective Function (evaluates model performance), Surrogate Model (probabilistic approximation of the objective function), and Acquisition Function (guides selection of next hyperparameter set to evaluate).

4. The notebook splits the data chronologically: the first 80% of the data (sorted by date) is used for training, the next 10% for validation, and the final 10% for testing.

5. The StackingClassifier is used to combine multiple base classifiers with a meta-classifier, aiming to achieve superior performance compared to individual models by employing their collective strengths.

6. The evaluation metrics used include F1 score, balanced accuracy, accuracy, Matthews Correlation Coefficient (MCC), and confusion matrix. The notebook particularly emphasizes MCC as a key metric.
 -->
