# Training ML models for supervised NLP tasks
## IMD1107 - Natural Language Processing
### [Dr. Elias Jacob de Menezes Neto](https://docente.ufrn.br/elias.jacob)

## Summary

## Keypoints
- Common Natural Language Processing (NLP) pipelines consists of three main steps: text processing (normalization, tokenization, numericalization), feature extraction, and model training.

- Hyperparameter optimization techniques include Grid Search, Random Search, and Bayesian Optimization, each with its own advantages and use cases.

- The "60 iterations rule" in Random Search states that 60 iterations can find the best 5% set of parameters 95% of the time, regardless of the grid size.

- Bayesian Optimization uses a surrogate model, objective function, and acquisition function to efficiently navigate the hyperparameter space.

- Ensemble methods like StackingClassifier can improve model performance by combining multiple base classifiers with a meta-classifier.

- Chronological data splitting is crucial for legal ruling analysis to prevent temporal leakage and maintain the integrity of legal precedents.

## Takeaways
- Proper data preprocessing and feature extraction are essential for effective NLP tasks, especially in specialized domains like legal text analysis.

- Choosing the right hyperparameter optimization technique depends on the problem complexity, computational resources, and desired balance between exploration and exploitation.

- Ensemble methods can significantly enhance model performance by leveraging the strengths of multiple classifiers, but may increase computational cost and reduce interpretability.

- When working with time-sensitive data like legal rulings, it's crucial to consider the temporal nature of the data in both preprocessing and model evaluation stages.

- Evaluation metrics should be chosen carefully based on the specific problem and domain; in this case, Matthews Correlation Coefficient (MCC) was emphasized for its effectiveness in imbalanced classification tasks.

- Balancing model performance with interpretability and computational efficiency is a key consideration in practical machine learning applications, especially in sensitive domains like legal prediction.


# Understanding The Natural Language Processing (NLP) Pipeline

The NLP pipeline is a series of structured steps that help transform raw text into a format that machines can understand and use to make decisions or predictions. Here we explore each step for a complete understanding.

## 1. Data Collection

Data collection is the first and crucial step in the pipeline, where we gather raw text data from various sources. The quality and quantity of this data can significantly impact the effectiveness of your NLP model.

## 2. Text Cleaning

In this stage, we clean the collected data by removing noise such as HTML tags, emojis, punctuation marks, etc., which do not contribute to understanding the actual content. This cleaned-up data will improve the model's performance and save computational resources.

## 3. Preprocessing

Preprocessing involves transformation to ready the data for feature extraction, including tasks like tokenization (splitting text into words or phrases), stemming/lemmatization (reducing words to their base/root form), and removing stop words (common words like 'is', 'an', 'the' that don't carry much meaning).

## 4. Feature Extraction

This stage involves converting preprocessed data into a format that can be understood by machine learning algorithms. Techniques like Bag-of-Words or TF-IDF (Term Frequency-Inverse Document Frequency) are employed here to create numerical representations of the text.

`The points above were already addressed on Notebook 4.`


`During this class, we'll cover:`

## 5. Modeling

Once the features are extracted and in proper format, we use them to build and train our NLP model. Depending on the end-goal, different models can be used, like Naive Bayes for classification, or LSTM (Long Short Term Memory) for sequence prediction.

## 6. Evaluation

After the model has been trained, it must be evaluated to ascertain its performance. Metrics like precision, recall, accuracy, and F1-score are typically considered. Also, the model might be tested with new data to validate its performance.




`What we won't cover in this class (I'll leave that to your MLOps professor):`

## 7. Deployment

Once satisfied with the model's performance, the next step is to deploy it for practical use. This can range from integrating within an existing system or application, to deploying on a server for production use.

## 8. Maintenance and Monitoring

Post-deployment, continuous monitoring is essential to ensure the model's performance doesn't degrade over time, due to changes in data patterns. Periodic retraining and tuning may be necessary to keep the model up-to-date.

# Understanding the BrCAD-5 Dataset

## Dataset Overview

The dataset we're working with is a sample from the [BrCAD-5](https://www.kaggle.com/datasets/eliasjacob/brcad5), a comprehensive collection of legal rulings from Brazilian Federal Small Claims Courts (FSCC). This dataset was specifically curated for academic purposes, with the primary goal of developing AI models capable of predicting appeal outcomes within the jurisdiction of the 5th Regional Federal Court (TRF5).

## Key Features of the Dataset

- **Sample Size**: Our sample contains over 40,000 legal rulings, providing a robust foundation for analysis.
- **Jurisdiction**: All cases are from the 5th Regional Federal Court (TRF5) jurisdiction in Brazil.
- **Case Types**: The dataset includes rulings from both federal courts and appellate panels.

## Dataset Structure

The dataset is composed of three main columns:

1. **`case_number`**: 
   - A unique identifier for each legal ruling
   - Ensures each case can be distinctly referenced and tracked

2. **`ruling_type`**: 
   - Indicates the type of legal ruling
   - Two categories:
     - ACÓRDÃO (Judgment): Typically refers to decisions made by appellate panels
     - SENTENÇA (Sentence): Usually refers to decisions made by a single judge in a lower court

3. **`outcome`**: 
   - Represents the result of the legal ruling
   - Multiple possible outcomes:
     - PROVIMENTO: Appeal granted
     - PROVIMENTO PARCIAL: Appeal partially granted
     - NÃO PROVIMENTO: Appeal denied
     - IMPROCEDENTE: Claim dismissed
     - PROCEDENTE: Claim upheld
     - PARCIALMENTE PROCEDENTE: Claim partially upheld
     - EXTINTO SEM MÉRITO: Case dismissed without judgment on merits
     - HOMOLOGADA TRANSAÇÃO: Settlement agreement approved

## Significance of the Dataset

1. **AI in Legal Prediction**: This dataset serves as a valuable resource for developing machine learning models that can predict legal outcomes, potentially revolutionizing legal research and case preparation.

2. **Understanding Brazilian Legal System**: It offers insights into the decision-making patterns within Brazilian Federal Small Claims Courts, which could be useful for comparative legal studies.

3. **Natural Language Processing (NLP) Applications**: While not explicitly mentioned, such datasets often include text data that can be used for NLP tasks like legal document classification or summarization.

## Considerations for Analysis

- **Balanced Representation**: It's important to check if all outcome categories are adequately represented in the sample to ensure unbiased analysis.
- **Temporal Aspects**: Consider whether the rulings span a specific time period, as legal trends may change over time.
- **Contextual Factors**: While not provided in this dataset, factors like the judge's identity, case subject matter, or regional variations could be influential in outcomes.


In [1]:
# Load our data.
import pandas as pd

df_texts = pd.read_parquet('data/brcad5/texts_sample.parquet.gz')
df_meta = pd.read_parquet('data/brcad5/metadata_sample.parquet.gz')

In [2]:
df_texts.head()

Unnamed: 0,case_number,text,outcome,ruling_type
2,0515165-56.2018.4.05.8202,SENTENÇA I - RELATÓRIO Dispensada a feitura do...,IMPROCEDENTE,SENTENÇA
11,0506287-42.2018.4.05.8300,"SENTENÇA I - RELATÓRIO Dispensado, nos termos ...",PARCIALMENTE PROCEDENTE,SENTENÇA
15,0513000-08.2019.4.05.8103,RECURSO INOMINADO CONTRA SENTENÇA DE EXTINÇÃO ...,NÃO PROVIMENTO,ACÓRDÃO
18,0503661-32.2018.4.05.8015,PROCESSO No 0503661-32.2018.4.05.8015 RECORREN...,NÃO PROVIMENTO,ACÓRDÃO
19,0516813-86.2018.4.05.8100,AMPARO SOCIAL (LOAS). REQUISITOS NÃO PREENCHID...,NÃO PROVIMENTO,ACÓRDÃO


In [3]:
df_texts.groupby('ruling_type').outcome.value_counts()

ruling_type  outcome                
ACÓRDÃO      NÃO PROVIMENTO             15581
             PROVIMENTO                  3004
             PROVIMENTO PARCIAL          1415
SENTENÇA     IMPROCEDENTE               11550
             PROCEDENTE                  5253
             PARCIALMENTE PROCEDENTE     2403
             EXTINTO SEM MÉRITO           791
             HOMOLOGADA TRANSAÇÃO           3
Name: count, dtype: int64

In [4]:
df_meta.head()

Unnamed: 0,case_number,filing_date,defendant_normalized,date_first_instance_ruling,date_appeal_panel_ruling,case_topic_code,case_topic_1st_level,case_topic_2nd_level,case_topic_3rd_level,court_id
11,0510491-75.2017.4.05.8103,2017-10-11 00:00:00,INSS,2018-01-16 11:16:19,2018-03-26 17:07:16,6095,Direito Previdenciário,Benefícios em Espécie,Aposentadoria por Invalidez,19-CE
16,0509917-52.2017.4.05.8103,2017-09-26 00:00:00,INSS,2018-01-15 18:07:55,2018-03-26 17:07:16,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,31-CE
19,0501051-55.2017.4.05.8103,2017-02-03 00:00:00,INSS,2017-11-21 11:44:40,2018-03-26 17:07:16,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),31-CE
23,0511681-76.2017.4.05.8102,2017-09-27 00:00:00,INSS,2017-12-29 02:18:24,2018-03-26 17:07:16,6104,Direito Previdenciário,Benefícios em Espécie,Pensão por Morte (Art. 74/9),30-CE
33,0500777-79.2017.4.05.8107,2017-02-23 00:00:00,INSS,2017-11-27 17:58:46,2018-03-26 17:07:16,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),25-CE


In [5]:
df_meta.columns

Index(['case_number', 'filing_date', 'defendant_normalized',
       'date_first_instance_ruling', 'date_appeal_panel_ruling',
       'case_topic_code', 'case_topic_1st_level', 'case_topic_2nd_level',
       'case_topic_3rd_level', 'court_id'],
      dtype='object')

In [6]:
df_texts.columns

Index(['case_number', 'text', 'outcome', 'ruling_type'], dtype='object')

In [7]:
df_texts.case_number.nunique()

20000


### Next Steps

As we proceed with our analysis, we'll need to:
1. Load the data into our working environment
2. Perform initial exploratory data analysis to understand the distribution of ruling types and outcomes
3. Consider any necessary data preprocessing steps, such as handling missing values or encoding categorical variables


## Problem Statement 1 - Classifying "Acórdãos"

When a legal case is adjudicated by a judge, the involved parties have the option to appeal the decision to a higher court. This appellate court then reviews the case and issues a document called an **"Acórdão"**. The "Acórdão" details the court's decision and the reasoning behind it, playing a crucial role in the judicial process. 

### Purpose and Importance of "Acórdãos"

The "Acórdão" can either:
- **Uphold** the original decision made by the lower court.
- **Overturn** the decision, leading to a different outcome.

These documents are invaluable for understanding the application of laws in various contexts. They offer insights into:
- **Judicial reasoning**: How judges interpret and apply the law.
- **Legal precedents**: Past decisions that influence future cases.

### Challenges in Analyzing "Acórdãos"

Despite their importance, "Acórdãos" are written in natural language, which poses challenges for systematic analysis:
1. **Complexity of Legal Language**: Legal terminology and reasoning can be intricate and difficult to parse for those without legal training.
2. **Volume of Data**: The sheer number of "Acórdãos" can be overwhelming, making manual analysis impractical.

To address these challenges, automated classification techniques can be employed to categorize and extract meaningful insights from these documents. By training machine learning models on labeled "Acórdãos," we can develop systems that classify the outcome of an appeal based on the content of the document.



## Problem Statement 2 - Predicting the Time to Issue an "Acórdão"

Once a case has been tried by a judge, the appellate court issues the "Acórdão" after a certain period. The duration between the initial trial and the issuance of the "Acórdão" can vary significantly, ranging from a few days to several months.

### Implications of the Time Frame

The time it takes to issue an "Acórdão" has significant implications for the parties involved:
- **Closure**: If the "Acórdão" supports the original decision, parties can find closure and move forward.
- **Uncertainty**: If the "Acórdão" overrules the initial decision, the parties may face prolonged uncertainty, potentially lasting months or years.

### Calculating the Time Frame

In the dataset, the time it takes to issue the "Acórdão" can be derived by calculating the difference between two columns:
- `date_first_instance_ruling`: The date when the initial trial decision was made.
- `date_appeal_panel_ruling`: The date when the appellate court issued the "Acórdão".

This calculation will provide an accurate representation of the time duration between the trial and the issuance of the "Acórdão".

> **Note**: When developing predictive models, it is essential to use only the data available at the time of prediction. Therefore, the `date_appeal_panel_ruling` column cannot be used as an input for the model since it is only available after the trial is completed. This prevents data leakage and ensures the model's practical utility.

### Potential Predictive Features

To predict the duration between the trial and the issuance of the "Acórdão," consider using features available at the time of the initial trial, such as:
- **Case characteristics**: Type of case, complexity, involved parties.
- **Court attributes**: Location, workload, historical averages.
- **Judge information**: Identity, workload, historical performance.


# Problem 1 - Classifying "Acórdãos"

In [8]:
# Filter the dataframe to include only rows where the ruling type is "ACÓRDÃO"
# Select only the columns 'case_number', 'text', and 'outcome'
df = df_texts.query('ruling_type == "ACÓRDÃO"')[['case_number', 'text', 'outcome']].copy()

# Merge the filtered dataframe with another dataframe containing metadata
# Specifically, we are adding the 'date_appeal_panel_ruling' column based on 'case_number'
df = df.merge(df_meta[['case_number', 'date_appeal_panel_ruling']], on='case_number', how='left')

# Display the summary information of the resulting dataframe
# This includes the number of entries, column names, non-null counts, and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 4 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   case_number               20000 non-null  object        
 1   text                      20000 non-null  object        
 2   outcome                   20000 non-null  object        
 3   date_appeal_panel_ruling  20000 non-null  datetime64[us]
dtypes: datetime64[us](1), object(3)
memory usage: 625.1+ KB


In [9]:
df.outcome.value_counts(normalize=True)

outcome
NÃO PROVIMENTO        0.77905
PROVIMENTO            0.15020
PROVIMENTO PARCIAL    0.07075
Name: proportion, dtype: float64

In [10]:
# Convert the 'outcome' column to a categorical type and then to integer codes
# This is useful for machine learning models that require numerical input
df['outcome_int'] = df.outcome.astype('category').cat.codes

# Display the count of each unique integer code in the 'outcome_int' column
# This helps to understand the distribution of different outcomes in the dataset
df.outcome_int.value_counts()

outcome_int
0    15581
1     3004
2     1415
Name: count, dtype: int64

## Step 1 - Split the data into training, validation and test sets

In [11]:
from sklearn.model_selection import train_test_split

# Split the dataframe into training and testing sets
# Use stratified sampling based on the 'outcome' column to ensure balanced classes
# Set aside 20% of the data for testing
df_train, df_test = train_test_split(df, test_size=0.2, random_state=271828, stratify=df.outcome)

# Further split the testing set into validation and testing sets
# Use stratified sampling based on the 'outcome' column to ensure balanced classes
# Set aside 50% of the testing set for validation, resulting in 10% of the original data
df_test, df_val = train_test_split(df_test, test_size=0.5, random_state=271828, stratify=df_test.outcome)

# Display the shapes of the resulting datasets
# This helps to verify the sizes of the training, testing, and validation sets
df_train.shape, df_test.shape, df_val.shape

((16000, 5), (2000, 5), (2000, 5))

## The Importance of Proper Data Splitting in Legal Ruling Analysis

When working with datasets containing legal rulings, it's crucial to recognize the inherent temporal nature of the data. Legal cases and their corresponding rulings typically follow a chronological order, often organized by case number or date. This temporal aspect introduces unique challenges and considerations for data analysis and model training.

### Pitfalls of Random Data Splitting

Randomly splitting legal ruling data can lead to several significant issues:

1. **Temporal Leakage**: This is the primary concern when randomly splitting time-ordered data. Temporal leakage occurs when a model is inadvertently trained on future information, leading to:
   - Overly optimistic performance estimates
   - Models that fail to generalize well to truly unseen data
   - Unrealistic predictions based on future knowledge

2. **Disruption of Legal Precedent**: Legal systems often rely on precedent, where earlier rulings influence later ones. Random splitting can break these important temporal relationships.

3. **Misrepresentation of Legal Trends**: Laws and their interpretations evolve over time. Random splitting may obscure these trends, leading to a model that doesn't accurately capture the current legal landscape.

### Best Practices for Splitting Legal Ruling Data

To address these challenges, consider the following approach:

1. **Chronological Splitting**: Divide the dataset based on the date of the rulings:
   - Training Set: Earlier rulings
   - Validation Set: Intermediate rulings
   - Test Set: Most recent rulings

2. **Preserving Temporal Order**: This method maintains the natural progression of legal precedents and trends.

3. **Realistic Model Evaluation**: By testing on the most recent data, you can better assess how well your model will perform on truly future cases.

### Benefits of Proper Data Splitting

1. **Improved Model Generalization**: Your model learns from historical patterns and is tested on more recent, unseen data, better mimicking real-world applications.

2. **Accurate Performance Assessment**: Evaluating on chronologically later data provides a more realistic measure of model performance.


### Considerations for Implementation

- **Time Window Selection**: Carefully consider the time spans for each split to ensure sufficient data in each set while maintaining temporal integrity.
- **Handling Landmark Cases**: Be aware of significant legal changes or landmark cases that might dramatically shift legal interpretations.
- **Regular Retraining**: In rapidly evolving legal areas, consider implementing a system for regular model retraining with the most up-to-date data.

> **Key Takeaway**: In the analysis of legal rulings, chronological data splitting is not just a best practice—it's essential for creating reliable, ethical, and practically applicable models.

In [12]:
# Calculate the number of rows for the training set (80% of the total dataframe length)
train_length = int(0.8 * len(df))

# Calculate the number of rows for the validation set (10% of the total dataframe length)
validation_length = int(0.1 * len(df))

# Sort the dataframe by the 'date_appeal_panel_ruling' column in ascending order
# This ensures that the data is split chronologically, which can be important for time series data
df = df.sort_values(by='date_appeal_panel_ruling', ascending=True)

# Split the dataframe into training, validation, and testing sets based on the calculated lengths
# The training set includes the first 80% of the rows
df_train = df.iloc[:train_length].copy()

# The validation set includes the next 10% of the rows
df_val = df.iloc[train_length:train_length + validation_length].copy()

# The testing set includes the remaining 10% of the rows
df_test = df.iloc[train_length + validation_length:].copy()

# Display the shapes of the resulting datasets
# This helps to verify the sizes of the training, validation, and testing sets
df_train.shape, df_val.shape, df_test.shape

((16000, 5), (2000, 5), (2000, 5))

In [13]:
df_train.date_appeal_panel_ruling.min(), df_train.date_appeal_panel_ruling.max()

(Timestamp('2018-03-26 17:02:45'), Timestamp('2019-09-30 17:20:24'))

In [14]:
df_val.date_appeal_panel_ruling.min(), df_val.date_appeal_panel_ruling.max()

(Timestamp('2019-09-30 17:20:24'), Timestamp('2019-12-11 14:57:02'))

In [15]:
df_test.date_appeal_panel_ruling.min(), df_test.date_appeal_panel_ruling.max()

(Timestamp('2019-12-11 14:57:02'), Timestamp('2020-04-01 11:24:06'))

## Step 2 - Text Processing and vectorization

In [17]:
# Import necessary functions from the 'helpers.text' module
from helpers.text import remove_accented_characters, remove_numbers_punctuation_from_text, remove_excessive_spaces, remove_short_words

# Apply the cleaning functions to the 'text' column of each dataframe
for dataframe in [df_train, df_val, df_test]:
    dataframe['clean_text'] = dataframe.text.apply(remove_accented_characters)
    dataframe['clean_text'] = dataframe.clean_text.apply(remove_numbers_punctuation_from_text)
    dataframe['clean_text'] = dataframe.clean_text.apply(remove_excessive_spaces)
    dataframe['clean_text'] = dataframe.clean_text.apply(remove_short_words)

In [18]:
df_train.iloc[0]['clean_text']

'RELATORIO Trata recurso interposto pela parte autora face sentenca que julgou improcedente pedido reajuste beneficio com base Indice Precos Consumidor Terceira Idade IPC Instituto Brasileiro Economia Fundacao Getulio Vargas FGV IBRE substituicao Indice Nacional Precos Consumidor INPC Instituto Brasileiro Geografia Estatistica IBGE alegando que este indice aplicado pelo INSS nao preservaria carater permanente valor real dos beneficios previdenciarios que afrontaria comando art VOTO Conforme bem fundamenta sentenca recorrida preservacao valor real dos beneficios assegurada pela aplicacao dos indices estabelecidos pela propria legislacao previdenciaria nao cabendo poder judiciario substituir indice eleito pelo legislador fato forma indices reajustes que devem ser aplicados aos beneficios previdenciarios concedidos apos sao aqueles estabelecidos pela Lei uma vez que Carta Magna remeteu legislador ordinario definicao dos indices serem aplicados aos reajustes dos beneficios para preservacao

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

# Load Portuguese stopwords from NLTK
stopwords_nltk = stopwords.words('portuguese')

# Create a TF-IDF vectorizer with specific parameters:
# - stop_words: remove common Portuguese stopwords
# - max_features: limit the number of features to 1000
# - ngram_range: consider unigrams and bigrams
# - min_df: ignore terms that appear in fewer than 5 documents
# - max_df: ignore terms that appear in more than 80% of documents
# - lowercase: convert all text to lowercase
vectorizer = TfidfVectorizer(
    stop_words=stopwords_nltk, 
    max_features=1000, 
    ngram_range=(1, 2), 
    min_df=5, 
    max_df=0.8, 
    lowercase=True
)

# Fit the vectorizer using only the training data
# This ensures that the model does not have access to the validation or test data during training
vectorizer.fit(df_train.clean_text)

# Transform the text data into TF-IDF vectors
# This converts the text data into numerical form suitable for machine learning models
X_train = vectorizer.transform(df_train.clean_text)
X_val = vectorizer.transform(df_val.clean_text)
X_test = vectorizer.transform(df_test.clean_text)

# Extract the target labels for training, validation, and testing sets
# These labels will be used to train and evaluate the machine learning model
y_train = df_train.outcome_int
y_val = df_val.outcome_int
y_test = df_test.outcome_int

In [20]:
df_train[['clean_text', 'outcome_int']].sample(10, random_state=271828)

Unnamed: 0,clean_text,outcome_int
13171,EMENTA PREVIDENCIARIO AUXILIO DOENCA LAUDO DES...,1
3344,EMENTA PREVIDENCIARIO BENEFICIO ASSISTENCIAL L...,0
2508,VOTO Dispensado relatorio nos termos art Lei a...,0
17639,PROCESSO EMENTA PROCESSO CIVIL ASSISTENCIA SOC...,0
17063,PROCESSO EMENTA ADMINISTRATIVO SERVIDOR PUBLIC...,0
15959,EMENTA PREVIDENCIARIO AUXILIO DOENCA APOSENTAD...,0
668,RECURSO INOMINADO DIREITO PREVIDENCIARIO PEDID...,0
4637,RECURSO INOMINADO DIREITO ADMINISTRATIVO SEGUR...,1
19977,VOTO EMENTA PREVIDENCIARIO BENEFICIO PREVIDENC...,0
1950,PROCESSO RECORRENTE MARIA LOURDES CONCEICAO RE...,0


In [21]:
X_train.shape, X_val.shape, X_test.shape

((16000, 1000), (2000, 1000), (2000, 1000))

In [22]:
y_train.shape, y_val.shape, y_test.shape

((16000,), (2000,), (2000,))

In [23]:
X_train[0]

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 244 stored elements and shape (1, 1000)>

In [24]:
X_train = X_train.toarray()
X_val = X_val.toarray()
X_test = X_test.toarray()

In [25]:
X_train[0]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.05059559, 0.        , 0.        ,
       0.        , 0.        , 0.0465296 , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.0376958 , 0.        ,
       0.02821556, 0.03924224, 0.04661888, 0.04740649, 0.        ,
       0.        , 0.        , 0.        , 0.04219101, 0.04322569,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.05352028, 0.        ,
       0.03443451, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.03712143, 0.04413272, 0.        , 0.        ,
       0.        , 0.03417699, 0.        , 0.12682283, 0.05464034,
       0.        , 0.04136232, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

In [26]:
import numpy as np

def get_top_ngrams(X_train: np.ndarray, vectorizer: TfidfVectorizer, top_n: int = 30) -> np.ndarray:
    """
    Get the top n most frequent n-grams from the vectorized text data.

    Args:
        X_train (np.ndarray): The vectorized text data.
        vectorizer (TfidfVectorizer): The vectorizer used to transform the text data.
        top_n (int, optional): The number of top n-grams to return. Defaults to 30.

    Returns:
        np.ndarray: An array of the top n most frequent n-grams.
    """
    # Sum all the columns to get the total frequency of each n-gram
    total_ngram_frequencies = np.sum(X_train, axis=0)

    # Sort the n-grams by their total frequency
    sorted_ngrams_indices = np.argsort(total_ngram_frequencies)[::-1]

    # Get the indices of the top n most frequent n-grams
    top_ngrams_indices = sorted_ngrams_indices[:top_n]

    # Get the names of the n-grams corresponding to the top n indices
    ngram_names = np.array(vectorizer.get_feature_names_out())

    return ngram_names[top_ngrams_indices]

# Use the function to get the top 30 n-grams from the training data
top_ngrams = get_top_ngrams(X_train, vectorizer, top_n=30)
top_ngrams

array(['incapacidade', 'beneficio', 'doenca', 'auxilio', 'atividade',
       'auxilio doenca', 'aposentadoria', 'segurado', 'laudo',
       'concessao', 'prova', 'especial', 'rural', 'fgts', 'anexo',
       'autor', 'monetaria', 'inss', 'periodo', 'pericial', 'invalidez',
       'correcao', 'trabalho', 'anos', 'tempo', 'aposentadoria invalidez',
       'social', 'prazo', 'carencia', 'direito'], dtype=object)

## Step 3 - Model training

In [27]:
import time

# Import various classifiers and utilities from scikit-learn and other libraries

# LightGBM classifier, a gradient boosting framework that uses tree-based learning algorithms
from lightgbm import LGBMClassifier

# CalibratedClassifierCV for probability calibration of classifiers
from sklearn.calibration import CalibratedClassifierCV

# Ensemble classifiers from scikit-learn
# ExtraTreesClassifier and RandomForestClassifier are ensemble methods that use multiple decision trees
# StackingClassifier allows combining multiple classifiers to improve performance
from sklearn.ensemble import (ExtraTreesClassifier, RandomForestClassifier, StackingClassifier)

# Linear models from scikit-learn
# LogisticRegression is a linear model for binary classification
# SGDClassifier is a linear classifier using stochastic gradient descent
from sklearn.linear_model import LogisticRegression, SGDClassifier

# Metrics for evaluating classification performance
# accuracy_score, balanced_accuracy_score, classification_report, confusion_matrix, f1_score, matthews_corrcoef
from sklearn.metrics import (accuracy_score, balanced_accuracy_score, classification_report, confusion_matrix, f1_score, matthews_corrcoef)

# Naive Bayes classifier for multinomially distributed data
from sklearn.naive_bayes import MultinomialNB

# K-Nearest Neighbors classifier
from sklearn.neighbors import KNeighborsClassifier

# Neural network-based classifier
from sklearn.neural_network import MLPClassifier

# Support Vector Machine classifiers
# SVC is a support vector classifier with a non-linear kernel
# LinearSVC is a support vector classifier with a linear kernel
from sklearn.svm import SVC, LinearSVC

# Decision tree classifier
from sklearn.tree import DecisionTreeClassifier

# XGBoost classifier, an optimized distributed gradient boosting library
from xgboost import XGBClassifier

### Basic Guide to Classifiers

This section provides a systematic and detailed explanation of various classifiers commonly used in machine learning, focusing on their working principles and typical use cases. 

#### 1. Calibrated-LSVC: CalibratedClassifierCV with LinearSVC

* **LinearSVC:** A linear Support Vector Machine (SVM) classifier that seeks to find the optimal hyperplane separating different classes in a high-dimensional space. This hyperplane maximizes the margin between data points belonging to different classes.

* **CalibratedClassifierCV:** A method used to calibrate the output of classifiers that do not inherently provide well-calibrated probabilities.  Since `LinearSVC` primarily focuses on separating classes without directly estimating probabilities, `CalibratedClassifierCV` is employed to obtain more reliable probability estimates. This combination results in a calibrated SVM model.

<p align="center">
  <img src="images/linear_svm.png" alt="" style="width: 40%; height: 40%"/>
</p>

#### 2. LR: Logistic Regression 

Logistic Regression is a widely used classification algorithm that predicts the probability of an instance belonging to a particular class. Unlike its name might suggest, it's not used for regression but for classification tasks. Here's how it works:

* **Logistic Function (Sigmoid):**  It utilizes a sigmoid function to map the output of a linear combination of input features to a probability value between 0 and 1. 
* **Decision Boundary:** Based on the calculated probability, the instance is assigned to the class with a probability greater than 0.5.  The point at which the probability equals 0.5 defines the decision boundary separating the classes.
* **Use Cases:** Known for its simplicity and efficiency, Logistic Regression finds applications in various domains, especially in binary classification problems but also extendable to multi-class scenarios using techniques like one-vs-rest or softmax.

<p align="center">
  <img src="images/logistic_regression.png" alt="" style="width: 40%; height: 40%"/>
</p>

#### 3. RF: Random Forest Classifier

Random Forest is a powerful ensemble learning method that operates by constructing a multitude of decision trees during training.

* **Ensemble Approach:**  Instead of relying on a single decision tree, Random Forest leverages the power of multiple trees to make predictions. Each tree is trained on a different random subset of the training data (bootstrapping) and potentially a random subset of features. This randomness helps to reduce overfitting and improve generalization.
* **Prediction:**  To classify a new instance, the Random Forest combines the predictions from all its individual trees. It employs a voting mechanism, where the class receiving the most votes from the trees is chosen as the final prediction. 
* **Advantages:** This ensemble approach results in a classifier that is more robust, accurate, and less prone to overfitting compared to individual decision trees.

<p align="center">
  <img src="images/random_forest.png" alt="" style="width: 40%; height: 40%"/>
</p>

#### 4. LGBM: Light Gradient Boosting Machine Classifier
#### 5. XGB: XGBoost Classifier
#### 13. CatBoost: CatBoost Classifier

These three classifiers (LGBMClassifier, XGBClassifier, CatBoostClassifier) belong to the family of **gradient boosting algorithms** and share core principles while differing in specific implementation details and optimizations:

* **Gradient Boosting Framework:** They are ensemble methods that sequentially build an ensemble of weak learners, typically decision trees, to create a strong classifier. Each tree attempts to correct the errors made by the previous trees.
* **Tree-Based Learning:** These algorithms leverage tree-based learning algorithms, making them suitable for both numerical and categorical features. 
* **Efficiency and Scalability:** They are designed to be efficient and scalable, handling large datasets and high-dimensional feature spaces effectively.

Here's a breakdown of their key characteristics:

* **LGBM (LightGBM):** Prioritizes speed and memory efficiency. It uses a novel technique called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to improve efficiency without sacrificing accuracy.
* **XGBoost:**  Widely recognized for its speed and performance enhancements. It utilizes regularization techniques to prevent overfitting and leverages parallel processing for faster training.
* **CatBoost:**  Specifically designed to handle categorical features effectively.  It incorporates a method called ordered boosting to reduce bias and employs symmetric trees to speed up prediction time. 

**Key Differences and Considerations:**

* **Categorical Feature Handling:** CatBoost excels in handling categorical features.
* **Speed and Scalability:** LightGBM is often favored for its speed and efficiency, especially for large datasets.
* **Regularization and Overfitting:** XGBoost is known for its strong regularization techniques.

The choice among these gradient boosting algorithms often depends on specific dataset characteristics, performance requirements, and the importance of handling categorical features.

<p align="center">
  <img src="images/xgboost_lgbm.png" alt="" style="width: 40%; height: 40%"/>
</p>

#### 6. MLP: Multi-Layer Perceptron Classifier

* **Neural Network Structure:** The MLP is a type of feedforward artificial neural network, a biologically inspired computing model loosely based on the structure of the human brain. It consists of interconnected nodes (neurons) organized in layers: an input layer that receives the data, one or more hidden layers responsible for learning complex patterns, and an output layer producing the classification result.
* **Non-Linear Activation Functions:** MLPs use non-linear activation functions (like sigmoid, ReLU) in the hidden layers. These functions introduce non-linearity, enabling the network to learn complex relationships between features that linear models might miss.  
* **Backpropagation for Learning:** MLPs learn from data through a process called backpropagation. During training, the network adjusts its internal weights based on the difference between its predicted output and the actual target values, minimizing errors over time.

<p align="center">
  <img src="images/mlp.png" alt="" style="width: 40%; height: 40%"/>
</p>

#### 7. SGD: Stochastic Gradient Descent Classifier

* **Linear Classifier:** SGDClassifier implements a linear classifier that finds the optimal hyperplane to separate different classes. 
* **Stochastic Gradient Descent (SGD):** The key difference lies in its optimization algorithm:  it employs Stochastic Gradient Descent (SGD) to learn from the data. Instead of updating weights based on the error calculated from the entire dataset (as in traditional gradient descent), SGD updates weights using the error from a single data point or a small batch of data. This makes SGD computationally less demanding, especially for large datasets.
* **Flexibility in Loss Function and Regularization:** SGD offers flexibility. It can be used with different loss functions (like hinge loss for SVM-like behavior or log loss for logistic regression) and allows incorporating various regularization techniques (L1, L2) to prevent overfitting.

<p align="center">
  <img src="images/sgd.png" alt="" style="width: 40%; height: 40%"/>
</p>

#### 8. NB: Multinomial Naive Bayes

* **Probabilistic Classifier:** Naive Bayes is a probabilistic classifier based on Bayes' theorem with a strong "naive" assumption of independence among features. This means it assumes that the presence or absence of a particular feature in a class is independent of the presence or absence of other features.
* **Multinomial Variant:** The Multinomial Naive Bayes classifier is specifically designed to handle discrete features, making it well-suited for text classification tasks. 
    * **Text Classification Example:** Imagine classifying documents into categories like sports, politics, and entertainment. The presence of words like "athlete," "election," or "movie" can be strong indicators of the respective categories.  Each feature in this case might represent the frequency of occurrence of a specific word in the document. 

> The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

#### 9. LSVC: Linear Support Vector Classification

`LinearSVC`, like its counterpart `CalibratedClassifierCV(LinearSVC)`, implements a linear Support Vector Machine classifier. The core functionality remains identical—finding the optimal hyperplane in a high-dimensional space to separate data points belonging to different classes effectively.

The primary distinction lies in probability prediction:

* `LinearSVC`: Focuses solely on separating classes using a hyperplane and does not inherently provide probability estimates for predictions.
* `CalibratedClassifierCV(LinearSVC)`: Combines `LinearSVC` with probability calibration, allowing it to provide probability estimates for class predictions. 

#### 10. KNN: K-Nearest Neighbors Classifier

* **Instance-Based Learning:**  K-Nearest Neighbors (KNN) is a simple and intuitive algorithm that classifies new data points based on their proximity to known data points. It falls under the category of instance-based learning or lazy learning, where the algorithm doesn't explicitly learn a model from the training data but memorizes the training instances.
* **Distance Metric:** KNN relies on a distance metric, commonly Euclidean distance, to determine closeness between data points in the feature space. 
* **"K" Parameter:** When classifying a new instance, KNN identifies the 'K' nearest neighbors to the instance in the feature space. The value of 'K' is a user-defined parameter.
* **Majority Voting:** Once the 'K' nearest neighbors are identified, their class labels are used to classify the new instance. A majority voting scheme is employed, meaning that the new instance is assigned the class label that appears most frequently among its 'K' neighbors.

<p align="center">
  <img src="images/knn.png" alt="" style="width: 40%; height: 40%"/>
</p>

#### 11. DT: Decision Tree Classifier
#### 12. ET: Extra Trees Classifier

* **Tree-Based Classifiers:** Both Decision Tree and Extra Trees classifiers are tree-based methods that recursively partition the input space based on feature values to make predictions. Imagine a flowchart where each node represents a decision based on a specific feature.  The branches represent possible outcomes, ultimately leading to a leaf node that indicates the predicted class label.

Let's differentiate these two:

**Decision Tree:** 
    * **Greedy Splitting:**  It greedily searches for the best feature and threshold to split the data at each node. This search aims to maximize information gain or reduce impurity.
    * **Prone to Overfitting:**  Without proper pruning or limitations, decision trees can become overly complex, memorizing the training data, which often results in overfitting.

**Extra Trees (Extremely Randomized Trees):**
    * **Random Subset of Features:** Like Random Forest, Extra Trees introduces randomization by selecting a random subset of features for splitting at each node.
    * **Random Thresholds:**  It goes a step further by also randomly selecting the splitting thresholds for each feature.
    * **Advantages:**  This additional randomness helps to reduce variance and prevent overfitting. It often results in faster training times compared to traditional decision trees.

> While prone to overfitting, decision trees' performance can be significantly improved when used in ensemble methods like Random Forests.  Similarly, the randomization introduced in Extra Trees makes it suitable as a base estimator for ensemble methods.


### Metrics Explanation

Let's explore the key metrics used for evaluating classification models, including their mathematical formulas and Python code fragments. Understanding these metrics and their calculations will help you interpret your model's performance more effectively.

#### 1. F1 Score
```python
f1 = f1_score(y, pred, average='micro')
```
The **F1 Score** is the harmonic mean of precision and recall. It provides a single metric that balances the trade-off between precision and recall, making it particularly useful when you need to consider both false positives and false negatives.

- **Precision**: The ratio of correctly predicted positive observations to the total predicted positives.
- **Recall**: The ratio of correctly predicted positive observations to all observations in the actual class.

**Formula**:
$$F1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$

Where:
- $\text{precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$
- $\text{recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$

TP = True Positives, FP = False Positives, FN = False Negatives

**Range**: 0 to 1, where 1 is the best possible score.

When using `average='micro'`, the F1 score is calculated globally by counting the total true positives, false negatives, and false positives across all classes.

> **Note**: The F1 Score is especially useful in scenarios with imbalanced classes, where one class is significantly more frequent than others.

---

#### 2. Balanced Accuracy
```python
bacc = balanced_accuracy_score(y, pred)
```
**Balanced Accuracy** is the average of recall obtained on each class. This metric is particularly effective for imbalanced datasets, where some classes are underrepresented.

- **Recall (Sensitivity)**: The ability of the classifier to find all the positive samples.

**Formula**:
$$\text{Balanced Accuracy} = \frac{1}{n} \sum_{i=1}^{n} \frac{\text{TP}_i}{\text{TP}_i + \text{FN}_i}$$

Where $n$ is the number of classes, and $\text{TP}_i$ and $\text{FN}_i$ are the true positives and false negatives for class $i$, respectively.

**Range**: 0 to 1, where 1 indicates perfect accuracy on all classes.

> **Example**: In a medical diagnosis scenario, balanced accuracy helps ensure that the model performs well across both common and rare conditions.

---

#### 3. Accuracy
```python
acc = accuracy_score(y, pred)
```
**Accuracy** measures the proportion of correct predictions out of the total number of samples. It is a straightforward metric but can be misleading in the presence of class imbalance.

**Formula**:
$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

Where TN = True Negatives

**Range**: 0 to 1, where 1 means all predictions are correct.

> **Important**: In datasets with imbalanced classes, accuracy might give an inflated sense of performance if the model is biased towards the majority class.

---

#### 4. Classification Report
```python
cr = classification_report(y, pred)
```
The **Classification Report** provides a comprehensive summary of various classification metrics for each class, including:

- **Precision**: How many selected items are relevant.
- **Recall**: How many relevant items are selected.
- **F1-Score**: The harmonic mean of precision and recall.
- **Support**: The number of actual occurrences of the class in the dataset.

The formulas for these metrics are:

- **Precision**: $$\frac{\text{TP}}{\text{TP} + \text{FP}}$$
- **Recall**: $$\frac{\text{TP}}{\text{TP} + \text{FN}}$$
- **F1-Score**: $$2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$

> **Usage**: This report is invaluable for gaining detailed insights into the performance of your model on a per-class basis, helping identify which classes are well-predicted and which are not.

---

#### 5. Matthews Correlation Coefficient (MCC)
```python
mcc = matthews_corrcoef(y, pred)
```
The **Matthews Correlation Coefficient (MCC)** is a measure of the quality of binary classifications. It takes into account true and false positives and negatives and is regarded as a balanced measure even with imbalanced classes.

**Formula**:
$$\text{MCC} = \frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}$$

**Range**: -1 to 1, where 1 indicates perfect prediction, 0 no better than random prediction, and -1 total disagreement.

> **Analogy**: Think of MCC as a correlation coefficient between the observed and predicted binary classifications. It provides a more informative and truthful score in scenarios where class imbalance is a concern.

---

#### 6. Confusion Matrix
```python
cm = confusion_matrix(y, pred)
```
A **Confusion Matrix** is a table used to describe the performance of a classification model. It provides a breakdown of correct and incorrect predictions by each class.

For a binary classification problem, it takes the form:

|               | Predicted Positive | Predicted Negative |
|---------------|--------------------|--------------------|
| Actual Positive | TP                 | FN                 |
| Actual Negative | FP                 | TN                 |

From this matrix, various metrics can be derived:

- **Sensitivity (Recall)**: $$\frac{\text{TP}}{\text{TP} + \text{FN}}$$
- **Specificity**: $$\frac{\text{TN}}{\text{TN} + \text{FP}}$$
- **Precision**: $$\frac{\text{TP}}{\text{TP} + \text{FP}}$$

> **Visualization**: The confusion matrix helps visualize the performance of your model, making it easier to identify which classes are being misclassified.


### Goodhart's Law and Classification Metrics: Balancing Numbers with Real-World Impact

> "When a measure becomes a target, it ceases to be a good measure." - Goodhart's Law

#### The Essence of Goodhart's Law in Machine Learning

Goodhart's Law, encapsulated in the quote above, is a critical concept in machine learning that reminds us to look beyond mere numbers. This principle is particularly relevant when interpreting and applying classification metrics, as it highlights the potential pitfalls of overly focusing on optimizing a single measure.

#### Unpacking Classification Metrics

Classification metrics are essential tools for quantifying model performance, but they should be understood as proxies for real-world outcomes rather than ends in themselves. Let's explore some key metrics and their implications:

##### Accuracy: A Deceptive Simplicity

**Accuracy** measures the proportion of correct predictions across all classes. While seemingly straightforward, it can be misleading, especially in imbalanced datasets. For instance:

- In a dataset where 95% of samples belong to class A and 5% to class B, a model always predicting class A would achieve 95% accuracy without providing any valuable insights.
- This metric fails to capture the nuances of misclassification costs, which can vary significantly in real-world scenarios.

##### Precision and Recall: A Balancing Act

**Precision** focuses on the accuracy of positive predictions, while **Recall** measures the model's ability to find all positive instances.

- $\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$


- $\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$

These metrics are particularly useful when the costs of false positives and false negatives differ. For example:

- In email spam detection, high precision is crucial to avoid marking important emails as spam (false positives).
- In medical diagnosis, high recall is vital to ensure all potential cases of a severe condition are identified, even at the cost of some false alarms.

##### F1-Score: Seeking Balance

The **F1-score** provides a single metric that balances precision and recall: $F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

This metric is particularly useful when you need a balanced measure of a model's performance, especially with imbalanced datasets.


#### Beyond Metrics: The Real-World Perspective

While metrics provide valuable insights, it's crucial to consider their real-world implications:

1. **Context Matters**: The choice of metric should align with the specific problem and its real-world consequences. For instance, in fraud detection, the cost of missing a fraudulent transaction (false negative) might far outweigh the inconvenience of a false alarm (false positive).

2. **Holistic Evaluation**: Rather than fixating on a single metric, consider a combination of measures that provide a more comprehensive view of model performance.

3. **Ethical Considerations**: Metrics don't capture ethical implications or societal impacts. For example, a facial recognition system might have high accuracy but could perpetuate biases if not carefully designed and evaluated.

4. **Interpretability**: Some models might achieve high metric scores but lack interpretability, which can be crucial in fields like healthcare or finance where understanding the decision-making process is essential.

#### Practical Approaches to Metric Selection

1. **Engage with Domain Experts**: Collaborate with subject matter experts to understand the real-world implications of model predictions and choose metrics accordingly.

2. **Consider Multiple Metrics**: Use a combination of metrics to get a more comprehensive view of model performance.

3. **Custom Metrics**: Develop problem-specific metrics that better align with the actual goals of your project.

4. **Evaluate on Multiple Datasets**: Test your model on various datasets to ensure its performance is consistent and generalizable.

5. **Monitor Real-World Performance**: Implement systems to track how well your model performs in actual deployments, not just on test sets.


> Understanding Goodhart's Law in the context of classification metrics reminds us that our ultimate goal is to solve real-world problems, not just optimize numbers. While metrics are invaluable tools for model evaluation and improvement, they should guide our decisions rather than dictate them. By maintaining a holistic view that considers the broader impact and context of our models, we can develop more effective and responsible machine learning solutions.

In [28]:
from typing import List, Tuple

def calculate_evaluation_metrics(y_true: pd.Series, y_pred: pd.Series) -> Tuple[float, float, float, str, float, np.ndarray]:
    """
    Calculate evaluation metrics for model predictions.

    Args:
        y_true (pd.Series): The true labels.
        y_pred (pd.Series): The predicted labels.

    Returns:
        Tuple[float, float, float, str, float, np.ndarray]: The calculated metrics including F1 score, balanced accuracy, accuracy, classification report, Matthews correlation coefficient, and confusion matrix.
    """
    # Calculate F1 score
    f1 = f1_score(y_true, y_pred, average='micro')
    # Calculate balanced accuracy
    balanced_accuracy = balanced_accuracy_score(y_true, y_pred)
    # Calculate accuracy
    accuracy = accuracy_score(y_true, y_pred)
    # Generate classification report
    classification_report_str = classification_report(y_true, y_pred)
    # Calculate Matthews correlation coefficient
    matthews_corr_coeff = matthews_corrcoef(y_true, y_pred)
    # Generate confusion matrix
    confusion_matrix_arr = confusion_matrix(y_true, y_pred)

    return f1, balanced_accuracy, accuracy, classification_report_str, matthews_corr_coeff, confusion_matrix_arr

def train_and_evaluate_models(X_train: pd.DataFrame, y_train: pd.Series, X_valid: pd.DataFrame, y_valid: pd.Series, n_jobs: int = -1) -> Tuple[pd.DataFrame, List[List]]:
    """
    Train multiple models and evaluate their performance.

    Args:
        X_train (pd.DataFrame): The training data.
        y_train (pd.Series): The training labels.
        X_valid (pd.DataFrame): The validation data.
        y_valid (pd.Series): The validation labels.
        n_jobs (int, optional): The number of jobs to run in parallel. Defaults to -1.

    Returns:
        Tuple[pd.DataFrame, List[List]]: A dataframe of the evaluation results and a list of classification reports.
    """
    # Define the models to be trained
    models = [
        ('Calibrated-LSVC', CalibratedClassifierCV(LinearSVC(random_state=271828, class_weight='balanced', dual='auto'))),
        ('LR', LogisticRegression(random_state=271828, n_jobs=n_jobs, class_weight='balanced')),
        ('RF', RandomForestClassifier(random_state=271828, n_jobs=n_jobs, class_weight='balanced')),
        ('LGBM', LGBMClassifier(random_state=271828, n_jobs=n_jobs, class_weight='balanced', verbose=-1)),
        ('XGB', XGBClassifier(random_state=271828, n_jobs=n_jobs, class_weight='balanced', verbosity=0)),
        ('MLP', MLPClassifier(random_state=271828)),
        ('SGD', SGDClassifier(random_state=271828, n_jobs=n_jobs, class_weight='balanced')),
        ('NB', MultinomialNB()),
        ('LSVC', LinearSVC(random_state=271828, class_weight='balanced', dual='auto')),
        ('KNN', KNeighborsClassifier(n_jobs=n_jobs)),
        ('DT', DecisionTreeClassifier(random_state=271828, class_weight='balanced')),
        ('ExtraTrees', ExtraTreesClassifier(random_state=271828, n_jobs=n_jobs, class_weight='balanced'))
    ]
    
    evaluation_results = []
    classification_reports = []
    
    # Train each model and evaluate its performance
    for model_name, model in models:
        start_time = time.time()  # Record the start time

        try:
            # Train the model
            model.fit(X_train, y_train)
            # Make predictions on the validation set
            predictions = model.predict(X_valid)
        except Exception as e:
            # Handle any exceptions that occur during training or prediction
            print(f'Error {model_name} - {e}')
            continue 

        # Calculate evaluation metrics
        f1, balanced_accuracy, accuracy, classification_report_str, matthews_corr_coeff, confusion_matrix_arr = calculate_evaluation_metrics(y_valid, predictions)
        # Store the classification report and confusion matrix
        classification_reports.append([model_name, classification_report_str, confusion_matrix_arr])

        elapsed_time = time.time() - start_time  # Calculate the elapsed time
        # Append the evaluation results
        evaluation_results.append([model_name, f1, balanced_accuracy, accuracy, matthews_corr_coeff, elapsed_time, confusion_matrix_arr, classification_report_str])

        # Print the evaluation results
        print(f'Name: {model_name} - F1: {f1:.4f} - BACC: {balanced_accuracy:.4f} - ACC: {accuracy:.4f} - MCC: {matthews_corr_coeff:.4f} - Elapsed: {elapsed_time:.2f}s')
        print(classification_report_str)
        print(confusion_matrix_arr)
        print('*' * 20, '\n')

    # Create a DataFrame to store the evaluation results
    results_df = pd.DataFrame(evaluation_results, columns=['Model', 'F1', 'BACC', 'ACC', 'MCC', 'Total Time', 'Confusion Matrix', 'Classification Report'])
    # Convert the confusion matrix to a string for better readability in the DataFrame
    results_df['Confusion Matrix'] = results_df['Confusion Matrix'].apply(lambda x: str(x))

    return results_df, classification_reports

In [29]:
df_results, creports = train_and_evaluate_models(X_train, y_train, X_val, y_val, n_jobs=-1)

Name: Calibrated-LSVC - F1: 0.9380 - BACC: 0.8338 - ACC: 0.9380 - MCC: 0.8456 - Elapsed: 8.50s
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1491
           1       0.84      0.87      0.86       328
           2       0.84      0.64      0.73       181

    accuracy                           0.94      2000
   macro avg       0.88      0.83      0.85      2000
weighted avg       0.94      0.94      0.94      2000

[[1474   14    3]
 [  23  286   19]
 [  24   41  116]]
******************** 

Name: LR - F1: 0.9255 - BACC: 0.8922 - ACC: 0.9255 - MCC: 0.8309 - Elapsed: 6.32s
              precision    recall  f1-score   support

           0       1.00      0.95      0.97      1491
           1       0.82      0.85      0.84       328
           2       0.66      0.88      0.75       181

    accuracy                           0.93      2000
   macro avg       0.83      0.89      0.85      2000
weighted avg       0.94      0.93     

In [30]:
df_results.sort_values(by='MCC', ascending=False)


Unnamed: 0,Model,F1,BACC,ACC,MCC,Total Time,Confusion Matrix,Classification Report
3,LGBM,0.9565,0.909151,0.9565,0.894218,17.729725,[[1467 15 9]\n [ 14 291 23]\n [ 4 ...,precision recall f1-score ...
4,XGB,0.9555,0.895726,0.9555,0.890957,16.173578,[[1472 14 5]\n [ 18 293 17]\n [ 4 ...,precision recall f1-score ...
8,LSVC,0.948,0.894947,0.948,0.874801,2.296809,[[1460 19 12]\n [ 12 284 32]\n [ 2 ...,precision recall f1-score ...
6,SGD,0.946,0.89742,0.946,0.870491,0.598567,[[1458 15 18]\n [ 14 276 38]\n [ 2 ...,precision recall f1-score ...
5,MLP,0.9405,0.868181,0.9405,0.853977,23.12721,[[1461 22 8]\n [ 22 281 25]\n [ 13 ...,precision recall f1-score ...
0,Calibrated-LSVC,0.938,0.833811,0.938,0.845626,8.498214,[[1474 14 3]\n [ 23 286 19]\n [ 24 ...,precision recall f1-score ...
1,LR,0.9255,0.89225,0.9255,0.830926,6.317317,[[1413 39 39]\n [ 5 279 44]\n [ 1 ...,precision recall f1-score ...
2,RF,0.93,0.797775,0.93,0.826808,0.749228,[[1475 12 4]\n [ 24 292 12]\n [ 19 ...,precision recall f1-score ...
10,DT,0.9105,0.815001,0.9105,0.779225,17.67632,[[1447 32 12]\n [ 49 239 40]\n [ 10 ...,precision recall f1-score ...
11,ExtraTrees,0.907,0.722932,0.907,0.768066,0.527333,[[1474 14 3]\n [ 34 282 12]\n [ 28 ...,precision recall f1-score ...


### Ensemble Classifier: Stacking for Enhanced Performance

<p align="center">
  <img src="images/stacking.png" alt="" style="width: 40%; height: 40%"/>
</p>

#### Understanding Ensemble Learning

Ensemble learning is a powerful technique in machine learning that combines multiple classifiers to achieve superior performance compared to individual models. This approach is particularly effective for:

- Tackling complex classification tasks
- Mitigating the limitations of single classifiers
- Improving overall prediction accuracy and robustness

#### The StackingClassifier: A Sophisticated Ensemble Technique

The `StackingClassifier` is an advanced ensemble method that leverages the strengths of multiple base classifiers and a meta-classifier:

1. **Base Classifiers**: Multiple models trained on the original dataset
2. **Meta-Classifier**: A higher-level model that learns to combine predictions from base classifiers

> **Key Insight**: The meta-classifier doesn't just average or vote on base classifier outputs. It learns the optimal way to combine these predictions, potentially capturing complex interactions between base models.

#### How Stacking Works

1. Train multiple base classifiers on the original dataset
2. Use these base classifiers to make predictions on a hold-out set or through cross-validation
3. Use the predictions from step 2 as features to train the meta-classifier
4. For new data, base classifiers make predictions, which are then fed into the meta-classifier for the final prediction

#### Choosing Classifiers for Stacking

Selecting diverse classifiers is crucial for maximizing the benefits of stacking:

- **Diversity in Algorithm Types**: Combine linear (e.g., Logistic Regression) and non-linear (e.g., Decision Trees) classifiers
- **Diversity in Strengths**: Include classifiers that excel in different aspects (e.g., handling missing data, feature importance)
- **Complementary Weaknesses**: Choose models whose weaknesses are offset by others' strengths

> **Note**: While using `scikit-learn`, the meta-classifier should support the `predict_proba()` method for probabilistic predictions. Common choices include Logistic Regression or Random Forest.

#### Practical Considerations

- **Computational Cost**: Stacking can be computationally expensive, especially with many base classifiers or large datasets
- **Overfitting Risk**: Careful cross-validation is necessary to prevent overfitting, particularly in the meta-classifier stage
- **Interpretability**: While stacking often improves performance, it can reduce model interpretability

#### Implementation Note

For classifiers lacking `predict_proba()` (e.g., LinearSVC), the `CalibratedClassifierCV` wrapper can be used:

```python
from sklearn.calibration import CalibratedClassifierCV
calibrated_svc = CalibratedClassifierCV(LinearSVC())
```

This calibration step adds probability estimation capabilities to otherwise non-probabilistic classifiers.


In [31]:

# Initialize individual models with specific parameters
random_forest_model = RandomForestClassifier(random_state=271828, n_jobs=-1, class_weight='balanced')
lgbm_model = LGBMClassifier(random_state=271828, n_jobs=-1, class_weight='balanced', verbose=-1)
lsvc_model = LinearSVC(random_state=271828, class_weight='balanced', dual='auto')

# List of base estimators for stacking
base_estimators = [
    ('random_forest', random_forest_model),
    ('calibrated_lsvc', lsvc_model),
    ('lgbm', lgbm_model),

]

# Initialize the stacking classifier with base estimators and a final estimator
stacking_model = StackingClassifier(
    estimators=base_estimators,
    final_estimator=LogisticRegression(random_state=271828, n_jobs=-1, class_weight='balanced'),
    n_jobs=2,
    cv=5
)

# Fit the stacking model on the training data
stacking_model.fit(X_train, y_train)

# Predict on the validation data
predictions = stacking_model.predict(X_val)

# Calculate evaluation metrics for the predictions
f1_score_val, balanced_accuracy_val, accuracy_val, classification_report_val, matthews_corr_coeff_val, confusion_matrix_val = calculate_evaluation_metrics(y_val, predictions)

# Print the evaluation metrics
print(f'F1: {f1_score_val:.4f} - BACC: {balanced_accuracy_val:.4f} - ACC: {accuracy_val:.4f} - MCC: {matthews_corr_coeff_val:.4f}')
print(classification_report_val)
print(confusion_matrix_val)

# Took 2 minutes to run in a 48 core CPU
# Best MCC was 0.894218
# Best MCC is now 0.8947



F1: 0.9560 - BACC: 0.9227 - ACC: 0.9560 - MCC: 0.8947
              precision    recall  f1-score   support

           0       0.99      0.98      0.99      1491
           1       0.89      0.88      0.88       328
           2       0.80      0.91      0.85       181

    accuracy                           0.96      2000
   macro avg       0.89      0.92      0.91      2000
weighted avg       0.96      0.96      0.96      2000

[[1459   20   12]
 [  10  288   30]
 [   1   15  165]]


`As we've seen above, we can improve our results at the cost of increased training time by combining multiple classifiers into a single model. `



### Hyperparameter Optimization

<p align="center">
  <img src="images/hyperparameter_optimization.png" alt="" style="width: 40%; height: 40%"/>
</p>


Hyperparameter optimization is the process of selecting the best set of hyperparameters for a machine learning model. These hyperparameters control the model's behavior and directly impact its performance. Finding the optimal values can often lead to better results in terms of accuracy, precision, or other evaluation metrics.

Hyperparameter optimization is represented in equation form as $ \mathcal{x}^* = \arg \min_{\mathcal{x}} f(\mathcal{x}) $, where:
- $ \mathcal{x}^* $ is the optimal set of hyperparameters
- $ \mathcal{x} $ represents the hyperparameter space
- $ f(\mathcal{x}) $ is the objective function to be minimized or maximized

The objective function can be defined based on various criteria, such as maximizing accuracy, minimizing loss, or optimizing a specific metric.

There are several methods of hyperparameter optimization, including:

<p align="center">
  <img src="images/hyperparameter_optimization2.jpg" alt="" style="width: 40%; height: 40%"/>
</p>

#### 1. Grid Search
Grid Search is an exhaustive search technique that evaluates all possible combinations of hyperparameter values within a specified range. It explores the entire parameter space but can be computationally expensive, especially for large datasets or complex models.

**Key Points:**
- **Exhaustive Search:** Evaluates every possible combination within the specified grid.
- **Computational Cost:** Can be very high, particularly with many hyperparameters or large datasets.
- **Use Cases:** Best for smaller hyperparameter spaces or when computational resources are not a constraint.

#### 2. Random Search
Random Search involves randomly sampling the hyperparameter space and evaluating the model's performance at each sampled point. This method is less computationally intensive than Grid Search, as it does not explore every possible combination. While there is no guarantee of finding the optimal values, in many cases, Random Search can provide reasonably good results within fewer iterations.

**Key Points:**
- **Random Sampling:** Selects random combinations of hyperparameters.
- **Efficiency:** Often finds good solutions faster than Grid Search.
- **Use Cases:** Suitable when the hyperparameter space is large and computational resources are limited.

#### 3. Bayesian Optimization
Bayesian Optimization is a sequential model-based optimization technique that incorporates prior knowledge about the hyperparameter space. It uses a probabilistic model (e.g., Gaussian Processes) to predict the performance of various hyperparameter settings and selects the next candidate based on a predefined acquisition function. This approach is more efficient than both Grid Search and Random Search, as it intelligently chooses points in the hyperparameter space by considering information from previous evaluations.

**Key Points:**
- **Probabilistic Model:** Uses models like Gaussian Processes to predict performance.
- **Sequential Approach:** Selects new hyperparameters based on past evaluations.
- **Efficiency:** Often more efficient than Grid and Random Search, especially for complex models.
- **Use Cases:** Ideal when the search space is large and evaluations are expensive.


> **TLDR:** Hyperparameter optimization plays a crucial role in improving a model's performance. The choice of optimization method depends on factors such as available computational resources, search space complexity, and the desired balance between exploration and exploitation. Each method has its advantages and limitations, and it's essential to select the one that best suits the specific problem at hand.

To save time, we'll perform each kind of optimization on the hyperparameters of our `SGDClassifier` (our fastest top classifiers). You may want to select the best classifier in your problem.

#### Performing grid search

In [32]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, matthews_corrcoef
from sklearn.linear_model import SGDClassifier
from typing import Dict, Any
import pandas as pd
import warnings
from sklearn.exceptions import ConvergenceWarning

def perform_grid_search(model: Any, param_grid: Dict[str, Any], X_train: pd.DataFrame, y_train: pd.Series, X_val: pd.DataFrame, y_val: pd.Series) -> None:
    """
    Perform grid search to find the best parameters for the model and evaluate it on the validation set.

    Args:
        model (Any): The model to be trained.
        param_grid (Dict[str, Any]): The parameter grid for the search.
        X_train (pd.DataFrame): The training data.
        y_train (pd.Series): The training labels.
        X_val (pd.DataFrame): The validation data.
        y_val (pd.Series): The validation labels.
    """
    # Define the scorer for the grid search using Matthews correlation coefficient
    scorer_mcc = make_scorer(matthews_corrcoef)

    # Create the GridSearchCV object with the model and parameter grid
    grid_search = GridSearchCV(model, param_grid, cv=3, scoring=scorer_mcc, n_jobs=-1)

    # Suppress ConvergenceWarning during grid search
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=ConvergenceWarning)
        
        # Perform the grid search by fitting the training data
        grid_search.fit(X_train, y_train)

    # Print the best parameters and corresponding MCC score found by the grid search
    print("Best parameters: ", grid_search.best_params_)
    print("Best score: ", grid_search.best_score_)

    # Evaluate the model with chosen parameters on the validation set
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=ConvergenceWarning)
        best_model = grid_search.best_estimator_
        validation_score = best_model.score(X_val, y_val)
    print("Validation score: ", validation_score)

# Define the SGDClassifier model with specific parameters
sgd_model = SGDClassifier(random_state=271828, n_jobs=-1, class_weight='balanced')

# Define parameter grid for the search
search_parameters = {
    'loss': ['hinge', 'log_loss', 'squared_hinge', 'modified_huber'],  # Different loss functions
    'alpha': [1e-4, 1e-3, 1e-2, 1e-1, 1e0],  # Regularization strength
    'max_iter': [1000, 2000, 3000, 4000, 5000],  # Number of iterations
    'penalty': ['l2', 'l1', 'elasticnet'],  # Regularization types
}

# Perform grid search
perform_grid_search(sgd_model, search_parameters, X_train, y_train, X_val, y_val)

# Took 17 minutes to run in a 48 core CPU
# It performs 4 * 5 * 5 * 3 combinations of parameters, which is 300 combinations in total.
# Since it uses Cross Validation with 3 folds, it will train 900 models in total!!!!!

# Best parameters:  {'alpha': 0.0001, 'loss': 'hinge', 'max_iter': 1000, 'penalty': 'l1'}
# Validation MCC:  0.947
# The best valid MCC for the SGDClassifier was 0.870491

  _data = np.array(data, dtype=dtype, copy=copy,


Best parameters:  {'alpha': 0.0001, 'loss': 'hinge', 'max_iter': 1000, 'penalty': 'l1'}
Best score:  0.8545240719127332
Validation score:  0.947


#### Performing Random Search

##### Understanding the 60 Iterations Rule

In hyperparameter optimization, a widely-used rule of thumb states that **60 iterations of random search can find the best 5% set of parameters 95% of the time**, regardless of the grid size.

This rule can be framed as a probability problem:

> What is the minimum number of trials (iterations) needed to have a 95% chance of finding at least one hyperparameter set in the top 5%?

This scenario aligns with the **geometric distribution**, which models the number of trials needed to achieve the first success in repeated independent trials. However, our specific question focuses on the number of trials required to be 95% confident of at least one success, rather than the expected number of trials for the first success.

##### Mathematical Analysis

Let's define our variables:
- $\mathcal{p}$ = probability of success (finding a top 5% hyperparameter set) in each trial = 0.05
- $\mathcal{n}$ = number of trials (iterations)

We want to calculate the probability of *not* having a success after n trials, which should be less than or equal to 0.05 (5%) to ensure 95% confidence of at least one success.

1. Probability of no success in n trials: $(1 - p)^n \le 0.05$
2. Substituting p = 0.05: $(1 - 0.05)^n \le 0.05$
3. Solving for n using logarithms:
   - $n * \log(0.95) \le \log(0.05)$
   - $n \ge \log(0.05) / \log(0.95)$

Calculating this, we find that $\mathcal{n}$ ≈ 59.

##### Interpretation and Implications

- **59 iterations** are needed for 95% confidence of finding at least one top 5% hyperparameter set.
- This result is rounded up to 60 in the commonly cited rule of thumb.
- The beauty of this rule lies in its **independence from the size of the hyperparameter space**.

##### Practical Considerations

1. **Efficiency**: Random search can be more efficient than grid search, especially in high-dimensional spaces where many parameters may not significantly impact the model's performance.

2. **Adaptability**: This method works well when you're unsure which hyperparameters are most important, as it samples from the entire space.

3. **Scalability**: The number of iterations needed grows logarithmically with the desired confidence level and inversely with the target percentile.

   > For example, to be 99% confident of finding a top 1% set, you would need approximately 459 iterations.

4. **Limitations**: While efficient, random search doesn't leverage information from previous trials to inform future searches, unlike more advanced methods like Bayesian optimization.



In [33]:
from sklearn.model_selection import RandomizedSearchCV

def perform_random_search(model: SGDClassifier, param_dist: Dict[str, Any], X_train: pd.DataFrame, y_train: pd.Series, X_val: pd.DataFrame, y_val: pd.Series) -> None:
    """
    Perform random search to find the best parameters for the model and evaluate it on the validation set.

    Args:
        model (SGDClassifier): The model to be trained.
        param_dist (Dict[str, Any]): The parameter distribution for the search.
        X_train (pd.DataFrame): The training data.
        y_train (pd.Series): The training labels.
        X_val (pd.DataFrame): The validation data.
        y_val (pd.Series): The validation labels.
    """
    # Define the scorer for the random search
    mcc_scorer = make_scorer(matthews_corrcoef)

    # Create the RandomizedSearchCV object with the SGDClassifier and parameter distribution
    random_search = RandomizedSearchCV(model, param_dist, cv=3, scoring=mcc_scorer, 
                                       n_jobs=-1, n_iter=60, random_state=271828)

    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=ConvergenceWarning)
        # Perform the random search by fitting training data
        random_search.fit(X_train, y_train)

    # Print the best parameters and corresponding MCC score found by the random search
    print("Best parameters: ", random_search.best_params_)
    print("Best score: ", random_search.best_score_)

    # Evaluate the model with chosen parameters on the validation set
    best_model = random_search.best_estimator_
    validation_mcc = best_model.score(X_val, y_val)
    print("Validation MCC: ", validation_mcc)

# Define the SGDClassifier model
sgd_model = SGDClassifier(random_state=271828, n_jobs=-1, class_weight='balanced')

# Define parameter distribution for the search
search_parameters = {
    'loss': ['hinge', 'log_loss', 'squared_hinge', 'modified_huber'],
    'alpha': [1e-4, 1e-3, 1e-2, 1e-1, 1e0],
    'max_iter': [1000, 2000, 3000, 4000, 5000],
    'penalty': ['l2', 'l1', 'elasticnet'],
}

# Perform random search
perform_random_search(sgd_model, search_parameters, X_train, y_train, X_val, y_val)

# This will typically run faster than GridSearchCV due to the reduced number of parameter combinations.
# Time to run: 3.5 minutes in a 48 core CPU

# Best parameters:  {'penalty': 'l1', 'max_iter': 3000, 'loss': 'hinge', 'alpha': 0.0001}
# Validation MCC:  0.947
# The best valid MCC for the SGDClassifier was 0.870491




Best parameters:  {'penalty': 'l1', 'max_iter': 3000, 'loss': 'hinge', 'alpha': 0.0001}
Best score:  0.8545240719127332
Validation MCC:  0.947


#### Bayesian Optimization for Hyperparameter Tuning

Bayesian optimization is a sophisticated approach to hyperparameter tuning, particularly valuable when dealing with complex models and computationally expensive evaluations. This method intelligently navigates the hyperparameter space by balancing exploration of unknown regions with exploitation of promising areas.

The fundamental principle of Bayesian optimization lies in its ability to learn from previous evaluations and make informed decisions about which hyperparameters to try next. This approach significantly reduces the number of evaluations needed to find optimal or near-optimal hyperparameters.

> **Key Insight**: Bayesian model-based methods can discover superior hyperparameters more efficiently by reasoning about the most promising configurations based on past trials.

##### Visual Understanding

To grasp the essence of Bayesian optimization, consider the following visual representations:

<p align="center">
  <img src="images/bayesian1.webp" alt="" style="width: 40%; height: 40%"/>
</p>

This image illustrates the initial state of the surrogate model (black line with gray uncertainty) after only two evaluations. At this stage, the surrogate model poorly approximates the true objective function (red line).

<p align="center">
  <img src="images/bayesian2.webp" alt="" style="width: 40%; height: 40%"/>
</p>

After eight evaluations, the surrogate model closely matches the true function. This improved approximation allows the algorithm to select hyperparameters that are likely to yield excellent results on the actual evaluation function.

For a nice conceptual understanding of Bayesian optimization, [check this link](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f)

##### The Bayesian Approach

Bayesian methods mirror human learning processes:
1. Form an initial view (prior)
2. Update the model based on new experiences (posterior)

In the context of hyperparameter optimization, this framework is applied to discover optimal model settings iteratively.

##### Key Components of Bayesian Optimization

1. **Objective Function**
   - Evaluates model performance for a given set of hyperparameters
   - Typically uses metrics like accuracy, loss, or cross-validation scores
   - Serves as the ground truth for comparing different configurations

2. **Surrogate Model**
   - Probabilistic approximation of the objective function
   - Often uses Gaussian Processes (GPs) for their ability to provide both predictions and uncertainty estimates
   - Predicts performance of untested hyperparameter sets based on observed results
   - Significantly reduces computational costs by minimizing actual evaluations of the (costly) objective function

3. **Acquisition Function**
   - Guides the selection of the next hyperparameter set to evaluate
   - Balances exploration (investigating new areas) and exploitation (focusing on known good regions)
   - Common types include:
     - Expected Improvement (EI)
     - Probability of Improvement (PI)
     - Upper Confidence Bound (UCB)

##### The Optimization Process

1. **Initialization**: Randomly select and evaluate a few initial hyperparameter sets.

2. **Surrogate Model Creation**: Fit a Gaussian Process to the initial data points.

3. **Acquisition Function Calculation**: Compute the acquisition function values across the hyperparameter space.

4. **Next Point Selection**: Choose the hyperparameter set with the highest acquisition function value.

5. **Evaluation**: Assess the chosen hyperparameters using the objective function.

6. **Model Update**: Refit the Gaussian Process with the new data point.

7. **Iteration**: Repeat steps 3-6 until a stopping criterion is met (e.g., performance threshold or iteration limit).

##### Advantages and Considerations

- **Efficiency**: Particularly beneficial for expensive-to-evaluate models, minimizing the number of evaluations needed.
- **Adaptability**: Learns from previous evaluations to make informed decisions about future trials.
- **Uncertainty Handling**: Incorporates uncertainty in its decision-making process, leading to a more robust exploration of the hyperparameter space.

> **Note**: While highly effective, Bayesian optimization does not guarantee finding the global optimum, especially in complex or noisy objective functions.

##### Illustrative Analogies

- **Surrogate Model as a Treasure Map**: Initially, you have a rough sketch of an island. As you explore and mark landmarks (data points), your map becomes more detailed and accurate, guiding your search more effectively.

- **Acquisition Function as Oil Drilling**: Deciding whether to drill deeper in a promising spot (exploitation) or start a new drill elsewhere (exploration) based on current knowledge and potential rewards.



In [34]:

from skopt import BayesSearchCV

def perform_bayesian_optimization(model: SGDClassifier, param_space: Dict[str, Any], X_train: pd.DataFrame, y_train: pd.Series, X_val: pd.DataFrame, y_val: pd.Series) -> None:
    """
    Perform Bayesian optimization to find the best parameters for the model and evaluate it on the validation set.

    Args:
        model (SGDClassifier): The model to be trained.
        param_space (Dict[str, Any]): The parameter space for the search.
        X_train (pd.DataFrame): The training data.
        y_train (pd.Series): The training labels.
        X_val (pd.DataFrame): The validation data.
        y_val (pd.Series): The validation labels.
    """
    # Define the scorer for the Bayesian optimization
    mcc_scorer = make_scorer(matthews_corrcoef)

    # Create the BayesSearchCV object with the SGDClassifier and parameter space
    bayesian_search = BayesSearchCV(model, param_space, cv=3, scoring=mcc_scorer, 
                                    n_jobs=-1, n_iter=30, random_state=271828, n_points=10)

    # Perform the Bayesian optimization by fitting training data
    bayesian_search.fit(X_train, y_train)

    # Print the best parameters and corresponding MCC score found by the Bayesian search
    print("Best parameters: ", bayesian_search.best_params_)
    print("Best score: ", bayesian_search.best_score_)

    # Evaluate the model with chosen parameters on the validation set
    best_model = bayesian_search.best_estimator_
    validation_mcc = best_model.score(X_val, y_val)
    print("Validation MCC: ", validation_mcc)

# Define the SGDClassifier model
sgd_model = SGDClassifier(random_state=271828, n_jobs=-1, class_weight='balanced')

# Define parameter space for the search
search_parameters = {
    'loss': ['hinge', 'log_loss', 'squared_hinge', 'modified_huber'],
    'alpha': [1e-4, 1e-3, 1e-2, 1e-1, 1e0],
    'max_iter': [1000, 2000, 3000, 4000, 5000],
    'penalty': ['l2', 'l1', 'elasticnet'],
}

# Perform Bayesian optimization
perform_bayesian_optimization(sgd_model, search_parameters, X_train, y_train, X_val, y_val)

# This cell took 1 minute do run
# Best parameters:  OrderedDict({'alpha': 0.0001, 'loss': 'hinge', 'max_iter': 5000, 'penalty': 'l1'})
# Validation MCC:  0.947 
# The best valid MCC for the SGDClassifier was 0.870491




Best parameters:  OrderedDict([('alpha', 0.0001), ('loss', 'hinge'), ('max_iter', 5000), ('penalty', 'l1')])
Best score:  0.8545240719127332
Validation MCC:  0.947


# Problem 2 - Predicting the Time it Takes to Issue an "Acórdão"
Your turn to build a model! In this problem, you'll use the data from the previous problem to predict the time it takes for an appeal to be judged. This is a regression problem, and you'll use the same features as in the previous problem.

In [35]:
# Load our data.
import pandas as pd

df_texts = pd.read_parquet('data/brcad5/texts_sample.parquet.gz')
df_meta = pd.read_parquet('data/brcad5/metadata_sample.parquet.gz')

In [36]:
df_meta

Unnamed: 0,case_number,filing_date,defendant_normalized,date_first_instance_ruling,date_appeal_panel_ruling,case_topic_code,case_topic_1st_level,case_topic_2nd_level,case_topic_3rd_level,court_id
11,0510491-75.2017.4.05.8103,2017-10-11 00:00:00,INSS,2018-01-16 11:16:19,2018-03-26 17:07:16,6095,Direito Previdenciário,Benefícios em Espécie,Aposentadoria por Invalidez,19-CE
16,0509917-52.2017.4.05.8103,2017-09-26 00:00:00,INSS,2018-01-15 18:07:55,2018-03-26 17:07:16,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,31-CE
19,0501051-55.2017.4.05.8103,2017-02-03 00:00:00,INSS,2017-11-21 11:44:40,2018-03-26 17:07:16,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),31-CE
23,0511681-76.2017.4.05.8102,2017-09-27 00:00:00,INSS,2017-12-29 02:18:24,2018-03-26 17:07:16,6104,Direito Previdenciário,Benefícios em Espécie,Pensão por Morte (Art. 74/9),30-CE
33,0500777-79.2017.4.05.8107,2017-02-23 00:00:00,INSS,2017-11-27 17:58:46,2018-03-26 17:07:16,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),25-CE
...,...,...,...,...,...,...,...,...,...,...
152600,0511177-90.2019.4.05.8202,2019-11-08 00:00:00,INSS,2019-12-16 13:09:08,2020-03-31 11:47:59,6176,Direito Previdenciário,Pedidos Genéricos Relativos aos Benefícios em ...,Parcelas de benefício não pagas,15-PB
152605,0508370-06.2019.4.05.8200,2019-06-13 00:00:00,INSS,2019-10-03 18:01:23,2020-03-31 15:22:15,6096,Direito Previdenciário,Benefícios em Espécie,Aposentadoria por Idade (Art. 48/51),7-PB
152610,0508717-73.2018.4.05.8200,2018-06-20 00:00:00,INSS,2019-08-29 16:53:58,2020-03-31 15:22:15,6114,Direito Previdenciário,Benefícios em Espécie,"Benefício Assistencial (Art. 203,V CF/88)",13-PB
152628,0507511-81.2019.4.05.8202,2019-08-22 00:00:00,INSS,2019-11-05 17:10:31,2020-04-01 11:24:06,6176,Direito Previdenciário,Pedidos Genéricos Relativos aos Benefícios em ...,Parcelas de benefício não pagas,15-PB


In [37]:
df_texts.info()

<class 'pandas.core.frame.DataFrame'>
Index: 40000 entries, 2 to 305281
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   case_number  40000 non-null  object
 1   text         40000 non-null  object
 2   outcome      40000 non-null  object
 3   ruling_type  40000 non-null  object
dtypes: object(4)
memory usage: 1.5+ MB


In [38]:
df

Unnamed: 0,case_number,text,outcome,date_appeal_panel_ruling,outcome_int
5318,0529055-14.2017.4.05.8100,RELATÓRIO Trata-se de recurso interposto pela ...,NÃO PROVIMENTO,2018-03-26 17:02:45,0
19114,0504805-96.2017.4.05.8105,DIREITO PREVIDENCIÁRIO. AUXÍLIO-DOENÇA. LAUDO ...,NÃO PROVIMENTO,2018-03-26 17:07:16,0
13460,0511681-76.2017.4.05.8102,RECURSO INOMINADO. DIREITO PREVIDENCIÁRIO. PEN...,PROVIMENTO,2018-03-26 17:07:16,1
2359,0502738-58.2017.4.05.8106,ADMINISTRATIVO. GRATIFICAÇÃO DE DESEMPENHO. GD...,NÃO PROVIMENTO,2018-03-26 17:07:16,0
1706,0509527-91.2017.4.05.8100,EMENTA: APLICAÇÃO DO INPC À CORREÇÃO MONETÁRIA...,PROVIMENTO,2018-03-26 17:07:16,1
...,...,...,...,...,...
2880,0500987-74.2019.4.05.8200,VOTO-EMENTA ADMINISTRATIVO. CONVERSÃO DE LICEN...,PROVIMENTO,2020-03-31 15:22:15,1
12876,0508370-06.2019.4.05.8200,VOTO – EMENTA PREVIDENCIÁRIO. APOSENTADORIA PO...,NÃO PROVIMENTO,2020-03-31 15:22:15,0
183,0507364-55.2019.4.05.8202,VOTO - EMENTA PREVIDENCIÁRIO. EXTENSÃO DO PERÍ...,PROVIMENTO PARCIAL,2020-04-01 11:24:06,2
19799,0508286-96.2019.4.05.8202,VOTO - EMENTA PREVIDENCIÁRIO. EXTENSÃO DO PERÍ...,PROVIMENTO PARCIAL,2020-04-01 11:24:06,2


In [39]:
df_meta.date_first_instance_ruling = pd.to_datetime(df_meta.date_first_instance_ruling, yearfirst=True)
df_meta.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20000 entries, 11 to 152635
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   case_number                 20000 non-null  object        
 1   filing_date                 20000 non-null  object        
 2   defendant_normalized        20000 non-null  object        
 3   date_first_instance_ruling  20000 non-null  datetime64[ns]
 4   date_appeal_panel_ruling    20000 non-null  datetime64[us]
 5   case_topic_code             20000 non-null  int64         
 6   case_topic_1st_level        20000 non-null  object        
 7   case_topic_2nd_level        20000 non-null  object        
 8   case_topic_3rd_level        19993 non-null  object        
 9   court_id                    20000 non-null  object        
dtypes: datetime64[ns](1), datetime64[us](1), int64(1), object(7)
memory usage: 1.7+ MB


In [40]:
df_meta['time_to_trial_appeal'] = df_meta['date_appeal_panel_ruling'] - df_meta['date_first_instance_ruling']
df_meta['time_to_trial_appeal'] = df_meta['time_to_trial_appeal'].dt.days
df_meta['time_to_trial_appeal'].describe()

count    20000.000000
mean       153.045100
std        267.485493
min          8.000000
25%         55.000000
50%         80.000000
75%        127.000000
max       3599.000000
Name: time_to_trial_appeal, dtype: float64

In [41]:
df = df_texts.query('ruling_type == "SENTENÇA"').merge(df_meta, on='case_number', how='left')
df.drop_duplicates(subset='case_number', inplace=True)
df

Unnamed: 0,case_number,text,outcome,ruling_type,filing_date,defendant_normalized,date_first_instance_ruling,date_appeal_panel_ruling,case_topic_code,case_topic_1st_level,case_topic_2nd_level,case_topic_3rd_level,court_id,time_to_trial_appeal
0,0515165-56.2018.4.05.8202,SENTENÇA I - RELATÓRIO Dispensada a feitura do...,IMPROCEDENTE,SENTENÇA,2018-10-19 00:00:00,INSS,2019-04-25 15:55:52,2019-08-20 10:11:35,6114,Direito Previdenciário,Benefícios em Espécie,"Benefício Assistencial (Art. 203,V CF/88)",15-PB,116
1,0506287-42.2018.4.05.8300,"SENTENÇA I - RELATÓRIO Dispensado, nos termos ...",PARCIALMENTE PROCEDENTE,SENTENÇA,2018-04-20 00:00:00,INSS,2019-05-02 23:48:08,2019-07-03 18:08:09,6099,Direito Previdenciário,Benefícios em Espécie,Aposentadoria por Tempo de Serviço (Art. 52/4),14-PE,61
2,0505610-87.2019.4.05.8102,SENTENÇA I – RELATÓRIO Por força do disposto n...,PROCEDENTE,SENTENÇA,2019-05-06 00:00:00,INSS,2019-07-28 09:57:06,2019-09-26 14:16:00,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,30-CE,60
3,0501720-83.2018.4.05.8100,"SENTENÇA Dispensado o relatório, nos termos do...",IMPROCEDENTE,SENTENÇA,2018-01-24 00:00:00,INSS,2018-08-27 12:39:45,2018-10-31 15:36:41,6114,Direito Previdenciário,Benefícios em Espécie,"Benefício Assistencial (Art. 203,V CF/88)",13-CE,65
4,0501533-63.2018.4.05.8104,"TERMO DE AUDIÊNCIA Aos 18 de setembro de 2018,...",PROCEDENTE,SENTENÇA,2018-05-02 00:00:00,INSS,2018-09-18 14:33:01,2018-11-14 13:50:04,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),22-CE,56
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,0500473-94.2019.4.05.8306,SENTENÇA (Tipo C – Sem Resolução do Mérito) I ...,EXTINTO SEM MÉRITO,SENTENÇA,2019-03-14 00:00:00,INSS,2019-04-15 09:12:31,2019-07-22 15:38:11,6114,Direito Previdenciário,Benefícios em Espécie,"Benefício Assistencial (Art. 203,V CF/88)",25-PE,98
19996,0505814-16.2014.4.05.8100,SENTENÇA Por força do disposto no art. 38 da L...,IMPROCEDENTE,SENTENÇA,2014-03-21 00:00:00,CEF,2014-03-28 10:32:04,2018-07-06 11:33:24,7691,Direito Civil,Obrigações,Inadimplemento,14-CE,1561
19997,0500153-84.2018.4.05.8304,SENTENÇA I. Relatório Ante o disposto no art. ...,IMPROCEDENTE,SENTENÇA,2018-01-24 00:00:00,INSS,2018-03-23 08:35:39,2018-05-23 14:42:26,6177,Direito Previdenciário,Pedidos Genéricos Relativos aos Benefícios em ...,Concessão,20-PE,61
19998,0511210-80.2019.4.05.8202,SENTENÇA I – RELATÓRIO Trata-se de ação de ind...,IMPROCEDENTE,SENTENÇA,2019-11-08 00:00:00,INSS,2019-12-16 13:09:08,2020-03-16 14:46:57,6176,Direito Previdenciário,Pedidos Genéricos Relativos aos Benefícios em ...,Parcelas de benefício não pagas,15-PB,91


In [42]:
df = df.sort_values(by='date_appeal_panel_ruling', ascending=True).reset_index(drop=True)
df

Unnamed: 0,case_number,text,outcome,ruling_type,filing_date,defendant_normalized,date_first_instance_ruling,date_appeal_panel_ruling,case_topic_code,case_topic_1st_level,case_topic_2nd_level,case_topic_3rd_level,court_id,time_to_trial_appeal
0,0529055-14.2017.4.05.8100,SENTENÇA Dispensado o relatório (art. 1o da Le...,IMPROCEDENTE,SENTENÇA,2017-12-22 00:00:00,INSS,2018-01-30 12:09:11,2018-03-26 17:02:45,6138,Direito Previdenciário,"RMI - Renda Mensal Inicial, Reajustes e Revisõ...",Reajustes e Revisões Específicos,14-CE,55
1,0505725-70.2017.4.05.8105,JUSTIÇA FEDERAL SEÇÃO JUDICIÁRIA DO ESTADO DO ...,IMPROCEDENTE,SENTENÇA,2017-11-21 00:00:00,INSS,2018-02-06 15:50:11,2018-03-26 17:07:16,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,23-CE,48
2,0501051-55.2017.4.05.8103,SENTENÇA I. RELATÓRIO Trata-se de demanda prev...,IMPROCEDENTE,SENTENÇA,2017-02-03 00:00:00,INSS,2017-11-21 11:44:40,2018-03-26 17:07:16,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),31-CE,125
3,0512841-39.2017.4.05.8102,PODER JUDICIÁRIO JUSTIÇA FEDERAL DE PRIMEIRA I...,PROCEDENTE,SENTENÇA,2017-10-25 00:00:00,INSS,2017-11-09 10:58:36,2018-03-26 17:07:16,10288,Direito Administrativo e outras matérias do Di...,Servidor Público Civil,Sistema Remuneratório e Benefícios,17-CE,137
4,0509917-52.2017.4.05.8103,SENTENÇA I. RELATÓRIO Trata-sede ação de rito ...,IMPROCEDENTE,SENTENÇA,2017-09-26 00:00:00,INSS,2018-01-15 18:07:55,2018-03-26 17:07:16,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,31-CE,69
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,0508370-06.2019.4.05.8200,SENTENÇA – TIPO A Vistos etc. Trata-se de ação...,PROCEDENTE,SENTENÇA,2019-06-13 00:00:00,INSS,2019-10-03 18:01:23,2020-03-31 15:22:15,6096,Direito Previdenciário,Benefícios em Espécie,Aposentadoria por Idade (Art. 48/51),7-PB,179
19996,0508717-73.2018.4.05.8200,SENTENÇA Dispensado o relatório nos termos do ...,IMPROCEDENTE,SENTENÇA,2018-06-20 00:00:00,INSS,2019-08-29 16:53:58,2020-03-31 15:22:15,6114,Direito Previdenciário,Benefícios em Espécie,"Benefício Assistencial (Art. 203,V CF/88)",13-PB,214
19997,0507511-81.2019.4.05.8202,SENTENÇA I – RELATÓRIO Trata-se de ação de ind...,IMPROCEDENTE,SENTENÇA,2019-08-22 00:00:00,INSS,2019-11-05 17:10:31,2020-04-01 11:24:06,6176,Direito Previdenciário,Pedidos Genéricos Relativos aos Benefícios em ...,Parcelas de benefício não pagas,15-PB,147
19998,0507364-55.2019.4.05.8202,SENTENÇA I – RELATÓRIO Trata-se de ação de ind...,IMPROCEDENTE,SENTENÇA,2019-08-21 00:00:00,INSS,2019-10-30 16:18:03,2020-04-01 11:24:06,6176,Direito Previdenciário,Pedidos Genéricos Relativos aos Benefícios em ...,Parcelas de benefício não pagas,15-PB,153


In [43]:
# Determine the length of the training set (80% of the total data)
train_len = int(0.8 * len(df))

# Determine the length of the validation set (10% of the total data)
val_len = int(0.1 * len(df))

# Create the training set by selecting the first 'train_len' rows from the dataframe
df_train = df.iloc[:train_len].copy()

# Create the validation set by selecting the next 'val_len' rows after the training set
df_val = df.iloc[train_len:train_len + val_len].copy()

# Create the test set by selecting the remaining rows after the training and validation sets
df_test = df.iloc[train_len + val_len:].copy()

# Print the shapes of the training, validation, and test sets to verify the splits
df_train.shape, df_val.shape, df_test.shape

((16000, 14), (2000, 14), (2000, 14))

In [44]:
for dataframe in [df_train, df_val, df_test]:
    dataframe['clean_text'] = dataframe.text.apply(remove_accented_characters)
    dataframe['clean_text'] = dataframe.clean_text.apply(remove_numbers_punctuation_from_text)
    dataframe['clean_text'] = dataframe.clean_text.apply(remove_excessive_spaces)
    dataframe['clean_text'] = dataframe.clean_text.apply(remove_short_words)

In [45]:
df_train

Unnamed: 0,case_number,text,outcome,ruling_type,filing_date,defendant_normalized,date_first_instance_ruling,date_appeal_panel_ruling,case_topic_code,case_topic_1st_level,case_topic_2nd_level,case_topic_3rd_level,court_id,time_to_trial_appeal,clean_text
0,0529055-14.2017.4.05.8100,SENTENÇA Dispensado o relatório (art. 1o da Le...,IMPROCEDENTE,SENTENÇA,2017-12-22 00:00:00,INSS,2018-01-30 12:09:11,2018-03-26 17:02:45,6138,Direito Previdenciário,"RMI - Renda Mensal Inicial, Reajustes e Revisõ...",Reajustes e Revisões Específicos,14-CE,55,SENTENCA Dispensado relatorio art Lei art Lei ...
1,0505725-70.2017.4.05.8105,JUSTIÇA FEDERAL SEÇÃO JUDICIÁRIA DO ESTADO DO ...,IMPROCEDENTE,SENTENÇA,2017-11-21 00:00:00,INSS,2018-02-06 15:50:11,2018-03-26 17:07:16,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,23-CE,48,JUSTICA FEDERAL SECAO JUDICIARIA ESTADO CEARA ...
2,0501051-55.2017.4.05.8103,SENTENÇA I. RELATÓRIO Trata-se de demanda prev...,IMPROCEDENTE,SENTENÇA,2017-02-03 00:00:00,INSS,2017-11-21 11:44:40,2018-03-26 17:07:16,6103,Direito Previdenciário,Benefícios em Espécie,Salário-Maternidade (Art. 71/73),31-CE,125,SENTENCA RELATORIO Trata demanda previdenciari...
3,0512841-39.2017.4.05.8102,PODER JUDICIÁRIO JUSTIÇA FEDERAL DE PRIMEIRA I...,PROCEDENTE,SENTENÇA,2017-10-25 00:00:00,INSS,2017-11-09 10:58:36,2018-03-26 17:07:16,10288,Direito Administrativo e outras matérias do Di...,Servidor Público Civil,Sistema Remuneratório e Benefícios,17-CE,137,PODER JUDICIARIO JUSTICA FEDERAL PRIMEIRA INST...
4,0509917-52.2017.4.05.8103,SENTENÇA I. RELATÓRIO Trata-sede ação de rito ...,IMPROCEDENTE,SENTENÇA,2017-09-26 00:00:00,INSS,2018-01-15 18:07:55,2018-03-26 17:07:16,6101,Direito Previdenciário,Benefícios em Espécie,Auxílio-Doença Previdenciário,31-CE,69,SENTENCA RELATORIO Trata sede acao rito especi...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15995,0516038-07.2019.4.05.8013,Processo no 05160380720194058013T Autor: CARLS...,PROCEDENTE,SENTENÇA,2019-05-23 00:00:00,IFES,2019-07-29 11:47:02,2019-09-30 17:20:24,10220,Direito Administrativo e outras matérias do Di...,Servidor Público Civil,Regime Estatutário,14-AL,63,Processo Autor CARLSON LAMENHA APOLINARIO Reu ...
15996,0500940-73.2019.4.05.8015,SENTENÇA Trata-se de ação proposta em face da ...,IMPROCEDENTE,SENTENÇA,2019-01-29 00:00:00,CEF,2019-06-14 11:24:25,2019-09-30 17:20:24,10433,Direito Civil,Responsabilidade Civil,Indenização por Dano Moral,10-AL,108,SENTENCA Trata acao proposta face Caixa Econom...
15997,0509291-41.2019.4.05.8013,SENTENÇA Trata-se de ação proposta contra o IN...,PROCEDENTE,SENTENÇA,2019-03-29 00:00:00,INSS,2019-06-13 12:08:18,2019-09-30 17:20:24,6118,Direito Previdenciário,Benefícios em Espécie,Aposentadoria por Tempo de Contribuição (Art. ...,9-AL,109,SENTENCA Trata acao proposta contra INSS que p...
15998,0519807-23.2019.4.05.8013,"SENTENÇA Trata-se de ação de rito sumaríssimo,...",PROCEDENTE,SENTENÇA,2019-06-25 00:00:00,INSS,2019-07-16 11:13:16,2019-09-30 17:20:24,6100,Direito Previdenciário,Benefícios em Espécie,Aposentadoria Especial (Art. 57/8),9-AL,76,SENTENCA Trata acao rito sumarissimo com pedid...


In [46]:
X_train = df_train.clean_text
X_val = df_val.clean_text
X_test = df_test.clean_text

y_train = df_train.time_to_trial_appeal
y_val = df_val.time_to_trial_appeal
y_test = df_test.time_to_trial_appeal


From now on, it's up to you.

# Questions

1. What are the three key steps in text processing mentioned in the NLP pipeline?

2. What is the "60 iterations rule" in the context of random search for hyperparameter optimization?

3. What are the three main components of Bayesian Optimization for hyperparameter tuning?

4. How does the notebook split the data for training, validation and testing?

5. What is the purpose of the StackingClassifier in the context of this notebook?

6. What evaluation metrics are used to assess the performance of the classifiers?


`Answers are commented inside this cell.`

<!-- 1. The three key steps in text processing mentioned in the NLP pipeline are: Normalization (standardizing text), Tokenization (breaking text into smaller units), and Numericalization (converting tokens to numerical representations).

2. The "60 iterations rule" states that 60 iterations of random search can find the best 5% set of parameters 95% of the time, regardless of the grid size. This rule provides an efficient approach to hyperparameter optimization.

3. The three main components of Bayesian Optimization for hyperparameter tuning are: Objective Function (evaluates model performance), Surrogate Model (probabilistic approximation of the objective function), and Acquisition Function (guides selection of next hyperparameter set to evaluate).

4. The notebook splits the data chronologically: the first 80% of the data (sorted by date) is used for training, the next 10% for validation, and the final 10% for testing.

5. The StackingClassifier is used to combine multiple base classifiers with a meta-classifier, aiming to achieve superior performance compared to individual models by leveraging their collective strengths.

6. The evaluation metrics used include F1 score, balanced accuracy, accuracy, Matthews Correlation Coefficient (MCC), and confusion matrix. The notebook particularly emphasizes MCC as a key metric.
 -->
