# Natural Language Processing with Disaster Tweets

Competition Description

Twitter has become an important communication channel in times of emergency.

The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:


The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

Submissions are evaluated using F1 between the predicted and expected answers.

F1 is calculated as follows:

𝐹1=2∗𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙 / (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙)

where:

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=𝑇𝑃/(𝑇𝑃+𝐹𝑃)

𝑟𝑒𝑐𝑎𝑙𝑙=𝑇𝑃/(𝑇𝑃+𝐹𝑁)

and:

True Positive [TP] = your prediction is 1, and the ground truth is also 1 - you predicted a positive and that's true!

False Positive [FP] = your prediction is 1, and the ground truth is 0 - you predicted a positive, and that's false.

False Negative [FN] = your prediction is 0, and the ground truth is 1 - you predicted a negative, and that's false.

In [4]:
# check the dataset
import pandas as pd
df = pd.read_csv('train.csv')
df.head(20)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


It means we need to deal with text and target.

### The choice of models

We are taking a model for a binary classification task, here are some investigation regarding this work:

BERT and its Variants: BERT is particularly good at understanding the context of a word in a sentence, which can be useful for tasks like sentiment analysis where the meaning of words can be highly context-dependent. Its variants like RoBERTa or ALBERT might offer improved performance or efficiency. **Seems BERT is a good start to try**

DistilBERT: If computational resources or inference time are a concern, DistilBERT offers a good balance between performance and efficiency, as it is a distilled version of BERT that retains most of its capabilities. **Skip this for the time being, let us consider pure version of BERT first**

GPT-3 or GPT-4: If your task can benefit from a large-scale pre-trained model and you have access to it, GPT-3 or GPT-4 can be fine-tuned for binary classification tasks. These models can be particularly powerful if your task involves creative language or requires a deep understanding of nuanced text. **Not very relevant to this task**

Fine-tuning vs. Feature-based Approach: With models like BERT, you can either fine-tune the entire model on your task or use it as a feature extractor where only a simple classifier is trained on top of the BERT features. Fine-tuning generally offers better performance but requires more computational resources.

Data Efficiency: If you have a small dataset, you might want to consider models that are known for being data-efficient. Techniques like few-shot learning with GPT-3 or GPT-4, or using a model pre-trained in a similar domain to your task, can be beneficial.

Transfer Learning and Domain-Specific Models: If there's a pre-trained model that's been trained on a corpus relevant to your task (like legal documents, medical reports, etc.), using that model could lead to better performance as it's already familiar with the kind of language used in your domain.

Here, we got relatively small size of training set, therefore BERT maybe the first one we want to try.


### Data clean for BERT model
When fine-tuning a BERT model, it's generally a good idea to clean and preprocess your text data to ensure that it is in a format that is compatible with the model and conducive to effective learning. Here are some common preprocessing steps for training a BERT model.

Text Cleaning: Depending on your dataset, you might want to remove unnecessary characters, such as HTML tags, special characters, or extra whitespace. You should also consider whether to convert the text to lowercase, as BERT models are available in both cased and uncased versions.

In [6]:
import re
import pandas as pd

# Define a function to clean the text
def clean_text(text):
    # Convert to lower case
    text = text.lower()
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove usernames (mentions)
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags (just the symbol, not the text)
    text = re.sub(r'#', '', text)
    # Remove non-alphabetic characters and keep only the words
    text = re.sub(r'[^a-zA-Z\s]', '', text, re.I|re.A)
    # Remove new line characters
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Assuming the dataset is in a CSV file and the text is in a column named 'text'
# Load your dataset
df = pd.read_csv('train.csv')

# Clean the text column
df['cleaned_text'] = df['text'].apply(clean_text)

# Show the first few rows of the cleaned text
print(df[['text', 'cleaned_text']].head())

# Save the cleaned dataset to a new CSV file
df.to_csv('cleaned_train.csv', index=False)


                                                text  \
0  Our Deeds are the Reason of this #earthquake M...   
1             Forest fire near La Ronge Sask. Canada   
2  All residents asked to 'shelter in place' are ...   
3  13,000 people receive #wildfires evacuation or...   
4  Just got sent this photo from Ruby #Alaska as ...   

                                        cleaned_text  
0  our deeds are the reason of this earthquake ma...  
1              forest fire near la ronge sask canada  
2  all residents asked to shelter in place are be...  
3  people receive wildfires evacuation orders in ...  
4  just got sent this photo from ruby alaska as s...  


## Understanding of BERT and its controlling parameters

### BERT
BERT (Bidirectional Encoder Representations from Transformers) is a model that uses the Transformer architecture. It was introduced by researchers at Google in 2018 and has since become one of the most popular and influential models in natural language processing (NLP).

Here's a high-level overview of how the BERT model works:

Input Representation: BERT takes as input a sequence of tokens (words or subwords) and converts them into vectors using an embedding layer. It also adds special tokens like [CLS] and [SEP] to the sequence and uses positional embeddings to capture the order of the tokens.

Transformer Encoder: The core of the BERT model is a stack of Transformer encoder layers. Each layer consists of two main components: a multi-head self-attention mechanism and a feed-forward neural network. The self-attention mechanism allows the model to weigh the importance of different tokens in the sequence relative to each other, and the feed-forward network processes the output of the attention mechanism.

Self-Attention: In the self-attention mechanism, each token in the input sequence is transformed into a query, key, and value vector. The model then computes attention scores by taking the dot product of the query vector of each token with the key vectors of all other tokens. These scores determine how much attention each token should pay to every other token in the sequence. The attention scores are then used to compute a weighted sum of the value vectors, which becomes the output of the attention mechanism.

Feed-Forward Network: The output of the self-attention mechanism is then passed through a feed-forward neural network, which applies additional transformations to the data.

Output: The output of the final Transformer encoder layer can be used for various NLP tasks. For example, in classification tasks, the output corresponding to the [CLS] token is often used as the representation of the entire sequence and passed through additional layers to produce the final class predictions.

BERT's ability to capture the context of each token in a sequence bidirectionally (i.e., considering both the preceding and following tokens) is one of its key strengths. This allows it to understand the meaning of words in context more effectively than previous models that processed text in a unidirectional manner.

### Training process
Tokenization: Use the BERT tokenizer to tokenize your text. This involves splitting the text into words or subwords (tokens) and converting each token into its corresponding ID in the BERT vocabulary. The tokenizer will also add special tokens like [CLS] and [SEP] as needed.

Padding and Truncation: Since BERT models require fixed-length input sequences, you'll need to pad shorter sequences and truncate longer ones to a specified maximum length. The BERT tokenizer can handle this for you.

Attention Masks: Generate attention masks to tell the model which tokens are actual words and which are padding tokens. This is important for the model to correctly interpret the input sequences.

Label Encoding: If you're working on a classification task, you'll need to encode your labels into a format that the model can understand (e.g., converting categorical labels into numerical IDs).

### Controlling parameters (which used in the coding below)
- add_special_tokens=True:

This parameter tells the tokenizer to add special tokens to the beginning and end of each sequence. For BERT, these are typically [CLS] at the beginning and [SEP] at the end. The [CLS] token is used for classification tasks, and the [SEP] token is used to separate different sentences or segments within a single sequence.

- return_attention_mask=True:

This parameter tells the tokenizer to return the attention mask, which we discussed earlier. This is necessary for the BERT model to distinguish between real tokens and padding tokens.

- pad_to_max_length=True:

This parameter tells the tokenizer to pad all sequences to a specified maximum length (max_length). This is important because BERT models require all input sequences to be of the same length.

- input_ids = inputs['input_ids']:

input_ids are the tokenized representations of your input text. The BERT tokenizer converts each token (word or subword) in your text into a unique integer ID. These IDs are used by the BERT model to look up the corresponding embeddings (vector representations) of each token. The inputs['input_ids'] is a dictionary key that retrieves these tokenized input IDs from the tokenizer output.

- attention_masks = inputs['attention_mask']:

attention_masks are used to tell the model which tokens in the input are actual words and which ones are padding tokens. This is important because BERT models are trained on fixed-length sequences, and not all input sequences are the same length. Padding tokens (usually represented by the ID 0) are added to the end of shorter sequences to make them all the same length. The attention mask is a binary mask that indicates which tokens are padding (0) and which are real words (1).

- output_attentions=False:

This parameter is used when loading the BERT model. Setting it to False means that the model will not return the attention weights. The attention weights are used to understand how the model is focusing on different parts of the input sequence when making predictions. If you don't need this information, you can set this parameter to False to save memory and computation.

- output_hidden_states=False:

Similar to output_attentions, this parameter is used when loading the BERT model. Setting it to False means that the model will not return the hidden states of each layer. The hidden states can be used for more advanced analysis or for creating more complex models, but if you're just doing simple classification, you can set this to False to save resources.

In [None]:
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load the dataset
df = pd.read_csv('cleaned_train.csv')

# Initialize BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize and encode sequences in the dataset
inputs = tokenizer.batch_encode_plus(
    df['text'].tolist(),
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,  # Choose a max_length that suits your dataset
    return_tensors='pt',
)

# Get the input IDs and attention masks from the tokenizer output
input_ids = inputs['input_ids']
attention_masks = inputs['attention_mask']

# Get the labels from the dataframe
labels = torch.tensor(df['target'].values)

# Load the BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2,  # Number of output labels--2 for binary classification
    output_attentions=False,
    output_hidden_states=False,
)


In [8]:
#Setting Up Data Loaders
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels)

# Create a 90-10 train-validation split.
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])

# Create the DataLoaders for our training and validation sets.
batch_size = 32

train_dataloader = DataLoader(
            train_dataset,
            sampler = RandomSampler(train_dataset),
            batch_size = batch_size
        )

validation_dataloader = DataLoader(
            val_dataset,
            sampler = SequentialSampler(val_dataset),
            batch_size = batch_size
        )


### Optimizer
AdamW is a popular optimizer used in training deep learning models, especially in the context of fine-tuning models like BERT. It is a variant of the Adam optimizer that incorporates weight decay, a technique used to regularize and prevent overfitting in the model.

Here's why AdamW is commonly used for fine-tuning BERT:

Adaptive Learning Rates: AdamW, like Adam, computes adaptive learning rates for each parameter. This means that it adjusts the learning rate based on the history of gradients for each parameter, which can lead to more effective and faster training compared to optimizers with a fixed learning rate.

Weight Decay: AdamW incorporates weight decay in a way that is more compatible with adaptive learning rate algorithms. Weight decay is a form of regularization that helps prevent the weights from growing too large, which can reduce overfitting. In traditional optimizers like SGD with momentum, weight decay is applied directly to the weights, but in AdamW, it is decoupled from the gradient updates, which can lead to better performance in practice.

Stability: AdamW includes a term called epsilon (eps) that helps improve numerical stability during optimization. This can be particularly important when working with deep neural networks, where small numerical errors can accumulate and lead to instability.

While AdamW is a popular choice for fine-tuning BERT and other transformer-based models, there are other optimizers you could consider:

SGD (Stochastic Gradient Descent): A simple and classic optimizer that can work well with a properly tuned learning rate and momentum. It may require more epochs to converge compared to AdamW.

RMSprop: Similar to Adam in that it maintains a moving average of squared gradients to adapt the learning rate for each parameter, but does not have a bias correction term.

Adam: The original version of AdamW, which also adapts learning rates based on the history of gradients but handles weight decay differently.

Each optimizer has its strengths and weaknesses, and the best choice can depend on the specific task and dataset. However, AdamW is often recommended for fine-tuning BERT due to its balance of efficiency, stability, and effectiveness in handling sparse gradients and adaptive learning rates.


### Discussion of optimizers

Regarding whether AdamW is better than other optimizers, it's not always the case that AdamW is universally superior. Its effectiveness can depend on the specific task, model architecture, and dataset. However, AdamW is often preferred for training deep learning models like BERT and CNNs because it combines the benefits of adaptive learning rates (from Adam) with a more effective handling of weight decay. This can lead to better generalization and faster convergence in many cases.

For CNNs, AdamW can also be a good choice, especially in scenarios where you're dealing with sparse gradients or require adaptive learning rates. However, traditional optimizers like SGD with momentum are also commonly used for training CNNs and can perform very well, particularly with a well-tuned learning rate schedule.

In summary, while AdamW is a powerful optimizer that is well-suited for many deep learning tasks, the best optimizer for a given problem depends on the specific characteristics of the task and the model. It's often a good idea to experiment with different optimizers and learning rate schedules to find the best combination for your particular application.

In [None]:
#Optimizer & Learning Rate Scheduler
from transformers import AdamW, get_linear_schedule_with_warmup

# Set up the optimizer.
optimizer = AdamW(model.parameters(),
                  lr = 2e-5,    # This is the learning rate recommended by the BERT authors
                  eps = 1e-8    # This is the default epsilon value in AdamW
                )

# Total number of training steps is [number of batches] x [number of epochs].
epochs = 3
total_steps = len(train_dataloader) * epochs

# Set up the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value
                                            num_training_steps = total_steps)


### What does 'Scheduler' do?
The learning rate scheduler is used to adjust the learning rate during training, typically reducing it over time. This can help improve the performance of the model and ensure that it converges to a good solution. The scheduler you're using, get_linear_schedule_with_warmup, is a common choice for fine-tuning BERT and other transformer models. It starts with a "warm-up" phase, where the learning rate increases linearly from 0 to the initial learning rate (set when configuring the optimizer). After the warm-up phase, the learning rate decreases linearly from the initial learning rate to 0 over the remaining training steps. This approach helps stabilize the training process in the early stages and leads to better final performance.

## Training Phase

### Function definitions
**flat_accuracy function**:
The flat_accuracy function is used to calculate the accuracy of the model's predictions. Here's how it works:

- preds is a 2D array where each row represents a prediction made by the model for a single input, and each column represents the probability of each class. The shape of preds is (number of examples, number of classes).

- np.argmax(preds, axis=1) finds the index of the maximum value in each row, which corresponds to the predicted class. This converts the probabilities into class predictions.
- pred_flat is a 1D array containing the predicted classes for all input examples.
- labels is a 2D array containing the true labels, which are then flattened into a 1D array labels_flat.
- The function then compares pred_flat and labels_flat element-wise to determine how many predictions match the true labels. The sum of matches is divided by the total number of examples to calculate the accuracy.

**Explanation of model.cuda()**:

model.cuda() is a method that moves the model's parameters and buffers to the GPU. This allows the model to take advantage of the parallel processing power of the GPU, which can significantly speed up training and inference.
If you've only worked with CPU computation before, using .cuda() is a way to switch to GPU computation. Note that you'll need a CUDA-capable GPU and the appropriate CUDA software installed for this to work. If you don't have a GPU, you can remove this line, and the model will run on the CPU by default. **You will be introduced how to setup CUDA environment later**

**model.train() and training loop**

model.train() is a method that sets the model to training mode. This is important because some layers, like dropout and batch normalization, behave differently during training than during evaluation. By calling model.train(), you're telling the model that it should prepare for training.The for loop after model.train() iterates over the training data in batches. Here's what happens inside the loop:

- Data Loading: Each batch of data is unpacked into input IDs (b_input_ids), attention masks (b_input_mask), and labels (b_labels). These are then moved to the GPU if available.
- Zero Gradient: The gradients are zeroed out using model.zero_grad() to ensure that they don't accumulate across batches.
- Forward Pass: The model performs a forward pass on the input data (b_input_ids and b_input_mask) and computes the loss based on the predictions and true labels (b_labels).
- Backward Pass: The loss.backward() call computes the gradients of the loss with respect to the model parameters.
- Gradient Clipping: torch.nn.utils.clip_grad_norm_() is used to prevent the gradients from becoming too large, which can cause numerical instability. The norm is a measure of the vector's magnitude, typically calculated using the L2 norm (Euclidean norm), which is the square root of the sum of the squares of all elements in the vector. If the norm of the gradient vector exceeds the specified threshold (in this case, 1), we scale down the entire vector so that its norm becomes equal to the threshold. This is done by dividing each element of the gradient vector by the ratio of the norm to the threshold. So, if the norm of the gradient vector is 1.9 and the threshold is 1, each element of the gradient vector is scaled down by a factor of 1/1.9. This ensures that the scaled gradient vector has a norm of 1 while preserving the direction of the original gradient vector. The elements of the gradient vector are scaled down proportionally, not just the highest value.
- Parameter Update: The optimizer (optimizer.step()) updates the model parameters based on the computed gradients.
- Learning Rate Update: The learning rate scheduler (scheduler.step()) adjusts the learning rate according to the schedule.

This process is repeated for each batch in the dataset, and for each epoch (full pass over the dataset), allowing the model to learn from the data and adjust its parameters to minimize the loss.


### Validation Phase 
- Add batch to GPU:

The line batch = tuple(t.cuda() for t in batch) moves each tensor in the batch (input IDs, attention masks, and labels) to the GPU. This is done to accelerate computation since GPUs are much faster than CPUs for the matrix operations involved in deep learning.

The batch is in tuple format because the data loader typically returns batches as tuples or lists of tensors. The tuple format is used here to ensure that the structure of the batch remains unchanged when moving it to the GPU.

- Logits:

In the context of neural networks, logits are the raw, unnormalized scores output by the last layer of the model before any activation function like softmax is applied. These scores are then typically passed through a softmax function to convert them into probabilities for each class.
outputs[0] refers to the logits because, in the PyTorch implementation of BERT, the model returns a tuple where the first element (outputs[0]) is the logits for each input sequence. 

- Move logits and labels to CPU:

Logits and labels are moved to the CPU (logits.detach().cpu().numpy() and b_labels.to('cpu').numpy()) for further processing (like calculating the F1 score) that doesn't require the computational power of the GPU. This is done because some operations, like converting tensors to NumPy arrays, can only be done on CPU tensors. Additionally, moving data off the GPU can help free up GPU memory for other computations.

- Weighted average for F1 score:

The average='weighted' parameter in the f1_score function specifies that the F1 scores for each class should be weighted by the number of true instances for each class. This means that the F1 score for each class is multiplied by the number of true instances in that class, and then the average is taken. This approach is useful in imbalanced datasets, where some classes might have significantly more instances than others. It ensures that the F1 score takes into account the class imbalance by giving more weight to the classes with more instances.








In [21]:
#Training loops
import numpy as np
import time
import datetime
from sklearn.metrics import f1_score


# Function to calculate the accuracy of predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

# Function for formatting the elapsed time.
def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

# Move the model to the GPU.
model.cuda()

# Store the average loss after each epoch so we can plot them.
loss_values = []

for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    print(f'======== Epoch {epoch_i + 1} / {epochs} ========')
    print('Training...')
    
    t0 = time.time()
    total_loss = 0
    model.train()
    
    for step, batch in enumerate(train_dataloader):
        
        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print(f'  Batch {step}  of  {len(train_dataloader)}.    Elapsed: {elapsed}.')
        
        # Unpack this training batch from our dataloader. 
        b_input_ids = batch[0].cuda()
        b_input_mask = batch[1].cuda()
        b_labels = batch[2].cuda()
        
        # Clear any previously calculated gradients before performing a backward pass.
        model.zero_grad()        
        
        # Perform a forward pass (evaluate the model on this training batch).
        outputs = model(b_input_ids, 
                        token_type_ids=None, 
                        attention_mask=b_input_mask, 
                        labels=b_labels)
        
        loss = outputs[0]
        
        # Accumulate the training loss over all of the batches 
        total_loss += loss.item()
        
        # Perform a backward pass to calculate the gradients.
        loss.backward()
        
        # Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        # Update parameters and take a step using the computed gradient
        optimizer.step()
        
        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train_dataloader)            
    
    # Store the loss value for plotting the learning curve.
    loss_values.append(avg_train_loss)

    print(f"  Average training loss: {avg_train_loss:.2f}")
    print(f"  Training epoch took: {format_time(time.time() - t0)}")    
    print("\nRunning Validation with F1 score...")

    t0 = time.time()

    # Put the model in evaluation mode
    model.eval()

    # Variables to gather full output
    predictions , true_labels = [], []

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Add batch to GPU
        batch = tuple(t.cuda() for t in batch)
        
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        
        # Telling the model not to compute or store gradients
        with torch.no_grad():
            # Forward pass
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
        
        logits = outputs[0]

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # Store predictions and true labels
        predictions.append(logits)
        true_labels.append(label_ids)

    # Flatten the predictions and true labels
    flat_predictions = np.concatenate(predictions, axis=0)
    flat_true_labels = np.concatenate(true_labels, axis=0)

    # Convert logits to predicted class (0 or 1) using argmax
    flat_predictions = np.argmax(flat_predictions, axis=1).flatten()

    # Calculate the F1 score
    f1 = f1_score(flat_true_labels, flat_predictions, average='weighted') # 'weighted' accounts for label imbalance

    print(f"F1 Score: {f1:.2f}")
    print(f"Validation took: {format_time(time.time() - t0)}\n")

print("Validation complete!")




        


Training...
  Batch 40  of  215.    Elapsed: 0:00:33.
  Batch 80  of  215.    Elapsed: 0:01:06.
  Batch 120  of  215.    Elapsed: 0:01:40.
  Batch 160  of  215.    Elapsed: 0:02:13.
  Batch 200  of  215.    Elapsed: 0:02:46.
  Average training loss: 0.06
  Training epoch took: 0:02:58

Running Validation with F1 score...
F1 Score: 0.83
Validation took: 0:00:07

Training...
  Batch 40  of  215.    Elapsed: 0:00:33.
  Batch 80  of  215.    Elapsed: 0:01:07.
  Batch 120  of  215.    Elapsed: 0:01:40.
  Batch 160  of  215.    Elapsed: 0:02:14.
  Batch 200  of  215.    Elapsed: 0:02:47.
  Average training loss: 0.06
  Training epoch took: 0:02:59

Running Validation with F1 score...
F1 Score: 0.83
Validation took: 0:00:07

Training...
  Batch 40  of  215.    Elapsed: 0:00:33.
  Batch 80  of  215.    Elapsed: 0:01:07.
  Batch 120  of  215.    Elapsed: 0:01:40.
  Batch 160  of  215.    Elapsed: 0:02:14.
  Batch 200  of  215.    Elapsed: 0:02:47.
  Average training loss: 0.06
  Training epoch 

In [22]:
model.save_pretrained('./my_bert_model3')
tokenizer.save_pretrained('./my_bert_model3')

('./my_bert_model3\\tokenizer_config.json',
 './my_bert_model3\\special_tokens_map.json',
 './my_bert_model3\\vocab.txt',
 './my_bert_model3\\added_tokens.json')

In [23]:
import pandas as pd

df_test = pd.read_csv('test.csv')

from transformers import BertTokenizer, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained('my_bert_model3')
model = BertForSequenceClassification.from_pretrained('my_bert_model3')

input_text = df_test['text'].tolist()
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)

# Convert the tensor to a NumPy array
predictions_np = predictions.numpy()

# Convert the NumPy array to a pandas DataFrame
predictions_df = pd.DataFrame(predictions_np, columns=['target'])

df_output = pd.DataFrame()
df_output['id'] = df_test['id']
df_output['target'] = predictions_df['target']
df_output.to_csv('submit3.csv', index=False)

In [3]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())

1.12.0+cu116
True


### Personal notes： 关于BERT模型其他类型的fine-tune
BERT模型可以用于多种自然语言处理任务，包括但不限于二分类问题。对于不同的任务，模型的微调和输出层可能会有所不同：

多分类问题：如果任务是多分类而不是二分类，你可以简单地修改模型的输出层，使其具有与类别数量相对应的输出神经元。例如，如果有10个类别，那么输出层应该有10个神经元，并且使用softmax激活函数来将输出转换为概率分布。

上下文预测：如果任务是根据上下文预测单词（例如，填空任务），你可以使用BERT的Masked Language Model (MLM) 功能。在这种情况下，你会在输入文本中随机遮盖一些单词，然后训练模型来预测这些遮盖的单词。输出层需要调整为词汇表大小的输出，以预测每个遮盖单词的概率分布。

问答任务：对于问答任务，你可以使用BERT来同时预测答案在文本中的开始和结束位置。这通常涉及到在BERT的基础上添加两个输出层，一个用于预测答案开始的位置，另一个用于预测答案结束的位置。每个输出层都会输出一个概率分布，表示答案开始或结束的每个位置的概率。

对于以上任何任务，模型的其余部分（即BERT的编码器层）通常保持不变，只需要根据具体任务调整输出层和损失函数。此外，你可能还需要根据任务的具体需求调整输入数据的格式和预处理步骤。

对于问答（QA）任务，模型通常被训练来预测答案在文本中的开始和结束位置。这里以BERT为例来详细说明训练过程：

数据准备：

首先，你需要准备一个问答数据集，其中每个样本包含一个问题、一个包含答案的上下文文本（例如，一个段落），以及答案在上下文中的开始和结束位置。
对于每个样本，你将问题和上下文文本拼接在一起，中间用一个特殊分隔符（如[SEP]）分隔，同时在序列的开始添加一个特殊的分类标记（如[CLS]）。
模型架构：

使用BERT模型对拼接后的问题和上下文文本进行编码，得到序列的表示。
在BERT的输出层之上，添加两个线性层，一个用于预测答案的开始位置，另一个用于预测答案的结束位置。这两个线性层的输出维度等于序列长度，每个位置的输出值表示该位置是答案开始或结束的概率。
训练过程：

在训练过程中，模型的目标是最小化预测的开始和结束位置与真实开始和结束位置之间的差距。这通常通过交叉熵损失函数来实现。
对于每个样本，计算开始位置和结束位置的损失，然后将这两个损失相加得到总损失。使用梯度下降算法（如Adam）来更新模型参数，以最小化总损失。
预测和评估：

在预测阶段，模型接收一个问题和一个上下文文本，输出答案的开始和结束位置的概率分布。
选择概率最高的位置作为答案的开始和结束位置。然后从上下文文本中提取出这个范围内的文本作为预测答案。
模型的性能通常通过评估指标，如精确率（Precision）、召回率（Recall）和F1分数来衡量，这些指标考虑了预测答案与真实答案的重叠程度。
这就是使用BERT进行问答任务的基本训练过程。需要注意的是，实际应用中可能需要对这个过程进行调整，以适应特定的数据集和任务要求。


在实际应用中，确实存在这样的问答任务，其中只提供问题，而没有给出包含答案的上下文。这种任务通常被称为开放域问答（Open-domain Question Answering），它比传统的基于上下文的问答（Context-based Question Answering）更具挑战性。开放域问答通常涉及以下几个步骤：

文档检索（Document Retrieval）：给定一个问题，系统首先需要从一个大型的文档集合（如维基百科）中检索出相关的文档或段落。这个步骤通常使用信息检索（Information Retrieval）技术来实现，如倒排索引（Inverted Index）、TF-IDF、BM25等。

候选生成（Candidate Generation）：从检索到的文档中提取出可能包含答案的文本片段作为答案候选。这可以通过简单的启发式方法来实现，例如选择包含问题中关键词的句子或段落。

答案抽取（Answer Extraction）：对于每个候选文本，使用类似于上述基于上下文的问答模型（如基于BERT的模型）来预测答案的开始和结束位置。这个步骤通常需要将问题和候选文本拼接起来，作为模型的输入。

答案排序（Answer Ranking）：对所有候选答案进行排序，选择最可能的答案作为最终输出。排序可以基于模型预测的概率，也可以结合其他特征，如候选文本的相关性得分、答案的置信度等。

开放域问答需要综合运用信息检索、自然语言处理和机器学习的技术。随着预训练语言模型和检索技术的发展，开放域问答系统的性能正在逐渐提高，但仍然是自然语言处理领域的一个活跃研究方向。