<span style=" color: yellow; font-size: 24px;">Deep Learning 1 Project _ Group 5:<br>Tweet Sentiment Extraction (Kaggle · Featured Code Competition)<br></span><span style="font-size: 22px;">
Shrey	Patel	101541370<br>
Sam	Emami	101575471<br>
Eric	Lessa	101549935<br>
Dwip	Makwana	101483523<br>
Moossa	Hussain	101542820<br>
Chaoyu	Liu	101573622<br>
Devanshi 	Dave	101582208<br>
Rutika	Bhuva	101551781<br>
</span>

In [7]:
!pip install shap
!pip install transformers
!pip install pipeline
!pip install datasets
!pip show torchvision
!pip install tf-keras
!pip install transformers[torch]
!pip install accelerate>=0.26.0

Name: torchvision
Version: 0.20.1+cu118
Summary: image and video datasets and models for torch deep learning
Home-page: https://github.com/pytorch/vision
Author: PyTorch Core Team
Author-email: soumith@pytorch.org
License: BSD
Location: c:\users\dwipm\.conda\envs\ml-venv-310\lib\site-packages
Requires: numpy, pillow, torch
Required-by: 


# Hugging Face Transformers Tools for Disaster Tweet Classification

## Key Components

### DisasterTweetProcessor Class
Our main class that encapsulates:
- Data loading and preprocessing
- Model initialization and setup
- Feature preparation
- SHAP analysis and visualization

In [8]:
import os
import torch
import numpy as np
import pandas as pd
import shap
import matplotlib.pyplot as plt
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    pipeline,
    DefaultDataCollator
)

### Core Transformers Components

#### `AutoModelForSequenceClassification`:
- Used for binary classification of disaster/non-disaster tweets
- Initialized with 'bert-base-uncased' model
- Configured with 2 output labels
- Takes sequences of text and outputs classification logits

#### `AutoTokenizer`:

- Handles text tokenization for BERT model
- Converts raw text into model-compatible format
- Manages padding and truncation
- Creates attention masks

#### `Trainer`:

- Manages the complete training process
- Handles batching and optimization
- Provides evaluation metrics
- Manages model saving
- Used in our code with custom training arguments

In [9]:
class DisasterTweetProcessor:
    def __init__(self, train_path):
        self.train_df = pd.read_csv(train_path)
        self.tokenizer = None
        self.model = None
        
    def prepare_data(self):
        """Clean and prepare the data"""
        print("Dataset columns:", self.train_df.columns.tolist())
        print("\nSample data:")
        print(self.train_df.head())
        
        self.train_df = self.train_df.dropna(subset=['text'])
        
        # Ensure we have binary labels
        if 'target' in self.train_df.columns:
            self.train_df['label'] = self.train_df['target'].astype(int)
        
        dataset = Dataset.from_pandas(self.train_df)
        split_data = dataset.train_test_split(test_size=0.1, seed=42)
        return split_data['train'], split_data['test']

    def setup_model(self):
        """Initialize the model and tokenizer for classification"""
        model_name = "bert-base-uncased"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        # Initialize for binary classification (2 labels)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name, 
            num_labels=2
        )
        
        # Move model to available device
        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(device)

    def prepare_features(self, examples):
        """Prepare features for classification"""
        tokenized = self.tokenizer(
            examples['text'],
            padding='max_length',
            truncation=True,
            max_length=128
        )
        
        if 'label' in examples:
            tokenized['labels'] = examples['label']
            
        return tokenized
    
    def analyze_with_shap(self, sample_texts):
        """
        Perform SHAP analysis on sample texts with proper text handling
        """
        try:
            device = next(self.model.parameters()).device
            
            # Function to predict probabilities
            def model_wrapper(texts):
                # Handle both string and list inputs
                if isinstance(texts, str):
                    texts = [texts]
                # Convert any non-string elements to strings
                texts = [str(t) if not isinstance(t, str) else t for t in texts]
                
                inputs = self.tokenizer(texts, padding=True, truncation=True, 
                                    max_length=128, return_tensors="pt")
                inputs = {k: v.to(device) for k, v in inputs.items()}
                
                with torch.no_grad():
                    outputs = self.model(**inputs)
                    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
                    return probs.cpu().numpy()

            # Create explainer
            print("Creating SHAP explainer...")
            masker = shap.maskers.Text(self.tokenizer)
            explainer = shap.Explainer(
                model_wrapper,
                masker,
                output_names=["Not Disaster", "Disaster"]
            )
            
            # Calculate SHAP values
            print("Calculating SHAP values...")
            shap_values = explainer(sample_texts)
            
            return shap_values
            
        except Exception as e:
            print(f"Error in SHAP analysis: {str(e)}")
            import traceback
            traceback.print_exc()
            return None


    # Define wrapper for SHAP
    def model_wrapper(texts):
        outputs = classifier(list(texts))
        # Convert pipeline outputs to numpy array of probabilities
        return np.array([[1 - output['score'], output['score']] for output in outputs])

        # Create SHAP explainer
        print("\nCreating SHAP explainer...")
        explainer = shap.Explainer(model_wrapper, tokenizer=self.tokenizer)
    
        # Calculate SHAP values
        print("Calculating SHAP values...")
        shap_values = explainer(sample_texts)
    
        # Visualize SHAP values for each text
        print("\nGenerating SHAP visualizations...")
        for i, text in enumerate(sample_texts):
            print(f"\nSHAP Analysis for Text {i + 1}: {text}")
        shap.plots.text(shap_values[i])
    
        return shap_values
    
    def visualize_shap_analysis(self, shap_values, sample_texts):
        """
        Create and save comprehensive SHAP visualizations
        """
        try:
            import os
            os.makedirs('shap_visualizations', exist_ok=True)
            
            # Word Importance Plot for each text
            for i, text in enumerate(sample_texts):
                plt.figure(figsize=(15, 5))
                # This line creates the red/blue visualization
                shap.plots.text(shap_values[i], display=False)
                plt.close()

                # Add interactive display
                shap.plots.text(shap_values[i])  # This will show the interactive plot

            print(f"\nVisualizations saved in 'shap_visualizations' directory:")
            print("- Word importance plots: Show which words influenced each prediction")
            
        except Exception as e:
            print(f"Error in SHAP visualization: {str(e)}")
            print("Shape of SHAP values:", shap_values.values.shape)
            print("Base values shape:", shap_values.base_values.shape)
            print("Detailed error info:")
            import traceback
            traceback.print_exc()


## SHAP Analysis Implementation
#### `analyze_with_shap`:

- Creates model wrapper for SHAP compatibility
- Handles text preprocessing
- Generates SHAP values for interpretability

#### `visualize_shap_analysis`:

- Creates word importance visualizations
- Shows feature contributions:
  - Red: Contributing to "disaster" classification
  - Blue: Contributing to "non-disaster" classification
- Saves high-resolution visualizations

In [10]:
def get_sample_size(dataset, max_size=1000):
    """
    Get appropriate sample size based on dataset size
    """
    return min(len(dataset), max_size)

## Model Training Process

1. ### Data Preparation

- Loads and cleans disaster tweet dataset
- Handles missing values
- Creates train/eval splits


2. ### Feature Processing

- Tokenizes text data
- Implements padding and truncation
- Generates attention masks
- Prepares binary labels


3. ### Training Configuration

- Uses complete dataset (no sampling)
- Implements early stopping
- Saves best model checkpoints
- Configures learning rate and optimization

In [11]:
def main():
    # Check for GPU availability
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"\nUsing device: {device}")

    # Disable wandb
    os.environ["WANDB_MODE"] = "disabled"
    
    # Initialize processor
    print("Initializing data processor...")
    processor = DisasterTweetProcessor('../tweet-disaster/train.csv')
    
    # Prepare data
    print("\nPreparing datasets...")
    train_dataset, eval_dataset = processor.prepare_data()
    print(f"Train dataset size: {len(train_dataset)}")
    print(f"Eval dataset size: {len(eval_dataset)}")
    
    # Setup model
    print("\nSetting up model...")
    processor.setup_model()
    
    # Process features
    print("\nProcessing training data...")
    tokenized_train = train_dataset.map(
        processor.prepare_features,
        remove_columns=train_dataset.column_names,
        batched=True
    )
    
    print("\nProcessing evaluation data...")
    tokenized_eval = eval_dataset.map(
        processor.prepare_features,
        remove_columns=eval_dataset.column_names,
        batched=True
    )
    
    # Get appropriate sample sizes
    #train_sample_size = get_sample_size(tokenized_train)
    #eval_sample_size = get_sample_size(tokenized_eval)
    
    #print(f"\nSelecting {train_sample_size} training samples and {eval_sample_size} evaluation samples...")
    #small_train = tokenized_train.shuffle(seed=42).select(range(train_sample_size))
    #small_eval = tokenized_eval.shuffle(seed=42).select(range(eval_sample_size))
    
    print(f"\nSelecting {tokenized_train} training samples and {tokenized_eval} evaluation samples...")

    # Setup training arguments
    training_args = TrainingArguments(
        "finetune-BERT-disaster",
        evaluation_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=3,
        weight_decay=0.01,
        logging_steps=100,
        save_strategy="epoch",
        load_best_model_at_end=True,
        no_cuda=not torch.cuda.is_available(),  # Use CUDA if available
        report_to="none"  # Disable reporting if not needed
    )
    
    # Initialize trainer
    print("\nInitializing trainer...")
    trainer = Trainer(
        model=processor.model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        data_collator=DefaultDataCollator(),
        tokenizer=processor.tokenizer,
    )
    
    # Train model
    print("\nStarting training...")
    try:
        trainer.train()
        print("\nTraining completed successfully!")
    except Exception as e:
        print(f"\nError during training: {str(e)}")
        return
    
    # Save the model
    print("\nSaving model...")
    trainer.save_model("disaster_tweet_model")
    
    # Test predictions
    print("\nTesting model on sample tweets...")
    test_texts = [
        "There was a major earthquake in the city center",
        "Having a great day at the park",
        "Breaking: Massive flood reported in coastal areas"
    ]
    
    # Get the device the model is on
    device = next(processor.model.parameters()).device
    
    # Tokenize test texts and move to correct device
    inputs = processor.tokenizer(test_texts, padding=True, truncation=True, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Get predictions
    with torch.no_grad():
        outputs = processor.model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        # Move predictions to CPU for printing
        predictions = predictions.cpu()
    
    print("\nPrediction Results:")
    for text, pred in zip(test_texts, predictions):
        disaster_prob = pred[1].item()
        print(f"\nText: {text}")
        print(f"Disaster Probability: {disaster_prob:.4f}")

    # Perform SHAP analysis on sample texts
    print("\nPerforming SHAP analysis...")
    try:
        sample_texts = [
            "There was a major earthquake in the city center",
            "Having a great day at the park",
            "Breaking: Massive flood reported in coastal areas"
        ]
        sample_texts = [str(text) for text in sample_texts]
        
        print("Starting SHAP analysis with sample texts:")
        for i, text in enumerate(sample_texts):
            print(f"{i+1}. {text}")
            
        shap_values = processor.analyze_with_shap(sample_texts)
        
        if shap_values is not None:
            print("\nGenerating SHAP visualizations...")
            processor.visualize_shap_analysis(shap_values, sample_texts)
            print("\nSHAP analysis and visualization completed successfully!")
            
            # Display interpretation guide
            print("\nInterpretation Guide:")
            print("- Red words/features contribute positively to disaster classification")
            print("- Blue words/features contribute negatively to disaster classification")
            print("- The width of color bars indicates the magnitude of the contribution")
            
    except Exception as e:
        print(f"Error during SHAP analysis: {str(e)}")
        import traceback
        traceback.print_exc()

## Model Outputs Explained

- Logits: Raw, unnormalized model outputs
- Probabilities: Softmax-normalized prediction scores
- SHAP Values: Feature importance scores for interpretation

In [12]:
if __name__ == "__main__":
    main()


Using device: cuda
Initializing data processor...

Preparing datasets...
Dataset columns: ['id', 'keyword', 'location', 'text', 'target']

Sample data:
   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target  
0       1  
1       1  
2       1  
3       1  
4       1  
Train dataset size: 6851
Eval dataset size: 762

Setting up model...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Processing training data...


Map:   0%|          | 0/6851 [00:00<?, ? examples/s]


Processing evaluation data...


Map:   0%|          | 0/762 [00:00<?, ? examples/s]


Selecting Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 6851
}) training samples and Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 762
}) evaluation samples...

Initializing trainer...

Starting training...


  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.4211,0.486356
2,0.3663,0.497899
3,0.2824,0.657174



Training completed successfully!

Saving model...

Testing model on sample tweets...

Prediction Results:

Text: There was a major earthquake in the city center
Disaster Probability: 0.9945

Text: Having a great day at the park
Disaster Probability: 0.0935

Text: Breaking: Massive flood reported in coastal areas
Disaster Probability: 0.9950

Performing SHAP analysis...
Starting SHAP analysis with sample texts:
1. There was a major earthquake in the city center
2. Having a great day at the park
3. Breaking: Massive flood reported in coastal areas
Creating SHAP explainer...
Calculating SHAP values...

Generating SHAP visualizations...



Visualizations saved in 'shap_visualizations' directory:
- Word importance plots: Show which words influenced each prediction

SHAP analysis and visualization completed successfully!

Interpretation Guide:
- Red words/features contribute positively to disaster classification
- Blue words/features contribute negatively to disaster classification
- The width of color bars indicates the magnitude of the contribution
