# Sentiment Analysis Project with Transfer Learning and Fine-Tuning

## 1. Introduction

In this project, we will explore the use of transfer learning and fine-tuning techniques for sentiment analysis. We'll utilize pre-trained models and adjust them for our specific task of classifying sentiment in text data.

## 2. Environment Setup

First, let's import the necessary libraries and set up our environment.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## 3. Model and Platform Research and Selection

For this project, we'll use the DistilBERT model, which is a lighter and faster version of BERT. We'll run this notebook on the local machine or a cloud platform like Google Colab.

Other models we considered:
- BERT
- RoBERTa
- Electra
- XLNet

We chose DistilBERT for its good balance between performance and computational efficiency.

## 4. Tokenization Research and Selection

We'll use the DistilBERT tokenizer, which is optimized for the DistilBERT model.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Example of tokenization
example_text = "This movie was great! I really enjoyed it."
tokenized_text = tokenizer(example_text, padding=True, truncation=True, return_tensors="pt")
print("Tokenized text:", tokenizer.convert_ids_to_tokens(tokenized_text["input_ids"][0]))

## 5. Dataset Research and Selection

We'll use the IMDB dataset for sentiment analysis. This dataset contains movie reviews labeled as positive or negative.

In [1]:
dataset = load_dataset("imdb")

# Split into training and testing sets
train_dataset = dataset["train"]
test_dataset = dataset["test"]

print(f"Training set size: {len(train_dataset)}")
print(f"Testing set size: {len(test_dataset)}")

NameError: name 'load_dataset' is not defined

## 6. Model Training

Now we'll fine-tune the DistilBERT model on our IMDB dataset.

In [None]:
# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
)

# Prepare the datasets
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
)

# Train the model
trainer.train()

## 7. Model Evaluation

Let's evaluate our model's performance on the test set.

In [None]:
# Make predictions on the test set
predictions = trainer.predict(tokenized_test)

# Calculate accuracy
preds = np.argmax(predictions.predictions, axis=-1)
accuracy = accuracy_score(tokenized_test["label"], preds)
print(f"Model accuracy: {accuracy}")

## 8. Model Generalization

To test our model's generalization capability, let's evaluate it on a different dataset, such as Yelp reviews.

In [None]:
# Load the Yelp dataset
yelp_dataset = load_dataset("yelp_review_full")
yelp_test = yelp_dataset["test"].select(range(10000))  # Use a subset for faster evaluation

# Tokenize the Yelp dataset
tokenized_yelp = yelp_test.map(tokenize_function, batched=True)

# Evaluate the model on the Yelp dataset
yelp_predictions = trainer.predict(tokenized_yelp)
yelp_preds = np.argmax(yelp_predictions.predictions, axis=-1)
yelp_accuracy = accuracy_score(tokenized_yelp["label"], yelp_preds)
print(f"Accuracy on Yelp dataset: {yelp_accuracy}")

## 9. Comparison of Tokenization Techniques

Let's compare the tokenization outputs of different models to understand how they process text differently.

In [None]:
text = "This movie was fantastic! The acting was superb and the plot kept me on the edge of my seat."

tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer_roberta = AutoTokenizer.from_pretrained("roberta-base")

print("DistilBERT tokenization:", tokenizer.tokenize(text))
print("BERT tokenization:", tokenizer_bert.tokenize(text))
print("RoBERTa tokenization:", tokenizer_roberta.tokenize(text))

## 10. Model Deployment

To deploy our model for real-time sentiment prediction, we can create a simple Flask application. Here's an example of how this could be implemented:

In [None]:
from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    text = request.json['text']
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    outputs = model(**inputs)
    prediction = torch.nn.functional.softmax(outputs.logits, dim=-1)
    sentiment = 'positive' if prediction[0][1] > 0.5 else 'negative'
    confidence = float(prediction[0][1] if sentiment == 'positive' else prediction[0][0])
    return jsonify({'sentiment': sentiment, 'confidence': confidence})

# Uncomment the following lines to run the Flask app
# if __name__ == '__main__':
#     app.run(debug=True)

print("To deploy this model, run the Flask app in a production environment.")

## 11. Conclusions

In this project, we've successfully fine-tuned a DistilBERT model for sentiment analysis. Here are some key takeaways:

1. We achieved good accuracy on the IMDB dataset, demonstrating the effectiveness of transfer learning.
2. The model showed some generalization capability when tested on the Yelp dataset, though with lower accuracy.
3. Different tokenization techniques can lead to slightly different representations of the same text.
4. Deploying the model as a web service allows for real-time sentiment predictions.

Future improvements could include:
- Experimenting with other pre-trained models
- Fine-tuning hyperparameters for better performance
- Collecting a more diverse dataset for improved generalization
- Implementing more robust error handling and input validation in the deployment script

This project demonstrates the power of transfer learning in NLP tasks and provides a foundation for further exploration in sentiment analysis and related fields.