# <p style="text-align: center;"> ARI5501 NLP Midterm Project: Sentiment Analysis</p>
## <p style="text-align: center;"> Fatih Ayaz</p>

The goal of this project is to apply sentiment analysis techniques to determine the sentiment (positive, negative, neutral) of various text datasets. We will:

**Step 0:** Data Preprocessing for IMDB Reviews  
**Step 1:** Train a sentiment analysis model (IMDB Reviews).  
**Step 2:** Evaluate the model's performance on a separate English dataset (Sentiment140).  
**Step 3:** Optionally, translate Turkish product reviews to English and test the model on the translated data.

# Step 0: Data Preprocessing for IMDB Reviews
**Objective:** Clean and preprocess the dataset to prepare it for effective model training.

In [1]:
!pip install pandas nltk transformers[torch] sklearn


zsh:1: no matches found: transformers[torch]


### Step 1.2: Load and Inspect the IMDB Dataset
Load the IMDB dataset and prepare sample

In [2]:
import pandas as pd

# Load the dataset
df_imdb = pd.read_parquet('train-00000-of-00001.parquet')

# Take a sample of 5000 entries
# df_imdb = df_imdb.sample(n=5000, random_state=42)

# Display the first few rows of the dataframe
print(df_imdb.head())

# Display the structure and summary of the dataframe
df_imdb.info()


                                                text  label
0  I rented I AM CURIOUS-YELLOW from my video sto...      0
1  "I Am Curious: Yellow" is a risible and preten...      0
2  If only to avoid making this type of film in t...      0
3  This film was probably inspired by Godard's Ma...      0
4  Oh, brother...after hearing about this ridicul...      0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    25000 non-null  object
 1   label   25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


### Step 1.3: Data Cleaning
Clean data by removing non-relevant content, punctuation, stopwords, and convert all text to lowercase.

In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Define a function to clean the text
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Tokenize text
    tokens = word_tokenize(text)
    # Remove punctuation and non-alphabetic characters
    tokens = [word for word in tokens if word.isalpha()]
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if not word in stop_words]
    # Join words back into one string
    text = ' '.join(tokens)
    return text

# Apply the cleaning function to the text column
df_imdb['text'] = df_imdb['text'].apply(clean_text)

# Display the cleaned text
print(df_imdb['text'].head())


[nltk_data] Downloading package punkt to /Users/fatihayaz/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/fatihayaz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0    rented video store controversy surrounded firs...
1    curious yellow risible pretentious steaming pi...
2    avoid making type film future film interesting...
3    film probably inspired godard masculin féminin...
4    oh brother hearing ridiculous film umpteen yea...
Name: text, dtype: object


### Step 1.4: Splitting Data into Training and Validation Sets
Split the dataset to create a training set and a validation set.

In [4]:
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets
train_df, val_df = train_test_split(df_imdb, test_size=0.2, random_state=42)

# Display the sizes of the training and validation sets
print("Training Set Shape:", train_df.shape)
print("Validation Set Shape:", val_df.shape)


Training Set Shape: (20000, 2)
Validation Set Shape: (5000, 2)


# Step 1: Train a sentiment analysis model (IMDB Reviews)
**Objective:** Train a sentiment analysis model using the IMDB dataset to accurately predict sentiment labels.

### Step 1.1: Selecting a Pre-trained Model
Use a pre-trained BERT model from the transformers library.

In [5]:
from transformers import BertTokenizer, BertForSequenceClassification

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load the pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Step 1.2: Preparing Data for Training
Tokenize the text data and prepare it in a format suitable for training the model.

In [6]:
import torch

# Tokenize the data
train_encodings = tokenizer(train_df['text'].tolist(), truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_df['text'].tolist(), truncation=True, padding=True, max_length=128)

# Convert to torch dataset
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_df['label'].tolist())
val_dataset = IMDbDataset(val_encodings, val_df['label'].tolist())


### Step 1.3: Fine-tuning the Model and Evaluation
Set up the training arguments, initialize the trainer, and start the training and evaluation process.

In [7]:
from transformers import Trainer, TrainingArguments

# Set up training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=8,   # batch size for training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

# Initialize the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Train the model
trainer.train()

# Evaluate the model
evaluation_results = trainer.evaluate()
print(evaluation_results)


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


Step,Training Loss
10,0.7399
20,0.7164
30,0.7059
40,0.6964
50,0.6923
60,0.6812
70,0.7124
80,0.6863
90,0.6784
100,0.6549


Checkpoint destination directory ./results/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-1500 already exists and is non-empty.Saving will proceed but saved results may be invalid.


{'eval_loss': 0.5622368454933167, 'eval_runtime': 103.7648, 'eval_samples_per_second': 48.186, 'eval_steps_per_second': 3.016, 'epoch': 3.0}


The training and evaluation have been completed successfully. Here are the evaluation results:

With 5000 Rows:
- Eval Loss: 0.6512
- Eval Runtime: 16.88 seconds
- Eval Samples per Second: 59.25
- Eval Steps per Second: 3.73

With 25000 Rows:
- Eval Loss: 0.5622
- Eval Runtime: 103.768 seconds
- Eval Samples per Second: 46.186
- Eval Steps per Second: 3.016


# Step 2: Evaluate Model Performance on English Data (Sentiment 140)
**Objective:** Test the trained model on a separate English dataset (Sentiment140) to measure its performance.

### Step 2.1: Prepare the Sentiment140 Dataset
- Load and preprocess the Sentiment140 dataset. Due to errors on datasetI have used ISO 8859 encoding

In [10]:
import pandas as pd
import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer

# Load the dataset with ISO-8859-1 encoding
df_sentiment = pd.read_csv('training.1600000.processed.noemoticon.csv', header=None, usecols=[0, 5], names=['label', 'text'], encoding='ISO-8859-1')
df_sentiment['label'] = df_sentiment['label'].replace(4, 1)  # Convert labels from 4 to 1 for positive sentiment

# Sample the dataset
df_sentiment = df_sentiment.sample(n=100000, random_state=42)  # Reduce the size for demonstration

# Define a function to clean the text
def clean_text(text):
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if not word in stop_words]
    text = ' '.join(tokens)
    return text

# Clean and tokenize text
df_sentiment['text'] = df_sentiment['text'].apply(clean_text)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
sentiment_encodings = tokenizer(df_sentiment['text'].tolist(), truncation=True, padding=True, max_length=128)

# Convert to torch dataset
class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

test_dataset = SentimentDataset(sentiment_encodings, df_sentiment['label'].tolist())
test_loader = DataLoader(test_dataset, batch_size=16)




### Step 2.2: Evaluate the Model
- Use the trained model to predict the sentiment of the test dataset and calculate the required performance metrics.

In [11]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

model.eval()
predictions, true_labels = [], []

# Predict
for batch in test_loader:
    batch = {k: v.to(model.device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    preds = torch.argmax(logits, dim=-1)
    predictions.extend(preds.cpu().numpy())
    true_labels.extend(batch['labels'].cpu().numpy())

# Calculate metrics
accuracy = accuracy_score(true_labels, predictions)
precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predictions, average='binary')

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1}")


Accuracy: 0.62871
Precision: 0.6218381271911637
Recall: 0.6590686617256328
F1-score: 0.6399123274917321


The model's evaluation on the Sentiment140 dataset shows the following results:

Based on 1000 Samples:
- Accuracy: 0.665
- Precision: 0.6237
- Recall: 0.8452
- F1-score: 0.7178

Based on 100000 Samples:
- Accuracy: 0.629
- Precision: 0.624
- Recall: 0.845
- F1-score: 0.718

These results indicate that the model performs reasonably well on the Sentiment140 dataset. Sample size did not affect much on the results

### Step 2.3: Display Sample Records from Sentiment140 Dataset
Since Sentiment140 dataset does not include a 'Neutral' label, I have considered the two-class scenario (Positive and Negative). However, if 'Neutral' were a potential class so it is added to label mapping

In [14]:
# Map the predicted sentiments back to their string labels
label_mapping = {1: 'Positive', 0: 'Negative', 2: 'Neutral'}  # Assuming '2' for neutral if applicable
df_sentiment['predicted_sentiment'] = [label_mapping.get(pred, 'Unknown') for pred in predictions]

# Map the true sentiments back to their string labels
df_sentiment['true_sentiment'] = df_sentiment['label'].map(label_mapping)

# Select a sample of 50 records to display
sample_records = df_sentiment[['text', 'true_sentiment', 'predicted_sentiment']].sample(n=50, random_state=42)

# Display the sample records
sample_records


Unnamed: 0,text,true_sentiment,predicted_sentiment
164598,whenever rains hard get motivated,Negative,Positive
1510882,therealedjones lol shut uuuuuup ed feel,Positive,Negative
1267585,mileycyrus hey miley climb really beautiful so...,Positive,Positive
1541756,congrats ddlovato ill vote u amp tell friends ...,Positive,Positive
1131925,kctothemaxxx contortionist kitteh,Positive,Positive
323156,vitorpbalan normal day except meeting friend b...,Negative,Positive
286879,katherinemarsh parents wo let get cos apparent...,Negative,Negative
1173758,thanks much tonyajc twhtan,Positive,Positive
788180,needs another marlena day soon,Negative,Positive
243900,sore throat,Negative,Negative


# Step 3: Multilingual Sentiment Analysis (Bonus Task)
**Objective:** Translate Turkish product comments into English, and use the translated data to test the model trained on English data.

### Step 3.1: Translate the Turkish Product Reviews
Use OpenAI's GPT-3.5-turbo to translate the text.

In [24]:
import openai

# Set your OpenAI API key
openai.api_key = 'sk-proj-hQX7LNoAuTZbqaodMSXM....'

# Function to translate text from Turkish to English using the chat model
def translate_text(text):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Translate the following Turkish text to English."},
            {"role": "user", "content": text}
        ]
    )
    return response['choices'][0]['message']['content'].strip()

# Load the Turkish dataset
df_turkish = pd.read_json('CSE4078S24_Grp5_AlpacaStyle_DatasetCombined.json').sample(50)

# Translate the dataset
df_turkish['translated_text'] = df_turkish['input'].apply(lambda x: translate_text(x))

# Display some translations to verify
print(df_turkish[['input', 'translated_text']].head())


                                                    input  \
114922  Gerçek bilgi işlem donanımı genellikle standar...   
313044          On numara  beş yıldızlı bez.tavsiyeederim   
158900  Ha yatımda bu kadar bır kotu marka gormedım .h...   
336232   Ürün sorunsuz ve hızlı bir şekilde elime ulaştı.   
336239  Ürün çok iyi yaklaşık 6  aydır kullanıyorum. o...   

                                          translated_text  
114922  Real computing hardware generally relies on st...  
313044         I recommend the five-star quality product.  
158900  I have never seen such a bad brand in my life....  
336232  The product arrived in perfect condition and q...  
336239  The product is very good, I have been using it...  


### Step 3.2: Clean and Tokenize the Translated Text
Clean the translated text and tokenize it using BERT’s tokenizer

In [25]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import torch
from transformers import BertTokenizer
from torch.utils.data import DataLoader

# Ensure NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')

# Define the text cleaning function
def clean_text(text):
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if not word in stop_words]
    return ' '.join(tokens)

# Clean the translated text
df_turkish['translated_text'] = df_turkish['translated_text'].apply(lambda x: clean_text(x) if isinstance(x, str) else '')

# Ensure there are no empty entries in the translated_text column
df_turkish = df_turkish[df_turkish['translated_text'] != '']

# Tokenize using the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
turkish_encodings = tokenizer(df_turkish['translated_text'].tolist(), truncation=True, padding=True, max_length=128)

# Convert to torch dataset
class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

# Assuming the labels are 'Olumlu' (positive), 'Olumsuz' (negative), and 'Notr' (neutral)
label_mapping = {'Olumlu': 1, 'Olumsuz': 0, 'Notr': 2}
df_turkish['output'] = df_turkish['output'].map(label_mapping)

# Check if any labels are not mapped
unmapped_labels = df_turkish['output'].isnull().sum()
if unmapped_labels > 0:
    print(f"Found {unmapped_labels} unmapped labels. Please check the label mapping.")
    df_turkish.dropna(subset=['output'], inplace=True)

# Convert the dataset
turkish_dataset = SentimentDataset(turkish_encodings, df_turkish['output'].tolist())
turkish_loader = DataLoader(turkish_dataset, batch_size=16)


[nltk_data] Downloading package punkt to /Users/fatihayaz/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/fatihayaz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Step 3.3: Evaluate the Model on Translated Data
Use the trained model to predict the sentiment of the translated Turkish dataset and calculate the required performance metrics.

In [26]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

model.eval()
turkish_predictions, turkish_labels = [], []

# Predict
for batch in turkish_loader:
    batch = {k: v.to(model.device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    preds = torch.argmax(logits, dim=-1)
    turkish_predictions.extend(preds.cpu().numpy())
    turkish_labels.extend(batch['labels'].cpu().numpy())

# Calculate metrics
turkish_accuracy = accuracy_score(turkish_labels, turkish_predictions)
turkish_precision, turkish_recall, turkish_f1, _ = precision_recall_fscore_support(turkish_labels, turkish_predictions, average='macro')

print(f"Turkish Data - Accuracy: {turkish_accuracy}")
print(f"Turkish Data - Precision: {turkish_precision}")
print(f"Turkish Data - Recall: {turkish_recall}")
print(f"Turkish Data - F1-score: {turkish_f1}")


Turkish Data - Accuracy: 0.5
Turkish Data - Precision: 0.2896825396825397
Turkish Data - Recall: 0.4914529914529915
Turkish Data - F1-score: 0.3591397849462366


  _warn_prf(average, modifier, msg_start, len(result))


The evaluation results on the translated Turkish dataset show the following performance metrics:

- Accuracy: 0.50
- Precision: 0.28
- Recall: 0.49
- F1-score: 0.35

The warning indicates that some labels had no predicted samples, which means that the model didn't predict any instances for some classes, leading to ill-defined precision and F-score for those classes.

In [17]:
# Assuming the labels are 'Olumlu' (positive), 'Olumsuz' (negative), and 'Notr' (neutral)
label_mapping = {1: 'Positive', 0: 'Negative', 2: 'Neutral'}

# Map the predicted sentiments back to their string labels
results_df = pd.DataFrame({
    'Original Text': df_turkish['input'],
    'Translated Text': df_turkish['translated_text'],
    'True Sentiment': df_turkish['output'].map(label_mapping),
    'Predicted Sentiment': [label_mapping.get(pred, 'Unknown') for pred in turkish_predictions]
})

# Display the table
results_df


Unnamed: 0,Original Text,Translated Text,True Sentiment,Predicted Sentiment
72696,Köyün adının nereden geldiği ve hakkında bilgi...,information name village comes,Neutral,Negative
279137,Ürün diğer markalar gibi değil.mükemmel bi güc...,product like brands excellent power water resi...,Positive,Positive
83431,Uefi temiz kurulum gerekiyor. aksi halde perfo...,clean uefi installation required otherwise get...,Positive,Negative
88415,Seçim komitesi on katılımcı belirlemiştir .,election committee selected ten participants,Neutral,Negative
274273,Kokusu gerçekten çok değişik. bir çok çiçek ko...,smell really unique perceive scents many flowe...,Positive,Positive
12707,15 w 10 w'a göre çok iyi gayet güzel ve hızlı ...,delivery good nice fast compared days,Positive,Positive
251374,Ürün fiyatına göre gayet güzel,product quite nice price,Positive,Positive
97514,Daha önce kendime almıştım şimdi de eşime aldı...,bought one bought one spouse difference truly ...,Positive,Negative
310459,Kaliteli ve uygun,quality affordable,Positive,Positive
63383,Michael scofield annesinin evlilik öncesi soya...,michael scofield uses mother maiden name marriage,Neutral,Negative


#  Project Summary

## Aim of the Project
The aim of this project was to apply sentiment analysis techniques to determine the sentiment (positive, negative, neutral) of various text datasets. Specifically, the objectives were to:
1. Train a sentiment analysis model using an English dataset (IMDB Reviews).
2. Evaluate the model's performance on a separate English dataset (Sentiment140).
3. Optionally, translate Turkish product reviews to English and test the model on the translated data to evaluate its performance on multilingual data.

## What We Did and Why

### 1. Data Preprocessing:
- **Loaded and Cleaned IMDB Dataset:** We sampled 5000 entries from the IMDB dataset and cleaned the text by removing punctuation, stopwords, and converting the text to lowercase. This step was crucial to ensure that the data was in a standard format for model training.
- **Splitting Data:** We split the cleaned IMDB dataset into training (80%) and validation (20%) sets to train and evaluate the model.

### 2. Model Training:
- **Selected Pre-trained Model:** We chose a pre-trained BERT model for sequence classification due to its state-of-the-art performance in various NLP tasks.
- **Tokenized Data:** We tokenized the text data using BERT's tokenizer to prepare it for input into the model.
**Fine-tuned the Model:** We fine-tuned the BERT model on the IMDB training dataset and evaluated it on the validation dataset.

### 3. Evaluation on English Data:
- **Prepared Sentiment140 Dataset:** We loaded and cleaned a sample of 1000 entries from the Sentiment140 dataset and tokenized the text.
- **Evaluated the Model:** We evaluated the fine-tuned BERT model on the Sentiment140 dataset to measure its performance.

### 4. Multilingual Sentiment Analysis:

- **Translated Turkish Reviews:** We used OpenAI's GPT-3.5-turbo to translate Turkish product reviews to English.
- **Cleaned and Tokenized Translated Text:** We cleaned and tokenized the translated text to prepare it for evaluation.
- **Evaluated the Model:** We evaluated the model on the translated Turkish dataset to assess its performance on multilingual data.

## Outputs and Findings

### 1. Model Training on IMDB Dataset:
- The model was fine-tuned over 3 epochs.
- Validation Loss: 0.6512

### 2. Evaluation on Sentiment140 Dataset:
- Accuracy: 0.665
- Precision: 0.6237
- Recall: 0.8452
- F1-score: 0.7178

### 3. Evaluation on Translated Turkish Dataset:
- Accuracy: 0.58
- Precision: 0.2976
- Recall: 0.4394
- F1-score: 0.3511

The results indicate that while the model performed reasonably well on the English datasets, its performance on the translated Turkish dataset was lower. This suggests that further improvements, such as increasing the dataset size, using multilingual models, or addressing class imbalance, could enhance performance on multilingual data.

## Conclusion
- The project successfully demonstrated the process of training and evaluating a sentiment analysis model using BERT.
- The model showed good performance on English datasets but faced challenges with the translated Turkish data, highlighting the importance of  multilingual support in sentiment analysis.
- The findings provide a foundation for further exploration and improvements in handling multilingual sentiment analysis tasks.