<a href="https://colab.research.google.com/github/andrybrew/data-science-machine-learning-BI/blob/main/005a_sentiment_analysis_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis with IndoBERT**

This notebook will demonstrate how to analyze sentiment using **IndoBERT**, a pre-trained transformer model specialized for the Indonesian language. The dataset to be analyzed will consist of Twitter posts related to interest rate policies by **Bank Indonesia**.

## **About the Interest Rate Dataset**
The interest rate is a critical tool used by central banks like **Bank Indonesia** to influence the country's economic activities. It affects how much it costs to borrow money and the return on savings, thus directly impacting people's spending and saving behaviors.

The decision by the **Board of Governors of Bank Indonesia** to increase the **BI-7 day reverse repo rate** by **25 basis points to 6%** is a strategic move to tackle economic challenges like **inflation** or **currency depreciation**. Such changes can have significant effects on various sectors, including loans, mortgages, savings, and investments. Understanding the public's reaction to the interest rate hike will yield insights into the general sentiment and expectations of this policy decision on everyday lives and economic outlooks.

### **Data Source:** Twitter  
- **Keyword:** suku bunga since:2023-10-19  
- **Collection Date:** October 22, 2023  
- **Total Tweets:** 503  
- **Language:** Bahasa Indonesia  

## **Importing Required Libraries**
We will install and import the necessary libraries for text processing, sentiment analysis, and visualization.

In [None]:
# Install Huggingface Transformers
! pip install huggingface transformers --quiet

In [None]:
import pandas as pd
import seaborn as sns
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

## **Importing Dataset**
We will load the dataset using **pandas.read_csv()** from a publicly available URL.

In [None]:
# Fetching the dataset from GitHub
data_url = "https://raw.githubusercontent.com/andrybrew/IHT-SEM1302-30Okt/main/data/001_suku-bunga.csv"

# Using pandas read_csv function to load the data from the URL directly into a DataFrame
df_tweet = pd.read_csv(data_url)

## **Loading the IndoBERT Model**
We will use the **mdhugol/indonesia-bert-sentiment-classification** model from Hugging Face to classify sentiment. The necessary tokenizer and model will be set up for processing.

In [None]:
# Download Pretrained Model
pretrained = "mdhugol/indonesia-bert-sentiment-classification"

In [None]:
# Set Model and Tokenizer
model = AutoModelForSequenceClassification.from_pretrained(pretrained)
tokenizer = AutoTokenizer.from_pretrained(pretrained)

# Create sentiment classifier using huggingface pipeline
sentiment_analysis = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

## **Performing Sentiment Analysis**
We will apply **IndoBERT** to classify each tweet into **Positive, Neutral, or Negative** sentiment categories. The sentiment classification will include:
- Tokenizing and truncating text to match model input requirements.
- Predicting sentiment labels based on model outputs.
- Mapping predictions to predefined sentiment categories.

In [None]:
def get_sentiment(text):
    # Tokenize text and truncate to ensure it doesn't exceed the maximum limit
    tokens = tokenizer.tokenize(text)
    tokens = tokens[:min(len(tokens), 512 - 2)]  # 2 tokens for [CLS] and [SEP]

    # Convert tokens back to string
    truncated_text = tokenizer.convert_tokens_to_string(tokens)

    # Get sentiment
    output = sentiment_analysis(truncated_text)[0]
    label = label_index[output['label']]
    score = output['score']
    return label, score

In [None]:
# Set Label
label_index = {'LABEL_0': 'positive', 'LABEL_1': 'neutral', 'LABEL_2': 'negative'}

In [None]:
# Test with one sentence
text = "Kebijakannya kacau balau"
sentiment, score = get_sentiment(text)
print(f'The sentiment is: {sentiment} with a score of: {score}')

## **Applying Sentiment Analysis to Dataset**
We will apply the sentiment classification function to each row in the dataset and create new columns for:
- **Sentiment Label** (Positive, Neutral, Negative)
- **Confidence Score** of the sentiment classification

In [None]:
# Get sentiment label for each row in dataframe
df_tweet[['sentiment', 'score']] = df_tweet['text'].apply(lambda x: pd.Series(get_sentiment(x)))

In [None]:
# Show Tweet with sentiment
df_tweet[['text', 'sentiment', 'score']]

In [None]:
# Visualise the sentiment distribution
sns.countplot(x ='sentiment', data = df_tweet)