## Using Notebook Environments
1. To run a cell, press `shift + enter`. The notebook will execute the code in the cell and move to the next cell. If the cell contains a markdown cell (text only), it will render the markdown and move to the next cell.
2. Since cells can be executed in any order and variables can be over-written, you may at some point feel that you have lost track of the state of your notebook. If this is the case, you can always restart the kernel by clicking Runtime in the menu bar (if you're using Colab) and selecting `Restart runtime`. This will clear all variables and outputs.
3. The final variable in a cell will be printed on the screen. If you want to print multiple variables, use the `print()` function as usual.

Notebook environments support code cells and markdown cells. For the purposes of this workshop, markdown cells are used to provide high-level explanations of the code. More specific details are provided in the code cells themselves in the form of comments (lines beginning with `#`)

## Environment Setup

In [None]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Installing requisite packages
    !pip install sentence_transformers &> /dev/null

    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')

    # Change working directory to health
    %cd /content/drive/MyDrive/LLM4BeSci_GSERM2024/day_3

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeClassifierCV
import seaborn as sns

## Feature Extraction
We will now use feature extraction to classify the tweets. We will use the `SentenceTransformer` library to extract features from the text, and then use a `RidgeClassifierCV` to classify the tweets. We will train the classifier on some additional training data (`media_bias_train`) and evaluate it on the test data (`media_bias_test`). We begin by loading the data:

In [None]:
# Reload test data from last last notebook (day_3a.ipynb)
media_bias_test = pd.read_csv('media_bias_test.csv')

# Load training data
media_bias_train = pd.read_csv('media_bias_train.csv')
media_bias_train

We next initialize the `SentenceTransformer` model and extract features from the training data:

In [None]:
# Initialize feature extraction pipeline
model = SentenceTransformer('all-mpnet-base-v2')

# Extract features
train_features = model.encode(media_bias_train['text'])
train_features

We then standardize the features and train the `RidgeClassifierCV`. `RidgeClassifierCV` will automatically perform cross-validation to find the best alpha value from the list of `alphas` provided.

In [None]:
# Standardize features
scaler = StandardScaler()
scaler.fit(train_features)
features = scaler.transform(train_features)

# Initialize classifier
ridge = RidgeClassifierCV(alphas=[1e-3, 1e-2, 1e-1, 1, 10, 100])

# Train classifier
ridge.fit(train_features, media_bias_train['bias'])
f"Train accuracy: {ridge.score(train_features, media_bias_train['bias'])}"

We next extract features for the test set and evaluate the classifier:

In [None]:
# Extract features for test set
test_features = model.encode(media_bias_test['text'])

# Standardising features
test_features = scaler.transform(test_features)

# Test classifier
f"Test accuracy: {ridge.score(test_features, media_bias_test['bias'])}"

As you can see, the feature extraction method performs better than both zero-shot and few-shot classification. We can also visualize the confusion matrix:

In [ ]:
# Confusion matrix
confusion = pd.crosstab(media_bias_test['bias'], ridge.predict(test_features))
sns.heatmap(confusion, annot=True)