## Using Notebook Environments
1. To run a cell, press `shift + enter`. The notebook will execute the code in the cell and move to the next cell. If the cell contains a markdown cell (text only), it will render the markdown and move to the next cell.
2. Since cells can be executed in any order and variables can be over-written, you may at some point feel that you have lost track of the state of your notebook. If this is the case, you can always restart the kernel by clicking Runtime in the menu bar (if you're using Colab) and selecting `Restart runtime`. This will clear all variables and outputs.
3. The final variable in a cell will be printed on the screen. If you want to print multiple variables, use the `print()` function as usual.

Notebook environments support code cells and markdown cells. For the purposes of this workshop, markdown cells are used to provide high-level explanations of the code. More specific details are provided in the code cells themselves in the form of comments (lines beginning with `#`)

## Environment Setup

In [1]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Installing requisite packages
    !pip install transformers accelerate

    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')

    # Change working directory to health
    %cd /content/drive/MyDrive/LLM4SocBeSci/day_3

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/LLM4SocBeSci/day_3


In [2]:
import pandas as pd
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import RidgeClassifierCV
import seaborn as sns
from tqdm.notebook import tqdm_notebook as tqdm

In [45]:
media_bias_test = pd.read_csv('media_bias_test.csv')
media_bias_test

Unnamed: 0,title,source,bias
0,Clinton aims to reframe 2016 debate,CNN,left
1,Iowa caucuses: Donald Trump's moment of truth,CNN,left
2,Supreme Court to hear online free speech case,CNN,left
3,The speech every woman should hear,CNN,left
4,The signs of a Democratic landslide are everyw...,CNN,left
...,...,...,...
155,"Obama taps Hagel for Pentagon, Brennan for CIA",Washington Times,right
156,"Walker, GOP win big in Wis. recall races",Washington Times,right
157,OPINION: Restoring the Senate,Washington Times,right
158,Cleaning up the Big Abortion machine,Washington Times,right


## Zero-shot Classification

In [4]:
torch.random.manual_seed(42)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [10]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 10,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

zero_shot_label = []
for headline in tqdm(media_bias_test['title'].iloc[:10]):
    message = [{"role": "user", "content": "Is this headline from a left-wing or right-wing source? Strictly answer with 'left' or 'right' only:\n" + headline}]
    output = pipe(message, **generation_args)[0]['generated_text'].lower()
    if 'left' in output:
        label = 'left'
    elif 'right' in output:
        label = 'right'
    else:
       label =  'nan'
    zero_shot_label.append(label)
    print(output, label)
    
media_bias_test['zero_shot_label'] = zero_shot_label
media_bias_test

  0%|          | 0/10 [00:00<?, ?it/s]



 right right
 left left
 right right
 left left
 right right
 right right
 right right
 left left
 right right
 left left


In [ ]:
# Comparing zero-shot and actual labels
print(f'Zero-shot accuracy: {(media_bias_test["zero_shot_label"] == media_bias_test["label"]).mean()}')

In [ ]:
# Confusion matrix
confusion = pd.crosstab(media_bias_test['label'], media_bias_test['zero_shot_label'])
sns.heatmap(confusion, annot=True)

## Few-shot Classification

In [None]:
n_shots = 10
few_shot_prompt = ((media_bias_test['headline'] + ': ' + media_bias_test['label']).iloc[:n_shots]).tolist().join('\n')
few_shot_prompt = "Based on the following headlines and labels: " + "\n" + few_shot_prompt + "\n" + "Classify the following headlines as real or fake:\n"
few_shot_prompt

In [None]:
# Editing headlines to include few-shot prompt
media_bias_test['few_shot_headline'] = few_shot_prompt + media_bias_test['headline']

# Classify all news articles
media_bias_test['few_shot_label'] = classifier(media_bias_test['few_shot_headline'].tolist(), candidate_labels, few_shot_prompt)
media_bias_test

In [None]:
# Comparing zero-shot and few-shot classification
zero_shot_accuracy = (media_bias_test['zero_shot_label'].labels == media_bias_test['label']).mean()
few_shot_accuracy = (media_bias_test['few_shot_label'].labels == media_bias_test['label']).mean()
print('Zero-shot accuracy:', zero_shot_accuracy)
print('Few-shot accuracy:', few_shot_accuracy)

## Feature Extraction

In [None]:
fake_news_train = pd.reead_csv('fake_news_train.csv')

# Initialize feature extraction pipeline
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract features
features = model.encode(fake_news_train['headline'].tolist())

# Initialize classifier
ridge = RidgeClassifierCV()

# Train classifier
ridge.fit(features, fake_news_train['label'])
f"Train accuracy: {ridge.score(features, fake_news_train['label'])}"

In [None]:
# Extract features for test set
test_features = model.encode(media_bias_test['headline'].tolist())

# Test classifier
f"Test accuracy: {ridge.score(test_features, media_bias_test['label'])}"