## Using Notebook Environments 
1. To run a cell, press `shift + enter`. The notebook will execute the code in the cell and move to the next cell. If the cell contains a markdown cell (text only), it will render the markdown and move to the next cell.
2. Since cells can be executed in any order and variables can be over-written, you may at some point feel that you have lost track of the state of your notebook. If this is the case, you can always restart the kernel by clicking Runtime in the menu bar (if you're using Colab) and selecting `Restart runtime`. This will clear all variables and outputs.
3. The final variable in a cell will be printed on the screen. If you want to print multiple variables, use the `print()` function as usual.

Notebook environments support code cells and markdown cells. For the purposes of this workshop, markdown cells are used to provide high-level explanations of the code. More specific details are provided in the code cells themselves in the form of comments (lines beginning with `#`)

## Environment Setup

In [None]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Installing requisite packages
    !pip install transformers sentence-transformers accelerate

    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')

    # Change working directory to health
    %cd /content/drive/MyDrive/LLM4SocBeSci/day_3

In [36]:
import pandas as pd
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import RidgeClassifierCV
import seaborn as sns
from tqdm.notebook import tqdm_notebook as tqdm

In [34]:
media_bias_test = pd.read_csv('media_bias_test.csv')
media_bias_test 

Unnamed: 0,title,source,bias_text
0,California slashes water use for upstate farmers,USA TODAY,center
1,Twitter slapped its first 'manipulated media' ...,Business Insider,center
2,OPINION: There's a sobering truth to Trump's r...,CNN - Editorial,left
3,Supreme Court Justice Ruth Bader Ginsburg hosp...,USA TODAY,center
4,Noncompliance Kneecaps New Zealand's Gun Contr...,Reason,right
...,...,...,...
95,"GOP 2016 hopefuls take aim at Hillary, each ot...",CNN (Web News),left
96,"Feds reportedly eye interview with Clinton, re...",Fox Online News,right
97,Taliban Terrorists Have No Place at Camp David,Guest Writer - Right,right
98,Scorecard For A Departing President: Assessing...,NPR Online News,center


## Zero-shot Classification 

In [44]:
torch.random.manual_seed(42)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 10,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

for headline in tqdm(media_bias_test['title']):
    message = {"role": "user", "content": "Is this headline from a lef-wing or right-wing source? Answer with 'left' or 'right' only:\n" + headline}
    output = pipe(message, **generation_args)[0]['generated_text'].lower()
    if 'left' in output:
        label = 'left'
    elif 'right' in output:
        label = 'right'
    else:
       label =  'nan'
    break

output, label

  0%|          | 0/100 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


("\n\ncalifornia's water use is down by more", 'nan')

## Few-shot Classification

In [ ]:
n_shots = 10
few_shot_prompt = ((media_bias_test['headline'] + ': ' + media_bias_test['label']).iloc[:n_shots]).tolist().join('\n')
few_shot_prompt = "Based on the following headlines and labels: " + "\n" + few_shot_prompt + "\n" + "Classify the following headlines as real or fake:\n" 
few_shot_prompt

In [ ]:
# Editing headlines to include few-shot prompt
media_bias_test['few_shot_headline'] = few_shot_prompt + media_bias_test['headline']

# Classify all news articles
media_bias_test['few_shot_label'] = classifier(media_bias_test['few_shot_headline'].tolist(), candidate_labels, few_shot_prompt)
media_bias_test

In [ ]:
# Comparing zero-shot and few-shot classification
zero_shot_accuracy = (media_bias_test['zero_shot_label'].labels == media_bias_test['label']).mean()
few_shot_accuracy = (media_bias_test['few_shot_label'].labels == media_bias_test['label']).mean()
print('Zero-shot accuracy:', zero_shot_accuracy)
print('Few-shot accuracy:', few_shot_accuracy)

## Feature Extraction

In [ ]:
fake_news_train = pd.reead_csv('fake_news_train.csv')

# Initialize feature extraction pipeline
model = SentenceTransformer('all-MiniLM-L6-v2')  

# Extract features
features = model.encode(fake_news_train['headline'].tolist())

# Initialize classifier
ridge = RidgeClassifierCV()

# Train classifier
ridge.fit(features, fake_news_train['label'])
f"Train accuracy: {ridge.score(features, fake_news_train['label'])}"

In [ ]:
# Extract features for test set
test_features = model.encode(media_bias_test['headline'].tolist())

# Test classifier
f"Test accuracy: {ridge.score(test_features, media_bias_test['label'])}"