# Article Personalization Trained Using Preferences

This notebook focuses on testing the main hypothesis of the project that human preferences from pairs can be used to train an LLM model to better predict the article the human is most likely to prefer to read next. 

# Implementation

In [1]:
## Creating Corpus

In [1]:
links = []
with open('./data/small-dataset.txt') as f:
    links = f.read().splitlines() 

In [2]:
links[0]

'https://www.greaterwrong.com/posts/Y9Yqux7iPwpvnppyS/engineering-experience-through-score'

In [3]:
from dataclasses import dataclass

@dataclass
class Content:
    content_id: int
    url: str
    content_type: str
    content_text: str

In [4]:
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request

def bs_fetcher(url):
    text = None
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    try:
        req = Request(url)
        req.add_header('User-Agent', user_agent)
        page = urlopen(req)
        html = page.read().decode("utf-8")
        soup = BeautifulSoup(html, "html.parser")
        text = soup.get_text()
         
    except Exception as e:
        print('failed to fetch ', url)
    return text

In [5]:
# Testing bs_fetcher function
bs_fetcher('https://www.edge.org/response-detail/26557')

"\n\n\n\n\n\n\n\n\n\nEdge.org\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to main content\n\n\nCopyright © 2023 By Edge Foundation, Inc. All Rights Reserved.\n\n\n\n \n\nEdge.org\n\n\n\n\n\n\n\n \n\n\n\nTo arrive at the edge of the world's knowledge, seek out the most complex and sophisticated minds, put them in a room together, and have them ask each other the questions they are asking themselves.\n\n\n\n\nhttps://www.edge.org/response-detail/26557Printed On Sun November 19th 2023 \n\n\n\nSun, Nov 19, 2023HOMECONVERSATIONSVIDEOAUDIOANNUAL QUESTIONEVENTSNEWSLIBRARYABOUTPEOPLE \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n2016 : WHAT DO YOU CONSIDER THE MOST INTERESTING RECENT [SCIENTIFIC] NEWS? WHAT MAKES IT IMPORTANT?\n\n\n\n In the News [ 22 ] \n \xa0\xa0|\xa0\xa0 \n Contributors [ 199 ] \n \xa0\xa0|\xa0\xa0 \n View All Responses [ 199 ]  \n\n\n\n\n  \n Pamela McCorduck \n Author, Machines Who Think, The Universal Machine, Bounded Rationality, This Could Be Important; Co-author (with Edward 

In [6]:
def fetch_content(links) -> list[Content]:
    return [Content(content_id=i, url=url, content_type='article', content_text=bs_fetcher(url)) 
            for i, url in enumerate(links)]

In [7]:
import pandas as pd

contents_df = pd.DataFrame([o.__dict__ for o in fetch_content(links) if o.content_text])

failed to fetch  https://www.crazypi.com/nvidia-jetson-nano-2gb
failed to fetch  http://lukemuehlhauser.com/about/


In [8]:
contents_df


Unnamed: 0,content_id,url,content_type,content_text
0,0,https://www.greaterwrong.com/posts/Y9Yqux7iPwp...,article,\n\nEngineering Experience Through Score - Les...
1,1,https://www.greaterwrong.com/posts/YMtZRGLbvdD...,article,\n\nGeneralized Efficient Markets in Political...
2,2,https://www.greaterwrong.com/posts/ZJzSxo6nCNv...,article,\n\nWhy Planning is Hard: A Multifaceted Model...
3,3,https://www.greaterwrong.com/posts/znfkdCoHMAN...,article,\n\nThe ground of optimization - LessWrong 2.0...
4,4,https://www.greaterwrong.com/s/oFePMp9rKftEeZDDr,article,\n\nLawful Truth - LessWrong 2.0 viewerArchive...
5,5,https://www.greaterwrong.com/s/oi873FWi6pHWxswSa,article,\n\nThe Science of Winning at Life - LessWrong...
6,6,https://www.coursera.org/specializations/whart...,article,Business Strategies for A Better World Special...
7,8,https://www.cs.cmu.edu/~02251/schedule.html,article,\n\n\n\n\n\n02-251: Great Ideas in Computation...
8,9,https://www.cs.cmu.edu/~15251/course-info.html,article,\n\n\n\n\n15-251 Fall 2018\n\n\n\n\n\n\n15-251...
9,10,https://slatestarcodex.com/2013/02/17/90-of-al...,article,\n\n\n\n90% of all claims about the problems w...


In [9]:
import pickle

# Saving loaded data as a binary file
pickle.dump(contents_df, open( "./temp/fetched-content.data", "wb" ))

## Creating Pipeline For Collecting Preferences From User

In [10]:
import random, itertools

MAX_PAIRS = 10
order = list(range(len(contents_df)))

pairs = list(itertools.combinations(order, 2))
random.shuffle(pairs)


prefs = []

for x, y in pairs[:MAX_PAIRS]:
    cont_x = contents_df.iloc[x]
    cont_y = contents_df.iloc[y]
    print(cont_x.url , '\n', cont_y.url)
    
    pref = input('Type t for top and b for bottom link : ')
    c, r = None, None
    if pref == 't':
        c, r = cont_x, cont_y
    elif pref == 'b':
        c, r = cont_y, cont_x
    else:
        print('Invalid input, skipping....')
    
    print('--------------')
    prefs.append({'chosen': c.content_text, 'rejected': r.content_text})

https://www.edge.org/response-detail/26557 
 https://theory.cs.northwestern.edu/courses/
--------------
http://www.nwlink.com/~donclark/about/about.html 
 https://teachyourselfcs.com/
--------------
https://www.greaterwrong.com/posts/znfkdCoHMANwqc2WE/the-ground-of-optimization-1 
 https://theory.cs.northwestern.edu/courses/
--------------
https://www.greaterwrong.com/posts/Y9Yqux7iPwpvnppyS/engineering-experience-through-score 
 https://www.greaterwrong.com/s/oFePMp9rKftEeZDDr
--------------
https://slatestarcodex.com/2014/07/30/meditations-on-moloch/ 
 http://www.math.toronto.edu/ilia/Teaching/MAT337.2018/index.html
--------------
https://www.greaterwrong.com/posts/ZJzSxo6nCNvod67Xs/why-planning-is-hard-a-multifaceted-model 
 https://teachcomputerscience.com/synchronous-and-asynchronous/
--------------
http://math.huji.ac.il/~mhochman/research-expo.html 
 http://stellar.mit.edu/S/course/18/fa08/18.415/materials.html
--------------
https://www.coursera.org/specializations/wharton-glob

In [37]:
pref_df = pd.DataFrame(prefs)

In [40]:
pref_df = pd.DataFrame(prefs)
pref_df

Unnamed: 0,chosen,rejected
0,\n\n\n\n\n\n\n\n\n\nEdge.org\n\n\n\n\n\n\n\n\n...,\n\n\n\n\nTeaching – Northwestern CS Theory Gr...
1,\n\nTeach Yourself Computer Science\n\n\n\n\n\...,\n\n\nAbout page for Donald Clark\n\n\n\n\n\n\...
2,\n\nThe ground of optimization - LessWrong 2.0...,\n\n\n\n\nTeaching – Northwestern CS Theory Gr...
3,\n\nLawful Truth - LessWrong 2.0 viewerArchive...,\n\nEngineering Experience Through Score - Les...
4,\n\n\n\nMAT337. Introduction to Real Analysis\...,\n\n\n\nMeditations On Moloch | Slate Star Cod...
5,\n\nWhy Planning is Hard: A Multifaceted Model...,Synchronous and Asynchronous Data Transmissi...
6,\n\n\n\nStellar : Message of the Day\n\n\n\n\n...,\n\nDynamical systems theory\n\n\n Dynamical S...
7,Business Strategies for A Better World Special...,\n\nAI Safety Fundamentals Course\n\n\n\n\n\n\...
8,\n\n18.408 - Fall '16\n\n\n18.408 Topics in Th...,\n\n\n\n\n\n02-251: Great Ideas in Computation...
9,\n\nThe Science of Winning at Life - LessWrong...,\n\nGeneralized Efficient Markets in Political...


In [25]:
pref_df.to_csv('./pref_data.csv')


In [2]:
pref_df = pd.read_csv('./pref_data.csv')

In [8]:
# s = pref_df[['chosen', 'rejected']].to_json('./pref_data.json')

In [13]:
s = pref_df[['chosen', 'rejected']].to_json('./pref_data.json', orient='records')

In [14]:
len(pref_df)

10

In [29]:
ld_pref = load_dataset('csv', data_files='./pref_data.csv', split='train')

In [41]:
ld_pref['chosen']

['\n\nWhy Planning is Hard: A Multifaceted Model\n - LessWrong 2.0 viewerArchiveSequencesAboutSearchLog InQuestionsEventsShortformAlignment ForumAF CommentsHomeFeaturedAllTagsRecent CommentsWhy Planning is Hard: A Multifaceted Model\nRuby31 Mar 2019 2:33 UTC29 points9 commentsLW linkPlanning & Decision-Making\uf141Post permalinkLink without commentsLink without top nav barsLink without comments or top nav barsContentsContentsWhat is plan\xadning? Plan\xadning is a Pre\xaddic\xadtion/\u200bIn\xadfor\xadma\xadtion ProblemPlan\xadning is a Com\xadpu\xadta\xadtion ProblemPlan\xadning is a Self-Knowl\xadedge & Self-Mastery ProblemPre\xaddict\xading yourselfKnow\xading what you wantSelf-mas\xadtery of your hu\xadman brainHeuris\xadtics and biasesUs\xading Sys\xadtem 1 (in\xadtu\xadition) and Sys\xadtem 2 (ex\xadplicit rea\xadson) in harmonyEmo\xadtional MasteryPlan\xadning is a Re\xadcur\xadsive ProblemSum\xadmary: What It Takes to Be a Great PlannerEndnotesEpistemic con\xadfi\xaddence: High

In [31]:
df

Dataset({
    features: ['chosen', 'rejected'],
    num_rows: 160800
})

### Load trained model



In [32]:
from transformers import OPTForCausalLM, AutoTokenizer, OPTForSequenceClassification

In [33]:
model = OPTForSequenceClassification.from_pretrained('/home/bhishma/Nextcloud/Documents/code/trl/output/checkpoint-5/')

Some weights of OPTForSequenceClassification were not initialized from the model checkpoint at facebook/opt-350m and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [9]:
tokenizer = AutoTokenizer.from_pretrained("/home/bhishma/Nextcloud/Documents/code/trl/output/checkpoint-5/")

In [11]:
inputs = tokenizer("Hello world!", return_tensors="pt")

In [28]:
outputs = model.forward(**inputs)

((tensor([[[[ 0.3267, -0.3466, -0.4174,  ...,  0.3120, -0.1403, -0.3449],
            [-0.8922,  0.8802,  0.8702,  ..., -0.8283,  0.7369,  0.8701],
            [-0.8554,  0.8466,  0.8191,  ..., -0.8056,  0.7275,  0.8099],
            [-0.6896,  0.7023,  0.7028,  ..., -0.6649,  0.6121,  0.6782]],
  
           [[-0.8134,  0.2669,  0.8097,  ...,  0.1604,  1.0996, -0.7234],
            [ 0.8925, -0.5340, -0.8607,  ..., -0.3973, -1.2051,  0.7701],
            [ 0.7087, -0.4685, -0.6615,  ..., -0.3842, -0.9141,  0.6378],
            [ 0.3812, -0.1438, -0.3133,  ..., -0.0276, -0.5163,  0.3137]],
  
           [[-0.4666,  0.5239,  0.4211,  ..., -0.4509, -0.4898,  0.4881],
            [-0.6169,  0.7540,  0.7537,  ..., -0.7358, -0.6892,  0.7365],
            [-0.6376,  0.7563,  0.7913,  ..., -0.8508, -0.7095,  0.7841],
            [-0.4877,  0.5625,  0.6224,  ..., -0.8138, -0.5233,  0.6062]],
  
           ...,
  
           [[-0.0427,  0.2214, -0.2157,  ...,  0.1592,  0.0185,  0.1363],
       

In [17]:
outputs

tensor([[    2, 31414,   232,   328, 50118, 50118,   100,    17,    27,   119,
            10,    92, 12750,     7,     5, 37744,   232,     4,    38,    17]])

[2,
 31414,
 232,
 328,
 50118,
 50118,
 100,
 17,
 27,
 119,
 10,
 92,
 12750,
 7,
 5,
 37744,
 232,
 4,
 38,
 17]

In [25]:
tokenizer.decode(outputs.tolist()[0])

'</s>Hello world!\n\nI’m a newbie to the blogging world. I�'