### Data Collection

Using PRAW on r/news to retrieve discussions on selected topics during 2020.

Topics: 
* Abortion
* Climate Change
* Gun Control
* Immigration
* Health Care

In [24]:
import praw
import pandas as pd
import datetime
import json

In [52]:
# initiate instance
reddit = praw.Reddit(
    client_id="slThjR4gi1Wxp-mEV1xYOw",
    client_secret="xlV0UgYRlXFJiaiop9qie133qYPdfg",
    user_agent="my user agent"
)

Manually select articles related to the topics from r/news 2020. 
First doing a search by year and num of comments/upvotes (not sure which one would make more sense, but anyways pretty similar results) -- it's not possible to do search by year directly on Reddit.

In [48]:
# refine search with other terms
queries= ['abortion', 'roe v wade', 'pro-choice', 'pro-life', 'abortion rights', 'abortion access',
          'gun control', 'gun access', 'background checks', 'second amendment', 'gun rights', 'mass shooting',
           'climate change', 'temperature', 'global warming', 'greenhouse gases', 'carbon emissions',
            'immigration', 'migration', 'border wall', 'border security',
             'healthcare', 'medicare', 'health insurance']


subreddit = reddit.subreddit('news')

dt_start_2020 = datetime.datetime(2020, 1, 1, 0, 0).timestamp()
dt_end_2020 = datetime.datetime(2020, 12, 31, 23, 59).timestamp()


results = []

for query in queries:
    search_results = subreddit.search(query, time_filter='all')

    # Filter submissions by the year 2020 and minimum number of comments
    for submission in search_results:
        if dt_start_2020 <= submission.created_utc <= dt_end_2020 and submission.num_comments > 1000:
            results.append({
                'search_term': query,
                'title': submission.title,
                'url': submission.url,
                'score': submission.score,
                'num_comments': submission.num_comments,
                'date': datetime.datetime.fromtimestamp(submission.created_utc),
                'id': submission.id
            })

df = pd.DataFrame(results)
df

Unnamed: 0,search_term,title,url,score,num_comments,date,id
0,abortion,Texas says abortions 'non-essential' amid pand...,https://www.bbc.com/news/52012243,10925,2473,2020-03-25 12:18:00,foolui
1,abortion,Group buys Alabama abortion clinic to keep it ...,https://www.sfgate.com/news/article/Group-buys...,51533,2625,2020-05-17 01:45:00,gl4zny
2,pro-choice,Pregnant woman suffers miscarriage after being...,https://news.sky.com/story/pregnant-woman-suff...,23169,1463,2020-11-26 13:58:53,k1et5v
3,pro-life,Kellyanne Conway's daughter the latest to test...,https://www.northjersey.com/story/news/coronav...,72812,3582,2020-10-04 22:03:11,j55ia6
4,pro-life,Iowa confirms first child death from COVID as ...,https://www.kcrg.com/2020/08/23/iowa-confirms-...,54468,3974,2020-08-24 04:55:50,ifgzya
5,pro-life,Atlanta officer who fatally shot Rayshard Broo...,https://www.cnn.com/2020/06/13/us/atlanta-poli...,3945,2875,2020-06-14 06:18:56,h8n2aj
6,abortion rights,COVID-19 Megathread #8,https://www.reddit.com/r/news/comments/fpy8ax/...,537,6311,2020-03-27 15:52:02,fpy8ax
7,abortion access,COVID-19 Megathread #8,https://www.reddit.com/r/news/comments/fpy8ax/...,533,6311,2020-03-27 15:52:02,fpy8ax
8,abortion access,Group buys Alabama abortion clinic to keep it ...,https://www.sfgate.com/news/article/Group-buys...,51541,2625,2020-05-17 01:45:00,gl4zny
9,gun access,COVID-19 Megathread #8,https://www.reddit.com/r/news/comments/fpy8ax/...,531,6311,2020-03-27 15:52:02,fpy8ax


I'm not really sure which kind of comments we want:
- only top level? -- also when input in the prompt we wouldn't have any sort of structure, but we lose the "discussion" part
- maybe the first top levels and the conversation after that? (a lot of bullshit)
- "controversiality" appearently is also sort of bullshit -- anyways we cannot filter by it, but by the percentage of votes that are upvotes (controversial are those with ~equal up and downvotes)
- "hot", "best" is only to search for submission in subreddit, not for commetns

In [69]:
# Handpicked some (pretty aleatory)

submission_ids = ['gl4zny', 'inznpg', '	i9jrvl', 'g25xe3', 'hc211l']

topic= {"abortion": "gl4zny", "climate change": "inznpg", "healthcare": "i9jrvl", "immigration": 'g25xe3', "gun control": 'hc211l'}


comments = []

for topic_name, submission_id in topic.items():
    submission = reddit.submission(id=submission_id)
    submission.comments.replace_more(limit=5)   ## it's a lot of comments - but thought of keeping it until we figure out how we are going to put them in prompt
    for comment in submission.comments:
        # Check if the comment is not deleted or removed
        if not comment.body == '[deleted]' and not comment.body == '[removed]':
            comments.append({
                'sub_id': submission_id,
                'topic': topic_name,
                'sub_title': submission.title,
                'comment': comment.body,
                'score': comment.score
            })
        # for reply in comment.replies:
        #     comments.append({
        #         'Submission ID': submission_id,
        #         'Submission Title': submission.title,
        #         'Comment Body': reply.body,
        #         'Comment Score': reply.score
        #     })

            
df_comments = pd.DataFrame(comments)
df_comments.to_csv('comments.csv', index=False)


Creating the prompts

Evaluate number of comments possible according to token count

In [25]:
df_comments= pd.read_csv('comments.csv')

In [33]:
# Token count for top 200 comments
for topic, group_df in df_comments.groupby('topic'):
    print("Topic:", topic)

    # sort comments by score
    sorted_comments = group_df.sort_values(by='score', ascending=False)
    top_200_comments = sorted_comments.head(200)
    
    # join into single string and tokenize
    all_comments_text = ' '.join(top_200_comments['comment'])
    token_count = len(all_comments_text.split())

    print("Total token count in top 200 comments:", token_count)
    print()

Topic: abortion
Total token count in top 200 comments: 3079

Topic: climate change
Total token count in top 200 comments: 4043

Topic: gun control
Total token count in top 200 comments: 6201

Topic: healthcare
Total token count in top 200 comments: 10059

Topic: immigration
Total token count in top 200 comments: 6352



We are well below 16k tokens... so probably would work with CodeLlama with ColabPro

* https://www.youtube.com/watch?v=ELax81LjFhU
* https://ai.meta.com/blog/code-llama-large-language-model-coding/ 

"The Code Llama models provide stable generations with up to 100,000 tokens of context. All models are trained on sequences of 16,000 tokens and show improvements on inputs with up to 100,000 tokens." It is fine-tuned for coding tasks, but still based on Llama 2. 

Create prompt

In [36]:
# Initialize an empty dictionary to store comments with scores for each topic
topic_comments_dict = {}

# Group by topic_name and iterate over groups
for topic, group_df in df_comments.groupby('topic'):
    topic_comments_list = []

    topic_comments_list.append(f"Headline: {group_df['sub_title'].iloc[0]}\n")

    i=0
    for index, row in top_200_comments.iterrows():
        comment = row['comment']
        score = row['score']
        comment_with_score = f"Comment {i}: '{comment}' (score: {score})\n"
        topic_comments_list.append(comment_with_score)
        i+=1
    
    # add list of comments to dictionary
    topic_comments_dict[topic] = topic_comments_list


topic_comments_dict["abortion"][:10]

# Save the dictionary to a file
with open('comments_for_prompts.json', 'w') as f:
    json.dump(topic_comments_dict, f, indent=4)

In [37]:
topic_comments_dict["climate change"]

["Headline: Los Angeles records county's highest temperature ever on record\n",
 "Comment 0: '“State officials won’t decide who gets the money. Instead, the state will give the money to a network of regional nonprofits to find and vet potential recipients. Advocates say that’s key to making the plan work because immigrants are unlikely to contact the government for fear of deportation.”\n\nGavin’s giving non-profits $75MM to dole out as they please in $500 increments. It doesn’t mention how they’ll keep these non-profits accountable.' (score: 1394)\n",
 "Comment 1: 'Illegal immigration should not be encouraged' (score: 898)\n",
 "Comment 2: 'US citizen here. How do I apply for immigrant status in California?' (score: 808)\n",
 "Comment 3: 'Didnt read all of the comments, but am I the only one amused by the irony that my state's essentially paying illegal aliens out of pocket and under the table (via nonprofit's discretion) with hard cash?\n\nEdit: and its less amusing that everythings 