# Format Data OpenAI Model for Classification of r/AITA Comments
Using the data we've already uploaded to GCP for posts and comments on r/AITA, we are going to pull said data locally to be converted into the JSONL format that OpenAI prefers. Using this data we will create two models. 
1. sentiment prediction for the classification label of the first layer comments
2. An text blurb generator explaining how the AI arrived at its classification

The first will be trained on only the posts and the following comments classifications for now. 

In [1]:
import openai
import os
import sys
import importlib
import pandas as pd
import numpy as np
from pathlib import Path
from dotenv import load_dotenv
from google.cloud import bigquery
from google.cloud.exceptions import NotFound

In [2]:
sys.path.append("..")

# load enviornment variables for praw to work later
load_dotenv(dotenv_path=Path("./settings.env"))

True

In [3]:
client = bigquery.Client()

In [4]:
# Remote data definitions
PROJ_NAME = "bonion"
DATASET_NAME = "AITA_dataset"
post_table_id = "{}.{}.post_table".format(PROJ_NAME, DATASET_NAME)
comment_table_id = "{}.{}.comment_table".format(PROJ_NAME, DATASET_NAME)
reply_table_id = "{}.{}.reply_table".format(PROJ_NAME, DATASET_NAME)

## Prepare Sentiment Analysis and Classification Prediction Dataset
Convert Posts and their comments to shape.

In [5]:
# get data 
query_for_post_data = """
    SELECT reddit_post_id AS post_id, post_title, post_self_text AS post_content, posts.upvotes AS post_upvotes, 
            comment_id, content AS comment_content, comments.upvotes AS comment_upvotes
    FROM {} posts JOIN {} comments ON (posts.reddit_post_id = SUBSTRING(comments.parent_id, 4, 100));
""".format(post_table_id, comment_table_id)

post_join_comment_df = pd.read_gbq(query_for_post_data, project_id=PROJ_NAME)

post_join_comment_df.head(4)

Unnamed: 0,post_id,post_title,post_content,post_upvotes,comment_id,comment_content,comment_upvotes
0,jyk2ac,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,22802,gd4fbhf,Nta. Your kid is gonna grow up with half sibli...,11786
1,jyk2ac,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,22802,gd4e5me,NTA- id have cut em both out completely. They’...,4036
2,jyk2ac,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,22802,gd4e513,NTA\n\nI think you handled the situation as we...,26273
3,jyk2ac,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,22802,gd4eips,"This is a lot to unpack, but you're definitely...",2271


In [6]:
post_join_comment_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8090 entries, 0 to 8089
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   post_id          8090 non-null   object
 1   post_title       8090 non-null   object
 2   post_content     8090 non-null   object
 3   post_upvotes     8090 non-null   Int64 
 4   comment_id       8090 non-null   object
 5   comment_content  8090 non-null   object
 6   comment_upvotes  8090 non-null   Int64 
dtypes: Int64(2), object(5)
memory usage: 458.3+ KB


In [169]:
# post_join_comment_df.iloc[3]["comment_content"]

### Outline of Model Data Inputs
Bellow I will show the inputs and outputs of each of the models.

1. ClassificationModel(post_title, post_contents) -> classification 
2. TextGenerationModel(post_title, post_contents, classification) -> completion with classification explicitly stated


#### A Note about the initial TextGenerationModel
For my first "naive" model, I'm going to exapand the classification labels into full words because the acronyms might not have much meaning to open ai's model. 

### Basic Data Modifications
Because r/AITA uses acronyms to describe their ratings of the story, it might make it difficult for the model to understand the meanings behind them due to the niche use of them. I'm going to use regex to parse each of the comment_content's for the classification labels:
- YTA = You're the Asshole
- YWBTA = You Would Be the Asshole
- NTA = Not the Asshole
- YWNBTA = You Would Not be the Asshole
- ESH = Everyone Sucks here
- NAH = No Assholes here
- INFO = Not Enough Info

These labels will be placed in a *class* column within the df. Then I will convert all instances of these acronyms within the comment content to its full "version" e.g. YTA -> You're the Asshole.

In [20]:
from data import DataClasses
importlib.reload(DataClasses)
from data import data_utils
importlib.reload(data_utils)

<module 'data.data_utils' from '/home/cstainsby/class/dataProj/bonion/src/backend_api_service/../data/data_utils.py'>

In [8]:
# create a subset with only the necessary data for training 
post_join_comment_subset_df = post_join_comment_df[["post_title", "post_content", "comment_content"]]
post_join_comment_subset_df.head(4)

Unnamed: 0,post_title,post_content,comment_content
0,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,Nta. Your kid is gonna grow up with half sibli...
1,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,NTA- id have cut em both out completely. They’...
2,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,NTA\n\nI think you handled the situation as we...
3,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,"This is a lot to unpack, but you're definitely..."


In [9]:
classifications = []
for _, post_comment_row in post_join_comment_subset_df.iterrows():
    row_class = data_utils.parse_class_label_from_AITA_comment(post_comment_row["comment_content"])
    classifications.append(row_class)

post_join_comment_subset_df["class"] = classifications
post_join_comment_subset_df.head(4)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  post_join_comment_subset_df["class"] = classifications


Unnamed: 0,post_title,post_content,comment_content,class
0,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,Nta. Your kid is gonna grow up with half sibli...,nta
1,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,NTA- id have cut em both out completely. They’...,nta
2,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,NTA\n\nI think you handled the situation as we...,nta
3,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,"This is a lot to unpack, but you're definitely...",nta


### Voting System
Currently our data now has multiple occurances of post_titles and post_contents for each comment. In order to give our model only one classification per post_title and post_content, we will have to create some sort of voting system for determining the "correct" class. Because we have the comments with classifications provided, we will assume they know what they're talking about. There are multiple approaches to how we can account for these votes.

**For now we will stick to a majority vote**

It may be necessary to drop instances which are highly contested to improve performance.

In [11]:
# group the 
pjc_group_by_class = post_join_comment_subset_df.groupby(["post_title", "post_content"]).agg({
    "class": lambda x: list(set(x))
}).reset_index()
pjc_group_by_class

Unnamed: 0,post_title,post_content,class
0,(UPDATE) AITA for telling my step-daughter to ...,[Original post ](https://www.reddit.com/r/AmIt...,"[None, nta]"
1,AITA (22M) for telling my mom (46F) to fuck off?,I lost my job two weeks ago due to redundancy....,"[None, nta, esh]"
2,AITA (31F) for not wanting to go out drinking ...,Me and my girlfriend Karla (fake name) live in...,"[None, nta, nah]"
3,AITA (39F) for NOT wanting to move to Las Vega...,Hello there. I've been with my husband for 12 ...,"[None, nta, info, esh, nah]"
4,AITA (F25) for telling my bridesmaid (F22) tha...,I’m going to try my best to detail this as cle...,"[None, nta]"
...,...,...,...
944,WIBTA if i told my ride off for texting while ...,My (17f ) family moved into a new neighborhood...,"[None, nta, ywbta]"
945,WIBTAH if I go against my husband and respond ...,I (27f) and my husband (26m) are back once aga...,"[yta, None, ywbta]"
946,WIBTAH if we didn't return the sofa to my dad'...,Hello all. I'll try and keep this as short as ...,"[None, nta, ywbta]"
947,[UPDATE] AITA for asking my boyfriend to charg...,\nOriginal Post: \n\nhttps://www.reddit.com/r/...,[None]


In [21]:
pjc_with_class_votes = pd.DataFrame({
    "post_title": pjc_group_by_class["post_title"],
    "post_content": pjc_group_by_class["post_content"]
})
votes = []

for _, pjc_row in pjc_group_by_class.iterrows():
    most_common_vote = data_utils.get_AITA_most_common_vote(pjc_row["class"])
    votes.append(most_common_vote)

pjc_with_class_votes["class"] = votes
pjc_with_class_votes.head(4)

Unnamed: 0,post_title,post_content,class
0,(UPDATE) AITA for telling my step-daughter to ...,[Original post ](https://www.reddit.com/r/AmIt...,nta
1,AITA (22M) for telling my mom (46F) to fuck off?,I lost my job two weeks ago due to redundancy....,nta
2,AITA (31F) for not wanting to go out drinking ...,Me and my girlfriend Karla (fake name) live in...,nta
3,AITA (39F) for NOT wanting to move to Las Vega...,Hello there. I've been with my husband for 12 ...,nta


### Drop Emtpy rows
I'm going to be dropping any posts which have made it to this point if their class is None to make sure training works fine.

In [22]:
pjc_with_class_votes = pjc_with_class_votes.dropna()

In [23]:
pjc_with_class_votes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 920 entries, 0 to 946
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   post_title    920 non-null    object
 1   post_content  920 non-null    object
 2   class         920 non-null    object
dtypes: object(3)
memory usage: 28.8+ KB


### Put the data in a CSV
For openai's tooling to be able to convert the data into JSONL

In [24]:
aita_data_obj = DataClasses.AITAData()

In [25]:
aita_data_obj.store_post_to_class_data("post_comment_class", pjc_with_class_votes)

### Remove Excessivly long rows
For classification purposes there is a limit on the ammount of tokens which can be input

In [26]:
pjc_with_class_votes.info()
pjc_with_class_votes.head(4)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 920 entries, 0 to 946
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   post_title    920 non-null    object
 1   post_content  920 non-null    object
 2   class         920 non-null    object
dtypes: object(3)
memory usage: 28.8+ KB


Unnamed: 0,post_title,post_content,class
0,(UPDATE) AITA for telling my step-daughter to ...,[Original post ](https://www.reddit.com/r/AmIt...,nta
1,AITA (22M) for telling my mom (46F) to fuck off?,I lost my job two weeks ago due to redundancy....,nta
2,AITA (31F) for not wanting to go out drinking ...,Me and my girlfriend Karla (fake name) live in...,nta
3,AITA (39F) for NOT wanting to move to Las Vega...,Hello there. I've been with my husband for 12 ...,nta


### Prepare Data with OpenAI tooling
Run the following and follow the recommended procedures for produing JSONL for training the model.

        openai tools fine_tunes.prepare_data -f <LOCAL_FILE>

## Data Formatting For Comment Generation

In [11]:
post_join_comment_subset_df.head(4)

Unnamed: 0,post_title,post_content,comment_content,class
0,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,Nta. Your kid is gonna grow up with half sibli...,nta
1,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,NTA- id have cut em both out completely. They’...,nta
2,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,NTA\n\nI think you handled the situation as we...,nta
3,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,"This is a lot to unpack, but you're definitely...",nta


In [14]:
post_join_comment_subset_df = post_join_comment_subset_df.dropna()

In [16]:
print("class labels present:", list(post_join_comment_subset_df["class"].unique()))

class labels present: ['nta', 'info', 'yta', 'esh', 'ywbta', 'nah', 'ywnbta']


In [23]:
aita_data_obj.store_post_and_class_to_text_data("post_and_class_to_gen_text", post_join_comment_subset_df)

In [27]:
temp_df = pd.read_csv("../data/store/post_and_class_to_gen_text.csv").head(4)

temp_df.iloc[3]["prompt"]

'title: AITA for leaving the call when my brother announced that his gf is pregnant?\ncontent: My brother and I do not get on.\n\nWhen we were younger he\'d go out of his way to make my life a living hell. To my parent\'s credit they did tell him off for it when they caught him but they both worked long hours and didn\'t have the energy to deal with our arguments. This continued into adulthood. He was salty that he failed his college course the first time around.\n\nThere was a bad argument in our family a while ago and, I shit you not, it all started because I refused the decorate my brothers living room. I wont go into too much detail but he wanted a pretty hefty discount, I said no and he threw a tantrum.\n\nYou really need to meet my brother to understand just how bad he is. But hopefully this post will do it some justice. Instead of being a grown up and talking to me, he decided to hook up with my toxic ex girlfriend and the mother of my child. Due to the rules here I can\'t go in