# Create and Fine Tune OpenAI Model for r/AITA
Using the data we've already uploaded to GCP for posts and comments on r/AITA, we are going to pull said data locally to be converted into the JSONL format that OpenAI prefers. Using this data we will create two models. 
1. sentiment prediction for the classification label of the first layer comments
2. An text blurb generator explaining how the AI arrived at its classification

The first will be trained on only the posts and the following comments classifications for now. 

In [95]:
import openai
import sys
import importlib
import pandas as pd
import numpy as np
from pathlib import Path
from dotenv import load_dotenv
from google.cloud import bigquery
from google.cloud.exceptions import NotFound

In [96]:
sys.path.append("..")

# load enviornment variables for praw to work later
load_dotenv(dotenv_path=Path("../settings.env"))

True

In [97]:
client = bigquery.Client()

In [98]:
# Remote data definitions
PROJ_NAME = "bonion"
DATASET_NAME = "AITA_dataset"
post_table_id = "{}.{}.post_table".format(PROJ_NAME, DATASET_NAME)
comment_table_id = "{}.{}.comment_table".format(PROJ_NAME, DATASET_NAME)
reply_table_id = "{}.{}.reply_table".format(PROJ_NAME, DATASET_NAME)

## Prepare Sentiment Analysis and Classification Prediction Dataset
Convert Posts and their comments to shape.

In [166]:
# get data 
query_for_post_data = """
    SELECT reddit_post_id AS post_id, post_title, post_self_text AS post_content, posts.upvotes AS post_upvotes, 
            comment_id, content AS comment_content, comments.upvotes AS comment_upvotes
    FROM {} posts JOIN {} comments ON (posts.reddit_post_id = SUBSTRING(comments.parent_id, 4, 100));
""".format(post_table_id, comment_table_id)

post_join_comment_df = pd.read_gbq(query_for_post_data, project_id=PROJ_NAME)

post_join_comment_df.head(4)

Unnamed: 0,post_id,post_title,post_content,post_upvotes,comment_id,comment_content,comment_upvotes
0,jyk2ac,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,22802,gd4fbhf,Nta. Your kid is gonna grow up with half sibli...,11786
1,jyk2ac,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,22802,gd4e5me,NTA- id have cut em both out completely. They’...,4036
2,jyk2ac,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,22802,gd4e513,NTA\n\nI think you handled the situation as we...,26273
3,jyk2ac,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,22802,gd4eips,"This is a lot to unpack, but you're definitely...",2271


In [168]:
post_join_comment_df.info()

2796
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2796 entries, 0 to 2795
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   post_id          2796 non-null   object
 1   post_title       2796 non-null   object
 2   post_content     2796 non-null   object
 3   post_upvotes     2796 non-null   Int64 
 4   comment_id       2796 non-null   object
 5   comment_content  2796 non-null   object
 6   comment_upvotes  2796 non-null   Int64 
dtypes: Int64(2), object(5)
memory usage: 158.5+ KB


In [169]:
# post_join_comment_df.iloc[3]["comment_content"]

### Outline of Model Data Inputs
Bellow I will show the inputs and outputs of each of the models.

1. ClassificationModel(post_title, post_contents) -> classification 
2. TextGenerationModel(post_title, post_contents, classification) -> completion with classification explicitly stated


#### A Note about the initial TextGenerationModel
For my first "naive" model, I'm going to exapand the classification labels into full words because the acronyms might not have much meaning to open ai's model. 

### Basic Data Modifications
Because r/AITA uses acronyms to describe their ratings of the story, it might make it difficult for the model to understand the meanings behind them due to the niche use of them. I'm going to use regex to parse each of the comment_content's for the classification labels:
- YTA = You're the Asshole
- YWBTA = You Would Be the Asshole
- NTA = Not the Asshole
- YWNBTA = You Would Not be the Asshole
- ESH = Everyone Sucks here
- NAH = No Assholes here
- INFO = Not Enough Info

These labels will be placed in a *class* column within the df. Then I will convert all instances of these acronyms within the comment content to its full "version" e.g. YTA -> You're the Asshole.

In [190]:
from data import DataClasses
importlib.reload(DataClasses)
from data import data_utils
importlib.reload(data_utils)

<module 'data.data_utils' from '/home/cstainsby/class/dataProj/bonion/src/notebooks/../data/data_utils.py'>

In [191]:
# create a subset with only the necessary data for training 
post_join_comment_subset_df = post_join_comment_df[["post_title", "post_content", "comment_content"]]
post_join_comment_subset_df.head(4)

Unnamed: 0,post_title,post_content,comment_content
0,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,Nta. Your kid is gonna grow up with half sibli...
1,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,NTA- id have cut em both out completely. They’...
2,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,NTA\n\nI think you handled the situation as we...
3,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,"This is a lot to unpack, but you're definitely..."


In [192]:
classifications = []
for _, post_comment_row in post_join_comment_subset_df.iterrows():
    row_class = data_utils.parse_class_label_from_AITA_comment(post_comment_row["comment_content"])
    classifications.append(row_class)

post_join_comment_subset_df["class"] = classifications
post_join_comment_subset_df.head(4)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  post_join_comment_subset_df["class"] = classifications


Unnamed: 0,post_title,post_content,comment_content,class
0,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,Nta. Your kid is gonna grow up with half sibli...,nta
1,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,NTA- id have cut em both out completely. They’...,nta
2,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,NTA\n\nI think you handled the situation as we...,nta
3,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,"This is a lot to unpack, but you're definitely...",nta


### Voting System
Currently our data now has multiple occurances of post_titles and post_contents for each comment. In order to give our model only one classification per post_title and post_content, we will have to create some sort of voting system for determining the "correct" class. Because we have the comments with classifications provided, we will assume they know what they're talking about. There are multiple approaches to how we can account for these votes.

**For now we will stick to a majority vote**

It may be necessary to drop instances which are highly contested to improve performance.

In [193]:
# group the 
pjc_group_by_class = post_join_comment_subset_df.groupby(["post_title", "post_content"]).agg({
    "class": lambda x: list(set(x))
}).reset_index()
pjc_group_by_class

Unnamed: 0,post_title,post_content,class
0,(UPDATE) AITA for telling my step-daughter to ...,[Original post ](https://www.reddit.com/r/AmIt...,"[None, nta]"
1,AITA - I missed my daughter’s award ceremony b...,This might be a bit long but thanks for readin...,"[yta, info]"
2,AITA For Barring My Husband From The Bedroom T...,So here is the situation.\n\nMe: nurse. Workin...,"[nta, info]"
3,AITA For Firing An Employee After His Parents ...,I'm the VP of Sales at a software company and ...,[yta]
4,AITA For Refusing To Crochet Something For My ...,Throwaway Account\n\nI (24m) have never been l...,"[None, nta]"
...,...,...,...
290,WIBTA if I record the audio of my neighbours h...,"My neighbours are often super loud, blasting m...","[None, nta, esh]"
291,WIBTA if I refused to attend my cousins weddin...,Yes I'm aware that my cousin posted here and o...,"[None, nta]"
292,WIBTA if I started calling my white coworkers ...,I moved from Georgia to the Pacific Northwest ...,"[None, nta]"
293,[UPDATE] AITA for asking my boyfriend to charg...,\nOriginal Post: \n\nhttps://www.reddit.com/r/...,[None]


In [222]:
pjc_with_class_votes = pd.DataFrame({
    "post_title": pjc_group_by_class["post_title"],
    "post_content": pjc_group_by_class["post_content"]
})
votes = []

for _, pjc_row in pjc_group_by_class.iterrows():
    most_common_vote = data_utils.get_AITA_most_common_vote(pjc_row["class"])
    votes.append(most_common_vote)

pjc_with_class_votes["class"] = votes
pjc_with_class_votes.head(4)

Unnamed: 0,post_title,post_content,class
0,(UPDATE) AITA for telling my step-daughter to ...,[Original post ](https://www.reddit.com/r/AmIt...,nta
1,AITA - I missed my daughter’s award ceremony b...,This might be a bit long but thanks for readin...,yta
2,AITA For Barring My Husband From The Bedroom T...,So here is the situation.\n\nMe: nurse. Workin...,nta
3,AITA For Firing An Employee After His Parents ...,I'm the VP of Sales at a software company and ...,yta


### Drop Emtpy rows
I'm going to be dropping any posts which have made it to this point if their class is None to make sure training works fine.

In [223]:
pjc_with_class_votes = pjc_with_class_votes.dropna()

In [224]:
pjc_with_class_votes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 276 entries, 0 to 292
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   post_title    276 non-null    object
 1   post_content  276 non-null    object
 2   class         276 non-null    object
dtypes: object(3)
memory usage: 8.6+ KB


### Put the data in a CSV
For openai's tooling to be able to convert the data into JSONL

In [225]:
post_comment_data_obj = DataClasses.PostCommentData()

In [226]:
post_comment_data_obj.store_data_for_training(pjc_with_class_votes)

### Remove Excessivly long rows
For classification purposes there is a limit on the ammount of tokens which can be input

In [231]:
pjc_with_class_votes.info()
pjc_with_class_votes.head(4)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 276 entries, 0 to 292
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   post_title    276 non-null    object
 1   post_content  276 non-null    object
 2   class         276 non-null    object
dtypes: object(3)
memory usage: 8.6+ KB


Unnamed: 0,post_title,post_content,class
0,(UPDATE) AITA for telling my step-daughter to ...,[Original post ](https://www.reddit.com/r/AmIt...,nta
1,AITA - I missed my daughter’s award ceremony b...,This might be a bit long but thanks for readin...,yta
2,AITA For Barring My Husband From The Bedroom T...,So here is the situation.\n\nMe: nurse. Workin...,nta
3,AITA For Firing An Employee After His Parents ...,I'm the VP of Sales at a software company and ...,yta


### Prepare Data with OpenAI tooling
Run the following and follow the recommended procedures for produing JSONL for training the model.

        openai tools fine_tunes.prepare_data -f <LOCAL_FILE>

## The Classification Model

**ClassificationModel(post_title, post_contents) -> classification**

For the first example, this would require that the data be placed in the following format:
```
    {
        "prompt":"post_title=<post_title>, post_contents=<post_contents>"
        "completion":<class label>
    }
```

### Create the Model
Call:

        openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>

This will create a 

In [236]:
# set api key
openai.api_key = "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"

In [237]:
openai.FineTune.list()

<OpenAIObject list at 0x7fa6fef96860> JSON: {
  "data": [],
  "object": "list"
}

In [None]:
openai.FineTune.