# Create and Fine Tune OpenAI Model for r/AITA
Using the data we've already uploaded to GCP for posts and comments on r/AITA, we are going to pull said data locally to be converted into the JSONL format that OpenAI prefers. Using this data we will create two models. 
1. sentiment prediction for the classification label of the first layer comments
2. An text blurb generator explaining how the AI arrived at its classification

The first will be trained on only the posts and the following comments classifications for now. 

In [19]:
import openai
import sys
import importlib
import pandas as pd
import numpy as np
from pathlib import Path
from dotenv import load_dotenv
from google.cloud import bigquery
from google.cloud.exceptions import NotFound

In [20]:
sys.path.append("..")

# load enviornment variables for praw to work later
load_dotenv(dotenv_path=Path("../settings.env"))

True

In [21]:
client = bigquery.Client()

In [22]:
# Remote data definitions
PROJ_NAME = "bonion"
DATASET_NAME = "AITA_dataset"
post_table_id = "{}.{}.post_table".format(PROJ_NAME, DATASET_NAME)
comment_table_id = "{}.{}.comment_table".format(PROJ_NAME, DATASET_NAME)
reply_table_id = "{}.{}.reply_table".format(PROJ_NAME, DATASET_NAME)

## Prepare Sentiment Analysis and Classification Prediction Dataset
Convert Posts and their comments to shape.

In [23]:
# get data 
query_for_post_data = """
    SELECT reddit_post_id AS post_id, post_title, post_self_text AS post_content, posts.upvotes AS post_upvotes, 
            comment_id, content AS comment_content, comments.upvotes AS comment_upvotes
    FROM {} posts JOIN {} comments ON (posts.reddit_post_id = SUBSTRING(comments.parent_id, 4, 100))
    LIMIT 4;
""".format(post_table_id, comment_table_id)

post_join_comment_df = pd.read_gbq(query_for_post_data, project_id=PROJ_NAME)

post_join_comment_df.head(4)

Unnamed: 0,post_id,post_title,post_content,post_upvotes,comment_id,comment_content,comment_upvotes
0,jyk2ac,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,22802,gd4fbhf,Nta. Your kid is gonna grow up with half sibli...,11786
1,jyk2ac,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,22802,gd4e5me,NTA- id have cut em both out completely. They’...,4036
2,jyk2ac,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,22802,gd4e513,NTA\n\nI think you handled the situation as we...,26273
3,jyk2ac,AITA for leaving the call when my brother anno...,My brother and I do not get on.\n\nWhen we wer...,22802,gd4eips,"This is a lot to unpack, but you're definitely...",2271


In [26]:
post_join_comment_df.iloc[3]["comment_content"]

"This is a lot to unpack, but you're definitely NTA. \n\nHe purposely seeked out your crazy ex gf, got her pregnant, and expected you to be okay with that news? Id cut all contact with them. NTA!"

### Basic Data Modifications
Because r/AITA uses acronyms to describe their ratings of the story, it might make it difficult for the model to understand the meanings behind them due to the niche use of them. I'm going to use regex to parse each of the comment_content's for the classification labels:
- YTA = You're the Asshole
- YWBTA = You Would Be the Asshole
- NTA = Not the Asshole
- YWNBTA = You Would Not be the Asshole
- ESH = Everyone Sucks here
- NAH = No Assholes here
- INFO = Not Enough Info

These labels will be placed in a *class* column within the df. Then I will convert all instances of these acronyms within the comment content to its full "version" e.g. YTA -> You're the Asshole.

In [None]:
for post_comment_row in post_join_comment_df.iterrows():
    break

### Outline of Model Data Inputs
Bellow I will show the inputs and outputs of each of the models.

1. ClassificationModel(post_title, post_contents) -> classification 
2. TextGenerationModel(post_title, post_contents, classification) -> completion with classification explicitly stated


#### A Note about the initial TextGenerationModel
For my first "naive" model, I'm going to exapand the classification labels into full words because the acronyms might not have much meaning to open ai's model. 

### The Classification Model

**ClassificationModel(post_title, post_contents) -> classification**

For the first example, this would require that the data be placed in the following format:
```
    {
        "prompt":"post_title=<post_title>, post_contents=<post_contents>"
        "completion":<class label>
    }
```

In [25]:
# collect data
for post_comment_row in post_join_comment_df.iterrows():
    print(post_comment_row)
    
    break

(0, post_id                                                       jyk2ac
post_title         AITA for leaving the call when my brother anno...
post_content       My brother and I do not get on.\n\nWhen we wer...
post_upvotes                                                   22802
comment_id                                                   gd4fbhf
comment_content    Nta. Your kid is gonna grow up with half sibli...
comment_upvotes                                                11786
Name: 0, dtype: object)
