# 01 - Data Preparation
This notebook loads the raw JSON dataset, transforms it into a flat table (one comment per row), and gives a first look at clickbait vs. non-clickbait distribution.

## Setup
Firstly, let's import pandas library, data_preparation helper function and set the display options. 

In [1]:
import pandas as pd
from src.data.data_preparation import create_dataframe

# setting display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

## Load and Prepare Data
Load the raw JSON and convert it into a pandas DataFrame. The helper function (create_dataframe) splits each comment list into separate rows and renames the column to `text`.


In [5]:
raw_path = "../data/raw/youtube_data.json"
data_frame = create_dataframe(raw_path)
print("Total comments:", len(data_frame))
print("Unique videos:", data_frame.video_id.nunique())

Total comments: 66669
Unique videos: 20


## Quick Look at the Data

In [6]:
data_frame.head()

Unnamed: 0,video_id,clickbait,metadata,text
0,9bscSY_FOOA,True,{'title': 'How To Be In A Calorie Deficit With...,"Total bs, same advice which all doctors have b..."
1,9bscSY_FOOA,True,{'title': 'How To Be In A Calorie Deficit With...,"I keep my calories under 1500 , my daily meal ..."
2,9bscSY_FOOA,True,{'title': 'How To Be In A Calorie Deficit With...,What if I only eat one meal and don’t snack th...
3,9bscSY_FOOA,True,{'title': 'How To Be In A Calorie Deficit With...,"You guys can skip breakfast, if it&#39;s okay ..."
4,9bscSY_FOOA,True,{'title': 'How To Be In A Calorie Deficit With...,In my experience it’s too easy to cheat if you...


## Clickbait Distribution

In [7]:
count = data_frame.clickbait.value_counts()
print(count)
print((count / count.sum() * 100).round(1).astype(str) + "%")

clickbait
False    41401
True     25268
Name: count, dtype: int64
clickbait
False    62.1%
True     37.9%
Name: count, dtype: object


## Sample Comments

In [9]:
# show a few positive and negative examples
print("=== Clickbait examples ===")
print(data_frame[data_frame.clickbait].text.sample(5, random_state=1).tolist())

print("\n=== Non-clickbait examples ===")
print(data_frame[~data_frame.clickbait].text.sample(5, random_state=1).tolist())


=== Clickbait examples ===
['I’m dying', 'My arms can&#39;t handle it 😩', 'I hardly ever eat carbs and I’m struggling to lose weight I barely even eat anything and I still can’t lose weight all I drink is water and I have my morning coffee', 'Cloe= 30 second exercise <br>Me= skips last 3 seconds', 'I just wanna reduce carbs since I notice that they&#39;re so high in calories.']

=== Non-clickbait examples ===
['i think this is the old spice guy when he was a baby.', 'Like the before', '@mariannevinas123 yeah it does. ive already got an ok 6 pack but the lower abs is my problem and i can honestly say in 12 times of me doing this workout every other day ive seen fast results all over :) so id keep on this and give it chance to work ', 'What if you can&#39;t do a full body workout..?? I have fibromyalgia, body pain, chronic pain, DDD, Scoliosis of the spine, etc. If i try to do basic stretching, my muscles start hurting right away. It&#39;s been ALONG TIME since I&#39;ve slept 9-6. <br>I 