In [35]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [42]:
random_seed = 42  # used for reproducability

In [20]:
df = pd.read_json("data/Books_10k.jsonl", lines=True)

Now that i have my data loaded in, i will take a look at a few rows to understand the type of data i am looking at

In [21]:
print(df.shape)

(10000, 10)


In [22]:
print(df.columns)
print(df.iloc[[0]])
print(df.iloc[[5000]])

Index(['rating', 'title', 'text', 'images', 'asin', 'parent_asin', 'user_id',
       'timestamp', 'helpful_vote', 'verified_purchase'],
      dtype='object')
   rating                                          title  \
0       1  Not a watercolor book! Seems like copies imo.   

                                                text  \
0  It is definitely not a watercolor book.  The p...   

                                              images        asin parent_asin  \
0  [{'small_image_url': 'https://m.media-amazon.c...  B09BGPFTDB  B09BGPFTDB   

                        user_id               timestamp  helpful_vote  \
0  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ 2022-01-17 06:06:38.485             0   

   verified_purchase  
0               True  
      rating             title  \
5000       4  Informative Text   

                                                   text images        asin  \
5000  This is a good book if you want to learn all a...     []  0071441964   

     parent_asin            

After seeing the availible columns, and seeing a few examples i decide that the obvious target here will be the rating.   
I see in my examples that the first example with a rating of 1 seems negative, as the user is disappointed the book is not a watercolor book, while the one rated 4 is described as a good book.

So i assume the rating is done by the user, and reflects their sentiment towards the product. I decide rating will be my target variable.  
As an input i will use the text column, as that is where the main text of the review is placed.

Now that i have noted my thoughts to ensure i or others can later review biases, i will start looking at the distribution of my data.

In [24]:
df["rating"].value_counts()

rating
1    2000
5    2000
4    2000
3    2000
2    2000
Name: count, dtype: int64

In [37]:
df["length"] = df["text"].str.len()
df["length"].describe()

count    10000.000000
mean       871.326200
std       1145.227031
min          2.000000
25%        152.000000
50%        463.500000
75%       1150.000000
max      15674.000000
Name: length, dtype: float64

I see that the ratings themself are all equally weighted, so there is no need to consider imbalances of ratings.
I should however ensure various lengths of texts are being tested, so i decide to bin each rating category into the datasets overall quantiles, and sample from each bin to ensure a representative distribtution.

In [38]:
df["length_group"], bins = pd.qcut(
    df["length"], q=4, labels=False, retbins=True, duplicates="drop"
)
np.set_printoptions(suppress=True, formatter={"float_kind": "{:,.0f}".format})
print(bins)

[2 152 464 1,150 15,674]


In [32]:
df.iloc[[0]]

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,length,length_group
0,1,Not a watercolor book! Seems like copies imo.,It is definitely not a watercolor book. The p...,[{'small_image_url': 'https://m.media-amazon.c...,B09BGPFTDB,B09BGPFTDB,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,2022-01-17 06:06:38.485,0,True,1427,3


Now it is simply a matter of doing the split and saving my test and my training data.
The test data will be used to evaluate models during test, while the training will be split further into a train and validation set later, and used for model training.

In [None]:
test_df = df.groupby(["rating", "length_group"], group_keys=False).apply(
    lambda x: x.sample(frac=0.2, random_state=random_seed)
)

train_df = df.drop(val_df.index)
print(f"val set length = {len(val_df)}")
print(f"train set length = {len(train_df)}")

val set length = 2001
train set length = 7999


  .apply(lambda x: x.sample(frac=0.2, random_state=random_seed))


I now have a split that takes distribution of rating and text length into consideration, and i am ready to start building a training pipeline and select a model.
Last thing i will do in this notebook, is save my dataset split.

In [None]:
train_df.to_json("data/train.jsonl", orient="records", lines=True, force_ascii=False)
test_df.to_json("data/test.jsonl", orient="records", lines=True, force_ascii=False)