**EDA**

In [3]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("kritanjalijain/amazon-reviews")
path += "/test.csv"
print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/amazon-reviews/test.csv


In [4]:
import pandas as pd
from tabulate import tabulate

df = pd.read_csv(path)
df.columns = ['Label', 'Review', "Detailed Review"]

In [5]:
df.isnull().sum()

Unnamed: 0,0
Label,0
Review,24
Detailed Review,0


In [6]:
null_reviews = df[df['Review'].isnull()]
longest_review_null = null_reviews["Detailed Review"].str.len().idxmax()
print(null_reviews.loc[longest_review_null, "Detailed Review"])
print("Length of the longest review:", len(null_reviews.loc[longest_review_null, "Detailed Review"]))

Some good rocking moments here and there (especially on "North Berwick Witch Trials" and "Oro the Manslayer"), but other than that Cathedral's style of Sabbath-isms combined with a sort of Southern Hard Rock groove can easily grow tired. "The Garden" also has its moments and strives to be different, but at 27 minutes long, it is just too long and sounds like an experiment gone wrong. Listening to some parts of this record reminded me of the funny tough-guy attitude brought by bands like Crowbar, which does not surprise considering who the producer is here. Cathedral rocks hard, but not in a way to maintain someone's interest for more than 5 minutes.
Length of the longest review: 657


In [7]:
df['Review'] = df['Review'].fillna(df['Detailed Review'])

In [8]:
# prompt: find longest string in df['Review']

longest_review = df['Review'].str.len().idxmax()
print(df.loc[longest_review, 'Review'])
print("Length of the longest review:", len(df.loc[longest_review, 'Review']))

Some good rocking moments here and there (especially on "North Berwick Witch Trials" and "Oro the Manslayer"), but other than that Cathedral's style of Sabbath-isms combined with a sort of Southern Hard Rock groove can easily grow tired. "The Garden" also has its moments and strives to be different, but at 27 minutes long, it is just too long and sounds like an experiment gone wrong. Listening to some parts of this record reminded me of the funny tough-guy attitude brought by bands like Crowbar, which does not surprise considering who the producer is here. Cathedral rocks hard, but not in a way to maintain someone's interest for more than 5 minutes.
Length of the longest review: 657


In [9]:
longest_review = df['Detailed Review'].str.len().idxmax()
print(df.loc[longest_review, 'Detailed Review'])
print("Length of the longest review:", len(df.loc[longest_review, 'Detailed Review']))

Originally had a Cuisinart thermal, 12 cup. This coffee maker was $100 and I went through three of them in two years. The thermos was great but the coffee pot is designed terrible. I figured "how could I go wrong with a Mr. Coffee?" They were the first and should be reliable. I bought their Thermal Pot for about $60 and it lasted two months. One pot of coffee took almost thirty minutes to complete. The steam that came out of the top was unreal! The clock stopped working due to the excessive moisture and I took it back and got another one from Walmart. The second one worked great for two weeks and then started with the excessive steam problem again, although not as bad as the first pot. The beeper is almost too quiet. Some reviewers have said it is too loud. Not mine! I am really disappoint that Mr. Coffee would sell a product that is this inconsistent in performace. Their marketing director must have been a castaway from Microsoft. Apparently it is a good pot if you get one that works 

In [10]:
# prompt: average length of df[DetailedReview']

average_length = df['Detailed Review'].str.len().mean()
print("Average length of detailed reviews:", average_length)

Average length of detailed reviews: 404.8999022497556


In [11]:
# prompt: 75th quantile of df['Detailed Review']

quantile_75 = df['Detailed Review'].str.len().quantile(0.75)
print("75th quantile of detailed review length:", quantile_75)

75th quantile of detailed review length: 565.0


In [12]:
# prompt: average length of df[DetailedReview']

average_length = df['Review'].str.len().mean()
print("Average length of detailed reviews:", average_length)

Average length of detailed reviews: 24.54189885474714


In [13]:
print(df['Review'].str.len().quantile(0.75))

32.0


Token length for Detailed review = 600 \\
Token length for Review = 50

**Tokenizer**

In [14]:
D_size = 600
R_size = 50

In [24]:
import regex as re

GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

class Tokenizer:
    def __init__(self):

        self.lookup = {}
        self.merge = False

    def clean(self, token):

        cleaner = r'[^\p{L}\p{N}\s]'
        return re.sub(cleaner, ' ', token)

    def train(self, train:str, vocab_size:int):

        if len(train) > vocab_size:
          self.merge = True

        cleaned_text = self.clean(train)
        token = re.compile(GPT4_SPLIT_PATTERN)
        token_list = re.findall(token, cleaned_text)

        unique_tokens = sorted(set(token_list))


        for i in token_list:
          if i not in self.lookup:
            self.lookup[i] = i.encode("UTF-8")

    def encode(self, token):
      pass




In [None]:
import regex as re

GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

class Tokenizer:
    def __init__(self):
        self.lookup = {}
        self.merge = False
        self.clean_pattern = r'[^\p{L}\p{N}\s]'
        self.vocab_size = 0

    def clean(self, token):
        return re.sub(self.clean_pattern, ' ', token)

    def train(self, train: str, vocab_size: int, verbose: bool = False):

        if len(train) > vocab_size:
            self.merge = True


        cleaned_text = self.clean(train)

        token_pattern = re.compile(GPT4_SPLIT_PATTERN)
        token_list = re.findall(token_pattern, cleaned_text)

        if verbose:
            print("Cleaned text:", cleaned_text)
            print("Tokens:", token_list)

        unique_tokens = sorted(set(token_list))

        if len(unique_tokens) > vocab_size:
            unique_tokens = unique_tokens[:vocab_size]

        self.lookup = {token: idx for idx, token in enumerate(unique_tokens)}
        self.vocab_size = len(self.lookup)

        if verbose:
            print("Vocabulary:", self.lookup)
            print("Vocabulary size:", self.vocab_size)
            vocab_table = [[token, idx] for token, idx in self.lookup.items()]
            print(tabulate(vocab_table, headers=["Token", "ID"], tablefmt="grid"))

        return self.lookup

    def encode(self, text: str):

        cleaned_text = self.clean(text)

        token_pattern = re.compile(GPT4_SPLIT_PATTERN)
        token_list = re.findall(token_pattern, cleaned_text)

        encoded = [self.lookup.get(token, -1) for token in token_list]

        return encoded, token_list


In [25]:
niga = Tokenizer()

for i in df['Review']:
  niga.train(i, R_size)

In [26]:
print(niga.lookup)

