# Data Preparation



In [1]:
# Import necessary libraries
import pandas as pd
from transformers import BertTokenizer

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [2]:
# Load dataset from csv file in data folder
df = pd.read_csv("../data/SPOTIFY_REVIEWS.csv")

### Data Filtering

The dataset was filtered to include only the most recent five years of reviews, from November 15, 2019, to November 15, 2023, ensuring the analysis captures the current trends. 

In [3]:
# Filter reviews from the most recent 5 years
df['review_timestamp'] = pd.to_datetime(df['review_timestamp'])
df2 = df[(df['review_timestamp'] >= '2019-11-15') & (df['review_timestamp'] <= '2023-11-15')]

# Display the number of reviews before and after filtering
print("Original:", len(df))
print("After the most recent 5 year:", len(df2))

Original: 3377423
After the most recent 5 year: 1711607


Previous data exploration analysis revealed that some reviews contained unusually long words. To enhance data quality, we implemented a filter to exclude reviews with words exceeding 15 characters.

In [4]:
# Function to check for very long words
def has_very_long_word(text, max_len=15):
    # make sure it's string
    if not isinstance(text, str):
        return False
    for w in text.split():
        if len(w) >= max_len:
            return True
    return False

# Filter out reviews with very long words
mask_long = df2["review_text"].apply(has_very_long_word)
df_clean = df2[~mask_long].copy()

# Display the number of reviews after filtering
print("After filtering long words:", len(df_clean))

After filtering long words: 1669701


### Review Length Segmentation

The variables `raw_word_count`, `length_type`, and `length_type2` were created to categorize review length for further analysis. The four-category `length_type` allows for granular examination, which is useful for isolating reviews of a suitable complexity for specific tasks; for instance, very short reviews may lack the textual depth required for reliable topic modeling. The binary `length_type2` enables a direct comparison between short and long reviews, facilitating the development and evaluation of separate predictive models for each group.

In [5]:
# Calculate raw word count and create length type categories
df_clean['raw_word_count'] = df_clean['review_text'].str.split().str.len()
df_clean['length_type'] = pd.cut(df_clean['raw_word_count'], 
                                bins=[0, 3, 6, 10, float('inf')], 
                                labels=['Very short', 'Short', 'Medium', 'Long'])
df_clean['length_type2'] = pd.cut(df_clean['raw_word_count'], 
                                bins=[0, 6, float('inf')], 
                                labels=['Short','Long'])

### Tokenization

The BERT tokenizer was employed to segment review text into subword units (e.g., "playing" -> "play", "##ing"). This approach effectively handles out-of-vocabulary words that are common in informal user reviews, such as slang, misspellings, and product-specific terms. 

In [None]:
# Create BERT tokenizer
df_clean["review_text"] = df_clean["review_text"].astype(str)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
for i in range(5): 
    text = df_clean.loc[i, "review_text"]
    tokens = tokenizer.tokenize(text)
    print(f"\n--- Review {i+1} ---")
    print("Original:", text)
    print("Tokens:", tokens)

In [7]:
# Tokenize the review text using BERT tokenizer
df_clean["tokens"] = df_clean["review_text"].apply(lambda x: tokenizer.tokenize(str(x)))

In [8]:
# Display the first few tokenized reviews
df_clean["tokens"].head()

1663991    [i, love, the, fact, that, i, can, listen, to,...
1663992                                       [awesome, app]
1663993                                      [really, [UNK]]
1663994                                           [love, it]
1663995                                   [liked, mast, ##i]
Name: tokens, dtype: object

In [11]:
# Save the cleaned and tokenized DataFrame to a CSV file
df_clean.to_csv("../data/SPOTIFY_REVIEWS_tokens.csv", index=False)