Q4 - Regex Riddles in Data Cleaning

Question: Welcome to Regex Riddles in Data Cleaning!
You are given a dataset containing customer reviews for various fantastical products.
However, the data is quite messy with inconsistent formats, typos, and random special characters.
Your task is to use regular expressions and other data cleaning techniques to answer the following questions:

- Standardize the format of email addresses.
- Correct common typos in product names.
- Remove any special characters from the reviews.
- Extract and count the number of reviews mentioning the word "magic".
- Identify the top 3 most frequently mentioned products in the reviews.

Datasets:

customer_reviews: Contains columns (review_id, email, product_name, review_text).

In [None]:
import pandas as pd
import numpy as np
import re

# Seed for reproducibility
np.random.seed(24)

# Generate synthetic data
review_ids = np.arange(1, 21)
emails = ['user{}@example.com'.format(i) for i in range(1, 21)]
emails = [email.replace('user', 'UsEr-') if i % 2 == 0 else email for i, email in enumerate(emails)]
product_names = ['Magic Wand', 'Potion', 'Spell Book', 'Crystal Ball', 'Flying Broom']
typos = ['Magic Wnd', 'Potin', 'Spell Bok', 'Crystal Bll', 'Flyng Broom']

reviews = [
    "This {} is awesome! It's pure magic!".format(np.random.choice(product_names + typos))
    for _ in review_ids
]
special_characters = ['!', '@', '#', '$', '%', '^', '&', '*', '(', ')']
reviews = [
    review + ' ' + ''.join(np.random.choice(special_characters, 3))
    for review in reviews
]

# Create DataFrame
customer_reviews = pd.DataFrame({
    'review_id': review_ids,
    'email': emails,
    'product_name': [np.random.choice(product_names + typos) for _ in review_ids],
    'review_text': reviews
})

# Display the dataset
customer_reviews.head()

In [None]:
# Standardize the format of email addresses.
customer_reviews['email'] = customer_reviews['email'].str.lower()
customer_reviews[['review_id', 'email']]

In [None]:
# Correct common typos in product names.
typo_corrections = {
                    'Magic Wnd': 'Magic Wand',
                    'Potin': 'Potion',
                    'Spell Bok': 'Spell Book',
                    'Crystal Bll': 'Crystal Ball',
                    'Flyng Broom': 'Flying Broom'
                   }
customer_reviews['product_name'] = customer_reviews['product_name'].replace(typo_corrections)
customer_reviews[['review_id', 'product_name']]

In [None]:
# Remove any special characters from the reviews
customer_reviews['review_text'] = customer_reviews['review_text'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
customer_reviews[['review_id', 'review_text']]

In [None]:
# Extract and count the number of reviews mentioning the word "magic"
magic_reviews = customer_reviews[customer_reviews['review_text'].str.contains(r'\bmagic\b', case=False)]
magic_reviews_count = magic_reviews.shape[0]
magic_reviews_count

In [None]:
# Identify the top 3 most frequently mentioned products in the reviews
product_mentions = customer_reviews['review_text'].str.extractall(r'({})'.format('|'.join(product_names)))
top_products = product_mentions[0].value_counts().head(3)
top_products