Yelp's website, Yelp.com, is a crowd-sourced local business review and social networking site.[6] The site has pages devoted to individual locations, such as restaurants or schools, where Yelp users can submit a review of their products or services using a one to five stars rating scale.[

<b>Problem Statement<b>

In a competitive market like the restaurant industry, understanding the factors that 
influence business success is crucial for stakeholders. Utilizing the Yelp dataset, 
this project aims to investigate the relationship between user engagement 
(reviews, tips, and check-ins) and business success metrics (review count, ratings) for restaurants.

<b>Data Overview<b>

This project uses a subset of the Yelp dataset, which contains information about businesses and user engagement from eight metropolitan areas in the USA and Canada. The original data is provided by Yelp in five separate JSON files:

business: Contains business information, including location, attributes, and categories.

review: Contains full review text, star ratings, and links to the user and business.

user: Contains user profiles, including their review count and "fan" status.

tip: Contains short, unstructured tips left by users.

checkin: Contains check-in data, showing the number of check-ins at a specific business on a given date.

These JSON files are stored in a database to allow for easy and efficient data retrieval.

<b>Research Objectives<b>

Based on the provided image, the research has three main objectives:

Quantify the correlation between user engagement and business success: This involves analyzing how user engagement metrics like reviews, tips, and check-ins relate to business success metrics such as review count and average star ratings. The goal is to determine if businesses with more user engagement also have more reviews and higher ratings.

Analyze the impact of sentiment on business success: This objective focuses on investigating how positive sentiment in reviews and tips influences a restaurant's average star rating. The research will also explore if positive sentiment can lead to an increase in the total number of reviews.

Investigate time trends in user engagement: This objective aims to determine if consistent user engagement over a period of time is a better predictor of a business's long-term success than sporadic, short-lived bursts of activity.

<b>Hypothesis Testing<b>

Higher levels of user engagement (more reviews, tips, and check-ins) correlate with higher review counts and ratings for restaurants.

Positive sentiment expressed in reviews and tips contributes to higher overall ratings and review counts for restaurants.

Consistent engagement over time is positively associated with sustained business success for restaurants.

In [1]:
import pandas as pd
import json
from sqlalchemy import create_engine

In [2]:
import time

start = time.time()

# ⏱ Small files can be loaded fully
business_df = pd.read_json("yelp_academic_dataset_business.json", lines=True)
checkin_df = pd.read_json("yelp_academic_dataset_checkin.json", lines=True)
tip_df = pd.read_json("yelp_academic_dataset_tip.json", lines=True)

# ✅ Load only first N rows of huge files (e.g., 100,000 reviews and 50,000 users)
review_sample = []
with open("yelp_academic_dataset_review.json", "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        if i >= 100000:  # Limit rows
            break
        review_sample.append(json.loads(line))
review_df = pd.DataFrame(review_sample)

user_sample = []
with open("yelp_academic_dataset_user.json", "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        if i >= 50000:  # Limit rows
            break
        user_sample.append(json.loads(line))
user_df = pd.DataFrame(user_sample)

# ✅ Shape check
print(f"✅ business_df: {business_df.shape}")
print(f"✅ checkin_df: {checkin_df.shape}")
print(f"✅ tip_df: {tip_df.shape}")
print(f"✅ review_df (sampled): {review_df.shape}")
print(f"✅ user_df (sampled): {user_df.shape}")
print(f"\n⏱ Time taken: {round(time.time() - start, 2)} seconds")


✅ business_df: (150346, 14)
✅ checkin_df: (131930, 2)
✅ tip_df: (908915, 5)
✅ review_df (sampled): (100000, 9)
✅ user_df (sampled): (50000, 22)

⏱ Time taken: 39.91 seconds


In [3]:
import time
start = time.time()

print(business_df.shape)
print(checkin_df.shape)
print(review_df.shape)
print(tip_df.shape)
print(user_df.shape)

print(f"\nTime taken: {round(time.time() - start, 2)} seconds")


(150346, 14)
(131930, 2)
(100000, 9)
(908915, 5)
(50000, 22)

Time taken: 0.0 seconds


In [4]:
#we r removing attributes and hours col as we don't need it

In [8]:
business_df.drop(['attributes', 'hours'], axis=1, inplace=True)


KeyError: "['attributes', 'hours'] not found in axis"

In [7]:
engine = create_engine('sqlite:///yelp.db')

def load_dataframe(df, table_name, engine):
    df.to_sql(table_name, con=engine, if_exists='replace', index=False)

# Load each DataFrame into a separate table
load_dataframe(business_df, 'business', engine)
load_dataframe(review_df, 'review', engine)
load_dataframe(user_df, 'user', engine)
load_dataframe(tip_df, 'tip', engine)
load_dataframe(checkin_df, 'checkin', engine)