# Case Study 2 : Data Mining in Yelp Data


Please download the Yelp dataset in Case Study 2 in BrightSpace. 

**Here is an example of the data format:**
### Business Objects

Business objects contain basic information about local businesses. The fields are as follows:

```json
{
  'type': 'business',
  'business_id': (a unique identifier for this business),
  'name': (the full business name),
  'neighborhoods': (a list of neighborhood names, might be empty),
  'full_address': (localized address),
  'city': (city),
  'state': (state),
  'latitude': (latitude),
  'longitude': (longitude),
  'stars': (star rating, rounded to half-stars),
  'review_count': (review count),
  'photo_url': (photo url),
  'categories': [(localized category names)]
  'open': (is the business still open for business?),
  'schools': (nearby universities),
  'url': (yelp url)
}
```
### Checkin Objects
```json
{
    'type': 'checkin',
    'business_id': (encrypted business id),
    'checkin_info': {
        '0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
        '1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
        ...
        '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
        ...
        '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
    }, # if there was no checkin for a hour-day block it will not be in the dict
}
```

# Problem: pick a data science problem that you plan to solve using Yelp Data
* The problem should be important and interesting, which has a potential impact in some area.
* The problem should be solvable using yelp data and data science solutions.

Please briefly describe in the following cell: what problem are you trying to solve? why this problem is important and interesting?

# Data Collection/Processing: 

In [None]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

import json

In [None]:
import pandas as pd
business_json_path = "yelp_academic_dataset_business.json"
df = pd.read_json(business_json_path, lines=True)

In [None]:
df

Unnamed: 0,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Apn5Q_b6Nz61Tq4XzPdf9A,Minhas Micro Brewery,,1314 44 Avenue NE,Calgary,AB,T2E 6L6,51.091813,-114.031675,4.0,24,1,"{'BikeParking': 'False', 'BusinessAcceptsCredi...","Tours, Breweries, Pizza, Restaurants, Food, Ho...","{'Monday': '8:30-17:0', 'Tuesday': '11:0-21:0'..."
1,AjEbIBw6ZFfln7ePHha9PA,CK'S BBQ & Catering,,,Henderson,NV,89002,35.960734,-114.939821,4.5,3,0,"{'Alcohol': 'none', 'BikeParking': 'False', 'B...","Chicken Wings, Burgers, Caterers, Street Vendo...","{'Friday': '17:0-23:0', 'Saturday': '17:0-23:0..."
2,O8S5hYJ1SMc8fA4QBtVujA,La Bastringue,Rosemont-La Petite-Patrie,1335 rue Beaubien E,Montréal,QC,H2G 1K7,45.540503,-73.599300,4.0,5,0,"{'Alcohol': 'beer_and_wine', 'Ambience': '{'ro...","Breakfast & Brunch, Restaurants, French, Sandw...","{'Monday': '10:0-22:0', 'Tuesday': '10:0-22:0'..."
3,bFzdJJ3wp3PZssNEsyU23g,Geico Insurance,,211 W Monroe St,Phoenix,AZ,85003,33.449999,-112.076979,1.5,8,1,,"Insurance, Financial Services",
4,8USyCYqpScwiNEb58Bt6CA,Action Engine,,2005 Alyth Place SE,Calgary,AB,T2H 0N5,51.035591,-114.027366,2.0,4,1,{'BusinessAcceptsCreditCards': 'True'},"Home & Garden, Nurseries & Gardening, Shopping...","{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188588,sMQAZ3DkfrURFoJAyOhjEw,Ross Massage,,"8000 McKnight Rd, Ste 570",Pittsburgh,PA,15237,40.551152,-80.021213,2.5,9,0,"{'AcceptsInsurance': 'False', 'BusinessAccepts...","Skin Care, Beauty & Spas, Day Spas, Massage","{'Monday': '10:0-21:0', 'Tuesday': '10:0-21:0'..."
188589,6hvuCibNS4uECetHb9MCQQ,Four Seasons Boutique,,3341 Babcock Blvd,Pittsburgh,PA,15237,40.534242,-80.019556,2.0,5,1,,"Fashion, Women's Clothing, Accessories, Bridal...",
188590,KleCXFYOmdACcQUvf6_XEg,Walmart Supercenter,,5825 Thunder Rd,Concord,NC,28027,35.378669,-80.724733,3.0,26,1,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Mobile Phones, Shopping, Department Stores, Fo...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W..."
188591,3_fIsSxN2RBovQ_6EFtLzA,Residence Inn Charlotte Concord,,7601 Scott Padgett Pkwy,Concord,NC,28027,35.364366,-80.703454,4.0,19,1,"{'BusinessAcceptsCreditCards': 'True', 'DogsAl...","Event Planning & Services, Hotels & Travel, Ho...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W..."


In [None]:
print(len(df))
# Dropping the rows with Null Values
df = df.dropna()
print(len(df))


188593
127442


# Data Exploration: Exploring the Yelp Dataset

**(1) Finding the most popular business categories:** 
* print the top 10 most popular business categories in the dataset and their counts in a table (i.e., how many business objects in each category). Here we say a category is "popular" if there are many business objects in this category (such as 'restaurants').

In [None]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary
c = ""
for j in df['categories']:
    c = c + j


In [None]:
categories = c.split(",")

In [None]:
from collections import Counter
counts = Counter(categories)
counts.most_common(10)

[(' Restaurants', 17966),
 (' Shopping', 12322),
 (' Food', 9931),
 (' Home Services', 6528),
 (' Beauty & Spas', 6495),
 (' Health & Medical', 6283),
 (' Nightlife', 6091),
 (' Bars', 5602),
 (' Local Services', 5227),
 (' Event Planning & Services', 4247)]

**(2) Find the most popular business objects** 
* print the top 10 most popular business objects/IDs in the dataset and their counts (i.e., how many checkins in total for each business object).  Here we say a business object is "popular" if the business object attracts a large number of checkins from the users.

In [None]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary






**(3) Other explorations you would like to present** 


In [None]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary






# The Solution: implement a data science solution to the problem you are trying to solve.

Briefly describe the idea of your solution to the problem in the following cell:

Write codes to implement the solution in python:

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the Yelp dataset
yelp_data = pd.read_json("/content/yelp_academic_dataset_business.json", lines=True)

# Select the features we want to use for prediction
features = ['review_count']

# Extract the features and target variable from the data
X = yelp_data[features]
y = yelp_data['stars']

# Encode the categorical variable (categories) using one-hot encoding
#X = pd.get_dummies(X, columns=['categories'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create a linear regression model and fit it to the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Predict the star rating for the testing data
y_pred = model.predict(X_test)

# Calculate the mean squared error (MSE) of the predictions
mse = mean_squared_error(y_test, y_pred)

print('Mean Squared Error:', mse)


ValueError: ignored

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
import matplotlib.pyplot as plt

plt.plot(range(len(y_pred)), y_pred)
plt.xlabel('cases')
plt.ylabel('ypredict')
plt.show()


NameError: ignored

# Results: summarize and visualize the results discovered from the analysis

Please use figures or tables to present the results.


In [None]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary








*-----------------
# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this Jupyter notebook, and submit it in BrightSpace. Please make sure all the plotted tables and figures are in the notebook. 