# **Dataset**: **Google Local Review Data**


---

> This Dataset contains review information on Google map (ratings, text, images, etc.) business metadata (address, geographical info, descriptions, category information, price, open hours, and MISC info), and links (relative businesses) up to Sep 2021 in the United States.

> Prediction Based Recommendation has implemented in this colab file.

> SVD Model is imported from the surprise library.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##**Preprocessing**

In [None]:
users_file_path = '/content/drive/MyDrive/DAIICT-RS/CP2/review-North_Dakota.json'
meta_file_path = '/content/drive/MyDrive/DAIICT-RS/CP2/meta-North_Dakota.json'

In [None]:
import pandas as pd

In [None]:
import json

def parse(path):
    # Open the JSON file
    with open(path, 'r') as json_file:
        # Read each line in the file
        for line in json_file:
            # Use json.loads() to parse each line as JSON
            yield json.loads(line)

# Parse JSON file and store objects in a list
user_json_objects = list(parse(users_file_path))
meta_json_objects = list(parse(meta_file_path))

# Create a DataFrame from the list of JSON objects
users_df = pd.DataFrame(user_json_objects)
meta_df = pd.DataFrame(meta_json_objects)

In [None]:
users_df.head()

Unnamed: 0,user_id,name,time,rating,text,pics,resp,gmap_id
0,108849313974597426826,أحمد آل إبراهيم,1600517759614,5.0,I want to join Catholic Christ.,,,0x52d94fbefa0e6353:0xf709e2d8674fe3a
1,113748047376932419918,John OpenMinded,1594757122443,5.0,Its Catholic and devout. What more do you need?,,,0x52d94fbefa0e6353:0xf709e2d8674fe3a
2,108988419397291213849,Juergen Wolf,1574580027389,5.0,This was my Church in Karlsruhe. However in Ka...,,,0x52d94fbefa0e6353:0xf709e2d8674fe3a
3,109461075406832601697,Jamie Lee,1572608951549,4.0,Go with god,,,0x52d94fbefa0e6353:0xf709e2d8674fe3a
4,117748833597621418948,lucas03,1604339936721,5.0,(Translated by Google) everything faker not th...,,,0x52d94fbefa0e6353:0xf709e2d8674fe3a


In [None]:
users_df.shape

(1109558, 8)

In [None]:
meta_df.head()

Unnamed: 0,name,address,gmap_id,description,latitude,longitude,category,avg_rating,num_of_reviews,price,hours,MISC,state,relative_results,url
0,St Peter & Paul Church,"St Peter & Paul Church, 500 Main St, Karlsruhe...",0x52d94fbefa0e6353:0xf709e2d8674fe3a,,48.093248,-100.618664,[Catholic church],4.9,7,,,{'Accessibility': ['Wheelchair accessible entr...,,"[0x52d9384b75abac93:0x13526f8266cae6cf, 0x52d9...",https://www.google.com/maps/place//data=!4m2!3...
1,Northwest Martial Arts Academy,"Northwest Martial Arts Academy, 1430 Main Ave,...",0x52c8cbe775edec7d:0xb46e15ed33643070,,46.875093,-96.802717,[Martial arts school],5.0,8,,"[[Thursday, 7:30AM–8PM], [Friday, 7:30AM–8PM],...",{'Accessibility': ['Wheelchair accessible entr...,Closed ⋅ Opens 7:30AM,"[0x52c8ccbcb1785327:0x2d50311eabd7afc, 0x52cf3...",https://www.google.com/maps/place//data=!4m2!3...
2,Thad's Amazing Magic - Fargo Birthday Party Magic,Thad's Amazing Magic - Fargo Birthday Party Ma...,0x52c8cd270f50bbbb:0x4ee4629598a8090e,,46.812415,-96.856729,"[Magician, Children's party service]",5.0,58,,"[[Thursday, Open 24 hours], [Friday, Open 24 h...",{'Amenities': ['Good for kids']},Open 24 hours,"[0x52c8c9613725e9ef:0xc628b86d8593e7e6, 0x52c8...",https://www.google.com/maps/place//data=!4m2!3...
3,Threefold,"Threefold, 212 W Main Ave, Bismarck, ND 58501",0x52d7836b7314da5d:0xc3cc63667b8c13a0,,46.805707,-100.79299,"[Film production company, Video editing servic...",5.0,5,,"[[Wednesday, 9AM–6PM], [Thursday, 9AM–6PM], [F...",,Closed ⋅ Opens 9AM Thu,"[0x52d7836c2b519b77:0x74c84187e38f42b, 0x52d78...",https://www.google.com/maps/place//data=!4m2!3...
4,Gray Brothers Dairy,"Gray Brothers Dairy, 408 N Main St, Stanley, N...",0x5320bcc09c8e6f15:0xc888ebee3ea483b6,,48.324312,-102.39,,5.0,1,,,,,"[0x5320bcc63e8fe69d:0x4f22ad0dd39b1970, 0x5320...",https://www.google.com/maps/place//data=!4m2!3...


In [None]:
meta_df.shape

(11987, 15)

In [None]:
users_df.isnull().sum()

user_id      11122
name             0
time             0
rating       11122
text        497305
pics       1090487
resp        914063
gmap_id          0
dtype: int64

In [None]:
meta_df.isnull().sum()

name                    0
address                97
gmap_id                 0
description         10231
latitude                0
longitude               0
category               57
avg_rating              0
num_of_reviews          0
price               10023
hours                3272
MISC                 2419
state                3756
relative_results      982
url                     0
dtype: int64

> **Drop the rows which have `None` in there user_id**

In [None]:
users_df.dropna(subset=['user_id'], inplace=True)

## **Exploratory Data Analysis**

> **Count number of unique users**

In [None]:
num_of_unique_users = len(users_df['user_id'].unique())

print("Unique number of Users: ", num_of_unique_users)

Unique number of Users:  293523


> **count users associated with numbers of bussiness**

In [None]:
# Group by user_id and count occurrences
user_counts = users_df.groupby('user_id').size().reset_index(name='count')

# Sort in descending order by counts
user_counts_sorted = user_counts.sort_values(by='count', ascending=False)

user_counts_sorted

Unnamed: 0,user_id,count
134999,108485153366736176374,438
2614,100163679090357494018,352
4515,100285235527386801510,341
251621,115830224400151689190,331
230030,114482968466650354622,330
...,...,...
122382,107686060009360628003,1
122383,107686095889188687482,1
122384,107686262875690134165,1
122386,107686344736906714446,1


> **We will select only users who is associated with atleast 25 bussiness**

In [None]:
filtered_users = (user_counts_sorted[user_counts_sorted['count'] > 10])



---



In [None]:
meta_df.shape

(11987, 15)

> **Number of unique bussiness**

In [None]:
unique_buss = len(meta_df['gmap_id'].unique())

print("No. of unique bussiness: ", unique_buss)

No. of unique bussiness:  11937


> **Drop the duplicated bussiness**

In [None]:
# Drop duplicate rows based on 'gmap_id'
final_meta_mat = meta_df.drop_duplicates(subset=['gmap_id'])

> **Bussiness is associated with set of categories like restaurant, hair cutting, bar, Quarry, etc.**

In [None]:
unique_category = list({item for sublist in final_meta_mat['category'].values if isinstance(sublist, list) and sublist for item in sublist})

# number of unique category
print(len(unique_category))

1960


In [None]:
# explode the list of category into multiple rows
meta_df_new = final_meta_mat.explode('category')

meta_df_new.reset_index(drop=True, inplace=True)

> **Counting number of bussiness associated with each categories.**

In [None]:
# Group by user_id and count occurrences
meta_category_counts = meta_df_new.groupby('category').size().reset_index(name='count')

# Sort in descending order by counts
meta_category_counts_sorted = meta_category_counts.sort_values(by='count', ascending=False)

# Reset index
meta_category_counts_sorted.reset_index(drop=True, inplace=True)

meta_category_counts_sorted

Unnamed: 0,category,count
0,Restaurant,679
1,Bar,468
2,Gas station,447
3,Fast food restaurant,357
4,Park,293
...,...,...
1955,Punjabi restaurant,1
1956,Quarry,1
1957,Beach clothing store,1
1958,Battle site,1


> **Update the users matrix with filtered user matrix**

In [None]:
final_users_mat = users_df[users_df['user_id'].isin(filtered_users['user_id'])]
final_users_mat.reset_index(drop=True, inplace=True)

In [None]:
final_users_mat.shape

(570431, 8)

In [None]:
final_users_mat.columns

Index(['user_id', 'name', 'time', 'rating', 'text', 'pics', 'resp', 'gmap_id'], dtype='object')

In [None]:
final_meta_mat.shape

(11937, 15)

In [None]:
final_meta_mat.columns

Index(['name', 'address', 'gmap_id', 'description', 'latitude', 'longitude',
       'category', 'avg_rating', 'num_of_reviews', 'price', 'hours', 'MISC',
       'state', 'relative_results', 'url'],
      dtype='object')

> **Merging the users and bussiness dataframe**

In [None]:
users_meta_merge_mat = final_users_mat.merge(final_meta_mat, on='gmap_id')

In [None]:
users_meta_merge_mat.shape

(570431, 22)

In [None]:
users_meta_merge_mat['user_id'].nunique()

20288

In [None]:
users_meta_merge_mat.columns

Index(['user_id', 'name_x', 'time', 'rating', 'text', 'pics', 'resp',
       'gmap_id', 'name_y', 'address', 'description', 'latitude', 'longitude',
       'category', 'avg_rating', 'num_of_reviews', 'price', 'hours', 'MISC',
       'state', 'relative_results', 'url'],
      dtype='object')

> **So, there is `360k` rows in merged user-meta matrix.**

> **Now, we want to select the 3 features only to generate the user-item matrix which is user_id, gmap_id, rating.**

In [None]:
user_meta_mat = users_meta_merge_mat[['user_id', 'gmap_id', 'rating']]

In [None]:
user_meta_mat.head()

Unnamed: 0,user_id,gmap_id,rating
0,113748047376932419918,0x52d94fbefa0e6353:0xf709e2d8674fe3a,5.0
1,109461075406832601697,0x52d94fbefa0e6353:0xf709e2d8674fe3a,4.0
2,113748047376932419918,0x52d94fbefa0e6353:0xf709e2d8674fe3a,5.0
3,109461075406832601697,0x52d94fbefa0e6353:0xf709e2d8674fe3a,4.0
4,108980581753286957185,0x52c8cbe775edec7d:0xb46e15ed33643070,5.0


In [None]:
user_meta_mat.shape

(570431, 3)

## **Recommendation Model**

In [None]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/772.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.3/772.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3163006 sha256=689523c5568864b0222900405681d1d389f7b57497fa1b99a7eefc3c40ef1510
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scik

In [None]:
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

> **Applying the SVD model using surprise library and predict the ratings.**

In [None]:
# Create a reader object specifying the rating scale
reader = Reader(rating_scale=(1, 5))

# Load the user-item matrix into the Surprise Dataset format
data = Dataset.load_from_df(user_meta_mat, reader)

# Split the dataset into train and test sets
trainset, testset = train_test_split(data, test_size=0.2)

# Choose the SVD model
model = SVD()

# Train the model on the trainset
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7d371d2f8a00>

In [None]:
# Make predictions on the testset
predictions = model.test(testset)

> **Measuring the accuracy on test dataset.**

In [None]:
print("RMSE: ", accuracy.rmse(predictions))
print("MAE: ", accuracy.mae(predictions))

RMSE: 0.9204
RMSE:  0.9204258342886501
MAE:  0.6591
MAE:  0.6590567410368855


In [None]:
user_id = 108988419397291213849
top_N = 10

recommendations = []

gmap_ids = list(user_meta_mat['gmap_id'].unique())

for gmap_id in gmap_ids:
  predicted_rating = model.predict(str(user_id), str(gmap_id)).est
  recommendations.append((gmap_id, predicted_rating))

# Sort recommendations by predicted rating
recommendations.sort(key=lambda x: x[1], reverse=True)

# Display top-N recommended movies
top_n_recommendations = recommendations[:top_N]

print("Recommendations for User are as follow: ".format(top_N))
print(70*"-")
for items in top_n_recommendations:
  gmap_id, pred_rating = items
  print("Name: ", meta_df[meta_df['gmap_id'] == gmap_id]['name'])
  print("Address: ", meta_df[meta_df['gmap_id'] == gmap_id]['address'])
  print("Description: ", meta_df[meta_df['gmap_id'] == gmap_id]['description'])
  print("Latitude: ", meta_df[meta_df['gmap_id'] == gmap_id]['latitude'])
  print("Longitude: ", meta_df[meta_df['gmap_id'] == gmap_id]['longitude'])
  print("Opening hours: ", meta_df[meta_df['gmap_id'] == gmap_id]['hours'])

  print(70*"-")


Recommendations for User are as follow: 
----------------------------------------------------------------------
Name:  3180    Legacy Plumbing, LLC
Name: name, dtype: object
Address:  3180    Legacy Plumbing, LLC, 3955 40th Ave S suite A,...
Name: address, dtype: object
Description:  3180    None
Name: description, dtype: object
Latitude:  3180    46.818668
Name: latitude, dtype: float64
Longitude:  3180   -96.84529
Name: longitude, dtype: float64
Opening hours:  3180    [[Friday, Open 24 hours], [Saturday, Open 24 h...
Name: hours, dtype: object
----------------------------------------------------------------------
Name:  5302    Capital Heights Auto Clinic & Mr Lubester
Name: name, dtype: object
Address:  5302    Capital Heights Auto Clinic & Mr Lubester, 142...
Name: address, dtype: object
Description:  5302    None
Name: description, dtype: object
Latitude:  5302    46.834808
Name: latitude, dtype: float64
Longitude:  5302   -100.768776
Name: longitude, dtype: float64
Opening hours