# Phase 1: Data Ingestion and Relational Mapping
### Project: Explainable Review Intelligence System

**Objective:** The goal of this stage is to ingest large-scale unstructured review data and map it to business metadata. This creates the foundation for high-dimensional feature engineering (50+ features) by linking customer sentiment to specific business attributes.

**Data Sources:**
* `yelp_academic_dataset_review.json`: Customer ratings and text.
* `yelp_academic_dataset_business.json`: Business categories and attributes.

----

## 1. Importing Libraries

We will import the libraries here

In [2]:
import pandas as pd

----

## 2. Memory-Efficient Data Loading
To maintain system stability while handling multi-gigabyte files, we are implementing a **chunked loading strategy**. 

* **Strategy:** Sample 100,000 records from the `review` dataset.
* **Logic:** This provides a statistically significant sample size for behavior analysis while staying within RAM limits.

In [3]:
def load_rows(file_path, nrows):
    with open(file_path, 'r', encoding='utf-8') as f:
        # Empty list to store json objects in
        data = []
        
        # Index to keep track of the number of lines read 
        index = 0

        # Looping over the content of the file
        for line in f:
            if index >= nrows:
                break
            # Extracting json object from the line
            json_object = json.loads(line)
            # Appending the object to the data list
            data.append(json_object)
            # Incrementing the index
            index = index + 1

            # Returning a dataframe from the data list
        return pd.DataFrame(data)

In [4]:
reviews_df = load_rows('./data/yelp_academic_dataset_review.json', 100000)
businesses_df = pd.read_json('./data/yelp_academic_dataset_business.json', lines=True)

In [5]:
reviews_df.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3.0,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5.0,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3.0,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5.0,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4.0,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


In [6]:
businesses_df.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


---

## 3. Relational Merging (Many-to-One)
We are performing an **Inner Join** between the sampled Reviews and the full Business dataset.

* **Key:** `business_id`
* **Purpose:** To enrich each review with context (Industry, Location, and Service Attributes). This allows us to move from "What is the star rating?" to "Which business attributes drive this rating?"

In [7]:
df = pd.merge(businesses_df, reviews_df, on='business_id', how='inner')
df.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars_x,review_count,...,categories,hours,review_id,user_id,stars_y,useful,funny,cool,text,date
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,...,"Doctors, Traditional Chinese Medicine, Naturop...",,9vwYDBVI3ymdqcyJ5WW2Tg,e0imecnX_9MtLnS2rUZM-A,5.0,3,2,1,I've had acupuncture treatments with Abby over...,2012-05-02 18:07:38
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,...,"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ...",-WXMS4p3D9NQsAPw4YPEyw,Jks_uMtTZHqP-84wSZ3COg,5.0,0,0,0,I have a po box there and ea. visit I am greet...,2014-09-15 14:37:42
2,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,...,"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ...",z7TqAKXXArEB6LH6Nfr9BQ,trf3Qcz8qvCDKXiTgjUcEg,3.0,1,0,1,"Bottom Line: \nClean store, Quick Service, Go...",2011-08-01 03:45:56
3,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,...,"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ...",8Di0vZGcRLVNCZ-AWKgshA,auE6cx-AMcv2fv4SW_gnzA,5.0,0,0,0,I went in to ship a package to my friend for h...,2018-03-06 03:17:02
4,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,...,"Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ...",IOmiYoBPtQsY_fh5uA4mXg,P-NTOAMFVSDFGkhcj4GaIQ,4.0,1,0,0,We are fans of Target. They seem to have a li...,2017-02-19 15:11:22


---

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 22 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   business_id   100000 non-null  object 
 1   name          100000 non-null  object 
 2   address       100000 non-null  object 
 3   city          100000 non-null  object 
 4   state         100000 non-null  object 
 5   postal_code   100000 non-null  object 
 6   latitude      100000 non-null  float64
 7   longitude     100000 non-null  float64
 8   stars_x       100000 non-null  float64
 9   review_count  100000 non-null  int64  
 10  is_open       100000 non-null  int64  
 11  attributes    98369 non-null   object 
 12  categories    99996 non-null   object 
 13  hours         94176 non-null   object 
 14  review_id     100000 non-null  object 
 15  user_id       100000 non-null  object 
 16  stars_y       100000 non-null  float64
 17  useful        100000 non-null  int64  
 18  funny

---

## 3. High-Dimensional Feature Extraction
The `attributes` column in the contains nested JSON/Dictionary data. To unlock deep insights, we must **flatten** these key-value pairs into individual columns.

* **Expected Outcome:** Transformation of the dataset from a narrow structure to a wide structure.
* **Process:** Extracting binary features such as 'WiFi', 'Parking', and 'NoiseLevel'.

In [13]:
attributes_df = df['attributes'].apply(pd.Series)
df = pd.concat([df.drop('attributes', axis=1), attributes_df], axis=1)
df.sample(n=5)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars_x,review_count,...,AcceptsInsurance,BestNights,BYOB,Corkage,BYOBCorkage,HairSpecializesIn,Open24Hours,AgesAllowed,DietaryRestrictions,RestaurantsCounterService
57399,d_tRshM-w6S4QxE4VVi8tQ,Jones,700 Chestnut St,Philadelphia,PA,19106,39.949314,-75.152525,3.5,1141,...,,,,,'no',,,,,
35199,jQBPO3rYkNwIaOdQS5ktgQ,The Fountain On Locust,3037 Locust St,Saint Louis,MO,63103,38.636198,-90.223038,4.5,1010,...,,,False,False,,,,,,
3495,-DXNvQLhKwunHzg8OjkwXA,Northeast Auto Service,5155 E 65th St,Indianapolis,IN,46220,39.875143,-86.0824,4.5,54,...,,,,,,,,,,
52207,PP3FoHboDjA2FuKOEBRBiQ,Bon Prix Auto Sales,3724 Airline Dr,Metairie,LA,70001,29.975001,-90.165006,2.0,12,...,,,,,,,,,,
97172,9V0LMtO1riRw9-pUuG4NFg,Delicia,5215 N College Ave,Indianapolis,IN,46220,39.847406,-86.145181,4.5,603,...,,,False,,,,,,,


In [None]:
attributes_df.sample(n=10)

In [None]:
attributes_df.loc[:,'BusinessParking':'AcceptsInsurance'].sample(n=10)