**Assignment Title: Lab Assignment 1 - Text Processing Techniques**

**Author name: Garima Astha**

**ASU ID: 1234333687 (gastha)**

**file creation date: 26 Jan 2025**

**Objective: Use basic text processing techniques and Pandas to explore the Yelp Review data and examine the language of positive and negative restaurant reviews.**



**Import libraries and dataset**

In [None]:
import pandas as pd
import numpy as np
import spacy
from collections import Counter
from nltk.corpus import stopwords
import string

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Load the dataset
file_path = r'/content/sample_data/restaurant_reviews_az.csv'
df = pd.read_csv(file_path)



**Data Exploration - Display summary of dataset**

In [None]:
# Display summary of the dataset
print("Dataset Summary:\n", df.info())
print("\nFirst 5 rows:\n", df.head())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48147 entries, 0 to 48146
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   review_id    48147 non-null  object
 1   user_id      48147 non-null  object
 2   business_id  48147 non-null  object
 3   stars        48147 non-null  int64 
 4   useful       48147 non-null  int64 
 5   funny        48147 non-null  int64 
 6   cool         48147 non-null  int64 
 7   text         48147 non-null  object
 8   date         48147 non-null  object
dtypes: int64(4), object(5)
memory usage: 3.3+ MB
Dataset Summary:
 None

First 5 rows:
                 review_id                 user_id             business_id  \
0  IVS7do_HBzroiCiymNdxDg  fdFgZQQYQJeEAshH4lxSfQ  sGy67CpJctjeCWClWqonjA   
1  QP2pSzSqpJTMWOCuUuyXkQ  JBLWSXBTKFvJYYiM-FnCOQ  3w7NRntdQ9h0KwDsksIt5Q   
2  oK0cGYStgDOusZKz9B1qug  2_9fKnXChUjC5xArfF8BLg  OMnPtRGmbY8qH_wIILfYKA   
3  E_ABvFCNVLbfOgRg3Pv1KQ  9MExTQ76GSKhxSWnT

**Data Exploration - Check for missing values**

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("\nMissing Values:\n", missing_values)



Missing Values:
 review_id      0
user_id        0
business_id    0
stars          0
useful         0
funny          0
cool           0
text           0
date           0
dtype: int64


**Select the 1 star reviews and 5 star reviews from the dataset. Apply necessary text processing techniques on the selected reviews.**

In [None]:
!pip install nltk
import nltk
nltk.download('stopwords')

# Select 1-star and 5-star reviews
selected_reviews = df[df['stars'].isin([1, 5])]

# Text Preprocessing Function
def preprocess_text(text):
    doc = nlp(text.lower())  # Convert to lowercase
    tokens = [token.lemma_ for token in doc if token.is_alpha and token.text not in stopwords.words('english') and token.text not in string.punctuation]
    return tokens

selected_reviews['processed_text'] = selected_reviews['text'].apply(preprocess_text)

# Function to extract top N frequent words of a specific POS
def get_top_n_words(reviews, pos_tag, n=20):
    words = []
    for text in reviews:
        doc = nlp(" ".join(text))
        words.extend([token.text for token in doc if token.pos_ == pos_tag])
    return Counter(words).most_common(n)

# Separate 1-star and 5-star reviews
one_star_reviews = selected_reviews[selected_reviews['stars'] == 1]['processed_text']
five_star_reviews = selected_reviews[selected_reviews['stars'] == 5]['processed_text']




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_reviews['processed_text'] = selected_reviews['text'].apply(preprocess_text)


**Display the top 20 frequently used nouns in 1 star reviews and 5 star reviews, respectively.**

**Display the top 20 frequently used adjectives in  1 star reviews and 5 star, respectively.**

**Display the top 20 frequently used verbs in 1 star reviews and 5 star reviews, respectively.**

In [None]:
# Find top 20 frequent nouns, adjectives, and verbs
print("\nTop 20 nouns in 1-star reviews:", get_top_n_words(one_star_reviews, "NOUN"))
print("\nTop 20 nouns in 5-star reviews:", get_top_n_words(five_star_reviews, "NOUN"))

print("\nTop 20 adjectives in 1-star reviews:", get_top_n_words(one_star_reviews, "ADJ"))
print("\nTop 20 adjectives in 5-star reviews:", get_top_n_words(five_star_reviews, "ADJ"))

print("\nTop 20 verbs in 1-star reviews:", get_top_n_words(one_star_reviews, "VERB"))
print("\nTop 20 verbs in 5-star reviews:", get_top_n_words(five_star_reviews, "VERB"))


Top 20 nouns in 1-star reviews: [('order', 8096), ('food', 6441), ('time', 4360), ('place', 3403), ('service', 3066), ('minute', 2406), ('restaurant', 2250), ('customer', 2216), ('table', 1705), ('manager', 1666), ('people', 1443), ('hour', 1258), ('experience', 1182), ('call', 1170), ('pizza', 1164), ('location', 1163), ('server', 1080), ('way', 1044), ('drink', 1039), ('meal', 1028)]

Top 20 nouns in 5-star reviews: [('food', 15137), ('place', 10061), ('order', 7535), ('service', 7365), ('time', 7099), ('restaurant', 4789), ('staff', 4000), ('tucson', 3980), ('love', 3920), ('pizza', 2797), ('menu', 2731), ('flavor', 2588), ('meal', 2394), ('experience', 2231), ('price', 2106), ('sauce', 1955), ('side', 1936), ('spot', 1925), ('drink', 1924), ('day', 1919)]

Top 20 adjectives in 1-star reviews: [('bad', 2390), ('good', 2077), ('last', 925), ('first', 893), ('terrible', 798), ('horrible', 787), ('cold', 731), ('great', 720), ('close', 621), ('long', 619), ('open', 596), ('rude', 595)

**Display the top 20 frequently used named entities from the selected reviews.**

In [None]:
import pandas as pd
import spacy
from collections import Counter
from spacy.lang.en import English

# Load the dataset
#file_path = "your_file.csv"  # Update with your actual file path
#df = pd.read_csv(file_path)

# Load spaCy's English NLP model
nlp = spacy.load("en_core_web_sm")

# Extract review text
documents = df['text'].dropna().tolist()

# Process text and extract named entities
entities = []
for doc in nlp.pipe(documents, disable=["parser", "tagger"]):
    entities.extend([ent.text for ent in doc.ents])

# Count frequency of named entities
entity_counts = Counter(entities)

# Get the top 20 most common named entities
top_20_entities = entity_counts.most_common(20)

# Print the results
print("Top 20 Named Entities:")
for entity, count in top_20_entities:
    print(f"{entity}: {count}")



Top 20 Named Entities:
Tucson: 7045
first: 4365
one: 3743
two: 3662
2: 2701
Mexican: 2228
3: 2046
5: 1860
today: 1627
4: 1535
First: 1344
One: 1283
three: 1172
second: 1066
1: 1062
half: 1049
Italian: 886
Chinese: 837
tonight: 790
10: 773


**Conclusion**

**The top 20 frequently used nouns, adjectives and verbs help to understand the reviews better.**

**In 1-star reviews, common nouns are "order", "time", "minute", "wait", "manager" with adjectives such as "bad", "cold", "terrible", "wrong" signify the disappointment with the restaurant. The common verbs are "ask", "tell", "leave", "think", "sit" which confirm the dissatisfaction of customers.**

**However, in 5-star reviews, common nouns are "food", "service", "staff", "flavour" with adjectives like such as "amazing", "delicious", "friendly" suggest that the customer is happy with the restaurant. The common verbs like "recommend", "enjoy", "love", "feel" reinforce that customers enjoyed the experience.**

**After analysing the reviews, we can say that the key to a good restaurant experience is food, service, time and experience.**

**The analysis of the top 20 names entities signify that most restaurants are in "Tucson" and popular cuisines are "Mexican", "Italian" and "Chinese".**


