Skip to content

Explore characteristics of businesses who have been reviewed by different users in different locations in the Yelp Open Dataset

Notifications You must be signed in to change notification settings

adataschultz/Yelp_Reviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Natural Language Understanding of Reviews

Objective

Which characteristics of businesses associate with different operational metrics from customer ratings and reviews?

Questions

  1. Do the services provided by the business to the customer, hours of operation and time during the year of the business play any role in the amount of positive ratings or reviews?
  2. Does the activity of the individual who reviews businesses demonstrate any patterns that could affect the increased number of positive ratings/reviews?
  3. Is the text contained within customer reviews associated with any of the features provided and engineered?

Data

The data was retrieved from here. This includes business, review, user, tip and photo json files. The photo.json was not utilized in the data warehouse.

Preprocessing

A data warehouse was constructed by joining the five sets utilized in the analysis. The reviews were used as the backbone of the warehouse in order to increase the number of reviews. The data was joined in the following order:

  • review
  • business
  • user
  • tip
  • checkin

The business data was used to obtain the name of business in the tip set. This allowed for features to be engineered like the number of compliments. The checkin set allowed for features to be created based of the time information. Exploratory data analysis (EDA) was then completed to examine the various components of the warehouse including the reviews, users, business ratings, locations and time. Prior to the NLP processing step, the warehouse was filtered to the food categories with over 30k counts and the states with the seven highest counts of reviews.

For the text processing step, all non-English words were removed prior to processing the reviews using langid for language detection and dask for parallel processing. Then, all non-words were removed. Then the reviews were processed for stopwords using nltk and lemmatized using WordNetLemmatizer. Further EDA was conducted after processing.

Classification

  • Word2Vec from gensim to create vocabulary -> Classification using XGBoost, CatBoost and LightGBM utilizing a GPU.
  • tokenizer from Keras with num_words=100000 and max_length=500 -> Bidirectional LSTM using Tensorflow with embedding_size=300 using batch_size=256.
  • Tokenization and vectorization using tf.keras.layers.TextVectorization with max_features=50000 and sequence_length=300 -> Classifier using layers consisting of Embedding, GlobalAveragePooling1D, with Dropout(0.2) after each layer with a final Dense layer. The baseline Embedding size used was embedding_dim=32. The loss function utilized was losses.BinaryCrossentropy, Adam as the optimizer and the default learning rate. The models were trained for 10 epochs using batch_size=1. Hyperparameter tuning using grid and random search with Weights & Biases tracking.
  • BoW and TF-IDF models were constructed in PyTorch by sampling 20,000 of each target group, building a vocabulary of max_len=300 and max_vocab=10000, tokenizing with wordpunct_tokenize, removing rare words and constructing BoW vector and TF-IDF vector. batch_size=1 was used for both models. A FeedfowardTextClassifier with two hiddens layers was used for classification. The loss function utilized was CrossEntropyLoss, Adam as the optimizer, learning_rate=6e-5 and the scheduler=CosineAnnealingLR(optimizer, 1). Early stopping was used if the validation loss was greater than the validation loss from three previous epochs.
  • A classifier utilizing BertModel.from_pretrained('bert-base-uncased') using PyTorch was constructed using the BertTokenizer.from_pretrained('bert-base-uncased') with max_length=300. The model architecture employed an initial Dropout(p=0.4) followed by a linear, another dropout, a ReLU and a final linear layer. The loss function was CrossEntropyLoss with Adam optimizer and a learning_rate=5e-6. batch_size=1 and batch_size=8 were tested in different models.

About

Explore characteristics of businesses who have been reviewed by different users in different locations in the Yelp Open Dataset

Topics

Resources

Stars

Watchers

Forks