Skip to content

Topic modeling for Disney reviews using HDBSCAN, Laten Dirichlet Allocation(LDA) and ChatGPT Prompt Engineering

License

Notifications You must be signed in to change notification settings

aimee0317/topic_modeling_Disney_reviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Topic Modeling Disney Reviews

  • Author: Amelia Tang

About

Disney land is regarded as "The Happiest Place on Earth" by many according to U.S. News Travel (Anonymous 2021). This Natural Language Processing (NLP) project is to visualize and analyze the Disneyland Reviews data set I collected from Kaggle. The dataset comprises approximately 42,000 reviews of three Disneyland locations: Paris, California, and Hong Kong. To facilitate data visualization, I created a Tableau story that provides an overview of ratings by branches and reviewer locations. For text data preprocessing, I performed basic text cleaning (e.g., removing digits), as well as stemming and lemmatization. In the project, I utilized two methods, TextBlob and NLTK VADER, for sentiment polarity analysis and compared the polarity scores assigned by both methods.


For topic modeling, I utilized Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) and Latent Dirichlet Allocation (LDA) models to identify clusters/topics. However, the topics are hard to interpret as both models do not output semantic topics.


Finally, I used prompt engineering and the ChatGPT gpt-3.5-turbo API to conduct sentiment/polarity analysis and topic inference. The results of sentiment analysis generated by ChatGPT generally agree with the results generated by TextBlob and VADER, implemented using NLTK. Among the Disney reviews, it seems that there are some popular topics, such as fireworks, queues, rides, and staff.

Reports

EDA report (Pre-processing and sentiment analysis) can be found here.
NLP topic modeling report (LDA and HDBSCAN) can be found here.
ChatGPT Prompt Engineering sentiment analysis and topic inferring report can be found here.

Visualizations

Tableau

Tableau Demo

WordCloud

Unigram WordCloud

Unigram WordCloud

Bigram WordCloud

Bigram WordCloud

HDBSCAN Clusters

HDBSCAN Clusters

Usage

Creating the environment

At the project root conda env create --file tm_disney.yaml


Run the following command from the environment where you installed JupyterLab.

conda install nb_conda_kernels

Dependencies

A complete list of dependencies is available here.
Python 3.10.12 and Python packages:
- hdbscan==0.8.29
- openai==0.27.8
- umap==0.1.1
- nltk=3.7=pyhd3eb1b0_0
- textblob=0.15.3=py_0

References

Anonymous. 2021. “Disneyland Resort Reviews | U.S. News Travel.” "https://travel.usnews.com/Anaheim_Disneyland_CA/Things_To_Do/Disneyland_62335/".

About

Topic modeling for Disney reviews using HDBSCAN, Laten Dirichlet Allocation(LDA) and ChatGPT Prompt Engineering

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published