# COGS 118A- Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training a supervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

### Peer Review

You will all have an opportunity to look at the Project Proposals of other groups to fuel your creativity and get more ideas for how you can improve your own projects. 

Both the project proposal and project checkpoint will have peer review.

# Names

Hopefully your team is at least this good. Obviously you should replace these with your names.

- Caitlin Connolly
- Carolyn Yatco
- Abby Koornwinder
- Joshua Widjanarko
- Caike Campana

# Abstract 
Our project focuses on attempting to predict both a hotel's average score based on their reviews. We will be utilizing “515K Hotel Reviews Data in Europe” from Kaggle. This dataset contains 515,000 reviews from 1493 different hotels and contains the review which separated by the positive and negative portions, the number word count of the positive and negative portion respectively, locations of hotel, days since the review. In addition, we will be doing sentiment analysis within the positive and negative portions to get the degree of the positivity and negativity.After the sentiment analysis we will be testing different models(OLS, K-Nearest Neighbor, etc) and determining which model has the highest accuracy. 


# Background
Beyond location, price, and amenities, consumers judge their perception on a possible hotel to stay out based on the experience “or reviews” of previous guests. For example, according to TripAdvisor, a site for hotel and restaurant reviews and booking, roughly 81% of people always or frequently check a hotel’s reviews before booking <a name="lorenz"></a>[<sup>[1]</sup>]. However, not all reviews and experiences are weighted equally. According to Sparks, B.A, and Browning, noticed that how information is framed within the review as well as the the focus of the review itself makes the biggest difference in how much a review would affect the overall view of a hotel  <a name="admonish"></a>[<sup>[2]</sup>]. This begs the question on whether a computer can take a look at these reviews and determine the trust(in the form of ratings) that a hotel has in a similar way to how humans prioritze certain ideas, information, and looking at the overall sentiments.. Sentiment analysis, the ability of extracting emotion, feeling, and other subjective states, has been used on a wide arrangement of different reviews from movies on Netflix to restaurant reviews <a name="sotanote"></a>[<sup>[3]</sup>]. However, most work on reviews ends with extracting the overall positivity or negativity of the review, not seeing if we can predict or gain insight on the overall view or rating of the hotel, movie, or other thing being reviewed. 


 

# Problem Statement

Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

# Data

We are going to use the dataset [515K Hotel Reviews Data in Europe
](https://www.kaggle.com/datasets/jiashenliu/515k-hotel-reviews-data-in-europe) from the repository Kaggle and generated by the user Jianshen Liu. 

The dataset contains 515000 reviews of 1493 hotels throughout Europe, it was collected from the hotel booking website [booking](www.booking.com). It has 17 features, and some very useful are the average score, hotel name, geographical latitude,nationality of the  geographical longitude, days since the review, negative word counts, and positive word counts.

Each single observation contains one customer review with all the features. Despite the data being clean and with no missing values,we need to process the review text such that we have a quantifiable metric.


# Proposed Solution

Our proposed solution to predicting a hotel’s average score and an individual’s score of a hotel will have multiple parts: Preprocessing, data cleaning, sentiment analysis of written reviews, determining the best classifier for predicting scores, and finally, training, testing, and evaluation. <br>

To begin, here are the libraries we plan on using. <br>
import numpy as np<br>
import pandas as pd<br>
from nltk.corpus import stopwords<br>
from wordcloud import STOPWORDS<br>
import matplotlib.pyplot as plt<br>
import string<br>

To preprocess and clean our data we have to remove any entries with missing values as well as remove those that correspond to reviewers who did not write out reviews and instead only gave numerical scores. Since we plan on using sentiment analysis to classify the written reviews, we need to remove the entries that did not provide any additional information aside from a score. We then need to remove any columns that correspond to features we do not need. <br>
Thus, the variables we plan on using are: <br>
the review date<br>
the average hotel score<br>
hotel name<br>
reviewer nationality<br>
negative review <br>
positive review<br>
the number of reviews the reviewer has given in the past<br>
total number of valid reviews that the hotel has<br>
tags the reviewer has given the hotel. <br>


In order to perform sentiment analysis on the reviews we must perform some text cleaning. Namely, getting rid of stopwords, punctuation, and converting all letters to lowercase. We plan on performing a sentiment classification on the written reviews that will make the data quantitatively easier to put into our hotel rating prediction classifier. One way we can perform sentiment classification on the reviews is through using logistic regression to determine how positive or negative a review is. 
Next, we want to determine a classification model that would best predict the hotel ratings. The specific input we plan on using in this model are the nine variables listed above. Some potential models we could use include OLS and KNN. For instance, we went with KNN we could use the ten closest neighbors and find the mean score of them.  
To split our data into train and test sets, we will use 80% for training and 20% for testing, choosing what entry goes into which set randomly. Once we train our model we will test it, run our evaluation metrics, and plot our results using Matplotlib. 


# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

# Ethics & Privacy

Before beginning this project, our team has acknowledged potential ethics and privacy concerns that may arise from our data and implementation. Primarily, it is important to note that the dataset we plan on using is legally obtained as it is data from Booking.com that has been made publicly available. We discovered this specific collection of data from Kaggle.com, posted by Jiashen Liu who curated the dataset and made it available to the public domain to copy and modify as we intend to do. 
It is also important to acknowledge the potential biases in the dataset. Most notably, each data entry includes the nationality of the user which is one of the many various features provided by the dataset that we can use to predict the user’s score of the hotel. We take this into consideration as it can result in a potential bias in our predictions. 
Our team has also taken into consideration the privacy of the responders and has assured that the privacy of all reviewers is upheld as there are no identifying personal features other than the reviewer’s nationality. However, though there may be personal information in specific written reviews, all information was freely given by the individual. Thus, the anonymity of individuals is maintained. One way we can address these ethical and privacy concerns is through using an ethics checklist that addresses important ethical considerations in data collection, modeling, and analysis. A useful tool we can use to add such a checklist to our project is the command line tool, Deon. 


# Team Expectations 

Our team expectations strive to be in accordance with COGS 118A policies and guidelines. Each member has the responsibility to participate equally in all aspects of the project and communicate if any conflicts or difficulties arise. Meeting times will be planned in accordance with personal schedules and each member has a commitment to attend the meetings or make up any work if they are unable to attend. To keep things on a timely schedule each team member must be attentive to group conversations and keep in contact about any problems, ideas, or thoughts that could contribute to the project. In dividing any work among team members, each team member has the responsibility to contribute equally and finish their responsibilities at a timely manner. Additionally, each team member will be responsible for reviewing and communicating feedback as a whole before they are turned. If any conflicts arise, it is expected to be handled in a professional manner with consideration of all team members. Overall, it is expected that each team member will contribute equally and actively communicate to the rest of the team.


# Project Timeline Proposal

Replace this with something meaningful that is appropriate for your needs. It doesn't have to be something that fits this format.  It doesn't have to be set in stone... "no battle plan survives contact with the enemy". But you need a battle plan nonetheless, and you need to keep it updated so you understand what you are trying to accomplish, who's responsible for what, and what the expected due dates are for each item.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM |  Brainstorm topics/questions (all)  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic (Pelé) | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets (Beckenbaur)  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data ,do some EDA (Maradonna) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin programming for project (Cruyff) | Discuss/edit project code; Complete project |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Carlos)| Discuss/edit full project |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project  | 




# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz):  TripAdvisor. (2019, July 16). Online reviews remain a trusted source of information when booking trips, reveals New Research. Retrieved April 24, 2022, from https://www.prnewswire.com/news-releases/online-reviews-remain-a-trusted-source-of-information-when-booking-trips-reveals-new-research-300885097.html 
<br> 
<a name="admonishnote"></a>2.[^](#admonish): Sparks, B.A., & Browning, V. (2011). The impact of online reviews on hotel booking intentions and perception of trust. Tourism Management, 32, 1310-1323.
<br>
<a name="sotanote"></a>3.[^](#sota):An example of one of these uses and how they approached sentiment analysis: https://towardsdatascience.com/customer-reviews-analysis-using-nlp-the-netflix-use-case-92b3645770e1