#Airbnb Sentiment Analysis

In this notebook we are going to perform sentiment analysis on the name of the Airbnb listing and save the sentiment analysis in our data frame as a variable to be used as feature in a different notebook.


We will use `distilbert/distilbert-base-uncased-finetuned-sst-2-english` as our sentiment analysis model.
https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english#how-to-get-started-with-the-model


## Load data

In [None]:
import pandas as pd
from google.colab import drive
from tabulate import tabulate

drive.mount('/content/drive')

#Load data into data frame
df = pd.read_csv('/content/drive/My Drive/Learning/Datasets/Airbnb_Open_Data.csv')

# Make all headers lowercase and use underscores in spaces
df.columns = df.columns.str.lower().str.replace(' ', '_')

#View all columns
pd.set_option('display.max_columns', None)
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


  df = pd.read_csv('/content/drive/My Drive/Learning/Datasets/Airbnb_Open_Data.csv')


Unnamed: 0,id,name,host_id,host_identity_verified,host_name,neighbourhood_group,neighbourhood,lat,long,country,country_code,instant_bookable,cancellation_policy,room_type,construction_year,price,service_fee,minimum_nights,number_of_reviews,last_review,reviews_per_month,review_rate_number,calculated_host_listings_count,availability_365,house_rules,license
0,1001254,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,US,False,strict,Private room,2020.0,$966,$193,10.0,9.0,10/19/2021,0.21,4.0,6.0,286.0,Clean up and treat the home the way you'd like...,
1,1002102,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,US,False,moderate,Entire home/apt,2007.0,$142,$28,30.0,45.0,5/21/2022,0.38,4.0,2.0,228.0,Pet friendly but please confirm with me if the...,
2,1002403,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,US,True,flexible,Private room,2005.0,$620,$124,3.0,0.0,,,5.0,1.0,352.0,"I encourage you to use my kitchen, cooking and...",
3,1002755,,85098326012,unconfirmed,Garry,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,US,True,moderate,Entire home/apt,2005.0,$368,$74,30.0,270.0,7/5/2019,4.64,4.0,1.0,322.0,,
4,1003689,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,US,False,moderate,Entire home/apt,2009.0,$204,$41,10.0,9.0,11/19/2018,0.1,3.0,1.0,289.0,"Please no smoking in the house, porch or on th...",


##Sentiment Analysis

Let's try to perform sentiment analysis on `name`.

In [None]:
!pip install transformers==4.31.0
from transformers import pipeline, DistilBertForSequenceClassification



Create a function to determine the sentiment.

In [None]:
# Assuming sentiment_pipeline is a pre-trained pipeline for sentiment analysis
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

def analyze_sentiment(text):
  result = sentiment_pipeline(str(text))[0] # Convert text to string explicitly
  return result['label']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


Run on first 1000 names

In [None]:
first_100_names = df['name'].head(1000)
sentiments = first_100_names.apply(analyze_sentiment)

df['name_sentiment'] = None
df.loc[:999, 'name_sentiment'] = sentiments

In [None]:
# prompt: Count how the number of positive, negative classifications. Also count any other sentiment and label it as other

positive_count = df['name_sentiment'].value_counts().get('POSITIVE', 0)
negative_count = df['name_sentiment'].value_counts().get('NEGATIVE', 0)
other_count = len(df['name_sentiment']) - positive_count - negative_count

print(f"Positive: {positive_count}")
print(f"Negative: {negative_count}")
print(f"Other: {other_count}")


Positive: 902
Negative: 98
Other: 101599


Looks like it works, let's run it on the entire dataframe and save the results. **Warning: This will take 2 hours to run.**

In [None]:
df['name_sentiment'] = df['name'].apply(analyze_sentiment)

Save the dataframe to google drive and use it in the Airbnb.ipynb notebook.

In [None]:
df.to_csv('/content/drive/My Drive/Learning/Datasets/Airbnb_Open_Data_with_sentiment.csv', index=False)