In [1]:
# Author: Afif Shomali
# Imports & loading dataset
import pandas as pd
import numpy as np
import ast
import re

df = pd.read_csv("Datasets/AirbnbData/All_Listings_Cleaned.csv")

# Feature Engineering for the Airbnb Dataset
Source: Inside Airbnb accessed at https://insideairbnb.com/get-the-data/ (Used New York City Datasets, used listings & reviews data)  
License : [Creative Commons Attribution 4.0 International License](#https://creativecommons.org/licenses/by/4.0/)

The listings dataset being used is one that has combined multiple months of data from Inside Airbnb, addtionaly, data preprocessing & cleaning was performed, see `PreProcessingAirbnb.ipynb` to see what changes were made.


Steps overview:
  - [Create Feautres based on the host verifications](#host-verification-status)
  - [Create Features based on the amenities list](#amenities-column)
  - [Using Reviews data, get sentiment of each review, the for each host create a review score column (use average/some sort of metric based on counts of sentiments), impute values for any hosts with no reviews](#creating-column-based-on-sentiment-of-reviews)
  - Extra: Try to create a Description/Title/host description/neighborhood overview column score based on the text from these columns, create some sort of metric that combines these into a new column
  - Extra: Create a Feature that rates the listing photo and assigns it some sort of rating

## Host Verification Status

Since there are only 3 types of verification methods (phone, email, work_email), we can either create 3 binary columns or create a column that counts the number of verification methods. Some EDA will be done later to check which of these two options is better by comparing to the 2 response variables we will use later on in the analysis, Price & whether the host is a superhost.

In [2]:
# Change column of strings to column lists
df["host_verifications"] = df["host_verifications"].apply(ast.literal_eval)

# Create column for number of verifications the host has
df["num_host_verifications"] = df["host_verifications"].apply(len)

# Create columns for which types of verifications the host has
df["host_phone_verified"] = df["host_verifications"].apply(lambda x: 'phone' in x)
df["host_email_verified"] = df["host_verifications"].apply(lambda x: 'email' in x)
df["host_work_email_verified"] = df["host_verifications"].apply(lambda x: 'work_email' in x)


## Amenities Column 

We see that there are over 7000 different types of amenities, so we have a few options, we can make a column that counts the number of amenities, or  
since alot of the ammenities in the set describe the same amenity with some variation (e.g. HDTV with streaming services, kitchen, oven, etc.), we can take a subset of amenities to create binary columns for to check if a listing has those amenities.  
For example, most of the elements of the amenities set below are just HDTV followed by a list of streaming services, we can make columns for if a listing contains a tv or if it contains contains a certain streaming service.
Also we can simplify specific brands of appliances to just one column (e.g. oven, stove, fridge) and things such as shampoo, soap & conditioner.

Here is the list of amenites we will create columns for after going through the set of amenities:
- kitchen 
- oven
- stove
- refrigerator
- air conditioning
- sound system
- wifi
- tv
- parking
- gym/exercise equipment 
- pool
- hygeine products (soap, shampoo, or conditioner)
- laundry (if a listing contains a washer or dryer)
- coffee 
- view


In [3]:
import pprint

amenities_set = set()

for x in df["amenities"]:
    amen_list = ast.literal_eval(x)
    amen_list = [x.lower().strip() for x in amen_list]
    amenities_set.update(amen_list)

print(len(amenities_set))
pprint.pprint(amenities_set)

7343
{'- oven',
 '- refrigerator',
 '-- conditioner',
 '-- shampoo',
 '. body soap',
 '. conditioner',
 '. refrigerator',
 '1 bottle of travel sized body wash  body soap',
 '1 inch hdtv',
 '1 inch hdtv with amazon prime video, apple tv, netflix, roku',
 '1 inch hdtv with amazon prime video, chromecast, hulu, netflix, roku',
 '1 refrigerator',
 '100 inch hdtv with amazon prime video, apple tv, disney+, hbo max, hulu, '
 'netflix',
 '100 inch hdtv with amazon prime video, apple tv, disney+, hbo max, hulu, '
 'netflix, premium cable',
 '100 inch hdtv with amazon prime video, apple tv, disney+, hbo max, netflix, '
 'standard cable',
 '100 inch hdtv with amazon prime video, apple tv, hbo max, hulu, netflix',
 '100 inch hdtv with amazon prime video, apple tv, netflix, premium cable, '
 'standard cable',
 '100 inch hdtv with amazon prime video, disney+, hbo max, hulu, netflix',
 '100 inch hdtv with apple tv',
 '100 inch hdtv with chromecast',
 '100 inch hdtv with fire tv, netflix, amazon prim

In [4]:
# First convert amenities column to list of string all lowercase with no leading/trailing spaces
df["amenities"] = df["amenities"].apply(ast.literal_eval).apply(lambda x: [amen.lower().strip() for amen in x])

# Create total amenity count column
df["num_amenities"] = df["amenities"].apply(len)

# Create indivual amenity binary columns
# Helper function
def populate_amens(row):
    for amen in short_amenities.keys():

        for keyword in short_amenities[amen]:
            if any(re.search(r'\b' + re.escape(keyword) + r'\b', row_amens) for row_amens in row["amenities"]):
                row[amen] = True
    return row

# Dict of amenities to make columns for and their corresponding keywords  
short_amenities = {
    "kitchen": ["kitchen"], 
    "oven": ["oven"], 
    "stove": ["stove"], 
    "refrigerator": ["refrigerator", "fridge"], 
    "air conditioning": ["air conditioning", "ac"], 
    "sound system": ["sound system"], 
    "wifi": ["wifi"], 
    "tv": ["hdtv", "tv"], 
    "parking": ["parking", "garage"], 
    "gym/exercise equipment": ["gym", "exercise equipment"], 
    "pool" : ["pool"], 
    "hygiene products": ["soap", "shampoo", "conditioner"], 
    "laundry" : ["washer", "dryer"], 
    "coffee" : ["coffee"], 
    "view" : ["view"]
}

for amen in short_amenities.keys():
    df[amen]  = False

df = df.apply(
    lambda row : populate_amens(row),
    axis = 1
)

In [5]:
df.head()

Unnamed: 0,id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_name,host_since,host_location,...,sound system,wifi,tv,parking,gym/exercise equipment,pool,hygiene products,laundry,coffee,view
0,739333866230665371,2024-09-04,Lovely room 2 windows tv work desk ac included,"Lovely vocation room, has work desk , tv, 2 wi...",Blank,https://a0.muscache.com/pictures/miso/Hosting-...,3013025,Suada,2012-07-21,"New York, NY",...,False,True,True,False,False,False,False,False,False,False
1,572612125615500056,2024-09-04,Room by Sunny & Bay! Sunset Park & Bay Ridge,Cozy room in a charming Sunset Park apartment....,Blank,https://a0.muscache.com/pictures/5f44a178-6043...,358089614,Joshua,2020-07-23,"New York, United States",...,False,True,True,True,True,False,False,True,True,False
2,45267941,2024-09-04,Private Room in Luxury Apartment,Blank,Blank,https://a0.muscache.com/pictures/3c15a88e-b08a...,39162543,Jeff,2015-07-21,"New York, United States",...,False,True,True,False,False,False,False,True,False,False
3,838141198693830649,2024-09-04,Modern renovated huge apartment,Blank,Blank,https://a0.muscache.com/pictures/prohost-api/H...,148571080,David,2017-08-31,"New York, NY",...,False,False,False,False,False,False,False,True,False,False
4,1082660771919357919,2024-09-04,Summertime Park Slope townhouse,425 10th Street is what dreams are made of! S...,Blank,https://a0.muscache.com/pictures/hosting/Hosti...,394869975,Betty,2021-03-30,"Queens, NY",...,False,True,True,False,False,False,False,True,False,False


## Creating Column Based on Sentiment of Reviews

Want to start by combining the datasets & omitting duplicate reviews by using the id column in the reviews datasets. 
Then want to do some partial pre-processing 

We will use Vader to process the sentiment of reviews:  
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text. Eighth International Conference on
Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

I roughly followed the steps in:
https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vaderSentiment.py#L517 

Some of the reviews are not in English, Vader can only do sentiment analysis on English text  
so we will make use of python langdetect library as well as google translate API to work around this  
This is different from the API suggested in the github above which uses My Memory Translation Service.  

Some hosts will not have any reviews, however this won't be an issue since I am planning to use the average compound sentiment score which 
is a normalized score from -1 to 1, with zero being neutral, so hosts without any reviews will receive a zero average compound score making the sentiment 
of their reviews netural which makes the most sense from the perspective of replacing missing values.

In [None]:
# Imports for sentiment of reviews
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt_tab')

In [None]:
# Combining Datasets


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Afif\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Afif\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [None]:
# Some testing with nltk
rev = pd.read_csv("Datasets/AirbnbReviews/SepReviews.csv")

rev

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2595,17857,2009-11-21,50679,Jean,Notre séjour de trois nuits.\r<br/>Nous avons ...
1,2595,19176,2009-12-05,53267,Cate,Great experience.
2,2595,19760,2009-12-10,38960,Anita,I've stayed with my friend at the Midtown Cast...
3,2595,34320,2010-04-09,71130,Kai-Uwe,"We've been staying here for about 9 nights, en..."
4,2595,46312,2010-05-25,117113,Alicia,We had a wonderful stay at Jennifer's charming...
...,...,...,...,...,...,...
947323,1202281404451653124,1210187339508573604,2024-07-27,336561829,Eli,Nice comfy place to get some rest
947324,1202281439306897980,1212447511379103143,2024-07-30,562140003,Mert,It located in a great area that really close t...
947325,1202281541998087802,1208820196573552796,2024-07-25,38066548,Jonny,"Works for a single night, however place felt a..."
947326,1202281541998087802,1214515290122107417,2024-08-02,593375750,Agim,Rip off


In [45]:
analyzer = SIA()

analyzer.polarity_scores(rev["comments"][2].replace("\r<br/>", " "))

{'neg': 0.016, 'neu': 0.793, 'pos': 0.192, 'compound': 0.9248}

In [44]:
for i in range(1, 1000):
    sentence_list = nltk.tokenize.sent_tokenize(rev["comments"][i].replace("\r<br/>", " "))
    paragraphSentiments = 0.0
    for sentence in sentence_list:
        vs = analyzer.polarity_scores(sentence)
        paragraphSentiments += vs["compound"]

    print(f"AVERAGE SENTIMENT FOR PARAGRAPH {i}: \t" + str(round(paragraphSentiments / len(sentence_list), 4)))
    print("----------------------------------------------------")

AVERAGE SENTIMENT FOR PARAGRAPH 1: 	0.6249
----------------------------------------------------
AVERAGE SENTIMENT FOR PARAGRAPH 2: 	0.4782
----------------------------------------------------
AVERAGE SENTIMENT FOR PARAGRAPH 3: 	0.5968
----------------------------------------------------
AVERAGE SENTIMENT FOR PARAGRAPH 4: 	0.8258
----------------------------------------------------
AVERAGE SENTIMENT FOR PARAGRAPH 5: 	0.5252
----------------------------------------------------
AVERAGE SENTIMENT FOR PARAGRAPH 6: 	0.7399
----------------------------------------------------
AVERAGE SENTIMENT FOR PARAGRAPH 7: 	0.248
----------------------------------------------------
AVERAGE SENTIMENT FOR PARAGRAPH 8: 	0.55
----------------------------------------------------
AVERAGE SENTIMENT FOR PARAGRAPH 9: 	0.5233
----------------------------------------------------
AVERAGE SENTIMENT FOR PARAGRAPH 10: 	0.7018
----------------------------------------------------
AVERAGE SENTIMENT FOR PARAGRAPH 11: 	0.773