# Using Word2vec To Analyzie Airbnb Listing Names

In this notebook file, we take the *name* column into computer-recognizable vectors using word2vec.

Then, we made a preliminary gradient boosted tree model to predict the price based on the listing name.
This was supposed to be the baseline of our predictive model for airbnb price, but we decided not to use the listing name 
as a feature due to time constraint.

In [1]:
%load_ext lab_black
import h2o

h2o.init()
from h2o.estimators.word2vec import H2OWord2vecEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,08 secs
H2O_cluster_timezone:,America/Chicago
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.1.1
H2O_cluster_version_age:,29 days
H2O_cluster_name:,H2O_from_python_steven_unique_ka5gn4
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,1.996 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


In [2]:
# loading the files
airbnb_names_path = "../data/airbnb_listings_2021.csv"
airbnb_names = h2o.import_file(
    airbnb_names_path, destination_frame="airbnbnames", header=1
)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [3]:
STOP_WORDS = [
    "w/",
    "at",
    "from",
    "in",
    "to",
    "/",
    "*",
    "-",
    "w",
    "+",
    "and",
    "&",
    "near",
    "next",
]

In [4]:
# this breaks airbnb names into sequence of words
def tokenize(sentences, stop_word=STOP_WORDS):
    tokenized = sentences.tokenize("\\W+")
    tokenized_lower = tokenized.tolower()
    tokenized_filtered = tokenized_lower[
        (tokenized_lower.nchar() >= 2) | (tokenized_lower.isna()), :
    ]
    tokenized_words = tokenized_filtered[
        tokenized_filtered.grep("[0-9]", invert=True, output_logical=True), :
    ]
    tokenized_words = tokenized_words[
        (tokenized_words.isna()) | (~tokenized_words.isin(STOP_WORDS)), :
    ]
    return tokenized_words

In [5]:
# this predicts the price of airbnb based on the name of the listing
def predict(airbnb_names, w2v, gbm):
    words = tokenize(h2o.H2OFrame(airbnb_names).ascharacter())
    airbnb_names_vec = w2v.transform(words, aggregate_method="AVERAGE")
    print(gbm.predict(test_data=airbnb_names_vec))

In [6]:
print("")
words = tokenize(airbnb_names["name"])




In [7]:
print("Build word2vec model")
w2v_model = H2OWord2vecEstimator(sent_sample_rate=0.0, epochs=10)
w2v_model.train(training_frame=words)

Build word2vec model
word2vec Model Build progress: |█████████████████████████████████████████████████| (done) 100%
Model Details
H2OWord2vecEstimator :  Word2Vec
Model Key:  Word2Vec_model_python_1652420694528_1

No model summary for this model




In [8]:
print("Sanity check - find synonyms for the word 'clean'")
w2v_model.find_synonyms("clean", count=5)

Sanity check - find synonyms for the word 'clean'


OrderedDict([('convenient', 0.5932838916778564),
             ('environment', 0.5449401140213013),
             ('super', 0.501754641532898),
             ('comfortable', 0.4874721169471741),
             ('cute', 0.4853975772857666)])

In [9]:
print("Calculate a vector for each airbnb name")
airbnb_names_vecs = w2v_model.transform(words, aggregate_method="AVERAGE")

Calculate a vector for each airbnb name


In [10]:
print("Prepare training&validation data (keep only names made of known words)")
valid_airbnb_names = ~airbnb_names_vecs["C1"].isna()
data = airbnb_names[valid_airbnb_names, :].cbind(
    airbnb_names_vecs[valid_airbnb_names, :]
)
data_split = data.split_frame(ratios=[0.8])

Prepare training&validation data (keep only names made of known words)


In [11]:
print("Build a basic GBM model")
gbm_model = H2OGradientBoostingEstimator()
gbm_model.train(
    x=airbnb_names_vecs.names,
    y="price",
    training_frame=data_split[0],
    validation_frame=data_split[1],
)

Build a basic GBM model
gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1652420694528_2


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,50.0,50.0,10459.0,5.0,5.0,5.0,6.0,23.0,12.06




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 67222.82256771806
RMSE: 259.2736441825857
MAE: 98.10490912939785
RMSLE: NaN
Mean Residual Deviance: 67222.82256771806

ModelMetricsRegression: gbm
** Reported on validation data. **

MSE: 69235.89974501255
RMSE: 263.12715508858554
MAE: 98.96447094431291
RMSLE: NaN
Mean Residual Deviance: 69235.89974501255

Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance,validation_rmse,validation_mae,validation_deviance
0,,2022-05-13 00:45:22,0.023 sec,0.0,295.531218,112.352819,87338.700985,267.574962,110.652229,71596.360196
1,,2022-05-13 00:45:22,0.674 sec,1.0,293.292151,111.118203,86020.285928,266.939644,109.543325,71256.773494
2,,2022-05-13 00:45:23,0.835 sec,2.0,291.47114,110.213471,84955.425263,266.422256,108.754906,70980.818466
3,,2022-05-13 00:45:23,0.973 sec,3.0,289.660536,109.201262,83903.226309,265.882435,107.747508,70693.469066
4,,2022-05-13 00:45:23,1.105 sec,4.0,288.010689,108.173277,82950.157035,265.447603,106.869703,70462.42983
5,,2022-05-13 00:45:23,1.299 sec,5.0,286.68832,107.401841,82190.193069,264.962797,106.069788,70205.283712
6,,2022-05-13 00:45:23,1.422 sec,6.0,285.360763,106.611551,81430.765001,264.625347,105.216325,70026.574434
7,,2022-05-13 00:45:23,1.561 sec,7.0,284.009228,105.97105,80661.241436,264.485544,104.611755,69952.603163
8,,2022-05-13 00:45:23,1.692 sec,8.0,282.919532,105.485373,80043.46184,264.291391,104.160913,69849.939239
9,,2022-05-13 00:45:24,1.834 sec,9.0,281.551486,104.912129,79271.239532,264.315049,103.792971,69862.44512



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,C68,504267424.0,1.0,0.162397
1,C64,321595936.0,0.637749,0.103569
2,C76,132874104.0,0.263499,0.042792
3,C24,117955248.0,0.233914,0.037987
4,C90,97031008.0,0.19242,0.031248
5,C9,84719888.0,0.168006,0.027284
6,C60,71706704.0,0.1422,0.023093
7,C57,71219864.0,0.141234,0.022936
8,C21,67965512.0,0.134781,0.021888
9,C32,60404712.0,0.119787,0.019453



See the whole table with table.as_data_frame()




In [12]:
print("Predict!")
print(predict(["Cozy & Clean Apartment for Two"], w2v_model, gbm_model))
print(predict(["Private Room Near Central Park"], w2v_model, gbm_model))
print(
    predict(
        ["Charming Apartment Walking Distance from Times Square"], w2v_model, gbm_model
    )
)

Predict!
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


predict
122.302



None
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


predict
137.765



None
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


predict
182.541



None
