<a href="https://colab.research.google.com/github/dernameistegal/airbnb_price/blob/main/data_utils/data_preparation/tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Generate tokenized descriptions and reviews

In [1]:
!git clone https://github.com/dernameistegal/airbnb_price.git

Cloning into 'airbnb_price'...
remote: Enumerating objects: 422, done.[K
remote: Counting objects: 100% (422/422), done.[K
remote: Compressing objects: 100% (395/395), done.[K
remote: Total 422 (delta 213), reused 113 (delta 21), pack-reused 0[K
Receiving objects: 100% (422/422), 3.23 MiB | 1.82 MiB/s, done.
Resolving deltas: 100% (213/213), done.


In [2]:
%cd airbnb_price
import sys
sys.path.append("/content/airbnb_price/custom_functions")
import general_utils as ut

/content/airbnb_price


## Installing the Hugging Face library 

 Next, let’s install the [transformers package from the Hugging Face library](https://huggingface.co/transformers/index.html) which will give us a Pytorch interface for working with implementations of state-of-the-art embedding layers. This library contains interfaces for  pretrained language models like BERT, XLNet, OpenAI’s GPT.


In [3]:
device = ut.get_device()

cuda available: True ; cudnn available: True ; num devices: 1
Using device Tesla P100-PCIE-16GB


In [4]:
%%capture
!pip install transformers
!pip install requests
!pip install captum


# Sentiment Analysis

In [5]:
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm
import fastprogress
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
from transformers import BertModel
import torch.nn as nn
from torch.nn import CrossEntropyLoss
from captum.attr import visualization

from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


## Data Pre-processing

In [24]:
# load the training data
file_name = "/content/drive/MyDrive/Colab/airbnb/data/data1/listings_workfile.pickle"
data = pd.read_pickle(file_name)
data = data[["description_en", "reviews", "log_price"]]

In [25]:
train_ids, val_ids , test_ids = ut.train_val_test_split(data.index)

reviews, labels = data['description_en'], data['log_price']

## BERT: tokenization & input formatting for description


In [16]:
# Get rid of Colab warning about Tensorflow versions
%tensorflow_version 1.x

from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loading the BERT tokenizer...')

# We will use bert-base-uncased model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

TensorFlow 1.x selected.
Loading the BERT tokenizer...


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [38]:
# Maximum length of a sequence
MAX_LEN = 256

# Tokenize all of the sentences and map the tokens to their word IDs.
input_ids = []

# Create attention masks
attention_masks = []

# For every sentence...
for review in tqdm(reviews):
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad to maximum length if the sequence is shorter
    sequence = tokenizer.encode_plus(
                    review,                      # Review to encode.
                    add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                    truncation=True,
                    max_length=MAX_LEN,
                    padding='max_length',
                    return_attention_mask=True)
    input_ids.append(sequence['input_ids'])
    attention_masks.append(sequence['attention_mask'])

input_ids = np.array(input_ids)
attention_masks = np.array(attention_masks)

100%|██████████| 11404/11404 [00:59<00:00, 191.19it/s]


In [45]:
data["desc_input_ids"] = list(input_ids)
data["desc_attention_masks"] = list(attention_masks)
data

Unnamed: 0,description_en,reviews,log_price,desc_input_ids,desc_attention_masks
15883,"Four rooms, each one differently and individua...",[The stay with Eva was to my full satisfaction...,4.787492,"[101, 2176, 4734, 1010, 2169, 2028, 11543, 199...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
38768,39m² apartment with beautiful courtyard of the...,"[Super friendly host! Great tips, great apartm...",4.110874,"[101, 4464, 2213, 10701, 4545, 2007, 3376, 101...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
40625,Welcome to my Apt. 1!This is a 2bedroom apartm...,"[Good place to visit the zoo, possibility to p...",4.875197,"[101, 6160, 2000, 2026, 26794, 1012, 1015, 999...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
51287,small studio in new renovated old house and ve...,[We had a perfect stay in this studio! Hannes ...,4.077537,"[101, 2235, 2996, 1999, 2047, 10601, 2214, 216...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
70637,The spaceMy apartment (including a large terra...,[The apartment of Elxe is a dream and just gre...,3.912023,"[101, 1996, 2686, 8029, 4545, 1006, 2164, 1037...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
...,...,...,...,...,...
53194341,The apartment has a spacious living room with ...,[no review],4.077537,"[101, 1996, 4545, 2038, 1037, 22445, 2542, 228...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
53194438,The apartment has a spacious living room with ...,[no review],4.077537,"[101, 1996, 4545, 2038, 1037, 22445, 2542, 228...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
53194698,The apartment has a spacious living room with ...,[no review],4.077537,"[101, 1996, 4545, 2038, 1037, 22445, 2542, 228...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
53195075,Enjoy the simple life in this quiet and centra...,[no review],4.060443,"[101, 5959, 1996, 3722, 2166, 1999, 2023, 4251...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [37]:
from re import T
# TODO
# Print the 25th (index 24 as indexing starts with 0 in python) example sentence
# 1) as original, 
print(data["description_en"].iloc[23])
# 2) list of token IDs, 
print(data["input_ids"].iloc[23])
# 3) it's attention mask,
print(data["attention_masks"].iloc[23])
# 4) the human readable string recreated from the token IDs, and
print(tokenizer.decode(data["input_ids"].iloc[23]))
# 5) the human readable, recreated string with its according attention mask applied
print(tokenizer.decode(data["input_ids"].iloc[23][data["attention_masks"].iloc[23].astype("bool")]))


The apartment Naschmarkt mini is beautiful studio in a calm neighborhood close to the famous Naschmarkt, Vienna's most popular market place with a lot of nice restaurants.The historic city center can be reached quickly: by foot it is a 10-15 minutes walk and with the metro you are even faster. The apartment is therefore the ideal base for singles, couples and small families to discover the beautiful city of Vienna.The spaceOur apartment „Naschmarkt mini“ is located in the mezzanine of a 19th century apartment building. It is a studio apartment with a comfortable couch that can be transformed into a double bed (150 x 200 cm in size) with a spring mattress.  Furthermore there is a doorway with a wooden double wing door that has been transformed into a large wardrobe. There is also a commode, a dining table with four chairs and an armchair. There is a flatscreen TV with a satellite receiver attached to the wall and free wi-fi internet acces
[  101  1996  4545 17235  2818 10665  2102  7163

# Tokenization for reviews

In [49]:
path = "/content/drive/MyDrive/Colab/airbnb/data/reviews/df.pickle"
data.to_pickle(path)

In [5]:
path = "/content/drive/MyDrive/Colab/airbnb/data/reviews/df.pickle"
data = pd.read_pickle(path)

In [6]:
reviews, labels = data['reviews'].to_numpy(), data['log_price'].to_numpy()
nreviews = np.vectorize(len)(reviews)
reviews = np.concatenate(reviews)
labels = np.repeat(labels, nreviews)

In [8]:
path = "/content/drive/MyDrive/Colab/airbnb/data/reviews/nreviews.npy"
np.save(path, nreviews)
path = "/content/drive/MyDrive/Colab/airbnb/data/reviews/reviews.npy"
np.save(path, reviews)
path = "/content/drive/MyDrive/Colab/airbnb/data/reviews/labels.npy"
np.save(path, labels)

In [9]:
len(reviews)

364781

In [10]:
# Maximum length of a sequence
MAX_LEN = 128

# Tokenize all of the sentences and map the tokens to their word IDs.
input_ids = []

# Create attention masks
attention_masks = []

# For every sentence...
for review in tqdm(reviews):
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad to maximum length if the sequence is shorter
    sequence = tokenizer.encode_plus(
                    review,                      # Review to encode.
                    add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                    truncation=True,
                    max_length=MAX_LEN,
                    padding='max_length',
                    return_attention_mask=True)
    input_ids.append(sequence['input_ids'])
    attention_masks.append(sequence['attention_mask'])

input_ids = np.array(input_ids)
attention_masks = np.array(attention_masks)

100%|██████████| 364781/364781 [11:45<00:00, 516.89it/s]


In [11]:
path = "/content/drive/MyDrive/Colab/airbnb/data/reviews/input_ids.npy"
np.save(path, input_ids)
path = "/content/drive/MyDrive/Colab/airbnb/data/reviews/labels.npy"
np.save(path, labels)
path = "/content/drive/MyDrive/Colab/airbnb/data/reviews/attention_masks.npy"
np.save(path, attention_masks)

In [6]:
root = "/content/drive/MyDrive/Colab/airbnb/data/reviews/"
attention_masks = np.load(root + "attention_masks.npy")
input_ids = np.load(root + "input_ids.npy")
labels = np.load(root + "labels.npy")

In [9]:
path = "/content/drive/MyDrive/Colab/airbnb/data/reviews/df.pickle"
data = pd.read_pickle(path)

In [11]:
path = "/content/drive/MyDrive/Colab/airbnb/data/reviews/nreviews.npy"
nreviews = np.load(path)

In [12]:
nreviews

array([ 14, 334, 162, ...,   1,   1,   1])

In [13]:
start = 0
id_list, mask_list = [], []
for nreview in nreviews:
    stop = start+nreview
    print(start, stop)
    id_list.append(input_ids[start:stop])
    mask_list.append(attention_masks[start:stop])
    start = stop

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
304913 305015
305015 305022
305022 305048
305048 305050
305050 305057
305057 305059
305059 305064
305064 305071
305071 305072
305072 305107
305107 305202
305202 305227
305227 305232
305232 305290
305290 305292
305292 305326
305326 305336
305336 305343
305343 305365
305365 305454
305454 305486
305486 305504
305504 305623
305623 305698
305698 305700
305700 305753
305753 305812
305812 305816
305816 305843
305843 305893
305893 305943
305943 305950
305950 305953
305953 305960
305960 305983
305983 305985
305985 305986
305986 305987
305987 305988
305988 306011
306011 306024
306024 306025
306025 306026
306026 306027
306027 306138
306138 306163
306163 306311
306311 306373
306373 306442
306442 306627
306627 306628
306628 306636
306636 306702
306702 306703
306703 306852
306852 306861
306861 306887
306887 306898
306898 307033
307033 307043
307043 307045
307045 307076
307076 307079
307079 307113
307113 307115
307115 30

In [14]:
data["review_input_ids"] = list(id_list)
data["review_attention_masks"] = list(mask_list)
data

Unnamed: 0,description_en,reviews,log_price,desc_input_ids,desc_attention_masks,review_input_ids,review_attention_masks
15883,"Four rooms, each one differently and individua...",[The stay with Eva was to my full satisfaction...,4.787492,"[101, 2176, 4734, 1010, 2169, 2028, 11543, 199...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 1996, 2994, 2007, 9345, 2001, 2000, 202...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,..."
38768,39m² apartment with beautiful courtyard of the...,"[Super friendly host! Great tips, great apartm...",4.110874,"[101, 4464, 2213, 10701, 4545, 2007, 3376, 101...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 3565, 5379, 3677, 999, 2307, 10247, 101...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,..."
40625,Welcome to my Apt. 1!This is a 2bedroom apartm...,"[Good place to visit the zoo, possibility to p...",4.875197,"[101, 6160, 2000, 2026, 26794, 1012, 1015, 999...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 2204, 2173, 2000, 3942, 1996, 9201, 101...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,..."
51287,small studio in new renovated old house and ve...,[We had a perfect stay in this studio! Hannes ...,4.077537,"[101, 2235, 2996, 1999, 2047, 10601, 2214, 216...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 2057, 2018, 1037, 3819, 2994, 1999, 202...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,..."
70637,The spaceMy apartment (including a large terra...,[The apartment of Elxe is a dream and just gre...,3.912023,"[101, 1996, 2686, 8029, 4545, 1006, 2164, 1037...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 1996, 4545, 1997, 3449, 2595, 2063, 200...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,..."
...,...,...,...,...,...,...,...
53194341,The apartment has a spacious living room with ...,[no review],4.077537,"[101, 1996, 4545, 2038, 1037, 22445, 2542, 228...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 2053, 3319, 102, 0, 0, 0, 0, 0, 0, 0, 0...","[[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
53194438,The apartment has a spacious living room with ...,[no review],4.077537,"[101, 1996, 4545, 2038, 1037, 22445, 2542, 228...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 2053, 3319, 102, 0, 0, 0, 0, 0, 0, 0, 0...","[[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
53194698,The apartment has a spacious living room with ...,[no review],4.077537,"[101, 1996, 4545, 2038, 1037, 22445, 2542, 228...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 2053, 3319, 102, 0, 0, 0, 0, 0, 0, 0, 0...","[[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
53195075,Enjoy the simple life in this quiet and centra...,[no review],4.060443,"[101, 5959, 1996, 3722, 2166, 1999, 2023, 4251...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 2053, 3319, 102, 0, 0, 0, 0, 0, 0, 0, 0...","[[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."


In [20]:
from re import T
# TODO
number = 70
# Print the 25th (index 24 as indexing starts with 0 in python) example sentence
# 1) as original, 
print(data["reviews"].iloc[number][0])
# 2) list of token IDs, 
print(data["review_input_ids"].iloc[number][0])
# 3) it's attention mask,
print(data["review_attention_masks"].iloc[number][0])
# 4) the human readable string recreated from the token IDs, and
print(tokenizer.decode(data["review_input_ids"].iloc[number][0]))
# 5) the human readable, recreated string with its according attention mask applied
print(tokenizer.decode(data["review_input_ids"].iloc[number][0][data["review_attention_masks"].iloc[number][0].astype("bool")]))


The apartment 15 in Fleischmanngasse is beautiful and large. Very new furnished- chic, bright, pleasant. It is in a typical Viennese Biedermeier house and is very central, you are in the city center in 5 minutes. I felt very comfortable there!
[  101  1996  4545  2321  1999 13109 17580 19944 13807 11393  2003  3376
  1998  2312  1012  2200  2047 19851  1011  9610  2278  1010  4408  1010
  8242  1012  2009  2003  1999  1037  5171 20098 10087  3366 12170 14728
 10867  7416  2121  2160  1998  2003  2200  2430  1010  2017  2024  1999
  1996  2103  2415  1999  1019  2781  1012  1045  2371  2200  6625  2045
   999   102     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0  

In [21]:
path = "/content/drive/MyDrive/Colab/airbnb/data/reviews/df.pickle"
data.to_pickle(path)