<a href="https://colab.research.google.com/github/dernameistegal/airbnb_price/blob/main/data_utils/munich/tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Generate tokenized descriptions and reviews

In [1]:
!git clone https://github.com/dernameistegal/airbnb_price.git

Cloning into 'airbnb_price'...
remote: Enumerating objects: 1134, done.[K
remote: Counting objects: 100% (1134/1134), done.[K
remote: Compressing objects: 100% (1073/1073), done.[K
remote: Total 1134 (delta 675), reused 304 (delta 55), pack-reused 0[K
Receiving objects: 100% (1134/1134), 9.90 MiB | 9.88 MiB/s, done.
Resolving deltas: 100% (675/675), done.


In [2]:
%cd airbnb_price
import sys
sys.path.append("/content/airbnb_price/custom_functions")
import general_utils as ut

/content/airbnb_price


## Installing the Hugging Face library 

 Next, let’s install the [transformers package from the Hugging Face library](https://huggingface.co/transformers/index.html) which will give us a Pytorch interface for working with implementations of state-of-the-art embedding layers. This library contains interfaces for  pretrained language models like BERT, XLNet, OpenAI’s GPT.


In [3]:
device = ut.get_device()

cuda available: True ; cudnn available: True ; num devices: 1
Using device Tesla P100-PCIE-16GB


In [4]:
%%capture
!pip install transformers
!pip install requests
#!pip install captum


# Sentiment Analysis

In [5]:
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm
import fastprogress
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
from transformers import BertModel
import torch.nn as nn
from torch.nn import CrossEntropyLoss

from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


## Data Pre-processing

In [9]:
path = "/content/drive/MyDrive/Colab/airbnb/munich/listings_embeddedcats.pickle"
data = pd.read_pickle(path)
data = data[["description_en", "reviews_en", "log_price"]]

In [14]:
reviews, labels = data['description_en'], data['log_price']

## BERT: tokenization & input formatting for description


In [15]:
# Get rid of Colab warning about Tensorflow versions
%tensorflow_version 1.x

from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loading the BERT tokenizer...')

# We will use bert-base-uncased model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Loading the BERT tokenizer...


In [16]:
# Maximum length of a sequence
MAX_LEN = 256

# Tokenize all of the sentences and map the tokens to their word IDs.
input_ids = []

# Create attention masks
attention_masks = []

# For every sentence...
for review in tqdm(reviews):
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad to maximum length if the sequence is shorter
    sequence = tokenizer.encode_plus(
                    review,                      # Review to encode.
                    add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                    truncation=True,
                    max_length=MAX_LEN,
                    padding='max_length',
                    return_attention_mask=True)
    input_ids.append(sequence['input_ids'])
    attention_masks.append(sequence['attention_mask'])

input_ids = np.array(input_ids)
attention_masks = np.array(attention_masks)

100%|██████████| 998/998 [00:02<00:00, 344.60it/s]


In [17]:
data["desc_input_ids"] = list(input_ids)
data["desc_attention_masks"] = list(attention_masks)
data

Unnamed: 0,description_en,reviews_en,log_price,desc_input_ids,desc_attention_masks
159634,In this idyllic stylish flat you live very qui...,[Very nice hostess and beautiful idyllic envir...,3.951244,"[101, 1999, 2023, 8909, 8516, 10415, 2358, 851...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
170154,"Enjoy a quiet neighbourhood, easy access to th...",[With their hospitality and lots of tips and l...,4.007333,"[101, 5959, 1037, 4251, 10971, 1010, 3733, 322...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
170815,The spaceIt's a 1-room studio appartment with ...,[Thank you. Everything was good. Gladly again....,4.174387,"[101, 1996, 2686, 4183, 1005, 1055, 1037, 1015...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
186596,The spaceBeautiful bright 2-room-flat in calm ...,"[Good house direct to the street, Hatice is ve...",4.406719,"[101, 1996, 2686, 26401, 3775, 3993, 4408, 101...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
190529,The spaceSuper nice rooms with modern faciliti...,[I was only one night in Munich and everything...,4.356709,"[101, 1996, 7258, 6279, 2121, 3835, 4734, 2007...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
...,...,...,...,...,...
53875317,Is very central and close the the city centre....,[no review],3.850148,"[101, 2003, 2200, 2430, 1998, 2485, 1996, 1996...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
53891783,no text,[no review],4.158883,"[101, 2053, 3793, 102, 0, 0, 0, 0, 0, 0, 0, 0,...","[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
53903590,"Hey there,Am pleased to offer you a warm welco...",[no review],4.174387,"[101, 4931, 2045, 1010, 2572, 7537, 2000, 3749...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
53929784,Hello folks!The apartment located near to (1 m...,[no review],4.844187,"[101, 7592, 12455, 999, 1996, 4545, 2284, 2379...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [19]:
from re import T
# TODO
# Print the 25th (index 24 as indexing starts with 0 in python) example sentence
# 1) as original, 
print(data["description_en"].iloc[23])
# 2) list of token IDs, 
print(data["desc_input_ids"].iloc[23])
# 3) it's attention mask,
print(data["desc_attention_masks"].iloc[23])
# 4) the human readable string recreated from the token IDs, and
print(tokenizer.decode(data["desc_input_ids"].iloc[23]))
# 5) the human readable, recreated string with its according attention mask applied
print(tokenizer.decode(data["desc_input_ids"].iloc[23][data["desc_attention_masks"].iloc[23].astype("bool")]))


The spaceFully furnished 125 sq meter top-floor apt. (elevator) in lovely,renovated old building. Sendling - 5min walk to Theresienwiesn/Oktoberfest, subwaystation  Poccistr. U 3 and U 6.Avaible during the Oktoberfest - 19 th of September  to 5  of october- 110,00 € per night including all costs and wireless internet. You can use the newly renovated kitchen with wooden cabinets, ceramic cooktop, oven,microwave, dishwasher and eating areaGuest toilet and bright, white, full bath/shower with clothes washerLarge living room, dining, sitting room and hallHardwood floorsI have been living here six years and love it. Very convenient tomarkets, etc. I'm returning to the US for six month so you would besharing the apartment with a lively, fun, English-speaking Germanwoman. She is out of the apartment much of the time so it is likehaving the place to yourself. CherieHello fellows!!
[  101  1996  2686  7699 19851  8732  5490  8316  2327  1011  2723 26794
  1012  1006  7764  1007  1999  8403  101

# Tokenization for reviews

In [20]:
path = "/content/drive/MyDrive/Colab/airbnb/munich/language/train.pickle"
data.to_pickle(path)

In [21]:
path = "/content/drive/MyDrive/Colab/airbnb/munich/language/train.pickle"
data = pd.read_pickle(path)

In [22]:
reviews, labels = data['reviews_en'].to_numpy(), data['log_price'].to_numpy()
nreviews = np.vectorize(len)(reviews)
reviews = np.concatenate(reviews)
labels = np.repeat(labels, nreviews)

In [24]:
len(reviews)

19369

In [25]:
# Maximum length of a sequence
MAX_LEN = 128

# Tokenize all of the sentences and map the tokens to their word IDs.
input_ids = []

# Create attention masks
attention_masks = []

# For every sentence...
for review in tqdm(reviews):
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad to maximum length if the sequence is shorter
    sequence = tokenizer.encode_plus(
                    review,                      # Review to encode.
                    add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                    truncation=True,
                    max_length=MAX_LEN,
                    padding='max_length',
                    return_attention_mask=True)
    input_ids.append(sequence['input_ids'])
    attention_masks.append(sequence['attention_mask'])

input_ids = np.array(input_ids)
attention_masks = np.array(attention_masks)

100%|██████████| 19369/19369 [00:22<00:00, 853.03it/s] 


In [None]:
path = "/content/drive/MyDrive/Colab/airbnb/data/reviews/df.pickle"
data = pd.read_pickle(path)

In [27]:
nreviews.shape

(998,)

In [28]:
start = 0
id_list, mask_list = [], []
for nreview in nreviews:
    stop = start+nreview
    print(start, stop)
    id_list.append(input_ids[start:stop])
    mask_list.append(attention_masks[start:stop])
    start = stop

0 34
34 530
530 594
594 757
757 764
764 828
828 840
840 1453
1453 1536
1536 1593
1593 1678
1678 1711
1711 1786
1786 1955
1955 2183
2183 2184
2184 2198
2198 2223
2223 2247
2247 2278
2278 2286
2286 2499
2499 2564
2564 2608
2608 2617
2617 2618
2618 2686
2686 2694
2694 2713
2713 2721
2721 2723
2723 2749
2749 2774
2774 2787
2787 2795
2795 2868
2868 2907
2907 3099
3099 3103
3103 3163
3163 3201
3201 3321
3321 3329
3329 3342
3342 3347
3347 3381
3381 3383
3383 3577
3577 3579
3579 3607
3607 3706
3706 3714
3714 3715
3715 3718
3718 3753
3753 4099
4099 4119
4119 4120
4120 4121
4121 4279
4279 4313
4313 4314
4314 4323
4323 4327
4327 4329
4329 4330
4330 4350
4350 4373
4373 4387
4387 4396
4396 4456
4456 4459
4459 4464
4464 4474
4474 4480
4480 4514
4514 4516
4516 4596
4596 4597
4597 4629
4629 4652
4652 4838
4838 4906
4906 4918
4918 4922
4922 4933
4933 4963
4963 5023
5023 5024
5024 5082
5082 5137
5137 5184
5184 5186
5186 5212
5212 5351
5351 5384
5384 5398
5398 5401
5401 5402
5402 5513
5513 5514
5514 5558

In [29]:
data["review_input_ids"] = list(id_list)
data["review_attention_masks"] = list(mask_list)
data

Unnamed: 0,description_en,reviews_en,log_price,desc_input_ids,desc_attention_masks,review_input_ids,review_attention_masks
159634,In this idyllic stylish flat you live very qui...,[Very nice hostess and beautiful idyllic envir...,3.951244,"[101, 1999, 2023, 8909, 8516, 10415, 2358, 851...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 2200, 3835, 22566, 1998, 3376, 8909, 85...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,..."
170154,"Enjoy a quiet neighbourhood, easy access to th...",[With their hospitality and lots of tips and l...,4.007333,"[101, 5959, 1037, 4251, 10971, 1010, 3733, 322...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 2007, 2037, 15961, 1998, 7167, 1997, 10...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,..."
170815,The spaceIt's a 1-room studio appartment with ...,[Thank you. Everything was good. Gladly again....,4.174387,"[101, 1996, 2686, 4183, 1005, 1055, 1037, 1015...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 4067, 2017, 1012, 2673, 2001, 2204, 101...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,..."
186596,The spaceBeautiful bright 2-room-flat in calm ...,"[Good house direct to the street, Hatice is ve...",4.406719,"[101, 1996, 2686, 26401, 3775, 3993, 4408, 101...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 2204, 2160, 3622, 2000, 1996, 2395, 101...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,..."
190529,The spaceSuper nice rooms with modern faciliti...,[I was only one night in Munich and everything...,4.356709,"[101, 1996, 7258, 6279, 2121, 3835, 4734, 2007...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 1045, 2001, 2069, 2028, 2305, 1999, 746...","[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,..."
...,...,...,...,...,...,...,...
53875317,Is very central and close the the city centre....,[no review],3.850148,"[101, 2003, 2200, 2430, 1998, 2485, 1996, 1996...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 2053, 3319, 102, 0, 0, 0, 0, 0, 0, 0, 0...","[[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
53891783,no text,[no review],4.158883,"[101, 2053, 3793, 102, 0, 0, 0, 0, 0, 0, 0, 0,...","[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[101, 2053, 3319, 102, 0, 0, 0, 0, 0, 0, 0, 0...","[[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
53903590,"Hey there,Am pleased to offer you a warm welco...",[no review],4.174387,"[101, 4931, 2045, 1010, 2572, 7537, 2000, 3749...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 2053, 3319, 102, 0, 0, 0, 0, 0, 0, 0, 0...","[[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."
53929784,Hello folks!The apartment located near to (1 m...,[no review],4.844187,"[101, 7592, 12455, 999, 1996, 4545, 2284, 2379...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[[101, 2053, 3319, 102, 0, 0, 0, 0, 0, 0, 0, 0...","[[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,..."


In [30]:
from re import T
# TODO
number = 70
# Print the 25th (index 24 as indexing starts with 0 in python) example sentence
# 1) as original, 
print(data["reviews_en"].iloc[number][0])
# 2) list of token IDs, 
print(data["review_input_ids"].iloc[number][0])
# 3) it's attention mask,
print(data["review_attention_masks"].iloc[number][0])
# 4) the human readable string recreated from the token IDs, and
print(tokenizer.decode(data["review_input_ids"].iloc[number][0]))
# 5) the human readable, recreated string with its according attention mask applied
print(tokenizer.decode(data["review_input_ids"].iloc[number][0][data["review_attention_masks"].iloc[number][0].astype("bool")]))


Ingrid is a perfect hostess. I felt very comfortable. The room is very lovingly furnished.The apartment is perfect for business.The public transport is very fast to reach.Ingrid also advised me very well.I am very happy to come back to Munich.
[  101 22093  2003  1037  3819 22566  1012  1045  2371  2200  6625  1012
  1996  2282  2003  2200  8295  2135 19851  1012  1996  4545  2003  3819
  2005  2449  1012  1996  2270  3665  2003  2200  3435  2000  3362  1012
 22093  2036  9449  2033  2200  2092  1012  1045  2572  2200  3407  2000
  2272  2067  2000  7469  1012   102     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0  

In [33]:
path = "/content/drive/MyDrive/Colab/airbnb/munich/language/train.pickle"
data.to_pickle(path)