# Label the Subsets
This notebook acts as a helper tool to label the subsets of the data. We use the `tortus` Python package to do this. When we label, we give ourselves the review, as well as the property description and amenities list for context to help in our decision. The labels are "Yes" (meaning the review is talking about a misleading listing), "No", and "Bad" (Not misleading, but they didn't like their stay.

## Import packages

In [None]:
%pip install tortus
!jupyter nbextension enable --py widgetsnbextension
from google.colab import drive
drive.mount('/content/drive')
from tortus import Tortus
import pandas as pd
import os

In [2]:
REVIEWS = "/content/drive/MyDrive/DS 440 Capstone/data/filtered/texas_reviews_filtered.csv"
LABELS = "/content/drive/MyDrive/DS 440 Capstone/data/labels/texas_reviews_labels.csv"
LISTINGS = "/content/drive/MyDrive/DS 440 Capstone/data/listings/texas_listings.csv"

## Load listing information and put into one field so Tortus can render it

In [3]:
listings = pd.read_csv(LISTINGS)[["id", "name", "description", "amenities"]]

def parse_amenities(amenities):
  amenities = amenities.replace("{", "").replace("]", "").replace('"', "")
  return amenities.split(",")

listings.amenities = listings.amenities.apply(parse_amenities)

print("Number of unique listings (beyond the subset):", listings.id.nunique())


Number of unique listings (beyond the subset): 11882


  if isinstance(nodelist[-1], _assign_nodes):


In [4]:
# join reviews data with listings
reviews = pd.read_csv(REVIEWS)

joined = pd.merge(reviews, listings, left_on="listing_id", right_on="id", suffixes=("_reviews_dataset", "_listings_dataset"))

def generate_display_text(row):
  return f"""<h2><b>{row["name"]} (id = {row["id"]})</b><h2>
  <h4>Description<h4>
  <p style="font-weight: normal;">{row.description}</p>
  <h4>Amenities</h4>
  <p style="font-weight: normal;">{row.amenities}<p>
  <hr/>
  <h4>Review <span style="font-weight: normal">(sentiment = {round(row.sentiment, 2)})</span></h4> 
  <p style="font-weight: normal;">{row["comments"]}</p>
  """

joined = joined.rename(columns={"id_reviews_dataset": "id"})

joined["display_text"] = joined.apply(generate_display_text, axis=1)

joined = joined[["id", "display_text", "id_listings_dataset"]]

joined.display_text[0]

print("Number of reviews:", reviews.shape[0])
print("number of unique listings in this reviews subset:", joined["id_listings_dataset"].nunique())

Number of reviews: 5629
number of unique listings in this reviews subset: 2701


## Run Tortus

In [14]:
import numpy as np


# create LABELS file
folder = os.path.dirname(LABELS)
if not os.path.exists(folder):
    os.makedirs(folder)
try:
    labels = pd.read_csv(LABELS)
    labels_size = labels.shape[0]
except:
    labels = None
    labels_size = 0

# calculate remaining revies to label
remaining = joined[~np.isin(joined.id, labels.id)]

print("Number of rows total:", joined.shape[0])
print("Number of rows already labelled:", labels_size)
print("Amount remaining:", remaining.shape[0])


tortus = Tortus(remaining, "display_text", num_records=25, id_column="id", annotations=None, random=False, labels=['Yes', 'No', 'Maybe', "Bad"])

Number of rows total: 5629
Number of rows already labelled: 633
Amount remaining: 4998


In [12]:
tortus.annotate()

HBox(children=(HTML(value='<h1>t &nbsp; <span style="color:#36a849">o</span>             &nbsp; r &nbsp; t &nb…

Output()

In [13]:
annotations = tortus.annotations

# add annotations to existing
labels = pd.concat([labels, annotations])

labels.to_csv(LABELS, index=False)
print("Done.")


Done.
