TODO: Show more about the formulas utilized on for the calculations.

# Equity of Attention Experiment

## Step 1: Setup the working environment

Requirements for your working environment
- Python >= 3.7
- Package requirements: pandas, numpy, scipy, matplotlib, scikit-learn, tensorflow

If on Google Colab
- GDrive storage requirements: ~1GB

**IMPORTANT: If running on Google Colab, set 'running_on_colab' to True.**

In [1]:
running_on_colab = True

### Install required packages locally

Make sure to install the local_requirements.txt using pip. Using a virtual environment its highly recommended.

### Install required packages on Google Colab
If on Google Colab, the only needed download is tensorflow-gpu, the rest of packages are already installed

In [2]:
if running_on_colab:
    from google.colab import drive
    drive.mount('/content/gdrive')

In [3]:
if running_on_colab:
    %cd /content/gdrive/My Drive/

In [4]:
if running_on_colab:
    !git clone https://github.com/crojascampos/equity-of-attention.git

In [5]:
if running_on_colab:
    %cd equity-of-attention
    !git pull

In [6]:
if running_on_colab:
    !pip install -r 'colab_requirements.txt'

### Import packages

In [8]:
import sys
import os

sys.path.append(os.path.join('..'))

In [9]:
import pandas as pd

In [10]:
from models.ilpbased import ILPBased

## Step 2: Load and understand the Airbnb dataset

We set two variables indicating the data path and the name of the city in the AirBnb dataset (this may be Boston, Geneva and HongKong) and read the corresponding CSV file using Pandas.

In [11]:
data_path = 'data' if running_on_colab else '../data'
airbnb_city = "Boston" # Possible: Boston, Geneva, HonkKong

In [12]:
filename = 'airbnb_' + airbnb_city + '_listings'
dataset = pd.read_csv(
    os.path.join(data_path, 'datasets/' + filename + '.csv'),
    encoding='utf8'
)

For this data, we are looking for rankings with scores, and this data set contains the following columns that satisfy this condition.

- 'review_scores_rating'
- 'review_scores_accuracy'
- 'review_scores_cleanliness'
- 'review_scores_checkin'
- 'review_scores_communication'
- 'review_scores_location'
- 'review_scores_value'

In [13]:
dataset.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'description',
       'neighborhood_overview', 'picture_url', 'host_id', 'host_url',
       'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_upd

### Prepare the model

The article for this method describes two scenarios for attention; the singular attention, which means that only one subject will be considered as relevant for calculations, and the geometric attention, which means that various subjects will be considered relevant to the calculations.

First, let's assign values for the following variables:
- pref_qty, amount of values to use in calculations in each iteration. In every iteration, new subjects will be selected based on calculations using the acummulated attention and relevance.
- k, amount of subjects considered as relevant subjects in calculations.
- prob, probability of any subject of being chosen in the ranking.
- theta, threshold for the ideal score.
- iterations, amount of times that the model will optimize each ranking.

For singular attention, both the value of k and the value of prob must be 1. Otherwise, it may be any positive number for k (less or equal to pref_qty), and between 0 and 1 (both inclusive) for prob.

The whole list of subjects may be used by setting pref_qty to the length of the ranking, but this is not recommended because it will affect performance without giving any significant improvement on the result.

In [14]:
pref_qty = 100
k = 1
prob = 1
theta = 1e-7
iterations = 100

For the AirBnb dataset we are using, this method may be single query, where only one ranking is used in the calculations, or multi query, where various ranking are used. The module developed supports passing both a list representing 1 ranking, or a list of lists with various rankings.

For single query, we are using the column 'review_scores_rating'. If multi query, we use all the described above.

The following flag may be set for the selection of single or multi query in this notebook.

In [15]:
query_mode = "multi"

Because it is needed to have a common ground for each ranking, normalization of values is used, where each value is represented as a number between 0 and 1. For this, the minimum and maximum possible values for each ranking are needed.

In [16]:
print("Minimum values for each column:")
dataset[['review_scores_rating',
         'review_scores_accuracy',
         'review_scores_cleanliness',
         'review_scores_checkin',
         'review_scores_communication',
         'review_scores_location',
         'review_scores_value']].min()


Minimum values for each column:


review_scores_rating           0.0
review_scores_accuracy         0.0
review_scores_cleanliness      0.0
review_scores_checkin          0.0
review_scores_communication    1.0
review_scores_location         1.0
review_scores_value            1.0
dtype: float64

In [29]:
print("Maximum values for each column:")
dataset[['review_scores_rating',
         'review_scores_accuracy',
         'review_scores_cleanliness',
         'review_scores_checkin',
         'review_scores_communication',
         'review_scores_location',
         'review_scores_value']].max()

Maximum values for each column:


review_scores_rating           5.0
review_scores_accuracy         5.0
review_scores_cleanliness      5.0
review_scores_checkin          5.0
review_scores_communication    5.0
review_scores_location         5.0
review_scores_value            5.0
dtype: float64

For the minimum and maximum to be passed to the module, it needs to be as pairs of values (for example, as a tuple), and there must be 1 pair per ranking.

The module accepts a list of pairs, and the list must be the same length as the amount of rankings.

In [18]:
ids = dataset['id'].to_numpy()

if query_mode == "single":
    min_max = (0,5)
    data = dataset['review_scores_rating'].fillna(0).to_numpy()
if query_mode == "multi":
    min_max = [(0,5), (0,5), (0,5), (0,5), (1,5), (1,5), (1,5)]
    data = dataset[['review_scores_rating',
                    'review_scores_accuracy',
                    'review_scores_cleanliness',
                    'review_scores_checkin',
                    'review_scores_communication',
                    'review_scores_location',
                    'review_scores_value']].fillna(
                        {
                            'review_scores_rating': 0,
                            'review_scores_accuracy': 0,
                            'review_scores_cleanliness': 0,
                            'review_scores_checkin': 0,
                            'review_scores_communication': 1,
                            'review_scores_location': 1,
                            'review_scores_value': 1,
                        }
                    ).to_numpy().T


With the rankings and the minimum and maximum possible values for each, we create the object for the model.

In [20]:
model = ILPBased(data, min_max)

We prepare the model using the previously declared variables for the prefilter quantity, the k relevant subjects, probability of being chosen and the threshold.

In [21]:
model.prepare(pref_qty, k, prob, theta)

And we start the model for the amount of iterations previously defined.

In [22]:
model.start(iterations)

Iteration 0
------------------------------------------------------------
Ranking 0
Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

command line - /home/crojascampos/Projects/Python/equity-of-attention/.venv/lib/python3.9/site-packages/pulp/apis/../solverdir/cbc/linux/64/cbc /tmp/3ae220cb1e164bfbbce5cd0eae21935b-pulp.mps branch printingOptions all solution /tmp/3ae220cb1e164bfbbce5cd0eae21935b-pulp.sol (default strategy 1)
At line 2 NAME          MODEL
At line 3 ROWS
At line 206 COLUMNS
At line 50000 RHS
At line 50202 BOUNDS
At line 60203 ENDATA
Problem MODEL has 201 rows, 10000 columns and 20097 elements
Coin0008I MODEL read with 0 errors
Continuous objective value is 90.582 - 0.02 seconds
Cgl0004I processed model has 201 rows, 10000 columns (10000 integer (10000 of which binary)) and 20097 elements
Cutoff increment increased from 1e-05 to 0.001998
Cbc0038I Initial state - 0 integers unsatisfied sum - 0
Cbc0038I Solution found of 90.582
Cbc0038I Before mini b

The results may be obtained from the method 'get_result' of the model.

In [24]:
result = model.get_result()

This will return the unfairness values for each iteration...

In [25]:
result["unfairness_vals"]

[636.1779999999999,
 1272.3559999999995,
 1908.5339999999994,
 2544.711999999999,
 3180.890000000001,
 3817.0679999999998,
 4453.245999999999,
 5089.423999999999,
 5725.601999999999,
 6361.780000000001]

And the accumulated relevance and attention for each subject.

In [27]:
result["acummulated_relevance"]

[69.19999999999999,
 67.34000000000003,
 66.97999999999999,
 66.70000000000002,
 69.00000000000003,
 0.0,
 0.0,
 0.0,
 62.57999999999996,
 64.33999999999995,
 69.21999999999997,
 69.35999999999999,
 65.52000000000002,
 64.95999999999998,
 67.95999999999995,
 65.70000000000003,
 63.019999999999975,
 66.14000000000001,
 67.66000000000001,
 62.96,
 69.23999999999997,
 68.57999999999997,
 68.45999999999998,
 68.79999999999997,
 68.26000000000002,
 61.48000000000002,
 67.88,
 64.48000000000003,
 61.77999999999999,
 67.84,
 68.12000000000002,
 66.26000000000002,
 63.97999999999998,
 63.98000000000001,
 64.17999999999996,
 61.17999999999999,
 68.50000000000001,
 67.68,
 69.63999999999999,
 63.579999999999956,
 65.60000000000001,
 67.52000000000002,
 69.14,
 68.89999999999996,
 66.79999999999997,
 68.00000000000001,
 68.36000000000003,
 67.97999999999996,
 63.96000000000001,
 68.75999999999996,
 67.09999999999998,
 66.34,
 69.45999999999997,
 45.46000000000001,
 62.440000000000005,
 69.3999999

In [30]:
result["acummulated_attention"]

[20.0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 