# Equity of Attention Experiment

## Setup the working environment

Requirements for your working environment
- Python >= 3.7
- Package requirements: pandas, numpy, scipy, matplotlib, scikit-learn, tensorflow

If on Google Collab
- GDrive storage requirements: ~1GB

IMPORTANT: Set the following variable to "locally" if running in own hardware or "collab" if on Google Collab

In [119]:
run_location = "locally"

### Install required packages
If on Google Collab, the only needed download is tensorflow-gpu, the rest of packages are already installed

In [120]:
if run_location == "locally":
    !pip install -r ../requirements.txt



You should consider upgrading via the 'c:\users\carlos\documents\programming\python\equity-of-attention\.venv\scripts\python.exe -m pip install --upgrade pip' command.





### Settings up in GDrive (only on Collab)

In [121]:
if run_location == "collab":
    from google.colab import drive
    drive.mount('/content/gdrive')

In [122]:
if run_location == "collab":
    %cd /content/gdrive/My Drive/

In [123]:
if run_location == "collab":
    !git clone https://github.com/crojascampos/equity-of-attention.git

In [124]:
if run_location == "collab":
    %cd equity-of-attention

In [125]:
if run_location == "collab":
    # Not needed in Google Colab (already installed), but just in case
    ! pip install matplotlib
    ! pip install numpy
    ! pip install pandas
    ! pip install scikit-learn
    ! pip install scipy
    # Needed
    ! pip install tensorflow-gpu

In [126]:
if run_location == "collab":
    %cd ./notebooks

### Import packages

In [127]:
import sys
import os
import math

sys.path.append(os.path.join('..'))

In [128]:
import pandas as pd
import numpy as np

In [129]:
import matplotlib.pyplot as plt
%matplotlib inline

In [130]:
# Load extra modules here

### Create folders for saving pre-computed results

We will define the subfolders in **./data** where we will store our pre-computed results. For each dataset:

- *data/outputs/splits* will include two csv files including the train and test interactions, according with the selected train-test split rule. 
- *data/outputs/instances* will include a csv file with instances to be fed to the model, either pairs for point-wise or triplets for pair-wise recommenders.
- *data/outputs/models* will include a h5 file associated with a pre-trained recommender model.  
- *data/outputs/predictions* will include a numpy file representing a user-item matrix; a cell stores the relevance score of an item for a given user.
- *data/outputs/metrics* will include a pickle dictionary with the computed evaluation metrics for a given recommender model. 

**N.B.** This strategy will allow us to play with the intermediate outputs of the pipeline, without starting from scratch any time (e.g., for performing a bias treatment as a post-processing, we just need to load the predictions of a model to start). 

In [131]:
data_path = '../data'

In [132]:
!mkdir "../data/outputs"
!mkdir "../data/outputs/splits"
!mkdir "../data/outputs/instances"
!mkdir "../data/outputs/models"
!mkdir "../data/outputs/predictions"
!mkdir "../data/outputs/metrics"

A subdirectory or file ../data/outputs already exists.
A subdirectory or file ../data/outputs/splits already exists.
A subdirectory or file ../data/outputs/instances already exists.
A subdirectory or file ../data/outputs/models already exists.
A subdirectory or file ../data/outputs/predictions already exists.
A subdirectory or file ../data/outputs/metrics already exists.


## Step 2: Load and understand the Airbnb dataset

In [133]:
airbnb_city = "Boston"
airbnb_dataset = 'airbnb_' + airbnb_city + '_listings'

In [134]:
data = pd.read_csv(os.path.join(data_path, 'datasets/' + airbnb_dataset + '.csv'), encoding='utf8')
data.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'description',
       'neighborhood_overview', 'picture_url', 'host_id', 'host_url',
       'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_upd

In [148]:
# Position Bias calculation
# Receives the position, outputs a scalar that is the bias of the position
# * pos -> position
# * p ---> probability of a subject of being selected, equal for every subject (default=0.5)
# * k ---> after this position, the bias will be 0
def pos_bias(pos, p=0.5, k=5):
    bias = 0
    if pos <= k:
        bias = p * math.pow((1 - p), (pos - 1))
    else:
        bias = 0
    return bias

In [146]:
# Ranking Quality calculation
# Receive a map of the values in the ranking + a list with the rankings, outputs a scalar that indicates
# the quality of the ranking, more is better
# * values --> map indicating the position of each subject on the ranking, must be numpy squared array of
#              the same size as the length of the ranking
# * ranking -> the ranking, may be a 1 dimension list or numpy array
# * k -------> maximum position to be used for calculations (default=5)
def ranking_quality(values, ranking, k=5):
    # Initial checks
    if type(values) != np.ndarray: raise TypeError("Values must be a numpy array")
    if values.shape != (len(ranking), len(ranking)): raise IndexError("Values must be a square array of the same length as the ranking")

    # Initialize the variables for the result of the function and the size of the ranking
    result, n = 0, len(ranking)
    # Iterate until the maximum position k
    for j in range(k):
        # Iterate over the length of the ranking
        for i in range(n):
            # Apply the function and add it to the result
            result += ((math.pow(2, ranking[i])-1)/math.log2((j+1)+1))*values[i,j]
    return result

In [162]:
# Unfairness calculation
# Receives a map of the values in the ranking + a list with the rankings, outputs a scalar that indicates
# the ammount of unfairness in the ranking, less is better
# * values ------> map indicating the position of each subject on the ranking, must be numpy squared array of
#                  the same size as the length of the ranking
# * ranking -----> the ranking, may be a 1 dimension list or numpy array
# * pos_bias_p --> probability of a subject that it will be chosen, same for all subjects (default=0.5)
# * pos_bias_k --> maximum position to be used for calculations (default=5)
def unfairness(values, ranking, pos_bias_p=0.5, pos_bias_k=5):
    # Initial checks
    if type(values) != np.ndarray: raise TypeError("Values must be a numpy array")
    if values.shape != (len(ranking), len(ranking)): raise IndexError("Values must be a square array of the same length as the ranking")

    # Initialize the variables for the result of the function and the size of the ranking
    result, n = 0, len(ranking)
    print("Starting...", end="")
    # Iterate over the length of the ranking
    for i in range(n):
        # Set accumulated attention and relevance variables to 0
        accummA = 0
        accummR = 0
        # Iterate over the length of the ranking
        for j in range(n):
            print("Result = {}. Calculating on [{}, {}]...".format(i, j), end="")
            # Calculate the values of the accumulated attention and relevance
            accummA += pos_bias(j, pos_bias_p, pos_bias_k)
            accummR += ranking[i]
            # Apply the function and add it to the result
            result += math.fabs(accummA - accummR) * values[i, j]
            print("\r", end="")
    # Return the result
    return result

In [149]:
# Get position map
# Receives a ranking and outputs a numpy squared array of size equal to the length of the ranking
# * ranking --> the ranking, must be a 1 dimensional numpy array
def get_pos_map(ranking):
    if type(ranking) != np.ndarray: raise TypeError("'ranking' must be a 1 dimensional numpy array")

    # Initialize variables for length of ranking (n), map and the copy of the ranking (sorted)
    n, map, sorted = len(ranking), [], ranking.copy()
    # Sorts the copy of the ranking
    sorted.sort()
    # Iterates over the ranking
    for i, val in enumerate(ranking):
        # Sets the variables for the row and the location in the sorted ranking of the value on this iteration (loc)
        row, loc = [], np.where(sorted == val)[0][0]
        # Sets the value of this iteration to an empty string for it to not be repeated in the map
        sorted = np.delete(sorted, loc)
        # Compensates for the lack of i values deleted in each past iterations
        loc += i
        # Iterates over the range of 0...n
        for j in range(n):
            # If the j is not the location, append a 0 on the row
            if j != loc: row.append(0)
            # If j is the location, append a 1 on the row
            else: row.append(1)
        # Append the row to the map
        map.append(row)
    # Return the map as a numpy array
    return np.array(map)

In [163]:
# Test with small data
test = np.array([4, 2, 7, 3])
test_map = get_pos_map(test)
unfairness(test_map, test)
#ranking_quality(test_map, test)

0 resulted in: 10.25
1 resulted in: 12.75
2 resulted in: 38.875
3 resulted in: 49.0


49.0

In [140]:
# Get the singular ranking
test_data = data[["id", "review_scores_rating"]].fillna(0)
test_data

Unnamed: 0,id,review_scores_rating
0,3781,4.95
1,5506,4.77
2,6695,4.79
3,10730,4.78
4,10813,5.00
...,...,...
3038,50904278,0.00
3039,50917545,0.00
3040,50937751,0.00
3041,50955477,0.00


In [164]:
# Test the function with the full ranking
ranking = test_data["review_scores_rating"].to_numpy()

ranking_map = get_pos_map(ranking)

unfairness(ranking_map, ranking)

0 resulted in: 12056.231250000355
1 resulted in: 20740.43250000076
2 resulted in: 29686.18375000084
3 resulted in: 38479.41500000078
4 resulted in: 51227.44625000078
5 resulted in: 51229.41500000078
6 resulted in: 51231.38375000078
7 resulted in: 51233.35250000078
8 resulted in: 55849.58375000087
9 resulted in: 61256.31500000087
10 resulted in: 73500.58625000021
11 resulted in: 85943.4974999998
12 resulted in: 92696.69874999992
13 resulted in: 100825.59999999957
14 resulted in: 110343.13124999931
15 resulted in: 118729.66249999931
16 resulted in: 123608.39374999932
17 resulted in: 130600.02499999915
18 resulted in: 139569.72624999925
19 resulted in: 145610.7774999992
20 resulted in: 157864.96874999854
21 resulted in: 168088.43999999855
22 resulted in: 178316.77124999854
23 resulted in: 189651.99249999828
24 resulted in: 199684.67374999868
25 resulted in: 204071.5049999986
26 resulted in: 213449.03624999878
27 resulted in: 218555.46749999875
28 resulted in: 223637.11874999877
Calculatin

KeyboardInterrupt: 

In [None]:
# IGNORE, TEST FOR SOMETHING
arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
arr2 = np.array([[0, 1], [1, 0]])

np.any((arr2!=0)&(arr2!=1))

False