# Tecnologias de Processamento de dados - 2019/2020

## Phase II - Group 12


|   Student      | Student ID |  Contribution in hours |
|----------------|------------|----------------|
| Beatriz Lima   |    49377   |   |
| David Almeida  |    54120   |   |
|João Castanheira|    55052   |   |
| Pedro Cotovio  |    55053   |   |





## 0. Get the data

In [92]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from functools import reduce

from scipy import stats
from scipy.stats import norm

import psycopg2 as pg
import psycopg2.extras
import pandas.io.sql as sqlio
from datetime import datetime

The main datasets used in this warehouse are:

- http://insideairbnb.com/get-the-data.html for Lisbon, Portugal. - listings.csv
- https://dadosabertos.turismodeportugal.pt/datasets/alojamento-local) - Alojamento_Local.csv

In [10]:
listings_file_path = '../data/airbnb/listings.csv'
al_file_path = '../data/Alojamento_Local.csv'
df_al = pd.read_csv(al_file_path)
df_listings = pd.read_csv(listings_file_path)

#### Merge _df_listings_ with _alojamento_local.csv_

In order to enrich the main dataset, we can cross it with the dataset from Registo Nacional de Alojamento Local (RNAL) to obtain further information regarding each listing's property, as well as refine already available data, particularly in the case of location data.

In [11]:
def intTryParse(value):
    """Tries to parse string to an integer"""
    try:
        a = int(value)
        return True
    except ValueError:
        return False

In [12]:
# get only listings where 
df_listings_with_license = df_listings[(~df_listings['license'].isnull()) #'license' is not null
                                        & (df_listings['license'] != 'Exempt')] # && != 'Exempt'

# string replace
df_listings_with_license['NrRNAL'] = [s.replace('/AL','').replace('.','') # remove '/AL' and '.' from code
                                      for s in df_listings_with_license['license']]

# get only records where license nr can be converted to int 
df_listings_with_license = df_listings_with_license[[intTryParse(s) # if code can be converted to int
                                                     for s in df_listings_with_license['NrRNAL']]] # keep it

# convert NrRNAL to int before merge the two dataframes
df_listings_with_license['NrRNAL'] = df_listings_with_license['NrRNAL'].astype(np.int64) # convert code to int

# inner join two dataframes
df_listings_al = pd.merge(df_listings_with_license, df_al, how='inner', on='NrRNAL')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Save the intersection of the two files to disk:

In [13]:
listings_al_file_path = '../data/listings_al.csv'
df_listings_al.to_csv(listings_al_file_path,index=False)
print('Dataset size: {}'.format(len(df_listings_al)))

Dataset size: 17168


## 1. Dimensions and facts tables of the data warehouse

+ Define and model them in SQL
+ Identify hierarchies and fact granularity
+ Create the dimensions and facts tables in the DBMS (postgreSQL)

### General schema

![Star schema](../images/schema.png)

## 2. Define an ETL workflow

+ Identify all data sources for all dimensions. Add URL links to all data that should be available. If not public data, point to dropbox files, Google drive, or whatever
+ For each dimension show the code used for modeling, filtering and inserting data
+ Describe the process for inserting facts data

The ETL workflow is defined in separate notebooks for each dimension:

 - ETL_Property.ipynb
 - ETL_Host.ipynb
 - ETL_Review.ipynb
 - ETL_Date.ipynb
 - ETL_Location.ipynb

## 3. Critical assessment of the work
+ Describe potential issues with the ETL procedure used
+ Compare your schema to the one previously defined in phase I
+ Discuss the issues for updating the data warehouse with novel data

In [14]:
# Mention that dimensions might have unnecessary rows! 

Describe facts loading:
* Merge every mapping
* Merge with one fact
* Merge Facts

## 3. Process and insert facts

In [73]:
use_local = True
server_host = "appserver-01.alunos.di.fc.ul.pt" if use_local == False else "localhost"
port = 5432
sslmode = "allow" if use_local == False else "disable"
dbname = "tpd012" if use_local == False else "postgres"
dbusername = "tpd012" if use_local == False else "postgres"
dbpassword = "Airbnbosses69420" if use_local == False else "h9qoipj2"

In [74]:
def get_listing_price(listing_id):
    return int(df_listings_al[df_listings_al['id']==listing_id].price.values[0].strip().split('.')[0].replace(',','').replace('$',''))

# function to query table and convert it to pandas dataframe
def query_table(conn, table_name):
    """Returns DataFrame with queried database table"""
    sql = "select * from {};".format(table_name)
    #return dataframe
    return sqlio.read_sql_query(sql, conn)

# for this function to run, the dataframes must have the same columns, in the same order
def get_data_to_insert(df_etl, df_sql,pk):
    """Returns data valid for insertion in dimension from a new ETL-processed DataFrame"""
    df_insert = df_etl[-df_etl[pk].astype(int).isin(df_sql[pk].astype(int))].dropna(how = 'all')
    df_insert = df_insert.drop_duplicates(subset=[pk], keep=False)
    return df_insert

# function for bulk insert
def insert_data(df, table_name, conn):
    """Inserts selected data into dimension table in database"""
    df_columns = list(df)
    columns = ",".join(df_columns)
    values = "VALUES({})".format(",".join(["%s" for _ in df_columns])) 
    insert_stmt = "INSERT INTO {} ({}) {}".format(table_name,columns,values)
    success = True
    try:
        cursor = conn.cursor()
        psycopg2.extras.execute_batch(cursor, insert_stmt, df.values)
        conn.commit()
        success = True
    except pg.DatabaseError as error:
        success = False
        print(error)
    finally:
        if conn is not None:
            conn.close()
    return success

### 3.1 Listing Fact

Load the mappings between each dimension and the listing fact table


In [15]:
listings_date_path = '../processed_dt/df_listings_date.csv'
listings_host_path = '../processed_dt/df_listings_host.csv'
listings_property_path = '../processed_dt/df_listings_property.csv'
listings_review_path = '../processed_dt/df_listings_review.csv'
listings_location_path = '../processed_dt/location_fk.csv'

df_listings_date = pd.read_csv(listings_date_path)[['listing_id','date_id']]
df_listings_host = pd.read_csv(listings_host_path)[['listing_id','host_id']]
df_listings_property = pd.read_csv(listings_property_path).rename(columns={'ID':'listing_id','Property':'property_id'})[['listing_id','property_id']]
df_listings_review = pd.read_csv(listings_review_path)[['listing_id','review_id']]
df_listings_location = pd.read_csv(listings_location_path).rename(columns={'fk':'location_id','listings_id':'listing_id'})[['listing_id','location_id']]

Inner join all dataframes by 'listing_id'

In [76]:
#inner join all dataframes, by listing_id
dfs = [df_listings_date, df_listings_host, df_listings_property, df_listings_review, df_listings_location]
df_listings_facts_etl = reduce(lambda  left,right: pd.merge(left,right,on=['listing_id'], how='inner'), dfs)
#get the fact metric
df_listings_facts_etl['price_per_night'] = [get_listing_price(i) for i in df_listings_facts_etl['listing_id']]

In [84]:
df_listings_facts_etl

Unnamed: 0,listing_id,date_id,host_id,property_id,review_id,location_id,price_per_night
0,25659,13092017,107347,1,1,1194,60
1,29248,15012016,125768,2,2,537,60
2,29396,12052016,126415,1,1,1745,60
3,29720,31082017,128075,3,1,1592,1000
4,29915,16102018,128890,5,3,480,45
...,...,...,...,...,...,...,...
12049,41079317,22042016,316386247,9,9,1828,38
12050,41251727,1012020,303598743,7,14,725,40
12051,41397838,8032017,318669472,2,2,1669,50
12052,41403386,3122019,325646937,1023,3,1293,80


Query listings table and convert it to dataframe

In [80]:
conn = pg.connect(host = server_host,database = dbname, user = dbusername,password = dbpassword, sslmode = sslmode)
df_listings_facts_sql = query_table(conn, 'listings')
conn.close()
df_listings_facts_sql.head()

Unnamed: 0,listing_id,host_id,date_id,location_id,property_id,review_id,price_per_night


Get just new listings that are not in the database

In [82]:
df_listings_insert = get_data_to_insert(df_listings_facts_etl,df_listings_facts_sql,'listing_id')
df_listings_insert

Unnamed: 0,listing_id,date_id,host_id,property_id,review_id,location_id,price_per_night
0,25659,13092017,107347,1,1,1194,60
1,29248,15012016,125768,2,2,537,60
2,29396,12052016,126415,1,1,1745,60
3,29720,31082017,128075,3,1,1592,1000
4,29915,16102018,128890,5,3,480,45
...,...,...,...,...,...,...,...
12049,41079317,22042016,316386247,9,9,1828,38
12050,41251727,1012020,303598743,7,14,725,40
12051,41397838,8032017,318669472,2,2,1669,50
12052,41403386,3122019,325646937,1023,3,1293,80


Insert listings into the database

In [None]:
if len(df_insert) > 0:
    table_name = 'listings'
    conn = pg.connect(host = server_host,database = dbname, user = dbusername,password = dbpassword, sslmode = sslmode)
    success = insert_data(df_listings_insert,table_name, conn)
    df_listings_facts_sql = query_table(conn, 'listings')
    conn.close()
    if success == True: print('Data inserted succefully')
else: print('No data to insert')

### 3.2 Availability fact

In [89]:
def date_pk(date):
    """Builds date primary key"""
    return int(date.strftime('%d%m%Y'))

Read data from calendar.csv, which contains the data to insert into the availability fact table

In [109]:
df_calendar = pd.read_csv('../data/airbnb/calendar.csv')
print(df_calendar.shape)

(9125846, 7)

In [None]:
listings_ids_on_sql = df_listings_facts_sql['listing_id'].values
df_calendar = df_calendar[df_calendar[np.isin(df_calendar,listings_ids_on_sql)]
print(df_calendar.shape)

Calendar file has more than 9M records.

For the purpose of this project, we will just read the first k rows of the calender file

In [103]:
#read just the first k items
k = 100000
df_availability_date = df_calendar.iloc[:k][['listing_id','date']]

#create columns with the date primary key
df_availability_date['date_id'] = [date_pk(datetime.strptime(d, "%Y-%m-%d")) for d in df_availability_date['date']]

#remove date column
df_availability_date = df_availability_date.drop(['date'], axis=1)

#drop duplicates if exists
df_abailability_date = df_availability_date.drop_duplicates(subset=['listing_id','date_id'])

print(df_abailability_date.shape)

(100000, 2)


In [105]:
df_listings_insert

Unnamed: 0,listing_id,date_id,host_id,property_id,review_id,location_id,price_per_night
0,25659,13092017,107347,1,1,1194,60
1,29248,15012016,125768,2,2,537,60
2,29396,12052016,126415,1,1,1745,60
3,29720,31082017,128075,3,1,1592,1000
4,29915,16102018,128890,5,3,480,45
...,...,...,...,...,...,...,...
12049,41079317,22042016,316386247,9,9,1828,38
12050,41251727,1012020,303598743,7,14,725,40
12051,41397838,8032017,318669472,2,2,1669,50
12052,41403386,3122019,325646937,1023,3,1293,80


In [108]:
pd.merge(df_abailability_date,df_listings_insert, how ='inner',on='listing_id')[['listing_id','date_id_x']].rename(columns = {'date_id_x':'date_id'})

Unnamed: 0,listing_id,date_id
0,250288,28012020
1,250288,29012020
2,250288,30012020
3,250288,31012020
4,250288,1022020
...,...,...
50367,416929,22012021
50368,416929,23012021
50369,416929,24012021
50370,416929,25012021


In [86]:
np.isin(np.unique(df_availability_date.listing_id.values),df_listings_facts['listing_id'].values)

array([False,  True, False, False,  True,  True,  True,  True, False,
        True,  True,  True, False, False,  True,  True, False, False,
       False,  True,  True,  True,  True,  True, False, False, False,
       False,  True, False,  True,  True, False,  True,  True,  True,
        True,  True, False, False, False, False, False,  True,  True,
       False, False,  True, False, False,  True, False, False, False,
       False, False])