# Tecnologias de Processamento de dados - 2019/2020

## Phase II - Group 12


|   Student      | Student ID |  Contribution in hours |
|----------------|------------|----------------|
| Beatriz Lima   |    49377   |   |
| David Almeida  |    54120   |   |
|João Castanheira|    55052   |   |
| Pedro Cotovio  |    55053   |   |





## 0. Get the data

In [9]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from functools import reduce

from scipy import stats
from scipy.stats import norm

The main datasets used in this warehouse are:

- http://insideairbnb.com/get-the-data.html for Lisbon, Portugal. - listings.csv
- https://dadosabertos.turismodeportugal.pt/datasets/alojamento-local) - Alojamento_Local.csv

In [10]:
listings_file_path = '../data/airbnb/listings.csv'
al_file_path = '../data/Alojamento_Local.csv'
df_al = pd.read_csv(al_file_path)
df_listings = pd.read_csv(listings_file_path)

#### Merge _df_listings_ with _alojamento_local.csv_

In order to enrich the main dataset, we can cross it with the dataset from Registo Nacional de Alojamento Local (RNAL) to obtain further information regarding each listing's property, as well as refine already available data, particularly in the case of location data.

In [11]:
def intTryParse(value):
    """Tries to parse string to an integer"""
    try:
        a = int(value)
        return True
    except ValueError:
        return False

In [12]:
# get only listings where 
df_listings_with_license = df_listings[(~df_listings['license'].isnull()) #'license' is not null
                                        & (df_listings['license'] != 'Exempt')] # && != 'Exempt'

# string replace
df_listings_with_license['NrRNAL'] = [s.replace('/AL','').replace('.','') # remove '/AL' and '.' from code
                                      for s in df_listings_with_license['license']]

# get only records where license nr can be converted to int 
df_listings_with_license = df_listings_with_license[[intTryParse(s) # if code can be converted to int
                                                     for s in df_listings_with_license['NrRNAL']]] # keep it

# convert NrRNAL to int before merge the two dataframes
df_listings_with_license['NrRNAL'] = df_listings_with_license['NrRNAL'].astype(np.int64) # convert code to int

# inner join two dataframes
df_listings_al = pd.merge(df_listings_with_license, df_al, how='inner', on='NrRNAL')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Save the intersection of the two files to disk:

In [13]:
listings_al_file_path = '../data/listings_al.csv'
df_listings_al.to_csv(listings_al_file_path,index=False)
print('Dataset size: {}'.format(len(df_listings_al)))

Dataset size: 17168


## 1. Dimensions and facts tables of the data warehouse

+ Define and model them in SQL
+ Identify hierarchies and fact granularity
+ Create the dimensions and facts tables in the DBMS (postgreSQL)

### General schema

![Star schema](../images/schema.png)

## 2. Define an ETL workflow

+ Identify all data sources for all dimensions. Add URL links to all data that should be available. If not public data, point to dropbox files, Google drive, or whatever
+ For each dimension show the code used for modeling, filtering and inserting data
+ Describe the process for inserting facts data

The ETL workflow is defined in separate notebooks for each dimension:

 - ETL_Property.ipynb
 - ETL_Host.ipynb
 - ETL_Review.ipynb
 - ETL_Date.ipynb
 - ETL_Location.ipynb

## 3. Critical assessment of the work
+ Describe potential issues with the ETL procedure used
+ Compare your schema to the one previously defined in phase I
+ Discuss the issues for updating the data warehouse with novel data

In [14]:
# Mention that dimensions might have unnecessary rows! 

Describe facts loading:
* Merge every mapping
* Merge with one fact
* Merge Facts

In [70]:
def get_listing_price(listing_id):
    return int(df_listings_al[df_listings_al['id']==listing_id].price.values[0].strip().split('.')[0].replace(',','').replace('$',''))

Load the mappings between each dimension and the listing fact table


In [15]:
listings_date_path = '../processed_dt/df_listings_date.csv'
listings_host_path = '../processed_dt/df_listings_host.csv'
listings_property_path = '../processed_dt/df_listings_property.csv'
listings_review_path = '../processed_dt/df_listings_review.csv'
listings_location_path = '../processed_dt/location_fk.csv'

df_listings_date = pd.read_csv(listings_date_path)[['listing_id','date_id']]
df_listings_host = pd.read_csv(listings_host_path)[['listing_id','host_id']]
df_listings_property = pd.read_csv(listings_property_path).rename(columns={'ID':'listing_id','Property':'property_id'})[['listing_id','property_id']]
df_listings_review = pd.read_csv(listings_review_path)[['listing_id','review_id']]
df_listings_location = pd.read_csv(listings_location_path).rename(columns={'fk':'location_id','listings_id':'listing_id'})[['listing_id','location_id']]

In [68]:
#inner join all dataframes, by listing_id
dfs = [df_listings_date, df_listings_host, df_listings_property, df_listings_review, df_listings_location]
df_listings_facts = reduce(lambda  left,right: pd.merge(left,right,on=['listing_id'], how='inner'), dfs)
#get the fact metric
df_listings_facts['price_per_night'] = [get_listing_price(i) for i in df_listings_facts['listing_id']]


In [63]:
df_listings_facts

Unnamed: 0,listing_id,date_id,host_id,property_id,review_id,location_id,price_per_night
0,25659,13092017,107347,1,1,1194,60.0
1,29248,15012016,125768,2,2,537,60.0
2,29396,12052016,126415,1,1,1745,60.0
3,29720,31082017,128075,3,1,1592,1000.0
4,29915,16102018,128890,5,3,480,45.0
...,...,...,...,...,...,...,...
12049,41079317,22042016,316386247,9,9,1828,38.0
12050,41251727,1012020,303598743,7,14,725,40.0
12051,41397838,8032017,318669472,2,2,1669,50.0
12052,41403386,3122019,325646937,1023,3,1293,80.0


In [22]:
df_availability_date = pd.read_csv('../processed_dt/df_availability_date.csv')[['listing_id','date_id']]
df_availability_date.head()

Unnamed: 0,listing_id,date_id
0,41791859,28012020
1,41791859,29012020
2,41791859,30012020
3,41791859,31012020
4,41791859,1022020


In [27]:
np.isin(np.unique(df_availability_date.listing_id.values),df_listings_facts['listing_id'].values)

array([False,  True, False, False,  True,  True,  True,  True, False,
        True,  True,  True, False, False,  True,  True, False, False,
       False,  True,  True,  True,  True,  True, False, False, False,
       False,  True, False,  True,  True, False,  True,  True,  True,
        True,  True, False, False, False, False, False,  True,  True,
       False, False,  True, False, False,  True, False, False, False,
       False, False])