# Italian AirBnB: A SQL Showcase

## The Beginning
As someone who studied and lived in southern Switzerland for four years, just minutes from the Italian border, and who learned Italian to a C1 level, I have a fond place in my heart for Italy and Italian culture. As a well-traveled person with an insatiable curiousity, I wanted to examine Italian AirBnB rentals for insights I might glean, while learning and showing off my SQL and Python skills.

I found a dataset on kaggle at https://www.kaggle.com/datasets/salvatoremarcello/italian-airbnb-dataset. Other AirBnB datasets of interest reside here https://insideairbnb.com/get-the-data/. 

## Prepping our data for load

While the structures of this git repository and associated databases are intended to mimic a potential production setup, certain elements (such as the absence of scheduled ETL) are missing. The focus of this project and associated resources is to showcase SQL competency, and as a consequence, some infrastructural knowledge around OLAP database structure. In our case, the overall structure be as follows:
1. A jupyter notebook (this one) cleans our csv
2. DDL scripts are run to prep a data mart for loading
3. A staging table is loaded from cleaned csv via a bash script
4. The staging table is then written to the data mart and erased

<div class="alert alert-block alert-info">
Postgres may not be the ideal OLAP RDBMS, but for this use case it works sufficiently well. It was chosen for many reasons, among them: 1. PostgreSQL is common and PL/pgSQL is a familiar language, 2. Quick deploy with bitnami helm chart for availability even locally, and 3. it integrates with Tableau
</div>

***

In [1]:
# Load in packages
import pandas as pd
import numpy as np
# Constants import for brevity
from constants import *
# Setting max display
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

['.ipynb_checkpoints', 'constants.py', 'cleaning.ipynb', '__pycache__']


In [4]:
# Let's load in our dataset.
exp = pd.read_csv("~/projects/airbnb_sql/op-db/airbnb.csv")

### Some investigation

I want to see what I'm working with here, so I look at the cities, how long our hosts have been around, and what dates our scrapings occurred. This is to get a feel for the data. Finally, I look at 10 rows. (Condensed from 100 for clarity + brevity)

In [3]:
# Printing our unique values
print(pd.unique(exp['City']))
print(len(pd.unique(exp['Host since'])))
print(pd.unique(exp['Date of scraping']))
# Taking the head
exp.head(10)

NameError: name 'exp' is not defined

In [None]:
exp.head(100)

In [None]:
len(exp)

In [None]:
print(len(pd.unique(exp['Listings id'])))

In [None]:
exp[exp["City"] == "Firenze"]

In [None]:
repeated_observations = exp[exp['Listings id'].duplicated(keep=False)]

In [None]:
repeated_observations

In [None]:
repeated_observations[repeated_observations['Listings id'] == 222527]

In [None]:
from shapely.geometry import Point

def to_wkt(coord):
    lat, lon = map(float, coord.split(','))
    return Point(lon, lat).wkt  # Note: WKT format is (lon lat)

exp['Coordinates'] = exp['Coordinates'].apply(to_wkt)

In [None]:
# Rename the columns
exp = exp.rename(columns=column_mapping)
# Select and reorder the columns
exp = exp[staging_columns]



In [None]:
exp['host_is_superhost'] = exp['host_is_superhost'].map({'Superhost': True, 'Host': False})

In [None]:
exp.to_csv("/home/eandrews/projects/de-proj-1/op-db/airbnb_clean.csv", index=False)

In [None]:
# import pandas as pd
# df = pd.read_csv('/home/eandrews/projects/de-proj-1/op-db/airbnb_clean.csv')