# BigQuery Data Warehousing Script

**Link To Data**: https://data.cityofnewyork.us/City-Government/NYC-Citywide-Annualized-Calendar-Sales-Update/w2pb-icbu

**API end-point**: https://data.cityofnewyork.us/resource/w2pb-icbu.json

**Data Dictionary**: https://data.cityofnewyork.us/api/views/w2pb-icbu/files/8ed811b4-8238-4b5e-9acc-1e33d8705498?download=true&filename=Annualized_Calendar_Sales_Update%20Data_Dictionary.xlsx

**Cleaned Data Dictionary**: https://docs.google.com/spreadsheets/d/17XyGmnw2fZuTMCWVKB1XiWGHQuwqWOidm0w80lbIyjE/edit?usp=sharing

**IMPORTANT: This data set is 121.3 MB. Once downloaded, please keep the file in the same directory as this jupyter notebook file, so that the .csv file can be uploaded to the Google Cloud correctly. This script will now allow us to upload our Data Model from our dataset in our Google Cloud Storage and into BigQuery**

# Data Model

## Fact Table:

**SALES_FACT**

- SALE_ID (Primary Key, String)
- DATE_ID (Foreign Key to DIM_DATE, Int)
- LOCATION_ID (Foreign Key to DIM_LOCATION, Int)
- SALE_PRICE (Int)
- RESIDENTIAL_UNITS (Int)
- COMMERCIAL_UNITS (Int)
- TOTAL_UNITS (Int)
- LAND_SQFT (Int)
- GROSS_SQFT(Int)
- INITIAL_BUILDING_CLASS (String)
- FINAL_BUILDING_CLASS (String)
- INITIAL_TAX_CLASS (Float)
- FINAL_TAX_CLASS (Float)
- PROPERTIES_UNSOLD (Boolean)
- PROPERTIES_UNSOLD_PRE_2020 (Boolean)
- PROPERTIES_UNSOLD_POST_2020 (Boolean)
- PROPERTIES_SOLD (Boolean)
- PROPERTIES_SOLD_POST_2020 (Boolean)
- PROPERTIES_SOLD_PRE_2020 (Boolean)

## Dimension Tables:

**Date Dimension (DIM_DATE)**

- DATE_ID (Primary Key, Int)
- SALE_DATE (Date)
- YEAR_BUILT (Int)
- YEAR_SOLD (Int)
- MONTH_SOLD (Int)
- DAY_SOLD (Int)


**Location Dimension (DIM_LOCATION)**

- LOCATION_ID (Primary Key, Int)
- BIN (Int)
- ADDRESS (String)
- BOROUGH (String)
- NEIGHBORHOOD (String)
- ZIP_CODE (Int)

# Grant Required Permissions to your Google Cloud Service Account to Create a BigQuery Data Set

**1. Grant Permissions:**
- Go to the Google Cloud Console.
- Navigate to the IAM & Admin > IAM page.
- Locate the user account associated with the credentials you are using.
- Click "ADD IAM CONDITION"
- Under the "Role" field, select the "BigQuery Data Editor"
- Click "SAVE" to grant permissions.

**Note: The BigQuery Data Editor role allows your service account access to edit all the contents of datasets. This step is important for loading your dataset from Google Cloud to the BigQuery Data Warehouse.**

# Install the google-cloud-bigquery library

In [None]:
pip install google-cloud-bigquery

# Install pandas gcsfs

In [None]:
pip install pandas gcsfs

# Install pyarrow library

**NOTE: Once pyarrow is installed, you should be able to use the load_table_from_dataframe function without encountering the ValueError from the "Load Data into BigQuery Tables" Cell.**

**After installing pyarrow, you might need to restart your Python environment or Jupyter Notebook kernel before running the script again to ensure that the changes take effect.**

In [None]:
pip install pyarrow

# Import the Python 'os' module

In [None]:
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'GOOGLE_CLOUD_ACCESSKEY.json'

# Create BigQuery Dataset

In [None]:
from google.cloud import bigquery

# Creates a function for creating a BigQuery dataset with your file stored in your Google Cloud
def create_bigquery_dataset(project_id, dataset_name):
    bigquery_client = bigquery.Client(project=project_id)
    dataset_id = f"{project_id}.{dataset_name}"
    dataset = bigquery.Dataset(dataset_id)
    dataset.location = "US"
    bigquery_client.create_dataset(dataset)
    print(f"Dataset {dataset_id} created.")

# Replace "your-project-id" with your actual Google Cloud project id
project_id = 'your-project-id'
dataset_name = 'NYC_sales_cleaned'  
create_bigquery_dataset(project_id, dataset_name)

# Create Tables in BigQuery

In [None]:
from google.cloud import bigquery
from google.oauth2 import service_account

# Get the path to the service account key file from the environment variable
service_account_path = os.environ.get('GOOGLE_APPLICATION_CREDENTIALS')

# Set your Google Cloud credentials using the environment variable
credentials = service_account.Credentials.from_service_account_file(service_account_path)
# Initialize a BigQuery client
client = bigquery.Client(credentials=credentials, project=credentials.project_id)

# Define your dataset and table names
dataset_name = 'NYC_sales_cleaned'
fact_table_name = 'SALES_FACT'
date_dim_table_name = 'DIM_DATE'
location_dim_table_name = 'DIM_LOCATION'

# Create the dataset
dataset_ref = client.dataset(dataset_name)
client.get_dataset(dataset_ref)

# Define the schema for the fact table
fact_table_schema = [
    bigquery.SchemaField('SALE_ID', 'STRING', mode='REQUIRED'),
    bigquery.SchemaField('DATE_ID', 'INTEGER'),
    bigquery.SchemaField('LOCATION_ID', 'INTEGER'),
    bigquery.SchemaField('SALE_PRICE', 'INTEGER'),
    bigquery.SchemaField('RESIDENTIAL_UNITS', 'INTEGER'),
    bigquery.SchemaField('COMMERCIAL_UNITS', 'INTEGER'),
    bigquery.SchemaField('TOTAL_UNITS', 'INTEGER'),
    bigquery.SchemaField('LAND_SQFT', 'INTEGER'),
    bigquery.SchemaField('GROSS_SQFT', 'INTEGER'),
    bigquery.SchemaField('INITIAL_BUILDING_CLASS', 'STRING'),
    bigquery.SchemaField('FINAL_BUILDING_CLASS', 'STRING'),
    bigquery.SchemaField('INITIAL_TAX_CLASS', 'FLOAT'),
    bigquery.SchemaField('FINAL_TAX_CLASS', 'FLOAT'),
    bigquery.SchemaField('PROPERTIES_UNSOLD', 'BOOL'),
    bigquery.SchemaField('PROPERTIES_UNSOLD_PRE_2020', 'BOOL'),
    bigquery.SchemaField('PROPERTIES_UNSOLD_POST_2020', 'BOOL'),
    bigquery.SchemaField('PROPERTIES_SOLD', 'BOOL'),
    bigquery.SchemaField('PROPERTIES_SOLD_POST_2020', 'BOOL'),
    bigquery.SchemaField('PROPERTIES_SOLD_PRE_2020', 'BOOL')
]

# Define the schema for the date dimension table
date_dim_table_schema = [
    bigquery.SchemaField('DATE_ID', 'INTEGER', mode='REQUIRED'),
    bigquery.SchemaField('SALE_DATE', 'DATE'),
    bigquery.SchemaField('YEAR_BUILT', 'INTEGER'),
    bigquery.SchemaField('YEAR_SOLD', 'INTEGER'),
    bigquery.SchemaField('MONTH_SOLD', 'INTEGER'),
    bigquery.SchemaField('DAY_SOLD', 'INTEGER')
]


# Define the schema for the location dimension table
location_dim_table_schema = [
    bigquery.SchemaField('LOCATION_ID', 'INTEGER', mode='REQUIRED'),
    bigquery.SchemaField('BIN', 'INTEGER'),
    bigquery.SchemaField('ADDRESS', 'STRING'),
    bigquery.SchemaField("BOROUGH", "STRING"),
    bigquery.SchemaField("NEIGHBORHOOD", "STRING"),
    bigquery.SchemaField("ZIP_CODE", "INTEGER")
]

# Creates the SALES_FACT table:
fact_table_ref = dataset_ref.table(fact_table_name)
try:
    client.get_table(fact_table_ref)
    print(f"Table {fact_table_name} already exists in the dataset {dataset_name}.")
except:
    fact_table = bigquery.Table(fact_table_ref, schema=fact_table_schema)
    client.create_table(fact_table)
    print(f"{fact_table_name} Created")

# Creates the DIM_DATE table:
date_dim_table_ref = dataset_ref.table(date_dim_table_name)
try:
    client.get_table(date_dim_table_ref)
    print(f"Table {date_dim_table_name} already exists in the dataset {dataset_name}.")
except:
    date_dim_table = bigquery.Table(date_dim_table_ref, schema=date_dim_table_schema)
    client.create_table(date_dim_table)
    print(f"{date_dim_table_name} Created")

# Creates the DIM_LOCATION table:
location_dim_table_ref = dataset_ref.table(location_dim_table_name)
try:
    client.get_table(location_dim_table_ref)
    print(f"Table {location_dim_table_name} already exists in the dataset {dataset_name}.")
except: 
    location_dim_table = bigquery.Table(location_dim_table_ref, schema=location_dim_table_schema)
    client.create_table(location_dim_table)
    print(f"{location_dim_table_name} Created")

# Read a dataset from your Google Cloud Storage into a Pandas DataFrame

In [None]:
import pandas as pd
from gcsfs import GCSFileSystem

# Replace 'xx' with your actual initials
gcs_bucket = 'cis-4400-project-xx'
gcs_file_path = 'NYC_sales_cleaned.csv'

# Use Pandas to read the dataset from GCS into a DataFrame
df = pd.read_csv(f'gcs://{gcs_bucket}/{gcs_file_path}')

# Display the first few rows of the DataFrame
df.head()


# Load Data into BigQuery Tables

In [None]:
# Creates a function that uploads your data to BigQuery from a DataFrame
def upload_data_from_dataframe(df, table_ref):
    job_config = bigquery.LoadJobConfig()
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
    job_config.autodetect = True
    job = client.load_table_from_dataframe(df, table_ref, job_config=job_config)
    job.result()  # Wait for the job to complete

# Split your DataFrame into the respective dimension and fact DataFrames
# fact_df, date_dim_df, time_dim_df, location_dim_df = split_your_dataframe(df_transformed)
def split_df(df):
    fact_cols = [
    "SALE_ID", "DATE_ID", "LOCATION_ID", 
    "SALE_PRICE", "RESIDENTIAL_UNITS", 
    "COMMERCIAL_UNITS", "TOTAL_UNITS", 
    "LAND_SQFT", "GROSS_SQFT", 
    "INITIAL_BUILDING_CLASS", "FINAL_BUILDING_CLASS", 
    "INITIAL_TAX_CLASS", "FINAL_TAX_CLASS", 
    "PROPERTIES_UNSOLD", "PROPERTIES_UNSOLD_PRE_2020", 
    "PROPERTIES_UNSOLD_POST_2020", "PROPERTIES_SOLD", 
    "PROPERTIES_SOLD_POST_2020", "PROPERTIES_SOLD_PRE_2020"]
    
    date_cols = [
    "DATE_ID", "SALE_DATE", "YEAR_BUILT", "YEAR_SOLD", "MONTH_SOLD", "DAY_SOLD"]
    

    location_cols = [
    "LOCATION_ID", "BIN", "ADDRESS", "BOROUGH", 
    "NEIGHBORHOOD", "ZIP_CODE"]

    fact_df = df[fact_cols]
    date_dim_df = df[date_cols]
    location_dim_df = df[location_cols]
    
    # Return the split DataFrames
    return fact_df, date_dim_df, location_dim_df

fact_df, date_dim_df, location_dim_df = split_df(df)

# Upload the data to BigQuery
upload_data_from_dataframe(fact_df, fact_table_ref)
upload_data_from_dataframe(date_dim_df, date_dim_table_ref)
upload_data_from_dataframe(location_dim_df, location_dim_table_ref)