# NYC Apartment Search - Group 1

### Purpose of the Project:
The project uses data-driven approaches to analyze and visualize New York City apartment data, 311 complaints, and urban forestry data to help understand urban living dynamics. This analysis is intended to aid in making informed decisions about apartment rentals based on environmental and urban living conditions.

### Sections and Key Functions:
1. **Setup**
   - Initializes the environment with necessary libraries and settings.

2. **Part 1: Data Preprocessing**
   - Functions to load and clean data from various sources (ZIP codes, 311 complaints, tree census, Zillow rent data).
   - Quality checks and basic data explorations are conducted.

3. **Part 2: Storing Data**
   - Database setup functions to create tables and indices.
   - Functions to convert geometries for database insertion and to insert cleaned data into a PostgreSQL database.
   - Data retrieval functions to fetch and display samples from each database table.

4. **Part 3: Understanding the Data**
   - Functions to execute SQL queries and to extract meaningful insights from the database.
   - Various SQL queries analyze the relationship between apartment prices, complaints, and tree census data.

5. **Part 4: Visualizing the Data**
   - Multiple visualizations to represent data insights graphically, including trends over time and spatial distributions.

## Setup

In [None]:
# Standard library imports
import os
import pathlib
import subprocess
from datetime import datetime, timedelta
from typing import Tuple

# Third-party imports
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import pandas as pd
import requests
import seaborn as sns
from sqlalchemy import create_engine
from sqlalchemy.engine.base import Engine
from shapely.geometry import Point
from geoalchemy2 import Geometry, WKTElement

In [None]:
# Path configuration
DATA_DIR = pathlib.Path("data")
ZIPCODE_DATA_FILE = DATA_DIR / "nyc_zipcodes" / "nyc_zipcodes.shp"
ZILLOW_DATA_FILE = DATA_DIR / "zillow_rent_data.csv"
QUERY_DIR = pathlib.Path("queries")  # Directory for saving DB queries

# API configuration
APP_TOKEN = "J9t5fS2TcfDISWng9WsnCdvCP"
COMPLAINTS_URL = 'https://data.cityofnewyork.us/resource/erm2-nwe9.geojson'
TREES_URL = 'https://data.cityofnewyork.us/resource/uvpi-gqnh.geojson'

# Database configuration
DB_NAME = "nyc_data"
DB_USER = "williamsjs"
DB_URL = f"postgresql+psycopg2://{DB_USER}@localhost/{DB_NAME}"
engine = create_engine(DB_URL)

In [None]:
def ensure_directory_exists(directory: pathlib.Path):
    """Ensure that a directory exists; if not, create it."""
    try:
        directory.mkdir(parents=True, exist_ok=True)
    except Exception as e:
        print(f"Error creating directory {directory}: {e}")

# Make sure the directories exist
ensure_directory_exists(DATA_DIR)
ensure_directory_exists(QUERY_DIR)

## Part 1: Data Preprocessing

The first part of the data cleaning process involves the following steps:

1. **Reading and cleaning the Zillow rental data**:
   - Loading the Zillow data from a CSV file
   - Melting the data so that each row represents a unique date-region pair
   - Filtering the data to include only New York City and the relevant date range (February 2022 to January 2024)
   - Keeping the required columns (zipcode, city, date, rent price) and renaming them
   - Converting the "zipcode" column to a string type and the "date" column to a datetime type

2. **Reading and cleaning the zipcode data**
3. **Downloading, cleaning, and filtering the 311 complaints data and tree data**
4. **Filtering all the datasets to include only the cleaned zipcode data**

5. **Performing data quality checks**:
   - Checking for null values in each dataset
   - Checking for duplicate entries in each dataset
   - Cross-referencing the zipcodes across the datasets to ensure consistency
6. **Show information and first 5 entries of each dataset**.

Overall, the purpose of this part of the code is to extract, clean, and integrate the necessary information from the original data sources, preparing the data for further analysis. It involves key steps such as data loading, data cleaning, data filtering, and data quality checks.

#### The `read_and_clean_zipcode_data()` function 
reads in a shapefile containing zipcode data, cleans and preprocesses the data, and returns a GeoDataFrame with unique zipcodes and their corresponding geometries. 

The key steps are:

1. Reading the shapefile using Geopandas, a library for working with geospatial data.
2. Selecting the relevant columns (zipcode and geometry) and renaming the 'ZIPCODE' column to 'zipcode' for better readability.
3. Converting the coordinate reference system (CRS) of the GeoDataFrame to EPSG:4326 (WGS84) for consistency.
4. Removing any duplicate zipcode entries, keeping only the first occurrence of each unique zipcode.

In [None]:
def read_and_clean_zipcode_data() -> gpd.GeoDataFrame:
    """
    Read and clean zipcode data from a shapefile.
    
    Returns:
        GeoDataFrame: A cleaned GeoDataFrame containing unique zipcodes and their geometries.
    """
    # Read the shapefile using Geopandas
    zipcode_data = gpd.read_file(ZIPCODE_DATA_FILE)
    
    # Select relevant columns and rename them
    zipcode_cleaned = zipcode_data[['ZIPCODE', 'geometry']].rename(columns={'ZIPCODE': 'zipcode'})
    
    # Convert CRS to EPSG:4326 for consistency
    zipcode_cleaned = zipcode_cleaned.to_crs(epsg=4326)
    
    # Remove duplicate zipcodes, keeping the first occurrence
    zipcode_cleaned = zipcode_cleaned.drop_duplicates(subset=['zipcode'], keep='first')
    
    return zipcode_cleaned

#### The `download_and_clean_311_data()` function 
fetches 311 complaint data from the NYC Open Data API, cleans and preprocesses the data, and returns a GeoDataFrame containing the cleaned 311 complaints.

The key steps are:
1. Defining the API parameters to fetch 311 complaint data within a specific date range and with valid latitude/longitude coordinates.
2. Sending a request to the API and handling any errors that may occur during the download.
3. Creating a GeoDataFrame from the API response and setting the appropriate coordinate reference system (EPSG:4326).
4. Selecting and renaming the relevant columns, and dropping any rows with missing zipcodes.
5. Converting the 'created_date' column from a string to a date format.

In [None]:
def download_and_clean_311_data() -> gpd.GeoDataFrame:
    """
    Download and clean 311 complaint data from the NYC Open Data API.
    
    Returns:
        GeoDataFrame: A cleaned GeoDataFrame containing 311 complaints with relevant fields and valid zipcodes.
    """
    # API parameters for fetching data
    complaints_params = {
        '$$app_token': APP_TOKEN,
        '$where': 'created_date >= "2022-02-01T00:00:00.000" AND created_date <= "2024-02-29T00:00:00.000" AND latitude IS NOT NULL',
        '$limit': 1000000
    }
    
    # Requesting data from the API
    complaints_response = requests.get(COMPLAINTS_URL, params=complaints_params)
    if complaints_response.status_code != 200:
        raise Exception("Failed to download data")

    # Create a GeoDataFrame from the response
    complaints_data = gpd.GeoDataFrame.from_features(complaints_response.json()['features']).set_crs(epsg=4326)
    
    # Select and rename columns, and drop rows without zipcodes
    complaints_cleaned = complaints_data[['unique_key', 'created_date', 'complaint_type', 'incident_zip', 'geometry']].rename(columns={
        'unique_key': 'unique_id', 
        'incident_zip': 'zipcode'
    }).dropna(subset=['zipcode'])
    
    # Convert 'created_date' from string to date
    complaints_cleaned['created_date'] = pd.to_datetime(complaints_cleaned['created_date']).dt.date
    
    return complaints_cleaned

#### The `download_and_clean_tree_data()` function 
fetches tree data from the NYC Open Data API, cleans and preprocesses the data, and returns a GeoDataFrame containing the cleaned tree data.

The key steps are:

1. Defining the API parameters to fetch tree data, with a limit of 10 million records.
2. Sending a request to the API and handling any errors that may occur during the download.
3. Creating a GeoDataFrame from the API response and dropping any rows where the latitude or longitude is missing.
4. Converting the latitude and longitude columns to float and creating geometry points from them.
5. Selecting and renaming the relevant columns, and dropping any rows where critical information (zipcode, health, or species) is missing.
6. Setting the coordinate reference system (EPSG:4326) for the GeoDataFrame.

In [None]:
def download_and_clean_tree_data() -> gpd.GeoDataFrame:
    """
    Download and clean tree data from the NYC Open Data API.
    
    Returns:
        GeoDataFrame: A cleaned GeoDataFrame containing tree data with geometries created from latitude and longitude.
    """
    trees_params = {
        '$$app_token': APP_TOKEN,
        '$limit': 10000000
    }
    
    trees_response = requests.get(TREES_URL, params=trees_params)
    if trees_response.status_code != 200:
        print(f"Failed to download tree data. Status code: {trees_response.status_code}")
        return None

    # Create a GeoDataFrame from the JSON response
    trees_data = gpd.GeoDataFrame.from_features(trees_response.json())

    # Drop rows where latitude or longitude is NaN, and convert them to float
    trees_data.dropna(subset=['latitude', 'longitude'], inplace=True)
    trees_data['latitude'] = trees_data['latitude'].astype(float)
    trees_data['longitude'] = trees_data['longitude'].astype(float)

    # Create geometry points from latitude and longitude
    trees_data['geometry'] = trees_data.apply(lambda row: Point(row['longitude'], row['latitude']), axis=1)

    # Select and rename columns, drop rows where any critical information is missing
    trees_cleaned = trees_data[['tree_id', 'spc_common', 'health', 'status', 'zipcode', 'geometry']].rename(
        columns={'spc_common': 'species'}
    )
    trees_cleaned['zipcode'] = trees_cleaned['zipcode'].astype(str)
    trees_cleaned.crs = 'EPSG:4326'
    trees_cleaned.dropna(subset=['zipcode', 'health', 'species'], inplace=True)

    return trees_cleaned

#### The `read_and_clean_zillow_data()` function 
reads Zillow rental data for New York City from a CSV file, cleans and preprocesses the data, and returns a cleaned DataFrame.

The key steps are:

1. Loading the Zillow data from a CSV file into a DataFrame.
2. Melting the data so that each row represents a unique date-region pair, making it easier to work with.
3. Filtering the data to include only New York City and selecting the relevant date range (February 2022 to January 2024).
4. Keeping only the required columns (zipcode, city, data_date, rent_price) and renaming them for better readability.
5. Converting the 'zipcode' column to a string and the 'data_date' column to a datetime format.

In [None]:
def read_and_clean_zillow_data() -> pd.DataFrame:
    """
    Read and clean Zillow rental data for New York City.
    
    Returns:
        DataFrame: A cleaned DataFrame containing Zillow rental data with selected columns and filtered dates.
    """
    # Load Zillow data from a CSV file
    zillow_data = pd.read_csv(ZILLOW_DATA_FILE)
    
    # Melt the data so that every row is a unique date-region pair
    id_vars = ['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName', 'State', 'City', 'Metro', 'CountyName']
    melted_data = pd.melt(zillow_data, id_vars=id_vars, var_name='data_date', value_name='rent_price')
    
    # Filter for New York City data and select relevant dates
    zillow_cleaned = melted_data[
        (melted_data['City'] == 'New York') &
        (melted_data['data_date'] >= '2022-02-01') &
        (melted_data['data_date'] <= '2024-01-31')
    ]
    
    # Keep only the required columns and rename them
    zillow_cleaned = zillow_cleaned[['RegionName', 'City', 'data_date', 'rent_price']].rename(
        columns={'RegionName': 'zipcode', 'City': 'city', 'data_date': 'data_date', 'rent_price': 'rent_price'}
    ).dropna()
    
    # Convert data types
    zillow_cleaned['zipcode'] = zillow_cleaned['zipcode'].astype(str)
    zillow_cleaned['data_date'] = pd.to_datetime(zillow_cleaned['data_date'])

    return zillow_cleaned

#### The `load_and_clean_all_data()` function 
is responsible for loading and cleaning multiple datasets, including zip codes, Zillow rental data, 311 complaints, and tree data. It then filters all the datasets to only include entries corresponding to the cleaned zip code data.

The key steps are:

1. Reading and cleaning the zip code data using the `read_and_clean_zipcode_data()` function.
2. Reading, cleaning, and filtering the Zillow rental data using the `read_and_clean_zillow_data()` function, ensuring that only the data for the cleaned zip codes is retained.
3. Downloading, cleaning, and filtering the 311 complaints data using the `download_and_clean_311_data()` function, again ensuring that only the data for the cleaned zip codes is kept.
4. Downloading, cleaning, and filtering the tree data using the `download_and_clean_tree_data()` function, keeping only the data for the cleaned zip codes.

In [None]:
def load_and_clean_all_data() -> Tuple[gpd.GeoDataFrame, pd.DataFrame, gpd.GeoDataFrame, gpd.GeoDataFrame]:
    """
    Load and clean all relevant datasets including zip codes, rental data, 311 complaints, and tree data,
    filtering them to only include entries corresponding to the cleaned zip code data.
    
    Returns:
        Tuple[GeoDataFrame, DataFrame, GeoDataFrame, GeoDataFrame]: A tuple containing the cleaned data for
        zip codes, Zillow rental, 311 complaints, and tree data respectively.
    """
    # Read and clean zipcode data
    zipcode_data = read_and_clean_zipcode_data()
    
    # Read, clean and filter Zillow rental data
    zillow_data = read_and_clean_zillow_data()
    zillow_data = zillow_data[zillow_data['zipcode'].isin(zipcode_data['zipcode'])]
    
    # Download, clean and filter 311 complaints data
    complaints_data = download_and_clean_311_data()
    if complaints_data is not None:
        complaints_data = complaints_data[complaints_data['zipcode'].isin(zipcode_data['zipcode'])]
    
    # Download, clean and filter tree data
    tree_data = download_and_clean_tree_data()
    if tree_data is not None:
        tree_data = tree_data[tree_data['zipcode'].isin(zipcode_data['zipcode'])]

    return zipcode_data, zillow_data, complaints_data, tree_data

#### Load and clean all data sets

In [None]:
# Load and clean all data sets
zipcode_data, zillow_data, complaints_data, tree_data = load_and_clean_all_data()

#### The `check_data_quality()` function 
performs a series of data quality checks on the loaded datasets, including:

1. Checking for null values in each dataset (zipcode_data, zillow_data, complaints_data, and tree_data).
2. Checking for duplicate entries in each dataset, such as duplicate zipcodes, tree IDs, and unique IDs.
3. Cross-referencing the zipcodes across the datasets to ensure that all zipcodes in the Zillow rental, 311 complaints, and tree data are present in the cleaned zipcode_data.

In [None]:
def check_data_quality():
    """
    Perform data quality checks on the loaded datasets.
    """

    # Check for null values in each dataset
    print("Null values in zipcode_data:", zipcode_data.isnull().sum())
    print("Null values in zillow_data:", zillow_data.isnull().sum())
    if complaints_data is not None:
        print("Null values in complaints_data:", complaints_data.isnull().sum())
    if tree_data is not None:
        print("Null values in tree_data:", tree_data.isnull().sum())

    # Check for duplicate entries
    print("Duplicate zipcodes in zipcode_data:", zipcode_data['zipcode'].duplicated().sum())
    if tree_data is not None:
        print("Duplicate tree_ids in tree_data:", tree_data['tree_id'].duplicated().sum())
    if complaints_data is not None:
        print("Duplicate unique_ids in complaints_data:", complaints_data['unique_id'].duplicated().sum())
    print("Duplicate zipcodes in zillow_data:", zillow_data['zipcode'].duplicated().sum())

    # Cross-reference zipcodes across datasets
    if zillow_data is not None and zipcode_data is not None:
        print("Are all zillow_data zipcodes in zipcode_data?", all(zillow_data['zipcode'].isin(zipcode_data['zipcode'].unique())))
    if complaints_data is not None and zipcode_data is not None:
        print("Are all complaints_data zipcodes in zipcode_data?", all(complaints_data['zipcode'].isin(zipcode_data['zipcode'].unique())))
    if tree_data is not None and zipcode_data is not None:
        print("Are all tree_data zipcodes in zipcode_data?", all(tree_data['zipcode'].isin(zipcode_data['zipcode'].unique())))

# Run the data quality checks
check_data_quality()

#### Show basic info about zipcode_data

In [None]:
zipcode_data.info()

#### Show first 5 entries about zipcode_data

In [None]:
zipcode_data.head()

#### Show basic info about complaints_data

In [None]:
complaints_data.info()

#### Show first 5 entries about complaints_data

In [None]:
complaints_data.head()

#### Show basic info about tree_data

In [None]:
tree_data.info()

#### Show first 5 entries about tree_data

In [None]:
tree_data.head()

#### Show basic info about zillow_data

In [None]:
zillow_data.info()

#### Show first 5 entries about zillow_data

In [None]:
zillow_data.head()