# cx2313_ly2637 Final Project

### Setup

In [1]:
import json
import pathlib
import urllib.parse

import geoalchemy2 as gdb
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import requests
import shapely
import sqlalchemy as db

from sqlalchemy.orm import declarative_base
from sqlalchemy import Column, Integer, Float, String, DateTime, create_engine
from sqlalchemy.ext.declarative import declarative_base

In [2]:
DATA_DIR = pathlib.Path("data")
ZIPCODE_DATA_FILE = DATA_DIR / "zipcodes" / "ZIP_CODE_040114.shp"
ZILLOW_DATA_FILE = DATA_DIR / "zillow_rent_data.csv"

NYC_DATA_APP_TOKEN = "4P7xr8685SCdZVFOLXScTCqJi"
BASE_NYC_DATA_URL = "https://data.cityofnewyork.us/"
NYC_DATA_311 = "erm2-nwe9.geojson"
NYC_DATA_TREES = "5rq2-4hqu.geojson"

DB_SCHEMA_FILE = "schema.sql"
# directory where DB queries for Part 3 will be saved
QUERY_DIR = pathlib.Path("queries")

### Part 1: Data Preprocessing

The first part of the project involves two main activities. The initial step is to download specific datasets manually. This is followed by using Python scripts for automated data downloads. Once the data is collected, the next step is to sort through it. This includes selecting the relevant information, fixing any missing or incorrect data, and creating samples from these datasets for further analysis.

#### 1.1 Load and clean data for zipcode file
- For the zipcode file, first, we'll remove any columns that are not needed. Next, we'll review the basic information of the dataset. Following that, we'll identify the parts that need cleaning and proceed with the cleaning process.

In [3]:
# Step 1: Load data and remove unnecessary columns
zipcode_data_file = DATA_DIR / "nyc_zipcodes.shp"
gdf = gpd.read_file(zipcode_data_file)

columns_to_keep = ['ZIPCODE', 'AREA', 'STATE', 'COUNTY', 'geometry']
zipcode_gdf = gdf[columns_to_keep]

# Rename columns for consistency
zipcode_gdf = zipcode_gdf.rename(columns={'ZIPCODE': 'zipcode', 'City': 'city','COUNTY': 'county','STATE': 'state','AREA': 'area'})

In [4]:
# Step 2: basic information of the data.
print(zipcode_gdf.info())
zipcode_gdf.head()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 263 entries, 0 to 262
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   zipcode   263 non-null    object  
 1   area      263 non-null    float64 
 2   state     263 non-null    object  
 3   county    263 non-null    object  
 4   geometry  263 non-null    geometry
dtypes: float64(1), geometry(1), object(3)
memory usage: 10.4+ KB
None


Unnamed: 0,zipcode,area,state,county,geometry
0,11436,22699300.0,NY,Queens,"POLYGON ((1038098.252 188138.380, 1038141.936 ..."
1,11213,29631000.0,NY,Kings,"POLYGON ((1001613.713 186926.440, 1002314.243 ..."
2,11212,41972100.0,NY,Kings,"POLYGON ((1011174.276 183696.338, 1011373.584 ..."
3,11225,23698630.0,NY,Kings,"POLYGON ((995908.365 183617.613, 996522.848 18..."
4,11218,36868800.0,NY,Kings,"POLYGON ((991997.113 176307.496, 992042.798 17..."


- From the results we observed, it's clear that the geometry column in the zipcode file is not in the WGS coordinate system. Therefore, we will need to include a transformation in our clean function to convert the geometries to the WGS system for consistency and analysis compatibility.
- Furthermore, we will drop any duplicate rows to ensure data integrity and accuracy for our analysis and reporting. This step is crucial for maintaining the quality of our dataset and providing reliable insights.

In [5]:
# Step 3: Clean function
def clean_zipcode_data(gdf):
    gdf_cleaned = gdf.copy()
    
    # Change to WGS system for consistency
    gdf_cleaned = gdf_cleaned.to_crs(epsg=4326)
    
    # Validate geometric data in 'geometry'
    gdf_cleaned = gdf_cleaned[gdf_cleaned['geometry'].is_valid]
    
    # Remove duplicates
    gdf_cleaned = gdf_cleaned.drop_duplicates()
    
    # Ensure categorical consistency
    categorical_columns = ['state', 'county']
    for col in categorical_columns:
        gdf_cleaned.loc[:, col] = gdf_cleaned[col].str.title()
    
    return gdf_cleaned

cleaned_zipcode_gdf = clean_zipcode_data(zipcode_gdf)

- Check cleaned data in details

In [6]:
print(cleaned_zipcode_gdf.info())
cleaned_zipcode_gdf.head()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 262 entries, 0 to 262
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   zipcode   262 non-null    object  
 1   area      262 non-null    float64 
 2   state     262 non-null    object  
 3   county    262 non-null    object  
 4   geometry  262 non-null    geometry
dtypes: float64(1), geometry(1), object(3)
memory usage: 12.3+ KB
None


Unnamed: 0,zipcode,area,state,county,geometry
0,11436,22699300.0,Ny,Queens,"POLYGON ((-73.80585 40.68291, -73.80569 40.682..."
1,11213,29631000.0,Ny,Kings,"POLYGON ((-73.93740 40.67973, -73.93487 40.679..."
2,11212,41972100.0,Ny,Kings,"POLYGON ((-73.90294 40.67084, -73.90223 40.668..."
3,11225,23698630.0,Ny,Kings,"POLYGON ((-73.95797 40.67066, -73.95576 40.670..."
4,11218,36868800.0,Ny,Kings,"POLYGON ((-73.97208 40.65060, -73.97192 40.650..."


Now that we have all the zipcode information for New York City, we need to extract this information to use as a reference. This will help us filter the data needed for `311`, `tree`, and `Zillow`. We will include this step in the cleaning steps for `tree`, `311`, and `Zillow` files.