## Cook County Parcel Data Cleaning, Integration & Connected Communities Mapping

This notebook processes multiple datasets at the parcel (PIN) level to prepare a clean, integrated view of land parcels in Cook County, with a focus on identifying those that fall within Chicago’s **Connected Communities (CC)** zoning area.

### Core Dataset: `parcel_cc.csv`
- Contains parcel identifiers (`pin`/`name`) and a key column called `inclusion_source`.
- `inclusion_source` indicates the specific reason why a parcel falls within the Connected Communities boundary (e.g., zoning inclusion, transit accessibility, affordability overlays).
- This file is used to flag PINs and later **merge with additional datasets** (e.g., tax records, city-owned land, vacant parcels).

### Additional Steps Performed in This Notebook:
- **Standardize and deduplicate PINs** across large datasets using `tax_year` to retain the latest parcel records.
- **Merge and enrich** parcel data using geospatial and tabular joins.
- **Filter** PINs that fall inside or outside the Connected Communities boundaries.
- **Identify vacant parcels** using city-owned land and improvement status.
- **Spatially map** the resulting vacant CC parcels against CTA bus routes, L stations, and block group boundaries using Folium and GeoPandas.
- **Flag parcels** based on proximity to transit and inclusion in CC for further analysis or development prioritization.

### ✅ Final Outputs:
- Cleaned and enriched dataset of Cook County parcels
- Connected Communities inclusion flag per parcel
- Interactive HTML map showing vacant parcels, transit access, and zoning overlays
- **A CSV file containing all vacant properties along with their block group GEOIDs and CC inclusion status**


# Parcel_CC

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import geopandas as gpd
from shapely import wkt
from shapely.geometry import Point
import folium
from folium import Choropleth, GeoJson, GeoJsonTooltip
import ast

In [11]:
# Load the CSV file
file_path = 'C:/Users/kaur6/Downloads/Urban Analytics/Parcel_CC.csv'
df = pd.read_csv(file_path, low_memory=False)
print("**********************")
# Preview the data
print("First 5 rows of the dataset:")
print(df.head())
print("**********************")
# Summary info
print("\nDataset info:")
print(df.info())
print("**********************")
# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())
print("**********************")
# Summary statistics
print("\nSummary statistics:")
print(df.describe(include='all'))
print("**********************")

**********************
First 5 rows of the dataset:
             name      pin10      job_no  longitude   latitude  census_tract  \
0  01271000040000  127100004           0 -88.168264  42.092977  1.703180e+10   
1  01033000180000  103300018           0 -88.176387  42.140587  1.703180e+10   
2  01271020230000  127102023           0 -88.175209  42.090937  1.703180e+10   
3  01284100100000  128410010  2008000622 -88.186716  42.083475  1.703180e+10   
4  01023000530000  102300053           0 -88.159604  42.141777  1.703180e+10   

    tract_geoid  tract_white_perc  tract_black_perc inclusion_source  
0  1.703180e+10          0.669399          0.014344              NaN  
1  1.703180e+10          0.669399          0.014344              NaN  
2  1.703180e+10          0.669399          0.014344              NaN  
3  1.703180e+10          0.669399          0.014344              NaN  
4  1.703180e+10          0.669399          0.014344              NaN  
**********************

Dataset info:
<cl

In [13]:
# Count total rows with duplicated 'name' values (excluding the first occurrence)
duplicate_row_count = df.duplicated(subset='name', keep='first').sum()
print(f"🔁 Total duplicate rows based on 'name': {duplicate_row_count}")

🔁 Total duplicate rows based on 'name': 489199


In [17]:
# Count missing values per row
df['missing_count'] = df.isnull().sum(axis=1)

# Sort by missing count (ascending), so best rows come first
df_sorted = df.sort_values(by='missing_count')

df_best_rows = df_sorted.drop_duplicates(subset='name', keep='first').copy()
df_best_rows.drop(columns='missing_count', inplace=True)
df_best_rows.reset_index(drop=True, inplace=True)

# Convert specified columns to string
df_best_rows['name'] = df_best_rows['name'].astype(str)
df_best_rows['pin10'] = df_best_rows['pin10'].astype(str).str.zfill(10)  # pad to 10 digits
df_best_rows['census_tract'] = df_best_rows['census_tract'].astype(str)
df_best_rows['tract_geoid'] = df_best_rows['tract_geoid'].astype(str)
df_best_rows['inclusion_source'] = df_best_rows['inclusion_source'].astype(str)

# Fill nulls and 'nan' strings in these string columns
string_cols = ['name', 'census_tract', 'tract_geoid', 'inclusion_source']
df_best_rows[string_cols] = df_best_rows[string_cols].replace('nan', 'Unknown').fillna('Unknown')

# Drop rows where 'name' is 'Unknown' (originally missing)
df_best_rows = df_best_rows[df_best_rows['name'].str.lower() != 'unknown']

# Reset index
df_best_rows.reset_index(drop=True, inplace=True)

# Save cleaned file
output_path = 'C:/Users/kaur6/Downloads/Urban Analytics/Parcel_CC_Cleaned.csv'
df_best_rows.to_csv(output_path, index=False)

# Preview
print("✅ Final cleaned file saved.")
print(df_best_rows.head())

✅ Final cleaned file saved.
             name       pin10  job_no  longitude   latitude   census_tract  \
0  16052210320000  1605221032       0 -87.768517  41.904417  17031250600.0   
1  16131260170000  1613126017       0 -87.701964  41.875518  17031271200.0   
2  16243100090000  1624310009       0 -87.704923  41.852313  17031841700.0   
3  16252180030000  1625218003       0 -87.692609  41.847523  17031301200.0   
4  16123140420000  1612314042       0 -87.699154  41.884731  17031837100.0   

     tract_geoid  tract_white_perc  tract_black_perc  \
0  17031250600.0          0.012864          0.863835   
1  17031271200.0          0.059382          0.893112   
2  17031841700.0          0.050959          0.438904   
3  17031301200.0          0.031319          0.066308   
4  17031837100.0          0.181078          0.652256   

                   inclusion_source  
0                   cdot bus routes  
1  osm rail entrance, exit, station  
2             gtfs rail stop points  
3             

In [18]:
# Count total rows with duplicated 'name' values (excluding the first occurrence)
duplicate_row_count = df_best_rows.duplicated(subset='name', keep='first').sum()
print(f"🔁 Total duplicate rows based on 'name': {duplicate_row_count}")

🔁 Total duplicate rows based on 'name': 0


In [19]:
# Load the CSV file
file_path = 'C:/Users/kaur6/Downloads/Urban Analytics/Parcel_CC_Cleaned.csv'
df = pd.read_csv(file_path, low_memory=False)
print("**********************")
# Preview the data
print("First 5 rows of the dataset:")
print(df.head())
print("**********************")
# Summary info
print("\nDataset info:")
print(df.info())
print("**********************")
# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())
print("**********************")
# Summary statistics
print("\nSummary statistics:")
print(df.describe(include='all'))
print("**********************")

**********************
First 5 rows of the dataset:
             name       pin10  job_no  longitude   latitude   census_tract  \
0  16052210320000  1605221032       0 -87.768517  41.904417  17031250600.0   
1  16131260170000  1613126017       0 -87.701964  41.875518  17031271200.0   
2  16243100090000  1624310009       0 -87.704923  41.852313  17031841700.0   
3  16252180030000  1625218003       0 -87.692609  41.847523  17031301200.0   
4  16123140420000  1612314042       0 -87.699154  41.884731  17031837100.0   

     tract_geoid  tract_white_perc  tract_black_perc  \
0  17031250600.0          0.012864          0.863835   
1  17031271200.0          0.059382          0.893112   
2  17031841700.0          0.050959          0.438904   
3  17031301200.0          0.031319          0.066308   
4  17031837100.0          0.181078          0.652256   

                   inclusion_source  
0                   cdot bus routes  
1  osm rail entrance, exit, station  
2             gtfs rail stop

### renamed Parcel_CC_Cleaned.csv to Parcel_CC_Unique.csv

## Cook County

In [23]:
file_path = 'C:/Users/kaur6/Downloads/Urban Analytics/final_cook_county_data_with_same_dtype.csv'

# Preview first 100 rows only
df_sample = pd.read_csv(file_path, nrows=100, low_memory=False)
print(df_sample.head())
print(df_sample.info())

              pin  tax_year  card_num  class  township_code  \
0  31074070140000      2021         1  '295'             32   
1  31074070140000      2022         1  '295'             32   
2  31074070140000      2023         1  '295'             32   
3  31074070140000      2024         1  '295'             32   
4  31074070160000      2003         1  '295'             32   

   proration_key_pin  pin_proration_rate  card_proration_rate cdu  \
0                NaN                   1                    0  AV   
1                NaN                   1                    0  AV   
2                NaN                   1                    0  AV   
3                NaN                   1                    0  AV   
4                NaN                   1                    0  AV   

   pin_is_multicard  ...  basement_finish      roof_material  \
0             False  ...       Unfinished  Shingle + Asphalt   
1             False  ...       Unfinished  Shingle + Asphalt   
2             

In [27]:
file_path = 'C:/Users/kaur6/Downloads/Urban Analytics/final_cook_county_data_with_same_dtype.csv'
intermediate_path = 'C:/Users/kaur6/Downloads/Urban Analytics/intermediate_latest_by_pin.csv'
final_output = 'C:/Users/kaur6/Downloads/Urban Analytics/cook_county_latest_by_pin.csv'

chunk_size = 100000
first_chunk = True

# Step 1: Chunked processing and intermediate writing
for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunk_size, low_memory=False, on_bad_lines='skip')):
    print(f"🔄 Processing chunk {i + 1}")
    
    chunk['pin'] = chunk['pin'].astype(str)
    chunk['tax_year'] = pd.to_numeric(chunk['tax_year'], errors='coerce')
    chunk = chunk.dropna(subset=['tax_year'])

    # Keep latest tax_year per pin in this chunk
    chunk_latest = chunk.sort_values('tax_year').drop_duplicates('pin', keep='last')

    # Write to intermediate file
    mode = 'w' if first_chunk else 'a'
    header = first_chunk
    chunk_latest.to_csv(intermediate_path, mode=mode, header=header, index=False)
    first_chunk = False

# Step 2: Final deduplication on intermediate file
print("🔁 Final deduplication across all chunks...")
df = pd.read_csv(intermediate_path, low_memory=False)
df['pin'] = df['pin'].astype(str)
df['tax_year'] = pd.to_numeric(df['tax_year'], errors='coerce')
df = df.sort_values('tax_year').drop_duplicates('pin', keep='last')

# Step 3: Save final result
df.to_csv(final_output, index=False)
print(f"✅ Done! Final cleaned file saved to:\n{final_output}")
print("📊 Final row count:", df.shape[0])

🔄 Processing chunk 1
🔄 Processing chunk 2
🔄 Processing chunk 3
🔄 Processing chunk 4
🔄 Processing chunk 5
🔄 Processing chunk 6
🔄 Processing chunk 7
🔄 Processing chunk 8
🔄 Processing chunk 9
🔄 Processing chunk 10
🔄 Processing chunk 11
🔄 Processing chunk 12
🔄 Processing chunk 13
🔄 Processing chunk 14
🔄 Processing chunk 15
🔄 Processing chunk 16
🔄 Processing chunk 17
🔄 Processing chunk 18
🔄 Processing chunk 19
🔄 Processing chunk 20
🔄 Processing chunk 21
🔄 Processing chunk 22
🔄 Processing chunk 23
🔄 Processing chunk 24
🔄 Processing chunk 25
🔄 Processing chunk 26
🔄 Processing chunk 27
🔄 Processing chunk 28
🔄 Processing chunk 29
🔄 Processing chunk 30
🔄 Processing chunk 31
🔄 Processing chunk 32
🔄 Processing chunk 33
🔄 Processing chunk 34
🔄 Processing chunk 35
🔄 Processing chunk 36
🔄 Processing chunk 37
🔄 Processing chunk 38
🔄 Processing chunk 39
🔄 Processing chunk 40
🔄 Processing chunk 41
🔄 Processing chunk 42
🔄 Processing chunk 43
🔄 Processing chunk 44
🔄 Processing chunk 45
🔄 Processing chunk 

In [28]:
# Load the cleaned file
final_path = 'C:/Users/kaur6/Downloads/Urban Analytics/cook_county_latest_by_pin.csv'
df = pd.read_csv(final_path, low_memory=False)

# Make sure 'pin' is treated as a string (to avoid issues with leading zeros)
df['pin'] = df['pin'].astype(str)

# Check for duplicate pin values
duplicate_pins = df[df.duplicated(subset='pin', keep=False)]

# Print summary
print(f"🔍 Total duplicate rows based on 'pin': {duplicate_pins.shape[0]}")

🔍 Total duplicate rows based on 'pin': 0


### renamed the file to cook_county_unique_pins

## Merged unique cook county pins with parcel_cc file

In [1]:
# Load both CSVs
df1 = pd.read_csv('C:/Users/kaur6/Downloads/Urban Analytics/cook_county_unique_pins.csv')
df2 = pd.read_csv('C:/Users/kaur6/Downloads/Urban Analytics/Parcel_CC_Unique.csv')

# Rename 'name' in df2 to 'pin' for consistency
df2.rename(columns={'name': 'pin'}, inplace=True)

# Perform left join on 'pin'
merged_df = pd.merge(df1, df2, on='pin', how='left')

# Save result
merged_df.to_csv('C:/Users/kaur6/Downloads/Urban Analytics/cook_county_merged_parcelCC.csv', index=False)

  df1 = pd.read_csv('C:/Users/kaur6/Downloads/Urban Analytics/cook_county_unique_pins.csv')
  df2 = pd.read_csv('C:/Users/kaur6/Downloads/Urban Analytics/Parcel_CC_Unique.csv')


## got only those pins where inclusion_source value is not null i.e. pins inside connected communities

In [17]:
# Load the CSV and force 'pin' column to be read as string
df = pd.read_csv(
    'C:/Users/kaur6/Downloads/Urban Analytics/cook_county_merged_parcelCC.csv',
    dtype={'pin': str}, low_memory=False
)

# Filter: keep rows where inclusion_source is NOT NaN and NOT 'Unknown'
filtered_df = df[df['inclusion_source'].notna() & (df['inclusion_source'] != 'Unknown')]

# Save the filtered result — PINs preserved as string
filtered_df.to_csv('C:/Users/kaur6/Downloads/Urban Analytics/pins_inside_cc.csv', index=False)

In [19]:
# Replace with your file path
cook_county_pins = pd.read_csv('C:/Users/kaur6/Downloads/Urban Analytics/cook_county_unique_pins.csv', low_memory=False)
# Print number of rows
print(f"Number of Unique pins in cook county dataset: {len(cook_county_pins)}")

# Replace with your file path
cc_pins = pd.read_csv('C:/Users/kaur6/Downloads/Urban Analytics/pins_inside_cc.csv', low_memory=False)
# Print number of rows
print(f"Number of pins in cook county (inside connected Communities): {len(cc_pins)}")

Number of Unique pins in cook county dataset: 1139209
Number of pins in cook county (inside connected Communities): 212733


## getting those pins where improvement is not done

In [None]:
import pandas as pd
from collections import defaultdict
import numpy as np

# Path to your large file
input_file = 'C:/Users/kaur6/Downloads/Urban Analytics/final_cook_county_data_with_same_dtype.csv'
chunk_size = 1000000
# Columns required for scoring (used inside is_improved), but we read all columns
scoring_columns = [
    'building_sqft',
    'year_built',
    'num_rooms',
    'num_bedrooms',
    'num_full_baths',
    'type_of_residence',
    'construction_quality'
]

vacant_rows = []
chunk_num = 0

def is_improved(row):
    signals = [
        not pd.isna(row['building_sqft']) and row['building_sqft'] >= 196,
        not pd.isna(row['year_built']) and row['year_built'] > 0,
        not pd.isna(row['num_rooms']) and row['num_rooms'] > 0,
        not pd.isna(row['num_bedrooms']) and row['num_bedrooms'] > 0,
        not pd.isna(row['num_full_baths']) and row['num_full_baths'] > 0,
        pd.notna(row['type_of_residence']),
        pd.notna(row['construction_quality'])
    ]
    return sum(signals) >= 3

for chunk in pd.read_csv(input_file, chunksize=chunk_size, low_memory=False):
    chunk_num += 1
    print(f"\n🔄 Processing chunk {chunk_num}...")

    # Filter using only scoring columns but keep all data
    vacant_chunk = chunk[~chunk[scoring_columns].apply(is_improved, axis=1)]

    vacant_rows.append(vacant_chunk)

    print(f"✅ Finished chunk {chunk_num} | Vacant this chunk: {len(vacant_chunk)}")

# Combine and save
vacant_df = pd.concat(vacant_rows, ignore_index=True)
vacant_df.to_csv('C:/Users/kaur6/Downloads/Urban Analytics/parcels_no_improvement_full_columns.csv', index=False)

print(f"\n🎯 Final: {len(vacant_df)} parcels classified as not clearly improved (with all original columns).")

In [21]:
# Load the CSV with all columns
df = pd.read_csv('C:/Users/kaur6/Downloads/Urban Analytics/parcels_no_improvement_full_columns.csv', low_memory=False)

# Ensure tax_year is numeric for proper comparison
df['tax_year'] = pd.to_numeric(df['tax_year'], errors='coerce')

# Drop rows with missing tax_year
df = df.dropna(subset=['tax_year'])

# Sort by pin and tax_year (latest first), then drop duplicates
df_sorted = df.sort_values(by=['pin', 'tax_year'], ascending=[True, False])
df_deduped = df_sorted.drop_duplicates(subset='pin', keep='first')

# Save the cleaned version
df_deduped.to_csv('C:/Users/kaur6/Downloads/Urban Analytics/parcels_no_improvement_unique_pin_latest_year.csv', index=False)

print(f"✅ Done! Final rows with unique pins and latest tax_year: {len(df_deduped)}")

✅ Done! Final rows with unique pins and latest tax_year: 18725


## Vacant lands owned by city

In [24]:
# Load your dataset
df = pd.read_csv("C:/Users/kaur6/Downloads/Urban Analytics/Cleaned_City_Owned_Land.csv")

# Remove hyphens from the 'PIN' column
df['PIN'] = df['PIN'].str.replace('-', '', regex=True)

# Save the cleaned dataset
df.to_csv("C:/Users/kaur6/Downloads/Urban Analytics/Cleaned_City_Owned_Land_no_hyphen.csv", index=False)

print("Hyphens removed and dataset saved!")

Hyphens removed and dataset saved!


In [35]:
not_imp_pins = pd.read_csv('C:/Users/kaur6/Downloads/Urban Analytics/parcels_no_improvement_unique_pin_latest_year.csv', dtype={'pin': str}, low_memory=False)
# Print number of rows
print(f"Number of pins not improved: {len(not_imp_pins)}")

city_pins = pd.read_csv('C:/Users/kaur6/Downloads/Urban Analytics/Cleaned_City_Owned_Land_no_hyphen.csv', dtype={'PIN': str}, low_memory=False)
# Print number of rows
print(f"Number of pins in city owned land: {len(city_pins)}")

Number of pins not improved: 18725
Number of pins in city owned land: 10916


In [36]:
city_pins.rename(columns={'PIN': 'pin'}, inplace=True)

# Concatenate both DataFrames (only the 'pin' column from each)
all_pins = pd.concat([not_imp_pins[['pin']], city_pins[['pin']]], ignore_index=True)

# Drop duplicates
unique_pins = all_pins.drop_duplicates()

# Save to CSV
unique_pins.to_csv('C:/Users/kaur6/Downloads/Urban Analytics/all_pins_vacant.csv', index=False)

# Optional: print the number of unique pins
print(f"Total unique pins: {len(unique_pins)}")

Total unique pins: 29511


In [45]:
# Load both files
df1 = pd.read_csv('C:/Users/kaur6/Downloads/Urban Analytics/pins_inside_cc.csv', dtype={'pin':str}, low_memory=False)
df2 = pd.read_csv('C:/Users/kaur6/Downloads/Urban Analytics/all_pins_vacant.csv', dtype={'pin':str}, low_memory=False)

# Make sure the column names are consistent
# Assuming both files have a column named 'pin'
common_pins = set(df1['pin']).intersection(set(df2['pin']))

# Print the count of matching PINs
print(f"Number of matching PINs: {len(common_pins)}")

Number of matching PINs: 7698


In [46]:
# Filter rows from df1 where pin exists in df2
filtered_df = df1[df1['pin'].isin(df2['pin'])]

# Save the filtered result
filtered_df.to_csv('C:/Users/kaur6/Downloads/Urban Analytics/pins_inside_cc_that_are_vacant.csv', index=False)

# Optional: print how many matched rows are saved
print(f"Number of rows saved: {len(filtered_df)}")

Number of rows saved: 7698


In [47]:
# Show count of null values in each column
null_counts = filtered_df.isnull().sum()

# Print the result
print(null_counts)

pin                            0
tax_year                       0
card_num                       0
class                          0
township_code                  0
proration_key_pin           6955
pin_proration_rate             0
card_proration_rate            0
cdu                            0
pin_is_multicard               0
pin_num_cards                  0
pin_is_multiland               0
pin_num_landlines              0
year_built                     0
building_sqft                  0
land_sqft                      0
num_bedrooms                   0
num_rooms                      0
num_full_baths                 0
num_half_baths                 0
num_fireplaces                 0
type_of_residence            378
construction_quality           0
num_apartments                 0
attic_finish                   0
garage_attached              448
garage_area_included         449
garage_size                    0
garage_ext_wall_material       0
attic_type                     0
basement_t

In [6]:
# --- Load pins ---
pins_df = pd.read_csv("C:/Users/kaur6/Downloads/Urban Analytics/pins_inside_cc_that_are_vacant.csv", dtype={'pin': str})
pins_gdf = gpd.GeoDataFrame(
    pins_df,
    geometry=gpd.points_from_xy(pins_df['longitude'], pins_df['latitude']),
    crs="EPSG:4326"
)

# --- Load Cook County block groups ---
block_groups_gdf = gpd.read_file("C:/Users/kaur6/Downloads/Urban Analytics/cook_county_bg/cook_county_block_groups.shp")
block_groups_gdf = block_groups_gdf.to_crs("EPSG:4326")

# --- Join pins with block groups ---
joined = gpd.sjoin(pins_gdf, block_groups_gdf, how='inner', predicate='within')
pin_counts = joined.groupby('GEOID').size().reset_index(name='pin_count')
block_groups_gdf = block_groups_gdf.merge(pin_counts, on='GEOID', how='left')
block_groups_gdf['pin_count'] = block_groups_gdf['pin_count'].fillna(0).astype(int)

# --- Create base map ---
m = folium.Map(location=[pins_df['latitude'].mean(), pins_df['longitude'].mean()], zoom_start=10, tiles='cartodbpositron')

# --- Choropleth for GEOID blocks ---
folium.Choropleth(
    geo_data=block_groups_gdf,
    data=block_groups_gdf,
    columns=["GEOID", "pin_count"],
    key_on="feature.properties.GEOID",
    fill_color="OrRd",
    fill_opacity=0.6,
    line_opacity=0.2,
    legend_name="Number of Pins"
).add_to(m)

# --- Add GEOID tooltips ---
for _, row in block_groups_gdf.iterrows():
    tooltip = folium.Tooltip(f"GEOID: {row['GEOID']}<br>Pin Count: {row['pin_count']}")
    folium.GeoJson(
        row['geometry'],
        tooltip=tooltip,
        style_function=lambda x: {
            'color': 'black',
            'weight': 0.4,
            'fillOpacity': 0
        }
    ).add_to(m)

# --- Add PIN markers (Blue) ---
for _, row in joined.iterrows():
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=2,
        color='blue',
        fill=True,
        fill_opacity=0.7,
        tooltip=f"PIN: {row['pin']}<br>GEOID: {row['GEOID']}"
    ).add_to(m)


# --- Load bus routes CSV ---
bus_routes_df = pd.read_csv("C:/Users/kaur6/Downloads/Urban Analytics/CTA_-_Bus_Routes_20250421.csv")

# --- Convert MULTILINESTRING to geometry ---
bus_routes_df['geometry'] = bus_routes_df['the_geom'].apply(wkt.loads)

# --- Convert to GeoDataFrame ---
bus_routes_gdf = gpd.GeoDataFrame(bus_routes_df, geometry='geometry', crs="EPSG:4326")

# --- Plot bus routes on map ---
for _, row in bus_routes_gdf.iterrows():
    folium.GeoJson(
        row['geometry'],
        name=f"Bus Route: {row['NAME']}",
        style_function=lambda x: {
            'color': 'red',
            'weight': 2,
            'opacity': 0.6
        },
        tooltip=row['NAME']
    ).add_to(m)

# --- Load CTA L Stops ---
l_stops = pd.read_csv("C:/Users/kaur6/Downloads/Urban Analytics/CTA_-_System_Information_-_List_of__L__Stops_20250421.csv")
for _, row in l_stops.iterrows():
    try:
        loc = ast.literal_eval(row['Location'])
        folium.CircleMarker(
            location=[loc[0], loc[1]],
            radius=3,
            color='darkgreen',
            fill=True,
            fill_color='darkgreen',
            fill_opacity=0.6,
            tooltip=row['STOP_NAME']
        ).add_to(m)
    except:
        continue


# --- Load Chicago boundary CSV ---
chicago_boundary_df = pd.read_csv("C:/Users/kaur6/Downloads/Urban Analytics/chicago_boundary.csv")

# --- Convert geometry from WKT in 'the_geom' column ---
from shapely import wkt
chicago_boundary_df['geometry'] = chicago_boundary_df['the_geom'].apply(wkt.loads)

# --- Create GeoDataFrame ---
chicago_boundary_gdf = gpd.GeoDataFrame(chicago_boundary_df, geometry='geometry', crs="EPSG:4326")

# --- Add boundary to map ---
folium.GeoJson(
    chicago_boundary_gdf,
    name="Chicago Boundary",
    style_function=lambda x: {
        'color': 'black',
        'weight': 2,
        'fillOpacity': 0
    },
    tooltip="Chicago City Boundary"
).add_to(m)


# --- Save Map ---
output_path = "C:/Users/kaur6/Downloads/Urban Analytics/vacant_properties_with_transit_map.html"
# --- Custom legend ---
legend_html = """
<div style="
    position: fixed;
    bottom: 50px;
    left: 50px;
    width: 240px;
    z-index: 9999;
    font-size: 14px;
    background-color: white;
    border: 2px solid gray;
    border-radius: 8px;
    padding: 10px;
    box-shadow: 2px 2px 6px rgba(0,0,0,0.3);">
    <strong>Map Legend</strong><br>
    <svg width="20" height="10"><line x1="0" y1="5" x2="20" y2="5" style="stroke:red;stroke-width:2" /></svg> Bus Routes<br>
    <svg width="20" height="10"><line x1="0" y1="5" x2="20" y2="5" style="stroke:black;stroke-width:2" /></svg> Chicago Boundary<br>
    <i style="background: darkgreen; width: 10px; height: 10px; float: left; margin-right: 8px; border-radius: 50%;"></i> CTA L Train Stations<br>
    <i style="background: #ffeda0; width: 10px; height: 10px; float: left; margin-right: 8px;"></i> Block Groups (Cook County)<br>
    <div style="clear: both;"></div>
</div>
"""
m.get_root().html.add_child(folium.Element(legend_html))

m.save(output_path)
print(f"✅ Map with transit saved at: {output_path}")

✅ Map with transit saved at: C:/Users/kaur6/Downloads/Urban Analytics/vacant_properties_with_transit_map.html


In [4]:
# Load the files
df1 = pd.read_csv('C:/Users/kaur6/Downloads/Urban Analytics/Parcel_CC_Unique.csv', dtype={'name': str}, low_memory=False)
df2 = pd.read_csv('C:/Users/kaur6/Downloads/Urban Analytics/all_pins_vacant.csv', dtype={'pin': str}, low_memory=False)

# Merge to get 'tract_geoid' from df1 based on matching pin -> name
merged_df = df2.merge(df1[['name', 'longitude', 'latitude']], left_on='pin', right_on='name', how='left')

# If you only want 'pin' and 'tract_geoid' columns in the result:
result_df = merged_df[['pin', 'longitude', 'latitude']]

# Save to CSV if needed
result_df.to_csv('C:/Users/kaur6/Downloads/Urban Analytics/all_pins_vacant_with_lat_long.csv', index=False)

# View a sample
print(result_df.head())

             pin  longitude  latitude
0  1011000040000        NaN       NaN
1  1011000060000        NaN       NaN
2  1011000090000        NaN       NaN
3  1011000370000        NaN       NaN
4  1011000710000        NaN       NaN


In [5]:
# Load both CSV files
all_pins_df = pd.read_csv("C:/Users/kaur6/Downloads/Urban Analytics/all_pins_vacant_with_lat_long.csv", dtype={'pin': str}, low_memory=False)
cc_pins_df = pd.read_csv("C:/Users/kaur6/Downloads/Urban Analytics/pins_inside_cc.csv", dtype={'pin': str}, low_memory=False)

# Create a set of pins that are inside connected communities
cc_pins_set = set(cc_pins_df['pin'])

# Add the column 'is_inside_cc' based on whether the pin is in the set
all_pins_df['is_inside_cc'] = all_pins_df['pin'].apply(lambda x: 'yes' if x in cc_pins_set else 'no')

# Optional: Save to a new file
all_pins_df.to_csv("C:/Users/kaur6/Downloads/Urban Analytics/all_vacant_pins_with_cc_flag.csv", index=False)

In [7]:
# Load the file (make sure this is the file where 'is_inside_cc' has been added)
df = pd.read_csv("C:/Users/kaur6/Downloads/Urban Analytics/all_vacant_pins_with_cc_flag.csv", dtype={'pin': str})

# Count rows where 'is_inside_cc' is 'yes'
count_yes = (df['is_inside_cc'] == 'yes').sum()

print(f"Number of rows where is_inside_cc is 'yes': {count_yes}")

Number of rows where is_inside_cc is 'yes': 7698


In [8]:
# Step 1: Load the pins CSV
pins_df = pd.read_csv("C:/Users/kaur6/Downloads/Urban Analytics/all_vacant_pins_with_cc_flag.csv")  # Update path if needed
pins_gdf = gpd.GeoDataFrame(
    pins_df,
    geometry=gpd.points_from_xy(pins_df.longitude, pins_df.latitude),
    crs="EPSG:4326"  # WGS 84
)

# Step 2: Load the Cook County block groups shapefile
block_groups_gdf = gpd.read_file("C:/Users/kaur6/Downloads/Urban Analytics/cook_county_bg/cook_county_block_groups.shp")

# Ensure both layers are in the same coordinate system
block_groups_gdf = block_groups_gdf.to_crs("EPSG:4326")

# Step 3: Spatial join - find which block group each pin falls into
joined = gpd.sjoin(pins_gdf, block_groups_gdf, how="left", predicate="within")

# Step 4: Save to new CSV with block group GEOID
joined[['pin', 'longitude', 'latitude', 'is_inside_cc', 'GEOID']].to_csv("C:/Users/kaur6/Downloads/Urban Analytics/vacant_pins_with_block_group_cc_Flag.csv", index=False)
