# __Data Cleaning__
Zhalae Daneshvari, Peyton Smith, Sunny Liu

## Data Description
### What are the observations (rows) and the attributes (columns)?
- **Observations:** Each row represents a unique combination of a census tract in New York State and a nearby retail food store.
- **Attributes:** 
  - NAME: Census tract identifier
  - median_income: Median household income of the census tract
  - white_pop, black_pop, asian_pop, hispanic_pop: Population counts for each racial/ethnic group
  - tract: Census tract number
  - total_pop: Total population of the census tract
  - white_percent, black_percent, asian_percent, hispanic_percent: Percentage of each racial/ethnic group in the tract
  - nearest_store: Identifier for the nearest retail food store
  - nearest_store_name: Name of the nearest retail food store
  - distance_to_nearest_store: Distance in meters from the census tract centroid to the nearest store

### Why was this dataset created?
This dataset was created to facilitate the analysis of the relationship between socioeconomic factors, racial composition, and access to grocery stores in New York State. It combines census data with retail food store locations to enable the exploration of potential disparities in food accessibility across different demographic groups. The retail food stores dataset was created to provide insights into grocery stores licensed by New York State. Its purposes include promoting transparency, facilitating economic analysis, informing policy-making, and aiding in business planning. The U.S. Census data was created to collect detailed demographics and additional information on the composition of small areas (tracts) within the U.S. It serves various purposes including resource allocation, policy-making, and academic research.

### Who funded the creation of the dataset?
The dataset is a combination of data from the U.S. Census Bureau (government-funded) and the New York State Department of Agriculture and Markets. As a result, it's logical that the creation and collection of this data was funded by the government for the purpose of research and policy. The merging and preprocessing of these datasets was done for the purpose of research for our INFO 2950 Group Project.

### What processes might have influenced what data was observed and recorded and what was not?
1. Census data collection methods and potential underreporting in certain communities
2. Licensing and registration processes for retail food stores, which may not capture all food sources
3. Definition of what constitutes a retail food store, potentially excluding some food sources
4. Geospatial calculation methods used to determine nearest stores and distances
5. The use of census tract centroids as proxies for residential locations, which may not accurately represent all residents' locations within a tract

### What preprocessing was done, and how did the data come to be in the form that you are using?
1. Merging of census tract data with retail food store location data
2. Calculation of racial/ethnic percentages from population counts
3. Geospatial analysis to determine the nearest store and its distance for each census tract
5. Data cleaning to handle missing values or outliers

### If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?
People involved in the census data collection were aware and expected it to be used for various governmental and public purposes. Store owners providing information for licensing were aware of data collection for regulatory purposes. However, individuals were likely unaware of the specific use of this data for analyzing food accessibility disparities based on race and income.

### Where can your raw source data be found?
The raw census data can be accessed through the U.S. Census Bureau's API. https://www.census.gov/data/developers/guidance/api-user-guide.html

The raw retail food store data can be found on the New York State Open Data portal: https://data.ny.gov/Economic-Development/Retail-Food-Stores/9a8c-vfzj/about_data

The merged and preprocessed dataset (merged_census_store_data.csv) used for this analysis is stored within the git repository.


### Importing
First run 'pip install numpy seaborn pandas matplotlib requests geopandas shapely duckdb' in terminal

In [2]:
! pip install geopandas
! pip install duckdb


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [3]:
#importing packages
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import requests
import geopandas as gpd
from shapely.geometry import shape
import os
import duckdb as db
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
from shapely.geometry import Point



## Data Cleaning



### Data Cleaning: Retail Food Stores in New York State
Source List of Grocery Stores Provided by New York State:

https://data.ny.gov/Economic-Development/Retail-Food-Stores/9a8c-vfzj/about_data

This contains a csv file of every grocery store location in New York state that could be then paired with census tract data to look at how average distance to grocery stores changes by different socioeconomic groups. We will use census tract centers as an approximation for where everyone in the tract lives and then use the census data to get information on population statistics.

#### Step 1: Loading in the Data

Our primary grocery store dataset comes from the New York State Open Data portal, specifically the "Retail Food Stores" dataset. This dataset provides information about licensed retail food stores across New York State. This step loads our raw data into a pandas DataFrame. Understanding the initial shape is to check later on that our cleaning is actaully changing the dataframe.

In [4]:
# Loading in the dataset
stores = pd.read_csv('https://data.ny.gov/api/views/9a8c-vfzj/rows.csv?accessType=DOWNLOAD')

# Displays the shape of the original raw data for the retail stores
print("Initial dataset shape:", stores.shape)

Initial dataset shape: (24221, 15)


#### Step 2: Selecting Relevant Columns
For our analysis of grocery store access we really only need the store name and location as this will be used to find the distance of how far the store is from the center of the census tract. We will extract these from the 'Entity Name' and 'Georeference' columns.

In [5]:
# Selecting relevant columns
stores = stores[['Entity Name', 'Georeference']]

# Renaming the columns for clarity
stores = stores.rename(columns={'Entity Name': 'name', "Georeference": "georeference"})

print(stores.head())

                   name                        georeference
0         ANK PETRO INC   POINT (-73.816249346 42.46925125)
1           EVANS JULIA  POINT (-73.949128787 42.577416307)
2            HUBRIX LLC  POINT (-77.795356304 42.158830586)
3       REID STORES INC  POINT (-78.277036649 42.220633641)
4  361 DELI GROCERY LLC  POINT (-73.877219853 40.871759695)


#### Step 3: Handling Missing Data
To ensure the accuracy of our spatial analysis in the next steps, we need to clean the data table to remove any stores with missing location data by dropping the NaN values. 

In [6]:
# Remove rows with missing Georeference
stores = stores.dropna(subset=['georeference'])

print("Dataset shape after removing missing georeference values:" + str(stores.shape))

Dataset shape after removing missing georeference values:(24221, 2)


#### Step 4: Extracting Latitude and Longitude from Georeference
To make the location data a more usuable format that we can calculate distance with with the census tract data, we have to extract the latitude and longitude from the Georeference column into their own separate columns, and then drop that column.

In [7]:
# Extract latitude and longitude from Georeference
def extract_coordinates(georeference):
    coords = georeference.strip('POINT ()').split() #this works because same amount of numbers for both latitude and longitude
    return pd.Series({'longitude': float(coords[0]), 'latitude': float(coords[1])}) #convert to float for ease of calculations

stores[['longitude', 'latitude']] = stores['georeference'].apply(extract_coordinates)

# Dropping the OG Georeference column 
stores = stores.drop('georeference', axis=1)

print(stores.head())

                   name  longitude   latitude
0         ANK PETRO INC -73.816249  42.469251
1           EVANS JULIA -73.949129  42.577416
2            HUBRIX LLC -77.795356  42.158831
3       REID STORES INC -78.277037  42.220634
4  361 DELI GROCERY LLC -73.877220  40.871760


#### Step 5: Cleaned Retail Grocery Store Data Check
Our final dataset contains the name, latitude, and longitude of each retail food store in New York State. This data set will be used to calculate distances between stores and census tract centroids.

When combined with census data, this will enable us to investigate how store locations correlate with various socioeconomic factors, addressing our research questions about food access across different demographic groups in New York State.

The next steps are to merge this cleaned store data with census tract data to begin our analysis of the relationship between store locations and socioeconomic factors.

In [8]:
# Info about the cleaned grocery stores dataset
print("Cleaned dataset shape:", stores.shape)
print("Cleaned column names: " + str(stores.columns.tolist()))

# First few rows of cleaned grocery dataset
print("First few rows of the cleaned dataset:")
print(stores.head().to_string())

# Saving the cleaned dataset to a csv file to use later 
stores.to_csv('cleaned_retail_stores.csv', index=False)

Cleaned dataset shape: (24221, 3)
Cleaned column names: ['name', 'longitude', 'latitude']
First few rows of the cleaned dataset:
                   name  longitude   latitude
0         ANK PETRO INC -73.816249  42.469251
1           EVANS JULIA -73.949129  42.577416
2            HUBRIX LLC -77.795356  42.158831
3       REID STORES INC -78.277037  42.220634
4  361 DELI GROCERY LLC -73.877220  40.871760


### Data Cleaning: Census Data for New York State

#### Step 1: Loading in Census Data through Census API

We'll use the Census API to collect demographic data for New York State census tracts, focusing on median household income and racial composition data. This step retrieves census data for all tracts in New York State, including median household income and population counts by race.

In [9]:
# Census API endpoint and parameters
api_key = "273983b851bd05f546e743ac334b18277e8c67d1"  
base_url = "https://api.census.gov/data/2021/acs/acs5"
variables = ["NAME", "B19013_001E", "B02001_002E", "B02001_003E", "B02001_005E", "B03003_003E"] #codes for median income as well as different racial category populations: white, black, hispanic, asian
state = "36" #code for NY State

# API URL
url = f"{base_url}?get={','.join(variables)}&for=tract:*&in=state:{state}&key={api_key}"

# Making the API request
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    data = response.json()

    # Creating the DataFrame
    census_df = pd.DataFrame(data[1:], columns=data[0])
    print("Successfully created new dataframe.")
else:
    print(f"Error: Unable to get data. Status code: {response.status_code}")
    print(response.text)
    exit()

Successfully created new dataframe.


#### Step 2: Renaming and Selecting Columns
Renaming the columns to names that make sense to the average person and select only the ones we need for our analysis. This step focuses our dataset on the demographic information relevant to our research question about food access across different socioeconomic groups which in this case is the median income, white population, black population, asian population, and hispanic population. 

In [10]:
# Renaming the columns
column_names = {
    "NAME": "name",
    "B19013_001E": "median_income",
    "B02001_002E": "white_pop",
    "B02001_003E": "black_pop",
    "B02001_005E": "asian_pop",
    "B03003_003E": "hispanic_pop"
}

census_df = census_df.rename(columns=column_names)

# Selecting the relevant columns
census_df = census_df[["name", "median_income", "white_pop", "black_pop", "asian_pop", "hispanic_pop", "tract", "state", "county"]]

#adding GEOID values for matching 
census_df["GEOID"] = census_df["state"] + census_df["county"] + census_df["tract"]
print(census_df.head())

                                         name median_income white_pop  \
0     Census Tract 1, Albany County, New York         44871       530   
1  Census Tract 2.01, Albany County, New York         42456       521   
2  Census Tract 2.02, Albany County, New York         24792       124   
3  Census Tract 3.01, Albany County, New York         40666      1155   
4  Census Tract 3.02, Albany County, New York         42370      2180   

  black_pop asian_pop hispanic_pop   tract state county        GEOID  
0      1086        87          311  000100    36    001  36001000100  
1      1945       109          165  000201    36    001  36001000201  
2      2316         0           71  000202    36    001  36001000202  
3      1251       221          836  000301    36    001  36001000301  
4       457       387          155  000302    36    001  36001000302  


#### Step 3: Data Type Conversion
We need to convert the numeric columns to the appropriate data type for analysis which in this case is a float as we will be calculating the percentage of the racial populations in a future step.

In [11]:
# Converting the numeric columns to float values
numeric_columns = ["median_income", "white_pop", "black_pop", "asian_pop", "hispanic_pop"]
census_df[numeric_columns] = census_df[numeric_columns].astype(float)

#Sanity check
print("Data types after conversion:")
print(census_df.dtypes)

Data types after conversion:
name              object
median_income    float64
white_pop        float64
black_pop        float64
asian_pop        float64
hispanic_pop     float64
tract             object
state             object
county            object
GEOID             object
dtype: object


#### Step 3: Handling Missing Values
Checking for and handle any missing values in our dataset as that would lead to problems when plotting the data later on and do not want to use data that misses crucial info for our analysis

In [12]:
# Replace -666666666 (Census placeholder for missing data) with NaN
census_df = census_df.replace(-666666666, np.nan)

# Drop the rows with the missing median income
census_df = census_df.dropna(subset=["median_income"])

# Check for missing values
print("Missing values:")
print(census_df.isnull().sum())

# Dataset shape after handling missing values 
print(census_df.shape)

Missing values:
name             0
median_income    0
white_pop        0
black_pop        0
asian_pop        0
hispanic_pop     0
tract            0
state            0
county           0
GEOID            0
dtype: int64
(5198, 10)


#### Step 4: Creating Percentage variables per racial group
Creating percentage variables for racial composition in the census tracts to help with our analysis of whether closest grocert store distance changes based on race in NY state census tracts.

In [13]:
# Calculate total population
census_df["total_pop"] = census_df[["white_pop", "black_pop", "asian_pop", "hispanic_pop"]].sum(axis=1)

# Calculate percentage for each race/ethnicity
for group in ["white", "black", "asian", "hispanic"]:
    census_df[f"{group}_percent"] = census_df[f"{group}_pop"] / census_df["total_pop"] * 100

print(census_df.head())

# Saving the cleaned census data
census_df.to_csv('cleaned_census_data.csv', index=False)

                                         name  median_income  white_pop  \
0     Census Tract 1, Albany County, New York        44871.0      530.0   
1  Census Tract 2.01, Albany County, New York        42456.0      521.0   
2  Census Tract 2.02, Albany County, New York        24792.0      124.0   
3  Census Tract 3.01, Albany County, New York        40666.0     1155.0   
4  Census Tract 3.02, Albany County, New York        42370.0     2180.0   

   black_pop  asian_pop  hispanic_pop   tract state county        GEOID  \
0     1086.0       87.0         311.0  000100    36    001  36001000100   
1     1945.0      109.0         165.0  000201    36    001  36001000201   
2     2316.0        0.0          71.0  000202    36    001  36001000202   
3     1251.0      221.0         836.0  000301    36    001  36001000301   
4      457.0      387.0         155.0  000302    36    001  36001000302   

   total_pop  white_percent  black_percent  asian_percent  hispanic_percent  
0     2014.0      26

#### Step 5: Finding Center of Census Tracts 
Now that we have a cleaned dataset we have to find the center of each census tract to serve as a proxy to where everyone in that census tract lives and in turn how far they are from a grocery store. 
We looked up how to find the center of each census tract and ended up using the .centroid function (sources linked at bottom of notebook). The first thing done was to get the points in actual gps coordinates so that proper distance could be calculated. We then created two new columns one for the center point and one for the tract number. The tract column is needed to merge the data with the census information. 

In [14]:
#Load in shapefile for New York Census Tracts 
shapefile_path = "tl_2021_36_tract.shp"  #must be in same folder as ipynb file
tracts_gdf = gpd.read_file(shapefile_path)

#gets points in gps plane 
tracts_gdf = tracts_gdf.to_crs(epsg=5070)

# Calculates centroids for each census tract
tracts_gdf['centroid'] = tracts_gdf.geometry.centroid

#gets tract number so can match with census information last 6 values of GEOID information census data 
tracts_gdf['tract'] = tracts_gdf['GEOID'].str[-6:]

# Take out tract number and centroid columns
centroid_df = tracts_gdf[['centroid', "GEOID"]]
print(centroid_df.head())

                          centroid        GEOID
0  POINT (1835948.861 2171541.933)  36047069601
1  POINT (1836565.723 2171170.171)  36047069602
2   POINT (1831518.497 2174267.51)  36047079801
3  POINT (1831354.054 2173996.174)  36047079802
4  POINT (1838171.439 2174599.774)  36047105801


#### Step 6: Merging DataFrames 
The US census data needs to be merged with data on the center of each census tract. The data is merged on tract values. 

In [None]:
# Ensure 'GEOID' is of string type in both DataFrames
census_df['GEOID'] = census_df['GEOID'].astype(str)
centroid_df['GEOID'] = centroid_df['GEOID'].astype(str)

# Merge the DataFrames on 'GEOID' using an inner join
merged_df = pd.merge(census_df, centroid_df, on='GEOID', how='inner')

# Print the first few rows of the merged DataFrame
print("Merged DataFrame:")
print(merged_df.head())


Merged DataFrame:
                                         name  median_income  white_pop  \
0     Census Tract 1, Albany County, New York        44871.0      530.0   
1  Census Tract 2.01, Albany County, New York        42456.0      521.0   
2  Census Tract 2.02, Albany County, New York        24792.0      124.0   
3  Census Tract 3.01, Albany County, New York        40666.0     1155.0   
4  Census Tract 3.02, Albany County, New York        42370.0     2180.0   

   black_pop  asian_pop  hispanic_pop   tract state county        GEOID  \
0     1086.0       87.0         311.0  000100    36    001  36001000100   
1     1945.0      109.0         165.0  000201    36    001  36001000201   
2     2316.0        0.0          71.0  000202    36    001  36001000202   
3     1251.0      221.0         836.0  000301    36    001  36001000301   
4      457.0      387.0         155.0  000302    36    001  36001000302   

   total_pop  white_percent  black_percent  asian_percent  hispanic_percent  \
0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)


#### Step 7: Calculating closest grocery store's average distance to each census tract
Now the closest grocery store and the average distance to each census tract needs to be found as that is the main focus and variable of our research question. 

In this step we faced significant challenges due to the need to work with and convert between different types of geometries (polygons for census tracts and points for store locations) and coordinate reference systems (projected CRS EPSG:5070 for census data and geographic CRS EPSG:4326 for store locations). To perform accurate distance calculations, we had to convert all geometries to a common CRS, calculate tract centroids from polygons, and then perform point-to-point distance calculations, all while accounting for the Earth's curvature. The large number of spatial computations required for calculating average distances from census tract centroids to grocery stores further complicated the process. To overcome these challenges, we utilized the GeoPandas library, which extends pandas to handle geospatial data, allowing us to efficiently perform CRS transformations, centroid calculations, spatial joins between census tracts and store locations, and calculate accurate geodesic distances using the pyproj library for cartographic projections and coordinate transformations. 

We learned how to do a lot of this from the sources listed in references column.

In [16]:

# Extracting the tract number and centroid columns
centroid_df = tracts_gdf[['tract', 'centroid','GEOID']].copy()

# Converting the centroids to latitude and longitude (WGS84, EPSG:4326)
centroid_df['centroid'] = centroid_df['centroid'].to_crs(epsg=4326)

# Extracting the latitude and longitude from the centroid points
centroid_df['longitude'] = centroid_df['centroid'].x
centroid_df['latitude'] = centroid_df['centroid'].y

# Dropping the original centroid column because it's no longer needed
centroid_df = centroid_df.drop(['centroid'], axis=1)

print("Centroid data:")
print(centroid_df.head())

# Loading the stores data
stores = pd.read_csv('cleaned_retail_stores.csv')

print("Stores data:")
print(stores.head())

# Creating the GeoDataFrames for both datasets
centroid_gdf = gpd.GeoDataFrame(
    centroid_df,
    geometry=gpd.points_from_xy(centroid_df.longitude, centroid_df.latitude),
    crs="EPSG:4326"
)

stores_gdf = gpd.GeoDataFrame(
    stores,
    geometry=gpd.points_from_xy(stores.longitude, stores.latitude),
    crs="EPSG:4326"
)

# Convert both GeoDataFrames to a projected CRS for distance calculation
centroid_gdf = centroid_gdf.to_crs(epsg=32618)
stores_gdf = stores_gdf.to_crs(epsg=32618)

# Finding the nearest grocery store to each census tract centroid
centroid_gdf['nearest_store_index'] = centroid_gdf.geometry.apply(
    lambda geom: stores_gdf.distance(geom).idxmin()
)
centroid_gdf['nearest_store_name'] = centroid_gdf['nearest_store_index'].apply(
    lambda idx: stores_gdf.loc[idx, 'name']
)

# Calculating the distance to the nearest store (in meters)
centroid_gdf['distance_to_nearest_store'] = centroid_gdf.geometry.apply(
    lambda geom: stores_gdf.loc[stores_gdf.distance(geom).idxmin()].geometry.distance(geom)
)

print("Centroid data with nearest store and distance:")
print(centroid_gdf.head())

# Converting back to WGS84 format for final output
centroid_gdf = centroid_gdf.to_crs(epsg=4326)

# Saving the result in a CSV file
centroid_gdf.to_csv('census_tracts_with_nearest_store.csv', index=False)

# Display the result
centroid_gdf.head()




Centroid data:
    tract        GEOID  longitude   latitude
0  069601  36047069601 -73.914786  40.627228
1  069602  36047069602 -73.908651  40.622730
2  079801  36047079801 -73.958719  40.660011
3  079802  36047079802 -73.961371  40.657991
4  105801  36047105801 -73.880596  40.649220
Stores data:
                   name  longitude   latitude
0         ANK PETRO INC -73.816249  42.469251
1           EVANS JULIA -73.949129  42.577416
2            HUBRIX LLC -77.795356  42.158831
3       REID STORES INC -78.277037  42.220634
4  361 DELI GROCERY LLC -73.877220  40.871760
Centroid data with nearest store and distance:
    tract        GEOID  longitude   latitude                        geometry  \
0  069601  36047069601 -73.914786  40.627228  POINT (591781.302 4497943.187)   
1  069602  36047069602 -73.908651  40.622730  POINT (592306.371 4497450.251)   
2  079801  36047079801 -73.958719  40.660011  POINT (588022.553 4501537.351)   
3  079802  36047079802 -73.961371  40.657991   POINT (58780

Unnamed: 0,tract,GEOID,longitude,latitude,geometry,nearest_store_index,nearest_store_name,distance_to_nearest_store
0,69601,36047069601,-73.914786,40.627228,POINT (-73.91479 40.62723),15784,L & T SUSHI LLC,165.315035
1,69602,36047069602,-73.908651,40.62273,POINT (-73.90865 40.62273),14369,ALMONTE MILL FOOD CORP,624.099569
2,79801,36047079801,-73.958719,40.660011,POINT (-73.95872 40.66001),4997,HK FRUIT MARKET INC,144.323987
3,79802,36047079802,-73.961371,40.657991,POINT (-73.96137 40.65799),13395,HABIB I DELI INC,95.43277
4,105801,36047105801,-73.880596,40.64922,POINT (-73.8806 40.64922),9051,CVS ALBANY LLC,302.382437


#### Step 8. Cleaning and merging the nearest grocery store data


In [20]:
# Loading the new census grocery store distance CSV file
store_distance_df = pd.read_csv('census_tracts_with_nearest_store.csv')

# Loading the cleaned census data CSV file
census_data_df = pd.read_csv('cleaned_census_data.csv')

# Ensure the GEOID number columns in both dataframes are of the same type (str)
store_distance_df['GEOID'] = store_distance_df['GEOID'].astype(str)
census_data_df['GEOID'] = census_data_df['GEOID'].astype(str)

# Extract county name from the 'NAME' column, create a new 'county' column, and make names uppercase
census_data_df['County'] = (
    census_data_df['name']
    .str.extract(r',\s(.*?)\sCounty')[0]
    .str.upper()
)

# Merging the dataframes on the GEOID number
merged_df = pd.merge(census_data_df, store_distance_df, on='GEOID', how='left')

# Dropping the unnecessary columns because they will not be used in the analysis
merged_df['tract']=merged_df['tract_x']
columns_to_drop = ['longitude', 'latitude', 'geometry', 'state', 'name','tract_x','tract_y']
merged_df = merged_df.drop(columns=columns_to_drop, errors='ignore')

# Display the first few rows to verify changes
print(merged_df.head())

# Saving the merged dataframe to a new CSV file
output_file = 'merged_census_store_data.csv'
merged_df.to_csv(output_file, index=False)



   median_income  white_pop  black_pop  asian_pop  hispanic_pop  county  \
0        44871.0      530.0     1086.0       87.0         311.0       1   
1        42456.0      521.0     1945.0      109.0         165.0       1   
2        24792.0      124.0     2316.0        0.0          71.0       1   
3        40666.0     1155.0     1251.0      221.0         836.0       1   
4        42370.0     2180.0      457.0      387.0         155.0       1   

         GEOID  total_pop  white_percent  black_percent  asian_percent  \
0  36001000100     2014.0      26.315789      53.922542       4.319762   
1  36001000201     2740.0      19.014599      70.985401       3.978102   
2  36001000202     2511.0       4.938272      92.234170       0.000000   
3  36001000301     3463.0      33.352584      36.124747       6.381750   
4  36001000302     3179.0      68.575024      14.375590      12.173640   

   hispanic_percent  County  nearest_store_index  \
0         15.441907  ALBANY                10543   


# Data Limitations

### 1. Oversimplification of Racial and Ethnic Categories:
The broad racial categories (white, black, Asian, Hispanic) may not capture the nuanced food access issues faced by specific ethnic subgroups or multiracial individuals.

**Impact**: This simplification could lead to overlooking cultural factors in food access and potentially misguide policy interventions aimed at improving food equity among diverse communities.


### 2. Limitations of Distance-Based Accessibility Measures:

Our analysis uses straight-line distances rather than actual travel routes such as roads, and doesn't account for transportation modes or barriers like highways or rivers.

**Impact**: This could significantly underestimate travel times and difficulties in accessing stores, especially in areas with complex urban layouts or for individuals relying on public transportation.


### 3. Exclusion of Food Quality and Affordability:

Our dataset focuses on the presence of stores but doesn't account for the quality, variety, or affordability of food offered.

**Impact**: This limitation could lead to overestimating true food access in areas where stores are present but don't offer healthy or affordable options, potentially masking issues of nutritional inequality.


### 4. Absence of Online and Mobile Food Services:

The focus on physical store locations doesn't account for the growing influence of online grocery delivery services and mobile food markets.

**Impact**: This omission could lead to underestimating food access in areas well-served by these alternative food sources, potentially skewing our understanding of modern food accessibility patterns.


### 5. Lack of Consideration for Informal Food Networks:

Our analysis doesn't capture informal food sources like community gardens, food sharing networks, or farmers' markets.

**Impact**: This could lead to underestimating food resources in some communities, particularly in areas with strong social networks or alternative food systems.


## References

Geopandas intro tutorial:
https://geopandas.org/en/stable/getting_started/introduction.html

Spacial data links: 
https://walker-data.com/posts/proximity-analysis/
https://michaelminn.net/tutorials/python-proximity/index.html

Geopandas, for geospatial operations in Python, guide to getting started:
https://geopandas.org/en/stable/getting_started/introduction.html
https://geopandas.org/en/stable/docs/user_guide.html

Shapely for geometric operations:
https://shapely.readthedocs.io/en/stable/manual.html 

Conducting Geospatial Analysis in GeoPandas:
https://geog-312.gishub.org/book/geospatial/geopandas.html 

Figuring out pd.DF.apply: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

Using Lambda function to PD Dataframes:
https://www.geeksforgeeks.org/applying-lambda-functions-to-pandas-dataframe/

Coordinate Reference Systems Info:
https://docs.qgis.org/latest/en/docs/gentle_gis_introduction/coordinate_reference_systems.html

Geopandas and Shapeley guide:
https://www.learndatasci.com/tutorials/geospatial-data-python-geopandas-shapely/ 

API Sources: 
https://pygis.io/docs/d_access_census.html
https://n8henrie.com/uploads/2017/11/plotting-us-census-data-with-python-and-geopandas.html
