# Analysis of Food Store Accessibility in Erie County, New York
## Erin Gregoire
## Fall 2024

### Dataset Collection and Justification

As an Erie County native, I have heard anecdotal relics of food deserts within the county, especially in the City of Buffalo. In this project, I have perfomed analysis on the available food stores, income and population within the municipalities located in Erie County to answer the following questions:
- Are there any areas that have a "food desert" in terms of lack of quality food stores?
- Is there a correlation between average income of a town and the quality of their food stores?
- Does living in the City of Buffalo differ with the Suburbs in terms of food store quality and accessibility?
- Can we predict whether a person is a resident of the City of Buffalo or a Suburb based on their income and local food stores?

### About the Data
The combination of these data sets will shows what the various socioeconomic classes has access to in terms of the type of groceries and accessibility to food stores. These datasets will also be useful in providing response and predictor variables when using machine learning techniques to predict whether a location is in the City of Buffalo or a Suburb. These datasets are cohesive due to the nature of the Zipcode column which will act as a unique indicator and link the data together seamlessly. 

#### Retail Food Stores
- Dataset by the New York State Department of Agriculture and Markets
- Provides Food Store Names, their Establishment Codes (Such as Store and Bakery), and their location
- Quality is controlled by a column called "Store Focus" which denotes the primary goal of the store
     - Grocery: Store is primarily for food products, includes food from all food groups (Produce, Dairy, Meat, and Dry/Canned Good)
     - Convenience: Gas stations, mini-marts, corner stores, prepared food stores
     - Multi-Purpose: Large name stores where groceries are only a small percent of products sold
     - Specialty: Store focuses on only one food group or product. Includes bakeries, meat/fish stores, artisan goods
     - Pharmacy: Pharmacies or drugs stores that may carry a small selection of grocery products
     - Other: Stores that carry prepared or snack foods that are not easily categorized above such as catering businesses.

#### Income and Tax Data
- Dataset by the US Department of Treasury and the IRS
- Provides income brackets and the number of returns conducted within each bracket
- Approximate population is accounted for by using the total number of tax returns for each zipcode

#### Anticipated Challenges
- Removing excess data from zipcodes that are not in Erie County
- Merging datasets smoothly

### Data Preprocessing and Cleaning

In this set, I will begin by loading the datasets one at a time and working with them until fully cleaned. Then, each dataset will be saved into their own file to be used again in Step 3.

In [8]:
import pandas as pd
import json
import sqlite3
import ast

In [9]:
erie_zipcodes = ['14001', '14004', '14006', '14010', '14025', '14026', '14027', '14030', '14031', '14032', '14033', '14034', '14035', '14038', '14043', '14047', '14051', '14052', '14055', '14057', '14059', '14061', '14068', '14069', '14072', '14075', '14080', '14085', '14086', '14091', '14102', '14110', '14111', '14127', '14134', '14139', '14140', '14141', '14150', '14151', '14169', '14170', '14201', '14202', '14203', '14204', '14205', '14206', '14207', '14208', '14209', '14210', '14211', '14212', '14213', '14214', '14215', '14216', '14217', '14128', '14219', '14220', '14221', '14222', '14223', '14224', '14225', '14226', '14227', '14228', '14231', '14233', '14240', '14241', '14260', '14261', '14263', '14264', '14265', '14267', '14269', '14270', '14272', '14273', '14276', '14280']

#### Retail Food Stores Data

In [11]:
stores_data = pd.read_csv('Retail_Food_Stores_20241119.csv')

In [12]:
stores_data.head()

Unnamed: 0,County,License Number,Operation Type,Establishment Type,Entity Name,DBA Name,Street Number,Street Name,Address Line 2,Address Line 3,City,State,Zip Code,Square Footage,Georeference
0,SUFFOLK,763163,Store,A,HEALTHY MEALS DIRECT LLC,HEALTHY MEALS DIRECT,1866,DEER PARK AVE,,,DEER PARK,NY,11729,,POINT (-73.32901606 40.7599309)
1,TIOGA,763162,Store,A,ALDI INC,ALDI #62,1150,STATE ROUTE 17C,,,OWEGO,NY,13827,,POINT (-76.231859408 42.096557734)
2,WESTCHESTER,763161,Store,A,WALGREEN EASTERN CO INC,WALGREENS #21443,3320,CROMPOND RD,,,YORKTOWN HEIGHTS,NY,10598,,POINT (-73.830051866 41.291524945)
3,SUFFOLK,763134,Store,A,SANJHA BAZAAR LLC,SANJHA BAZAAR,2160,JERICHO TURNPIKE,,,COMMACK,NY,11725,,POINT (-73.284552495 40.842579189)
4,KINGS,763133,Store,A,SKILLMART INC,SKILLMART,1010,BEDFORD AVE,,,BROOKLYN,NY,11205,,POINT (-73.955486796 40.690346184)


#### Collect only food stores located in Erie County:

In [14]:
erie_stores = pd.DataFrame(stores_data[stores_data["County"] == "ERIE"])
erie_stores.head()

Unnamed: 0,County,License Number,Operation Type,Establishment Type,Entity Name,DBA Name,Street Number,Street Name,Address Line 2,Address Line 3,City,State,Zip Code,Square Footage,Georeference
96,ERIE,762218,Store,A,KABUL MARKET & BAKERY LLC,KABUL MARKET & BAKERY,803,NIAGARA FALLS BLVD,,,BUFFALO,NY,14226,1900.0,POINT (-78.822585223 42.976011007)
101,ERIE,762204,Store,AC,BLACHOWICZ MICHAEL I,MEGA BITES VENDING,92,FRANKLIN ST,,,BUFFALO,NY,14202,500.0,POINT (-78.877599185 42.883237171)
113,ERIE,762139,Store,AC,REGAN MICHAEL & BRENDA,REGANS VILLAGE DELI,83,JOHN STREET,,,AKRON,NY,14001,1200.0,POINT (-78.495381288 43.020975014)
133,ERIE,762086,Store,AC,SOE SAN LLC,GOLDEN BURMA ASIA FOODS,92,GRANT STREET,,,BUFFALO,NY,14213,2200.0,POINT (-78.890780135 42.916332862)
139,ERIE,762018,Store,AC,AKHOAT ALSALAM LLC,SOHO MARKET,1985,SOUTH PARKE AVE,,,BUFFALO,NY,14220,400.0,POINT (-78.823803865 42.844617678)


#### Remove unneccessary data

In [16]:
erie_stores = erie_stores.drop(columns=['County', 'License Number', 'Operation Type', 'Entity Name', 'Street Number', 'Street Name', 'Address Line 2', 'Address Line 3', 'State', 'Square Footage', 'Georeference'])

#### Rename columns and Check data type

In [18]:
erie_stores = erie_stores.rename(columns={'DBA Name': 'Food Store Name'})
erie_stores = erie_stores.rename(columns={'Zip Code': 'Zipcode'})

In [19]:
erie_stores.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1126 entries, 96 to 23318
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Establishment Type  1126 non-null   object
 1   Food Store Name     1126 non-null   object
 2   City                1126 non-null   object
 3   Zipcode             1126 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 44.0+ KB


In [20]:
erie_stores['Zipcode'] = erie_stores['Zipcode'].astype(str)

In [21]:
erie_stores.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1126 entries, 96 to 23318
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Establishment Type  1126 non-null   object
 1   Food Store Name     1126 non-null   object
 2   City                1126 non-null   object
 3   Zipcode             1126 non-null   object
dtypes: object(4)
memory usage: 44.0+ KB


#### Check for missing data and outliers

In [23]:
rows_with_missing_data = erie_stores[erie_stores.isna().any(axis=1)]
print(rows_with_missing_data)

Empty DataFrame
Columns: [Establishment Type, Food Store Name, City, Zipcode]
Index: []


In [24]:
establishment_types = erie_stores["Establishment Type"].unique()

In [25]:
for type in establishment_types:
    amount = len(erie_stores[erie_stores["Establishment Type"] == type])
    print (f' There are {amount} stores in Erie County with Establishment Type {type}')

 There are 338 stores in Erie County with Establishment Type A
 There are 686 stores in Erie County with Establishment Type AC
 There are 67 stores in Erie County with Establishment Type ABC
 There are 2 stores in Erie County with Establishment Type ACW
 There are 5 stores in Erie County with Establishment Type AD
 There are 3 stores in Erie County with Establishment Type ACY
 There are 15 stores in Erie County with Establishment Type ACH
 There are 4 stores in Erie County with Establishment Type ACD
 There are 1 stores in Erie County with Establishment Type ABCD
 There are 1 stores in Erie County with Establishment Type ACK
 There are 2 stores in Erie County with Establishment Type ABCH
 There are 1 stores in Erie County with Establishment Type ACHD
 There are 1 stores in Erie County with Establishment Type ABHK


There are no outliers specifically. However, there are some types of food stores that are unique. The majority of food stores fall into three main categories: AC (Store & Food Manufacturer), A (Store), and ABC (Store, Bakery, and Food Manufacturer). At this point, we do not need to remove or change any of the food store establishment types that are in unique categories. Many of the store types have overlap including the stores with unique types so they should still be relevant data.

#### Ensure that multi-valued attributes will be able to be handled in later steps

In [28]:
def split_types (types):
    result = list(types)
    return result

In [29]:
erie_stores['Establishment Type'] = erie_stores['Establishment Type'].apply(split_types)

In [30]:
len(erie_stores['Food Store Name'].unique())

1058

In [31]:
erie_stores.head()

Unnamed: 0,Establishment Type,Food Store Name,City,Zipcode
96,[A],KABUL MARKET & BAKERY,BUFFALO,14226
101,"[A, C]",MEGA BITES VENDING,BUFFALO,14202
113,"[A, C]",REGANS VILLAGE DELI,AKRON,14001
133,"[A, C]",GOLDEN BURMA ASIA FOODS,BUFFALO,14213
139,"[A, C]",SOHO MARKET,BUFFALO,14220


In [32]:
erie_stores.to_csv('erie_stores.csv', index=False)

#### Summary of Retail Food Store Data Cleaning Process
- Loaded CSV file
- Removed columns that belonged to counties other than Erie County
- Removed unnecessary columns that will not be useful to analysis
- Cleaned up column names, missing data, outliers, and data type discrepencies
- Saved file as new CSV

#### Income Tax Data

In [35]:
income_data = pd.read_csv('Tax_Zip_Code_Data.csv', usecols = (0, 1, 2, 3, 4, 5), skiprows = 3, skipfooter = 31, dtype = str, engine = 'python')

In [36]:
income_data.head(10)

Unnamed: 0,ZIP\ncode [1],Size of adjusted gross income,Number of \nreturns [2],Number of single returns,Number of joint returns,Number of head of household returns
0,,,,,,
1,,,(1),(2),(3),(4)
2,0.0,Total,9620850,5158780,2792800,1403280
3,0.0,"$1 under $25,000",2691220,2004510,257970,375610
4,0.0,"$25,000 under $50,000",2324490,1357130,393190,515930
5,0.0,"$50,000 under $75,000",1424670,779250,348060,247450
6,0.0,"$75,000 under $100,000",908050,408800,342880,121210
7,0.0,"$100,000 under $200,000",1493450,449260,876140,120400
8,0.0,"$200,000 or more",778970,159830,574560,22680
9,,,,,,


#### Remove extra blank header rows and correct column names

In [38]:
income_data = income_data.drop(0).drop(1)

In [39]:
income_data.head(10)

Unnamed: 0,ZIP\ncode [1],Size of adjusted gross income,Number of \nreturns [2],Number of single returns,Number of joint returns,Number of head of household returns
2,0.0,Total,9620850.0,5158780.0,2792800.0,1403280.0
3,0.0,"$1 under $25,000",2691220.0,2004510.0,257970.0,375610.0
4,0.0,"$25,000 under $50,000",2324490.0,1357130.0,393190.0,515930.0
5,0.0,"$50,000 under $75,000",1424670.0,779250.0,348060.0,247450.0
6,0.0,"$75,000 under $100,000",908050.0,408800.0,342880.0,121210.0
7,0.0,"$100,000 under $200,000",1493450.0,449260.0,876140.0,120400.0
8,0.0,"$200,000 or more",778970.0,159830.0,574560.0,22680.0
9,,,,,,
10,10001.0,,16070.0,11850.0,2560.0,1150.0
11,10001.0,"$1 under $25,000",2960.0,2370.0,210.0,290.0


In [40]:
income_data = income_data.rename(columns={'ZIP\ncode [1]': 'Zipcode'})

In [41]:
erie_incomes = income_data[income_data['Zipcode'].isin(erie_zipcodes)]

In [42]:
erie_incomes = erie_incomes.rename(columns={'Number of \nreturns [2]': 'Number of returns'})

In [43]:
erie_incomes.head(10)

Unnamed: 0,Zipcode,Size of adjusted gross income,Number of returns,Number of single returns,Number of joint returns,Number of head of household returns
9570,14001,,4880,2420,1950,380
9571,14001,"$1 under $25,000",1190,940,140,80
9572,14001,"$25,000 under $50,000",1170,770,240,130
9573,14001,"$50,000 under $75,000",830,420,290,90
9574,14001,"$75,000 under $100,000",600,170,350,60
9575,14001,"$100,000 under $200,000",890,120,750,20
9576,14001,"$200,000 or more",200,**,180,**
9578,14004,,5540,2700,2310,420
9579,14004,"$1 under $25,000",1330,1080,170,80
9580,14004,"$25,000 under $50,000",1310,860,240,170


#### Ensure Data Type Consistency

When the data was pulled into Python, pandas required specification for the data type of each column due to multiple data types in each column. They were all loaded in as a string for ease with the intention to correct them later in the data cleaning process. Below is how they were corrected.

In [46]:
erie_incomes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 413 entries, 9570 to 10416
Data columns (total 6 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   Zipcode                              413 non-null    object
 1   Size of adjusted gross income        354 non-null    object
 2   Number of returns                    413 non-null    object
 3   Number of single returns             413 non-null    object
 4   Number of joint returns              413 non-null    object
 5   Number of head of household returns  413 non-null    object
dtypes: object(6)
memory usage: 22.6+ KB


In [47]:
erie_incomes[['Number of returns', 'Number of single returns', 'Number of joint returns', 'Number of head of household returns']] = erie_incomes[['Number of returns', 'Number of single returns', 'Number of joint returns', 'Number of head of household returns']].replace(',', '', regex=True)

In [48]:
erie_incomes.loc[erie_incomes['Number of returns'] == '** ', 'Number of returns'] = 0
erie_incomes.loc[erie_incomes['Number of single returns'] == '** ', 'Number of single returns'] = 0
erie_incomes.loc[erie_incomes['Number of joint returns'] == '** ', 'Number of joint returns'] = 0
erie_incomes.loc[erie_incomes['Number of head of household returns'] == '** ', 'Number of head of household returns'] = 0

In [49]:
erie_incomes.head(10)

Unnamed: 0,Zipcode,Size of adjusted gross income,Number of returns,Number of single returns,Number of joint returns,Number of head of household returns
9570,14001,,4880,2420,1950,380
9571,14001,"$1 under $25,000",1190,940,140,80
9572,14001,"$25,000 under $50,000",1170,770,240,130
9573,14001,"$50,000 under $75,000",830,420,290,90
9574,14001,"$75,000 under $100,000",600,170,350,60
9575,14001,"$100,000 under $200,000",890,120,750,20
9576,14001,"$200,000 or more",200,0,180,0
9578,14004,,5540,2700,2310,420
9579,14004,"$1 under $25,000",1330,1080,170,80
9580,14004,"$25,000 under $50,000",1310,860,240,170


In [50]:
erie_incomes[['Number of returns', 'Number of single returns', 'Number of joint returns', 'Number of head of household returns']] = erie_incomes[['Number of returns', 'Number of single returns', 'Number of joint returns', 'Number of head of household returns']].apply(pd.to_numeric)

In [51]:
erie_incomes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 413 entries, 9570 to 10416
Data columns (total 6 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   Zipcode                              413 non-null    object
 1   Size of adjusted gross income        354 non-null    object
 2   Number of returns                    413 non-null    int64 
 3   Number of single returns             413 non-null    int64 
 4   Number of joint returns              413 non-null    int64 
 5   Number of head of household returns  413 non-null    int64 
dtypes: int64(4), object(2)
memory usage: 22.6+ KB


In [52]:
erie_incomes = erie_incomes[erie_incomes["Zipcode"] != 0]

In [53]:
erie_incomes.head(20)

Unnamed: 0,Zipcode,Size of adjusted gross income,Number of returns,Number of single returns,Number of joint returns,Number of head of household returns
9570,14001,,4880,2420,1950,380
9571,14001,"$1 under $25,000",1190,940,140,80
9572,14001,"$25,000 under $50,000",1170,770,240,130
9573,14001,"$50,000 under $75,000",830,420,290,90
9574,14001,"$75,000 under $100,000",600,170,350,60
9575,14001,"$100,000 under $200,000",890,120,750,20
9576,14001,"$200,000 or more",200,0,180,0
9578,14004,,5540,2700,2310,420
9579,14004,"$1 under $25,000",1330,1080,170,80
9580,14004,"$25,000 under $50,000",1310,860,240,170


#### Feature Engineering

In [55]:
def calculate_percentage(group):
    total = group.iloc[0]["Number of returns"]
    group["Percentage (Number of returns)"] = (group["Number of returns"] / total) * 100
    return group

erie_incomes2 = erie_incomes.groupby("Zipcode", group_keys=False).apply(calculate_percentage, include_groups=False)

In [56]:
erie_incomes2["Zipcode"] = erie_incomes["Zipcode"]

In [57]:
cols = erie_incomes2.columns.tolist()

In [58]:
cols = cols[-1:] + cols[:-1]

In [59]:
erie_incomes = erie_incomes2[cols]

In [60]:
erie_incomes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 413 entries, 9570 to 10416
Data columns (total 7 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Zipcode                              413 non-null    object 
 1   Size of adjusted gross income        354 non-null    object 
 2   Number of returns                    413 non-null    int64  
 3   Number of single returns             413 non-null    int64  
 4   Number of joint returns              413 non-null    int64  
 5   Number of head of household returns  413 non-null    int64  
 6   Percentage (Number of returns)       413 non-null    float64
dtypes: float64(1), int64(4), object(2)
memory usage: 25.8+ KB


In [61]:
erie_incomes['Size of adjusted gross income'] = erie_incomes['Size of adjusted gross income'].fillna('Total')

In [62]:
erie_incomes.head()

Unnamed: 0,Zipcode,Size of adjusted gross income,Number of returns,Number of single returns,Number of joint returns,Number of head of household returns,Percentage (Number of returns)
9570,14001,Total,4880,2420,1950,380,100.0
9571,14001,"$1 under $25,000",1190,940,140,80,24.385246
9572,14001,"$25,000 under $50,000",1170,770,240,130,23.97541
9573,14001,"$50,000 under $75,000",830,420,290,90,17.008197
9574,14001,"$75,000 under $100,000",600,170,350,60,12.295082


In [63]:
erie_incomes.to_csv('erie_incomes.csv', index=False)

#### Summary of Income Tax Return Data Cleaning Process
Loaded CSV file
Removed rows that had blanks/missing data
Ensured data type consistency by converting rows to correct data types
Removed unnecessary columns that will not be useful to analysis
Removed rows that belonged to zipcodes not in Erie County
Feature engineered columns to calculate the percentage of tax returns for each income bracket
Saved to new CSV file

#### Establishment Type data from the Retail Food Store dataset

In [66]:
establishment_types = {
    "A" : 'Store',
    'B' : "Bakery",
    'C' : "Food Manufacturer",
    'D' : 'Food Warehouse',
    'H' : 'Wholesale Manufactuer',
    'K' : 'Vehicle',
    'W' : 'Farm Winery',
    'Y' : 'Slaughterhouse'}

In [67]:
import json

In [68]:
with open('establishments.json', 'w', encoding='utf-8') as file:
        json.dump(establishment_types, file, sort_keys=True, indent=4, separators=(',', ': '), ensure_ascii=False)

In [69]:
with open('establishments.json', 'r', encoding='utf-8') as file:
    testing = json.load(file)

#### Summary of Establishment Type Data Cleaning Process
- In order to understand the meaning of the establishment types, created a dictionary to hold the code and the meaning as key, value pairs
- Saved dictionary as a JSON file

### Summary of Data Cleaning Process:
- Food Store, Income Tax, and Establishment Type data are now stored in their own files
- Data only includes data for zipcodes in Erie County
- Zipcodes are saved as string data types due to their purpose at categorizing data, rather than being used as an integer with functions
- Impact of Cleaning:
    - Drastically reduced the size of the data we are working with, both by removing rows with zipcodes in counties other than Erie and also removing columns that provide irrelevant data.
    - Data is provided in columns and tables that are easy to understand and also very accessible