### Starting analysis of the Open NYC data

In [4]:
import pandas as pd
df = pd.read_csv('GreenThumb_Garden_Info.csv')
df.keys()

Index(['assemblydist', 'borough', 'communityboard', 'congressionaldist',
       'coundist', 'gardenname', 'juris', 'multipolygon', 'openhrsf',
       'openhrsm', 'openhrssa', 'openhrssu', 'openhrsth', 'openhrstu',
       'openhrsw', 'parksid', 'policeprecinct', 'statesenatedist', 'status',
       'zipcode'],
      dtype='object')

In [None]:
df.drop(columns='multipolygon', inplace=True)

After running some quick analysis, I realized that the column "multipolygon" is gumming up the output and making it very hard to understand what's going on. I looked at Open NYC's data dictionary and realized that I won't need its information, so I removed it from my dataset. 

In [15]:
active_g = df[df['status'].str.contains('Active')]
print(len(active_g))

#There are at least 541 active gardens in NYC.

541


In [17]:
#NYC population per the July 2022 estimate is 8,335,897
#Source: https://www.census.gov/quickfacts/fact/table/newyorkcitynewyork,richmondcountynewyork,queenscountynewyork,newyorkcountynewyork,kingscountynewyork,bronxcountynewyork/PST045222

541 / 8335897 * 100000

#6.5 people per 100,000 NYC residents have access to a community garden.

6.490003415349301

In [27]:
#Pulling only active gardens for future analysis

active_g_borough = active_g.borough.value_counts()
active_g_borough 

borough
B    238
M    138
X    115
Q     43
R      7
Name: count, dtype: int64

In [53]:
active_g_borough.to_csv('garden_by_borough.csv')

I'm saving this borough information dataset into a new CSV so so it's easier to add additional demographic information and I can have a quick file for datawrapper.

In [24]:
active_g_zipcode = active_g.zipcode.value_counts()
active_g_zipcode.to_csv('garden_by_zipcode.csv')

I'm also saving this zipcode information into a new CSV file so it's easier to add additional demographic information and I can have a quick file for datawrapper.

### Data analysis for boroughs

In [56]:
df_borough = pd.read_csv('garden_by_borough.csv')

In [57]:
#add borough population (per July 2022 estimate) to dataframe
#url: https://www.census.gov/quickfacts/fact/table/richmondcountynewyork,queenscountynewyork,newyorkcountynewyork,kingscountynewyork,bronxcountynewyork/PST045222
population = [2590516, 1596273, 1379946, 2278029, 491133]
df_borough['population'] = population
print(df_borough)

  borough  garden_count  population
0       B           238     2590516
1       M           138     1596273
2       X           115     1379946
3       Q            43     2278029
4       R             7      491133


In [65]:
#people per 100,000 residents who has access to community garden in each borough
ratio = (df_borough.garden_count / df_borough.population * 100000).round(2)

#add ratio to the dataframe and save
df_borough['ratio'] = ratio
df_borough.to_csv('garden_by_borough.csv')
print(df_borough)

  borough  garden_count  population  ratio
0       B           238     2590516   9.19
1       M           138     1596273   8.65
2       X           115     1379946   8.33
3       Q            43     2278029   1.89
4       R             7      491133   1.43


### Data analysis for zipcodes

In [2]:
import pandas as pd
df_zipcode = pd.read_csv('garden_by_zipcode.csv')

Without knowing how to scrape websites, I'm adding demographic data like population, racial makeup, median age and median household income into the dataset manually on the csv.

In [3]:
#people per 100,000 residents who has access to community garden in each zipcode with a garden
ratio_zipcode = (df_zipcode.garden_count / df_zipcode.population * 100000).round(2)

#adding ratio to the dataframe and save
df_zipcode['ratio_zipcode'] = ratio_zipcode
df_zipcode.to_csv('garden_by_zipcode.csv')
df_zipcode

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,garden_count,zipcode,borough,population,median_household_income,median_age,hispanic_or_latino,white_alone,black_alone,asian_alone,Unnamed: 10,ratio_zipcode
0,0,0,44,11207,B,97690,45616,35.0,34.4,5.1,55.2,1.5,,45.04
1,1,1,38,10009,M,60000,77551,37.0,24.5,50.8,6.8,12.9,,63.33
2,2,2,26,11233,B,85633,52380,34.1,15.9,11.4,67.7,1.3,,30.36
3,3,3,21,11221,B,89728,66923,32.1,32.4,20.0,39.5,4.4,,23.40
4,4,4,20,10027,M,65840,58435,31.0,25.8,26.1,35.3,8.6,,30.38
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86,86,86,1,10024,M,63916,153177,41.5,11.6,77.4,2.3,5.6,,1.56
87,87,87,1,10033,M,60739,71630,36.9,66.6,24.9,3.7,2.8,,1.65
88,88,88,1,11102,Q,27876,95478,35.0,25.8,48.2,6.2,14.9,,3.59
89,89,89,1,10475,X,44509,56327,46.7,31.2,5.2,58.4,2.2,,2.25


In [5]:
df_zipcode.garden_count.median()
#the median number of gardens to have per zipcode that has garden(s) is 3. 

3.0

print(df_zipcode.garden_count.value_counts())
#the most common number of garden to have for a zipcode that has garden(s) is 1. 
#68 zipcodes have single-digit numbers of garden.

For the two calculations above, I can't word it as the median number of gardens per zipcode in the city because this dataset doesn't include info on zipcodes without a garden.

In [6]:
#the 10 zipcodes with the highest numbers of active community gardens
df_zipcode.head(10).to_csv('top_garden_zipcode.csv')

Saving this as a separate csv file for easy datawrapper visualization

In [7]:
#the borough where these top 10 zipcodes are located
df_zipcode.head(10).borough.value_counts()

borough
B    6
M    3
X    1
Name: count, dtype: int64

In [8]:
#There are at least 541 active gardens in NYC, so half is 271. 
print(df_zipcode.query('median_household_income < 52500').garden_count.sum())
print(df_zipcode.query('median_household_income < 52000').garden_count.sum())

289
263


Here, I'm trying to figure out what kind of neighbourhood most of the community gardens are in and the variable I'm trying to analyze is median household income. I added up the value count for median household income levels to see where the threshold is, and it turns out to be $52,500.

In [10]:
print(df_zipcode.median_household_income.median())
# #The median of median household income for zipcodes with garden(s) is $65,038.

print(df_zipcode.hispanic_or_latino.median())
# #The median percentage of the Latino/Hispanic population is around 25.8%.

print(df_zipcode.white_alone.median())
# #The median percentage of the white alone population is 20.4%

print(df_zipcode.black_alone.median())
# #The median percentage of the black alone population is 23.8%

print(df_zipcode.asian_alone.median())
# #The median percentage of the Asian alone population is 5.4%


65038.0
25.8
20.4
23.8
5.4


I'm using the median of the demographic data from all zipcodes with community gardens to approximate a profile of what kind of neighbourhood would have a garden. At the same time, I don't think this is conclusive because I don't have the demographic data of zipcodes without a garden to include in this calculation. A better way of looking at this calculation is that it is giving me the median of zipcodes that currently have garden(s) — as opposed to help with creating a predictive profile. 

In [128]:
df_top_ten_zipcode = pd.read_csv('top_garden_zipcode.csv')

print(df_top_ten_zipcode.median_household_income.median())
# #The median of median household income for these 10 zipcodes is $49,346.

print(df_top_ten_zipcode.hispanic_or_latino.median())
# #The median percentage of the Latino/Hispanic population is around 33.4%.

print(df_top_ten_zipcode.white_alone.median())
# #The median percentage of the white alone population is 12.3%

print(df_top_ten_zipcode.black_alone.median())
# #The median percentage of the black alone population is 38.6%

print(df_top_ten_zipcode.asian_alone.median())
# #The median percentage of the Asian alone population is 4.05%

49346.0
33.4
12.25
38.6
4.050000000000001


I'm using the median of the demographic data from the 10 zipcodes with the highest numbers of community gardens (ranging from 17 to 44) to approximate a profile of what a neighbourhood with such a statisitic looks like. 