<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Relevant-imports" data-toc-modified-id="Relevant-imports-1">Relevant imports</a></span></li><li><span><a href="#Useful-datasets" data-toc-modified-id="Useful-datasets-2">Useful datasets</a></span></li><li><span><a href="#I.-Visualizing-Restaurants" data-toc-modified-id="I.-Visualizing-Restaurants-3">I. Visualizing Restaurants</a></span><ul class="toc-item"><li><span><a href="#Restaurants-on-a-map" data-toc-modified-id="Restaurants-on-a-map-3.1">Restaurants on a map</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Heatmap-showing-restaurants-locations-concentration" data-toc-modified-id="Heatmap-showing-restaurants-locations-concentration-3.1.0.0.1">Heatmap showing restaurants locations concentration</a></span></li><li><span><a href="#Number-of-restaurants-by-community-area" data-toc-modified-id="Number-of-restaurants-by-community-area-3.1.0.0.2">Number of restaurants by community area</a></span></li></ul></li><li><span><a href="#B.-Grocery-stores-by-community-area" data-toc-modified-id="B.-Grocery-stores-by-community-area-3.1.0.1">B. Grocery stores by community area</a></span></li></ul></li></ul></li></ul></li></ul></div>

# Relevant imports

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import math
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

import datetime

from autocorrect import Speller

from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

from datetime import date as dt

import folium
from folium import plugins
from folium.plugins import HeatMap
#http://geopandas.org/install.html
import geopandas as gpd

import geopandas as gpd
from shapely.geometry import shape, Point

import requests
from pandas.io.json import json_normalize
import pandas_profiling

import warnings
warnings.filterwarnings('ignore')

# Useful datasets

In [4]:
socioeconomic_factors = pd.read_csv("socioeconomic_factors.csv")
inspections = pd.read_pickle('datasets/cleaned_inspections.pickle')

# I. Visualizing Restaurants

Let us first focus on one of the most important facility types in our dataset: restaurants.

## Restaurants on a map

Let's display the different restaurants in our dataset on a map.

In [5]:
#We only display restaurant in our map
restaurant_locations = inspections[inspections['Facility Type']=='Restaurant']
#We reduce the data size & display only restaurants that still exist in year 2018
restaurant_locations = restaurant_locations[restaurant_locations['Inspection Date'].dt.year==2018]
restaurant_locations = restaurant_locations[['DBA Name', 'Latitude','Longitude']].drop_duplicates()
restaurant_locations_array = np.array(restaurant_locations)

In [177]:
locations_map = folium.Map(location=[41.86087, -87.608945], zoom_start=10)


# add a marker for every record in the filtered data, use a clustered view
for each in crimedata[0:MAX_RECORDS].iterrows():
    map.simple_marker(
        location = [each[1]['Y'],each[1]['X']], 
        clustered_marker = True)
    
for i, info in enumerate(restaurant_locations_array):
    folium.Marker(
        location=[info[1], info[2]],
        popup=info[0],
        icon=folium.Icon(color='red', icon='info-sign')).add_to(locations_map)
#locations_map.save('./docs/restaurants_map.html')

**The map can be accessed [here](https://aaag97.github.io/ada-2019-project-databusters/restaurants_map.html)**

In [178]:
len(inspections['Location'].unique())

16812

In [179]:
len(inspections['DBA Name'].unique())

27420

We notice the following:

* From the map displayed we see that the icons are stacked and condensed into the same regions, hence this time of displaying does not meet our needs.
* This is not a good way to display more than 27'420 restaurants in one map as the result is not visible. But it will probably be useful later on when we will want to filter and visualize only a subset of the restaurants.
* We also see above that apparently the locations of the restaurants we have are not very precise because we have less locations that establishments.

For those two last points, it makes more sense to show a heatmap of the restaurants which will display the restaurants "concentration" (areas with high concentration of restaurants)

##### Heatmap showing restaurants locations concentration

In [180]:
locations_heatmap = folium.Map([41.86087, -87.608945], zoom_start=11)

# List comprehension to make out list of lists
heat_data = [[row['Latitude'],row['Longitude']] for index, row in restaurant_locations.iterrows()]

# Plot it on the map
HeatMap(heat_data, radius=17).add_to(locations_heatmap)

# Display the map
#locations_heatmap.save('./docs/restaurants_heatmap.html')
#locations_heatmap

<folium.plugins.heat_map.HeatMap at 0x1c4fd00ef0>

**The map can be accessed [here](https://aaag97.github.io/ada-2019-project-databusters/restaurants_heatmap.html)**

As expected, there are a lot more restaurants in the city center of chicago.

##### Number of restaurants by community area

Now we want to display the community areas as a heatmap of the number of restaurants they have.

Get community areas boundaries from:
https://www.chicago.gov/city/en/depts/doit/dataset/boundaries_-_communityareas.html

In [181]:
fp = "./boundaries_processed.geojson"
boundaries_community_areas = gpd.read_file(fp)
boundaries_community_areas.head()

Unnamed: 0,area_number,community,geometry
0,0,ALBANY PARK,"MULTIPOLYGON (((-87.70404 41.97355, -87.70403 ..."
1,1,ARCHER HEIGHTS,"MULTIPOLYGON (((-87.71437 41.82604, -87.71436 ..."
2,2,ARMOUR SQUARE,"MULTIPOLYGON (((-87.62917 41.84556, -87.62947 ..."
3,3,ASHBURN,"MULTIPOLYGON (((-87.71255 41.75734, -87.71252 ..."
4,4,AUBURN GRESHAM,"MULTIPOLYGON (((-87.63990 41.75615, -87.63990 ..."


We do a sanity check to see if the community areas in the inspection data we got from geopy correspond to the community areas we get from Chicago Government portal.

In [182]:
#We care about restaurants only
restaurants_intensity = inspections[inspections['Facility Type']=='Restaurant']
#We focus on year 2018
restaurants_intensity = restaurants_intensity[restaurants_intensity['Inspection Date'].dt.year==2018]
restaurants_intensity = restaurants_intensity[['Community Area','DBA Name']].drop_duplicates().groupby('Community Area')['DBA Name'].agg({'nbr_restaurants' : len})
restaurants_intensity.reset_index(inplace=True)
restaurants_intensity['Community Area'] = restaurants_intensity['Community Area'].str.upper().str.strip()
restaurants_intensity

Unnamed: 0,Community Area,nbr_restaurants
0,ALBANY PARK,118
1,ARCHER HEIGHTS,42
2,ARMOUR SQUARE,94
3,ASHBURN,53
4,AUBURN GRESHAM,49
...,...,...
71,WEST LAWN,63
72,WEST PULLMAN,11
73,WEST RIDGE,176
74,WEST TOWN,314


There is one community area that does not appear in our dataset for the parameters we chose (restaurants in 2018), let's add it to our data with nbr_restaurants = 0.

In [183]:
missing_1 = set(boundaries_community_areas['community'])-set(restaurants_intensity['Community Area'])
missing_1 = pd.DataFrame(missing_1, columns=['Community Area'])
missing_1['nbr_restaurants'] = 0
print("Community areas missing in the inspection dataset: ")
missing_1

Community areas missing in the inspection dataset: 


Unnamed: 0,Community Area,nbr_restaurants
0,RIVERDALE,0


We can now merge the community areas and display the resulting nbr_restaurants

In [184]:
restaurants_intensity = restaurants_intensity.append(missing_1)
restaurants_intensity.sort_values(by=['Community Area'], inplace=True)
restaurants_intensity.reset_index(inplace=True,drop=True)
restaurants_intensity['Community Area'] = restaurants_intensity['Community Area'].astype('str')
restaurants_intensity

Unnamed: 0,Community Area,nbr_restaurants
0,ALBANY PARK,118
1,ARCHER HEIGHTS,42
2,ARMOUR SQUARE,94
3,ASHBURN,53
4,AUBURN GRESHAM,49
...,...,...
72,WEST LAWN,63
73,WEST PULLMAN,11
74,WEST RIDGE,176
75,WEST TOWN,314


In [185]:
# Initialize the map:
restaurants_by_community = folium.Map([41.86087, -87.608945], zoom_start=11, tiles = "cartodbpositron")
 
# Add the color for the chloropleth:
restaurants_by_community.choropleth(
 geo_data=fp,
 data=restaurants_intensity,
 columns=['Community Area', 'nbr_restaurants'],
 key_on='feature.properties.community',
 fill_color='YlGn',
 fill_opacity=0.7,
 line_opacity=0.2,
 legend_name='Number of restaurants by community'
)
folium.LayerControl().add_to(restaurants_by_community)
 
# Save to html
#restaurants_by_community.save('./docs/restaurants_by_community.html')
#restaurants_by_community

<folium.map.LayerControl at 0x1c279283c8>

**The map can be accessed [here](https://aaag97.github.io/ada-2019-project-databusters/restaurants_by_community.html)**

The results we obtain are consistent with the ones we had in the heatmap: More restaurants are concentrated around the city center of Chicago.

Another pattern also emerges, the northern community areas of the city seems to contain more restaurants than the southern ones.

#### B. Grocery stores by community area

After a small research, we found out about a problem affecting public health in Chicago at a large scale called **Food deserts**. In a report by the Illinois Advisory Committee to the United States Commission on Civil Rights, it is reported that some of the community areas in Chicago are lacking supermarkets, grocery stores and healthy food in general. Restricted access to healthy foods lead to higher rates of chronic illness like diabetes, hypertension, or cardiovascular disease. [source](https://www.usccr.gov/pubs/docs/IL-FoodDeserts-2011.pdf)

Furthermore, it is stated in the report that these food desert neighborhoods are almost exclusively in African American neighborhoods making the issue a civil rights one beyond its public health dimension.

**In the next part we will try to see if we can visualize the food deserts using the food inspection dataset**

In [186]:
#We care about grocery stores only
groceries_intensity = inspections[inspections['Facility Type']=='Market']
#We focus on year 2018
groceries_intensity = groceries_intensity[groceries_intensity['Inspection Date'].dt.year==2018]
groceries_intensity = groceries_intensity[['Community Area','DBA Name']].drop_duplicates().groupby('Community Area')['DBA Name'].agg({'nbr_grocery' : len})
groceries_intensity.reset_index(inplace=True)
groceries_intensity['Community Area'] = groceries_intensity['Community Area'].str.upper().str.strip()
groceries_intensity.head()

Unnamed: 0,Community Area,nbr_grocery
0,ALBANY PARK,25
1,ARCHER HEIGHTS,1
2,ARMOUR SQUARE,11
3,ASHBURN,4
4,AUBURN GRESHAM,23


In [187]:
# Initialize the map:
groceries_by_community = folium.Map([41.86087, -87.608945], zoom_start=11, tiles = "cartodbpositron")
 
# Add the color for the chloropleth:
groceries_by_community.choropleth(
 geo_data=fp,
 data=groceries_intensity,
 columns=['Community Area', 'nbr_grocery'],
 key_on='feature.properties.community',
 fill_color='YlGn',
 fill_opacity=0.7,
 line_opacity=0.2,
 legend_name='Number of grocery stores by community'
)
folium.LayerControl().add_to(groceries_by_community)
 
# Save to html
#groceries_by_community.save('./docs/groceries_by_community.html')
#groceries_by_community

<folium.map.LayerControl at 0x1c27928710>

**The map can be accessed [here](https://aaag97.github.io/ada-2019-project-databusters/groceries_by_community.html)**

"Chicago’s segregation is certainly legendary, with the North and South sides divided by class and race. To keep it stereotypically simple: The North Side is white, the South Side is black." [source](https://chicago.eater.com/2018/12/13/18138387/chicago-magazine-john-kessler-food-scene-racism-immigration-food)

We certainly see a difference in the concentration of grocery stores between the North and the South, the south being quite sparse.

* **For the next milestone, we will visualize the evolution of this distribution accross years using timestamps.**
* **We will also plot the density of markets per 1'000 habitants in a community area which seems more fair than plotting the absolute number of markets.**
* **We also have in our dataset health related establishments serving food, and from the health indicators dataset, diabetes and other diet related diseases rates which can be interesting to explore as well against what get from the markets visualization.**