# Capstone Project - The Battle of the Neighborhoods (Week 2)
## Applied Data Science Capstone by IBM/Coursera

## Vegan Restaurants to fight Climate Change

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

Climate change poses a threat to the security of our food supply. Rising temperatures, increased rain and more extreme weather events will all have an impact on crops and livestock. Food production also contributes to global warming. Agriculture - together with forestry - accounts for about a quarter of greenhouse gas emissions. Livestock rearing contributes to global warming through the methane gas the animals produce, but also via deforestation to expand pastures, for example. Feeding massive amounts of grain and water to farmed animals and then killing them and processing, transporting, and storing their flesh is extremely energy-intensive. And forests—which absorb greenhouse gases—are cut down in order to supply pastureland and grow crops for farmed animals. Finally, the animals themselves and all the manure that they produce release even more greenhouse gases into our atmosphere.

The environmental impact of meat production is important to many vegetarians and vegans. In the US, vegan burger patties are made from plant-based meat substitutes to taste like the real thing thanks to an iron-rich compound called heme. Eating vegan foods rather than animal-based ones is the best way to reduce your carbon footprint. A University of Chicago study even showed that one can reduce carbon footprint more effectively by going vegan than by switching from a conventional car to a hybrid.

In this project we will attempt to find one or several optimal locations for a Vegan restaurant. Specifically, this report will be targeted to stakeholders interested in opening a Vegan restaurant in one of the populous cities of the most populous states in the U.S, California and Texas.

Since there are lots of restaurants in the cities within these two states we will try to detect locations that are not already crowded with restaurants. We are also particularly interested in areas with hardly any Vegan restaurants in the vicinity. 

We will use the data science powers to generate a few most promising neighborhoods based on this criteria. Populations in each neighborhood within the focus region will then be clearly expressed so that best possible location(s) can be chosen by stakeholders. By no means this will be a complete study that considers all possible factors in determining the best possible locations.

## Data <a name="data"></a>

We will use the FourSquare API to collect data about locations of Vegan places in 4 US metros areas: Houston, TX, Dallas, TX, San Francisco, CA and Los Angeles, CA. These are two most populous metro areas within Texas and California.

Based on definition of our problem, factors that will influence our decision are:

* Number of existing vegan restaurants in the metro area
* Mean distance to the vegan restaurants in the metro area
* Population within the neighborhoods that have least number of Vegan restaurants

Following data sources will be needed to extract/generate the required information:

* The number of vegan restaurants and their location in every neighborhood of the metro will be obtained using **Foursquare API**

* The zip codes of the neighborhoods and population information will be obtained using open data available in **https://catalog.data.gov/dataset**

### Import Libraries

In [28]:
import numpy as np # lib to handle data in a vectorized manner
import pandas as pd # lib for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import requests # library to handle requests
from pandas.io.json import json_normalize # Tranform JSON file into pandas dataframe
import matplotlib.pyplot as plt
import folium # map rendering lib
!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
print('Libraries imported.')

Libraries imported.


### Foursquare Information

In [29]:
CLIENT_ID = '55B0MPP35REWKMZXF3B4A0PKIWJSEOL4WL5MKGUOBTDBAJIA' # My Foursquare ID
CLIENT_SECRET = 'SSL4STGZJU4BRSK5UPPJDDRVZARIJAYFZ02IZFJQVGLQGNS2' # My Foursquare Secret
VERSION = '20210417' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)

Your credentails:
CLIENT_ID: 55B0MPP35REWKMZXF3B4A0PKIWJSEOL4WL5MKGUOBTDBAJIA


### Define the metro areas and leverage foursquare data

In [None]:
LIMIT = 500 #Top venues
metros = ['Houston, TX', 'Dallas, TX', 'San Francisco, CA', 'Los Angeles, CA'] #Main cities within the metro area is chosen
results = {}
for metro in metros:
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&limit={}&categoryId={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        metro,
        LIMIT,
        "4bf58dd8d48988d1d3941735") # VEGAN/VEGETARIAN PLACE CATEGORY ID
    results[metro] = requests.get(url).json()

## Methodology <a name="methodology"></a>

The objectives here are to assess which metro area would have the least established vegan restaurants density and further explore the focused area to determine possible neighborhoods for a new restaurant. For the four metro areas, the main city within the metro is chosen initially for simplicity with Foursquare. For example "San Francisco, CA" is chosen in lieu of "San Francisco-Oakland-Berkeley, CA"; "Dallas, TX in lieu of "Dallas-Fort Worth, TX". The number of Vegan places in the main city is assumed to be an indicator of the Vegan places in that entire metro region. 

The Four Square API  is used via the venues query. The near query is used to get venues in the metros. Also, the Category ID <4bf58dd8d48988d1d3941735> that is available for Vegan/Vegetarian place is assumed for Vegan restaurants. The Foursquare limits us to max 100 venues per query. This request was repeated for the 4 studied metros and received their top 100 venues. The name and coordinate data were saved from the result and plotted them on the map for visual inspection.

Next, to get an indicator of the density of Vegan Places, a center coordinate of the venues was calculated to get the mean longitude and latitude values. Then the mean of the Euclidean distance from each venue was calculated to the mean coordinates. That was the indicator: mean distance to the mean coordinate.

Next the metro with least densely populated with Vegan places is determined. This will be the focus area for further analysis. This project takes into account only the population levels and assumes more the people, more the demand for Vegan food. The zip codes, neighborhood names, coordinates and population information for this metro region are obtained. Each neighborhood within the largest county in the metro is analyzed with respective vegan places and population in them. The most populated neighborhoods with least amount of vegan places are then obtained.

### Obtain current established Vegan places and densities in the metro areas of interest

In [None]:
df_venues={}
for metro in metros:
    venues = pd.json_normalize(results[metro]['response']['groups'][0]['items'])
    df_venues[metro] = venues[['venue.name', 'venue.location.address', 'venue.location.lat', 'venue.location.lng']]
    df_venues[metro].columns = ['Name', 'Address', 'Lat', 'Lng']

In [None]:
maps = {}
for metro in metros:
    metro_lat = np.mean([results[metro]['response']['geocode']['geometry']['bounds']['ne']['lat'],
                        results[metro]['response']['geocode']['geometry']['bounds']['sw']['lat']])
    metro_lng = np.mean([results[metro]['response']['geocode']['geometry']['bounds']['ne']['lng'],
                        results[metro]['response']['geocode']['geometry']['bounds']['sw']['lng']])
    maps[metro] = folium.Map(location=[metro_lat, metro_lng], zoom_start=11)

    # add markers to map
    for lat, lng, label in zip(df_venues[metro]['Lat'], df_venues[metro]['Lng'], df_venues[metro]['Name']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(maps[metro])  
    print(f"Total number of vegan places in {metro} = ", results[metro]['response']['totalResults'])
    print("Established Places")

### Create a map of the Vegan places for all metro areas of interest

In [None]:
maps[metros[0]]

In [None]:
maps[metros[1]]

In [None]:
maps[metros[2]]

In [None]:
maps[metros[3]]

### Focus Area Determination

We can see that San Francisco and Los Angeles are the densest cities with established Vegan places. Dallas is least dense with established Vegan places.

To measure the density, some basic statistics will be obtained. First the mean location of the Vegan places (which should be near to most of them if they are really dense or far if not) is obtained. Then the average of the distance of the venues to the mean coordinates.

In [None]:
maps = {}
for metro in metros:
    metro_lat = np.mean([results[metro]['response']['geocode']['geometry']['bounds']['ne']['lat'],
                        results[metro]['response']['geocode']['geometry']['bounds']['sw']['lat']])
    metro_lng = np.mean([results[metro]['response']['geocode']['geometry']['bounds']['ne']['lng'],
                        results[metro]['response']['geocode']['geometry']['bounds']['sw']['lng']])
    maps[metro] = folium.Map(location=[metro_lat, metro_lng], zoom_start=11)
    venues_mean_coor = [df_venues[metro]['Lat'].mean(), df_venues[metro]['Lng'].mean()] 
    # add markers to map
    for lat, lng, label in zip(df_venues[metro]['Lat'], df_venues[metro]['Lng'], df_venues[metro]['Name']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(maps[metro])
        folium.PolyLine([venues_mean_coor, [lat, lng]], color="green", weight=1.5, opacity=0.5).add_to(maps[metro])
    
    label = folium.Popup("Mean Co-ordinate", parse_html=True)
    folium.CircleMarker(
        venues_mean_coor,
        radius=10,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(maps[metro])

    print(metro)
    print("Mean Distance from Mean coordinates")
    print(np.mean(np.apply_along_axis(lambda x: np.linalg.norm(x - venues_mean_coor),1,df_venues[metro][['Lat','Lng']].values)))

In [None]:
maps[metros[0]]

In [None]:
maps[metros[1]]

In [None]:
maps[metros[2]]

In [None]:
maps[metros[3]]

## Analysis <a name="analysis"></a>

We can see that Dallas, TX is the least favorable in terms of number of Vegan places and also the places in this metro area are farther apart. Houston comes next as least favorable.

Based on this, it is assumed Dallas-Fort Worth (DFW) metroplex has comparatively less Vegan restaurants than the other 3 metropolitan areas.

Let's explore the DFW metro area neighborhoods to determine the population levels and the location of currently established Vegan places.

We need to import the zip codes for the DFW metro along with its cities or neighborhoods and their coordinates.

In [None]:
# Import zip codes, latitude/longitude and cities/neighborhoods of Texas
raw_zip = pd.read_csv('https://raw.githubusercontent.com/amarvelous/Coursera_Capstone/main/us-zip-code-latitude-and-longitude.csv')
raw_zip.head()

### Data Cleaning/Processing
Raw data of Texas is cleaned up to create a new dataframe with only DFW counties and their neighborhoods

In [None]:
# Drop unused colums
raw_zip.drop(['zcta','parent_zcta','density','county_fips','all_county_weights','imprecise','military','timezone'], axis=1, inplace=True)
raw_zip=raw_zip[['state_name', 'state_id', 'county_name', 'city', 'zip', 'lat', 'lng', 'population']]

# Rename columns
raw_zip.rename(columns={'lat':'latitude','lng':'longitude','city':'neighborhood'}, inplace=True)
raw_zip.rename(columns={'county_name':'city','state_name':'state'}, inplace=True)
raw_zip.head()

DFW is mainly covered by 4 counties: Dallas, Tarrant, Collin and Denton with sizable populations.

In [None]:
# New dataframe with DFW neighborhoods
dal_zip = raw_zip[raw_zip.city == 'Dallas']
dal_zip.set_index("city")

Tar_zip = raw_zip[raw_zip.city == 'Tarrant']
Tar_zip.set_index("city")

Col_zip = raw_zip[raw_zip.city == 'Collin']
Col_zip.set_index("city")

Dent_zip = raw_zip[raw_zip.city == 'Denton']
Dent_zip.set_index("city")

DFW_zip = dal_zip.append(Tar_zip).append(Col_zip).append(Dent_zip)
print(DFW_zip.dtypes)
print(DFW_zip.shape)
#print(dal_zip.columns)
DFW_zip

In [None]:
Dallas_pop = DFW_zip.loc[DFW_zip['city'] == 'Dallas', 'population'].sum()
print('Dallas County population is', Dallas_pop)

Tarrant_pop = DFW_zip.loc[DFW_zip['city'] == 'Tarrant', 'population'].sum()
print('Tarrant County population is', Tarrant_pop)

Collin_pop = DFW_zip.loc[DFW_zip['city'] == 'Collin', 'population'].sum()
print('Collin County population is', Collin_pop)

Denton_pop = DFW_zip.loc[DFW_zip['city'] == 'Denton', 'population'].sum()
print('Denton County population is', Denton_pop)

In [None]:
DFW_zip.groupby('city')['population'].sum().plot.bar(figsize=(10,5), color='r')
plt.title('Population per DFW County', fontsize = 20)
plt.ylim([0, 2500000])
plt.xlabel('County', fontsize = 15)
plt.ylabel('Population (M)',fontsize = 15)
plt.xticks(rotation = 'horizontal')
plt.show()

Among the four DFW counties, Dallas county is the most populous with over 2.4 million residents. Dallas County is the focus area for this project. Similar studies can be extended to other counties of the metro.

Next, let's render a map of Neighborhood in the Dallas County. Name user agent (instance of geocoder) as dallas_explorer.

In [None]:
address = 'Dallas, TX'
geolocator = Nominatim(user_agent="dal_explorer")
location = geolocator.geocode(address)
d_latitude = location.latitude
d_longitude = location.longitude
print('The geographical coordinate of Dallas County is {}, {}.'.format(d_latitude, d_longitude))

Map of Dallas county with neighborhoods superimposed.

In [None]:
# Map of Dallas county using latitude and longitude
map_dal = folium.Map(location=[d_latitude, d_longitude], zoom_start=10)

# Markers
for lat, lng, city, neighborhood in zip(dal_zip['latitude'], dal_zip['longitude'], dal_zip['city'], dal_zip['neighborhood']):
    label = '{}, {}'.format(neighborhood, city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dal)  
    
map_dal

Let's obtain the population levels in the Dallas County neighborhoods

In [None]:
Addison_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Addison', 'population'].sum()
print('Addison population is', Addison_total_pop)

Dallas_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Dallas', 'population'].sum()
print('Dallas population is', Dallas_total_pop)

Desoto_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Desoto', 'population'].sum()
print('Desoto population is', Desoto_total_pop)

Irving_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Irving', 'population'].sum()
print('Irving population is', Irving_total_pop)

Richardson_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Richardson', 'population'].sum()
print('Richardson population is', Richardson_total_pop)

Rowlett_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Rowlett', 'population'].sum()
print('Rowlett population is', Rowlett_total_pop)

Carrollton_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Carrollton', 'population'].sum()
print('Carrollton population is', Carrollton_total_pop)

Coppell_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Coppell', 'population'].sum()
print('Coppell population is', Coppell_total_pop)

Garland_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Garland', 'population'].sum()
print('Garland population is', Garland_total_pop)

Sachse_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Sachse', 'population'].sum()
print('Sachse population is', Sachse_total_pop)

Grand_Prairie_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Grand Prairie', 'population'].sum()
print('Grand Prairie population is', Grand_Prairie_total_pop)

Cedar_Hill_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Cedar Hill', 'population'].sum()
print('Cedar Hill population is', Cedar_Hill_total_pop)

Duncanville_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Duncanville', 'population'].sum()
print('Duncanville population is', Duncanville_total_pop)

Hutchins_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Hutchins', 'population'].sum()
print('Hutchins population is', Hutchins_total_pop)

Lancaster_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Lancaster', 'population'].sum()
print('Lancaster population is', Lancaster_total_pop)

Mesquite_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Mesquite', 'population'].sum()
print('Mesquite population is', Mesquite_total_pop)

Seagoville_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Seagoville', 'population'].sum()
print('Seagoville population is', Seagoville_total_pop)

Wilmer_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Wilmer', 'population'].sum()
print('Wilmer population is', Wilmer_total_pop)

Balch_Springs_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Balch Springs', 'population'].sum()
print('Balch Springs population is', Balch_Springs_total_pop)

Sunnyvale_total_pop = dal_zip.loc[dal_zip['neighborhood'] == 'Sunnyvale', 'population'].sum()
print('Sunnyvale population is', Sunnyvale_total_pop)

Within the Dallas County, most populous neighborhoods (or cities) with 50K or more residents are chosen for further analysis. These neighborhoods are Dallas, Garland, Grand Prairie, Irving, Mesquite, Richardson, and Rowlett.

In [None]:
LIMIT = 500 #Top venues
d_cities = ['Dallas, TX', 'Garland, TX', 'Grand Prairie, TX', 'Irving, TX', 'Mesquite, TX', 'Richardson, TX', 'Rowlett, TX']
d_results = {}
for d_city in d_cities:
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&limit={}&categoryId={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        d_city,
        LIMIT,
        "4bf58dd8d48988d1d3941735") # VEGAN/VEGETARIAN PLACE CATEGORY ID
    d_results[d_city] = requests.get(url).json()

In [None]:
df_dvenues={}
for d_city in d_cities:
    dvenues = pd.json_normalize(d_results[d_city]['response']['groups'][0]['items'])
    df_dvenues[d_city] = dvenues[['venue.name', 'venue.location.address', 'venue.location.lat', 'venue.location.lng']]
    df_dvenues[d_city].columns = ['Name', 'Address', 'Lat', 'Lng']

In [None]:
maps = {}
for d_city in d_cities:
    dcity_lat = np.mean([d_results[d_city]['response']['geocode']['geometry']['bounds']['ne']['lat'],
                        d_results[d_city]['response']['geocode']['geometry']['bounds']['sw']['lat']])
    dcity_lng = np.mean([d_results[d_city]['response']['geocode']['geometry']['bounds']['ne']['lng'],
                        d_results[d_city]['response']['geocode']['geometry']['bounds']['sw']['lng']])
    maps[d_city] = folium.Map(location=[dcity_lat, dcity_lng], zoom_start=11)

    # add markers to map
    for lat, lng, label in zip(df_dvenues[d_city]['Lat'], df_dvenues[d_city]['Lng'], df_dvenues[d_city]['Name']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(maps[d_city])  
    print(f"Total number of vegan places in {d_city} = ", d_results[d_city]['response']['totalResults'])
    print("Established Places")

## Results and Discussion <a name="results"></a>

Per the above analysis, among the 4 most populous metro areas within the most populous states in U.S, Dallas, TX metro area has the least number of Vegan restaurants. The Dallas metro area or DFW includes mainly 4 counties with most number of residents. Dallas County is the largest among these 4 counties.  With Dallas County as the focus, the most populous neighborhoods/cities are considered to be ones with 50K or more residents. This filtered down to 7 neighborhoods: Dallas, Irving, Richardson, Rowlett, Garland, Grand Prairie and Mesquite. From above population and restaurant density analysis, we can arrive at the following results -

* Dallas has 51 established Vegan places for its population of about 1.2M.
* Garland has 48 established Vegan places for its population of about 227K.
* Grand Prairie has 58 established Vegan places for its population of about 170K.
* Irving has 65 established Vegan places for its population of about 218K.
* Mesquite has 35 established Vegan places for its population of about 141K. 
* Richardson has 47 established Vegan places for its population of about 78K.
* Rowlett has 11 established Vegan places for its population of about 55K.

Dallas has 1 Vegan place for approximately every 23K residents. Garland has 1 Vegan place for approximately every 4K residents. Grand Prairie has 1 Vegan place for approximately every 3K residents. Irving has 1 Vegan place for about approximately every 3K residents also. Mesquite has 1 Vegan place for approximately every 4K residents. Richardson has 1 Vegan place for approximately every 1.7K residents; and finally Rowlett has 1 Vegan place for approximately every 5K residents.    

## Conclusion <a name="conclusion"></a>

Based on an analysis of the population and amount of established Vegan places in the 4 largest metros within the two populous U.S states, Dallas TX metro provides opportunities for the stake holders to open Vegan restaurant and encourage Vegan food among residents to help fight climate change. Within the DFW metro, the top 4 neighborhoods within Dallas county area for opening new business for Vegan food are Dallas, Rowlett, Garland and Mesquite.

This project took into account only the population levels and assumed more the people, more demand for Vegan food. Further analysis can be conducted taking into the account the education and employment levels in the neighborhoods, the income levels and age groups.