# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Andrew Rodgers

## Introduction
Akron, Ohio has many restaurants already, but we have a client who wants to open a new Chinese restaurant in the city. In addition to hosting many different neighborhoods, Akron also has a large population of commuter workers during the week. Most of these workers come from the cities and towns immediatley outside the city.  

Our client wants the new restaurant to be as close to as many Akron residents as possible to maximize potential customers. The restaurant must also be in an urban area, so not in the middle of a highly residential neighborhood. Additionally, he wants it to be near other restaurants, preferably close to another Chinese restaurant, to increase competition.  

We will use data science, along with some cool math, to suggest a few attractive locations.

## Data
To find a possible solution for the client, we will use data from several sources: 
* 2017 zip code population data (US Census Bureau)
* Zip codes in and around Akron (https://www.zipmap.net/Ohio/Summit_County/Akron.htm)
  * Geographic centers of those zip codes (https://gist.github.com/erichurst/7882666)
* Foursquare API to find Chinese restaurants in Akron

Akron has different neighborhoods, but they are small or have poorly defined locations. To avoid this, we can use the zip codes of Akron and its surrounding suburbs instead. Since zip codes are oddly shaped, we will use the geographic center of each zip code, treating each it as a point population. Then, we can use a dot product to measure each potential location's attractiveness. The attracvtiveness dot product for a given point **P** is equal to:  

<img src="https://github.com/arodgers11/Coursera_Capstone/blob/master/Personal/sum.png?raw=true" width="400"/>

    
To satisfy the first requirment, maximum accessibility, we can find some points ${P_1,P_2,...}$ that are local minima for all potential points in the city. But first, we need to gather the lat/lon data for zip codes.

In [6]:
import requests  #access APIs
import folium  #create map
import pandas as pd  #data manipulation
import matplotlib.pyplot as plt  #scatterplots
from math import sin,cos,sqrt,asin,atan2,pow,acos  #operations to convert spherical to cartesian
import utm #utm module for lat/lon to x/y
from geopy.geocoders import Nominatim  #geolocator data
from datetime import datetime  #date for API requests

#get census data for zip codes
url='https://api.census.gov/data/2017/acs/acs5/subject'
key='7bad81938d25611bd2d0362e77c32f0594ea0243'
zips='44301,44302,44303,44304,44305,44306,44307,44308,44310,44311,44313,44314,44320'
call='%s?key=%s&get=NAME,S0101_C01_001E&for=zip%%20code%%20tabulation%%20area:%s'
call = call % (url,key,zips)

resp=requests.get(call).json()

df=pd.DataFrame(columns=['name','population','zip'],data=resp[1:])
df=df[['zip','population']]
df=df.sort_values(by=['zip']).reset_index(drop=True)
df=df.astype(int)

git='https://github.com/arodgers11/Coursera_Capstone/blob/master/Personal/centerzips.txt?raw=True'

df2 = pd.read_csv(git,sep=",",header=0)
df['lat']=df2['LAT']
df['lon']=df2['LNG']

address='Akron, OH'
geolocator = Nominatim(user_agent="akron_zips")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
map_akron = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lon, zipcode, pop in zip(df['lat'], df['lon'], df['zip'],df['population']):
    label = "zip:{}\n pop:{}".format(zipcode,pop)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='#ce1141',
        fill=True,
        fill_color='#ce1141',
        fill_opacity=0.5,
        parse_html=False).add_to(map_akron)
    
df.head()

ModuleNotFoundError: No module named 'folium'

Now that we have geographic and population data for each zip code, we can write these to the dataframe **df**, above. The next step is to find the lat/lon data for Chinese restaurants in Akron. We export the location data to a *csv* file, <a href="https://github.com/arodgers11/Coursera_Capstone/blob/master/Personal/ch.csv?raw=true">ch.csv</a>. We use this in **R** to calculate the total disatances between attractive locations and Chinese restaurants.

In [5]:
ch=pd.read_csv("https://github.com/arodgers11/Coursera_Capstone/blob/master/Personal/ch.csv?raw=true",sep=",",header=0)
ch.head()

NameError: name 'pd' is not defined

In [4]:
url = 'https://api.foursquare.com/v2/venues/explore'

params = dict(
  client_id='DYLLTWFBCB0RPXB3RKFNAWYZGXGJMGCKPGMPG4LEKLQ4MFSL',
  client_secret='TBKXLTH2RESZKL5FFAF1I5HKQF1AERA0WKRKXY444YCHT1KO',
  v=datetime.today().strftime('%Y%m%d'),
  ll='41.075576, -81.511134', # UA campus lat/long
  query='bar',
  limit=100,
  radius=1609*10 # miles from location
)
resp = requests.get(url=url, params=params).json()
r = resp['response']['groups'][0]['items']
lat=[]
lon=[]
name=[]
for i in r:
      name.append(i['venue']['name'])
      lat.append(i['venue']['location']['lat'])
      lon.append(i['venue']['location']['lng']) 
ch=pd.DataFrame(list(zip(name,lat,lon)),columns=['Name','lat','lon'])

for name,lat,lon in zip(ch['Name'],ch['lat'],ch['lon']):
    label = "{}".format(name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='#daa520',
        fill=True,
        fill_color='#daa520',
        fill_opacity=0.5,
        parse_html=False).add_to(map_akron)
    
map_akron

KeyError: 'groups'

The map shows the zip code centers (red), as well as the Chinese restaurants (yellow).  

The code below doesn't work in notebooks but works on any other python comiler. It uses the **utm** module to convert lat/lon to UTM x/y coordiantes, which makes the distance calculations much easier. It writes the new x/y coordinates to another  *csv* file, <a href="https://github.com/arodgers11/Coursera_Capstone/blob/master/df.csv?raw=true">df.csv</a>, for processing in **R**.

In [None]:
x=[0]*len(ch)
y=[0]*len(ch)
for i in range(0,len(ch)):
    lat=ch['lat'][i]
    lon=ch['lon'][i]
    x[i]=utm.from_latlon(lat,lon)[0]
    y[i]=utm.from_latlon(lat,lon)[1]
with open('ch.csv','w') as f:
    ch.to_csv(f,index=False)

x=[0]*len(df)
y=[0]*len(df)
for i in range(0,len(df)):
    lat=df['lat'][i]
    lon=df['lon'][i]
x[i]=utm.from_latlon(lat,lon)[0]
y[i]=utm.from_latlon(lat,lon)[1]
df['x']=x
df['y']=y

with open('df.csv','w') as f:
    df.to_csv(f,index=False)

## Analysis in R
This is the **R** code we used to manipuate the data and calculate the distances, as well as create the heatmap and scatter plots.

In [5]:
library(ggplot2)

df <- read.csv("~/python/df.csv")
ch <- read.csv("~/python/ch.csv")

df$xstd=df$x-mean(df$x)
df$ystd=df$y-mean(df$y)
xmean=mean(df$x)
ymean=mean(df$y)

xrange=seq(-5000,6000,by=100)
yrange=seq(-4500,6000,by=100)
xlen=length(xrange)
ylen=length(yrange)

distance <- function(x,y,x1,y1) {
  return(sqrt((x-x1)^2+(y-y1)^2))
}

d=matrix(0,nrow=xlen,ncol=ylen)

data <- expand.grid(X=xrange, Y=yrange)

for(x in 1:ylen)
  for(y in 1:xlen) {
    for(i in 1:length(df$zip))
      d[y,x]=d[y,x]+distance(df$xstd[i],df$ystd[i],yrange[x],xrange[y])^0.1*df$population[i]*10^-6
    d[y,x]=d[y,x]^3*100
  }

data$Z=matrix(unlist(d),nrow=xlen*ylen)

row=c(11,14,9,37,43,47,49,58,65,82,83,110) 
col=c(18,48,82,51,29,1,101,62,36,38,73,7)

ch$xstd=ch$x-mean(ch$x)
ch$ystd=ch$y-mean(ch$y)

empty=data.frame(rep(0,dim(pp)[1]))

potential_points=data.frame('Location'=LETTERS[1:length(row)])
potential_points$xstd=round(xrange[row],2)
potential_points$ystd=round(yrange[col],2)
potential_points$x=potential_points$xstd+xmean
potential_points$y=potential_points$ystd+ymean

pp=potential_points
pp$row=row
pp$col=col
pp=pp[order(pp$row,decreasing = FALSE),]

pp$dot=rep(0,dim(c)[1])
for(i in 1:dim(pp)[1])
  for(j in 1:length(ch$Name))
    pp$dot[i]=pp$dot[i]+distance(ch$x[j],ch$y[j],pp$x[i],pp$y[i])

p=ggplot(data, aes(X, Y, z=Z)) + geom_tile(aes(fill = Z))+ 
  theme_bw()+ 
  scale_fill_gradient(high="black", low="white")+
  geom_point(data=df,aes(xstd,ystd),cex=3,shape=19,color='red',inherit.aes = FALSE)+
  geom_point(data=empty,aes(xrange[pp$row],yrange[pp$col]),inherit.aes = FALSE,
             shape=LETTERS[1:dim(pp)[1]],color='blue',cex=3)+
  geom_point(data=ch,aes(xstd,ystd),inherit.aes = FALSE,
             shape=17,color='yellow',cex=3)

write.csv(pp,'~/python/potential_points.csv',row.names = FALSE)
write.csv(t(d),'~/python/d.csv',row.names=FALSE)

plot(p)

SyntaxError: invalid syntax (<ipython-input-5-470e716b3f21>, line 6)

Using **R**, we can create a heatmap of the attractiveness (Z) for every point in the map area. We standardized the x and y axes for more 'friendly' numbers. The resulting heatmap, after scaling by $1/1000000$, combined with the scatter plots of Chinese restaurants (yellow) and zip codes (red) looks like this:  

<img src="https://github.com/arodgers11/Coursera_Capstone/blob/master/map.png?raw=true"  width="800" align='left'/>  


Here we can see there are some clear bright spots, marked with blue letters, including some that are less bright. These points are local minima for the attractiveness of the location. These are the points we will consider for our proposal. We calcualted the total distance between each location and the restaurants and wrote to a file, <a href="https://github.com/arodgers11/Coursera_Capstone/blob/master/Personal/potential_points.csv">potential_points.csv</a>.
## Results From R
After all the analysis in **R**, we end up with the reulsting data frame. Now, all we have left to do is map the locations, determine which qualifications are met by each location, and rank the qualifying locations for the client to make a decision.

We have to convert back to lat/lon to plot the locations on the map. Using some more code that will not work in notebooks, we can get the lat/long for each location.

In [6]:
pp['lat']=[float(0)]*len(pp)
pp['lon']=[float(0)]*len(pp) 
for i in range(0,len(pp)):
    pp['lat'][i]=utm.to_latlon(pp['x'][i],pp['y'][i],zone_number=17,zone_letter='U')[0]
    pp['lon'][i]=utm.to_latlon(pp['x'][i],pp['y'][i],zone_number=17,zone_letter='U')[1]
    
with open('potential_points.csv','w') as f:
    pp.to_csv(f,index=False)

NameError: name 'pp' is not defined

In [7]:
pp=pd.read_csv("https://github.com/arodgers11/Coursera_Capstone/blob/master/Personal/potential_points.csv?raw=true",sep=",",header=0)
pp.head()

NameError: name 'pd' is not defined

Now we can add the candidate locations to our map and perform some more analysis.

In [8]:
for name,lat,lon in zip(pp['Location'],pp['lat'],pp['lon']):
    label = "{}".format(name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='#0000ff',
        fill=True,
        fill_color='#0000ff',
        fill_opacity=0.5,
        parse_html=False).add_to(map_akron)
map_akron

NameError: name 'pp' is not defined

Looking at the map, we can see a couple of problems.
* C,G,H are in the middle of parks
* L is right next to the airport and a dense residential area
* A,B,F,E are in residential areas  

We can eliminate these locations as possibilities, so our new dataframe looks like this.

In [156]:
badlocations=['A','B','C','E','F','G','H','L']
ppfinal=pp.loc[~pp['Location'].isin(badlocations)]
ppfinal=ppfinal.sort_values(by=['dot']).reset_index(drop=True)
ppfinal

Unnamed: 0,Location,xstd,ystd,x,y,row,col,dot,lat,lon
0,I,1400,-1000,457158.574684,4546435.0,65,36,54737.69479,41.06804,-81.509925
1,D,-1400,500,454358.574684,4547935.0,37,51,55512.009779,41.081399,-81.543362
2,J,3100,-800,458858.574684,4546635.0,82,38,61116.588899,41.069929,-81.489705
3,K,3200,2700,458958.574684,4550135.0,83,73,65445.249006,41.101461,-81.488748


In [157]:
address='Akron, OH'
geolocator = Nominatim(user_agent="akron_zips")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
map_akron_final = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lon, name in zip(ppfinal['lat'], ppfinal['lon'], ppfinal['Location']):
    label = "{}".format(name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='#0000FF',
        fill=True,
        fill_color='#0000FF',
        fill_opacity=0.5,
        parse_html=False).add_to(map_akron_final)
map_akron_final

# Conclusion
We now have a ranked list of possible locations for our client to consider. There are other things to consider, but the cleint did not provide a list. However, we can always revisit the problem with more information. The legal, financial, and logistical implications of each location are for the reader to consider. Additionally, many other locations can be considered, but not here. 

<a href='https://arodg000.wixsite.com/website'>Blog Post</a>

In [8]:
latitude,longitude

(41.5051613, -81.6934446)