In [10]:
# The code was removed by Watson Studio for sharing.

# Data Science Capstone Introduction
My project is going to try to answer the question of where someone should open a comic book store in my area.  My audience is largely myself, since I am considering if I should try to start a business or not, but would apply to anyone potentially interested in opening a comic store.  I spent most of my life living in the Midwest (United States) and most of the comic stores I knew there were largely dependent on foot traffic: people literally walking past the store and deciding to drop in and see what was there.  A few years ago, though, I moved to southern New Jersey where there aren't many places with foot traffic.  Most stores are in little shopping centers that are unconnected from each other, and you have to drive most places anyway as the area is not very walkable or public transit-friendly.  So there are really two questions to answer: is this a good place to open a comic store at all? And if so, where would you want to place it?

# Capstone Data Section
To decide if and where to place a comic book store, I decided there would be three steps to the analysis:
* Find other cities similar to mine
* Analyze how many comic stores they have and what those neighborhoods are like
* Look at my area to see if and which areas are like those in the other cities

### Find Similar Cities
There are many ways to determine how cities might be similar to each other.  There's population, area, demographics, income levels, and much more.  I decided to rely on other people already having done the work.  I found two references:

[This](https://www.nytimes.com/interactive/2018/04/03/upshot/what-is-your-citys-twin.html) New York Times article written by someone who works at Indeed.  You can look up a city (as long as it's sufficiently large to be in their list) and it will tell you what other cities are similar to it in terms of job listings.  I thought this would be helpful because job listings could be related to income and certain kinds of people who like to buy comics.

[This](https://www.chicagofed.org/region/community-development/data/pcit) tool put together by the Federal Reserve Bank of Chicago, which lets you enter a city and it tells you what other cities are similar to it on a few different dimensions: equity (demographics and income inequality measures), resilience (unemployment, income levels, and labor force participation measures), outlook (population and demographics measures), and housing (home value and rent measures).  I thought this would be helpful as more direct measures of what the different cities are like but put together by a different group, and thus resulting in different similar cities.

One limitation of these tool is that my exactly city, Egg Harbor Township NJ, is not big enough to be in their databases.  So I will use Atlantic City NJ as the 'base' city to compare to.  Another option would be Philadelphia, which is the nearest large city, but I am not interested in looking at Philadelphia.  I want to look at options in southern New Jersey specifically.

Using Atlantic City as the base city, I could use over 20 comparison cities based on these two references.  That seems like too many, especially since some of the comparisons won't really be very similar to Atlantic City, so I am going to choose the following:

* Las Vegas, NV from the Times link: Atlantic City is a gambling-based tourist town, so Vegas makes a lot of sense.  Vegas is a much bigger city and better tourist draw, though.
* North Port, FL from the Times link: seven of the top ten matches are in Florida.  North Port is 4th of the 7, but the similarity scores are very similar and North Port has the smallest population, making it a bit more similar to Atlantic City.  Both are coastal towns.
* Cleveland and Dayton, OH: the top matches from the Chicago equity link.  They're also on the resilience list.
* Albany, GA; the top match from the Chicago resilience link.
* Port Arthur, TX: the top match from the Chicago outlook link.
* New Orleans, LA: not the top match in the Chicago housing link, but fairly similar.  New Orleans was also on the Times list, presumably because both cities are heavily based on tourism.

These seven cities will hopefully give me a good sense of if they can support comic book stores, and thus if southern New Jersey should be able to, while also representing a range of city types.  I won't be dependent on comparing my area to one other place and relying on its special characteristics or uniqueness.

### Analyze The Similar Cities
I will use the Foursquare API to look at these cities in two steps:
1. How many comic stores are there in the area?
2. What kinds of places (Foursquare venues) are near those stores?
3. What kinds of places are **not** near stores?

Specifically, since the cities cover a range of sizes, I will use the lat/long and area listed for each city on Wikipedia to define an appropriate center and range for a Foursquare venue 'search' for comics.  I'll have to do some filtering on the results, as I know from doing a Foursquare search of my area that there could be redundant entries (I get my local comic store as well as the mall where it is located, and another business is listed twice under different categories).

Once I have a list of comic stores in each of the comparison cities, I will do a Foursquare venue 'explore' around each store for what other points of interest are nearby.  Are comic stores near restaurants and coffee shops?  Are they in malls or other retail shopping?  Are there any that are near parks or something else?  This will be similar to what we did for the week 3 assignment, but based around comic stores instead of neighborhoods.  Importantly, I will pick a few lat/longs not near comic stores and get their characteristics as well.  This is a neccesary step to make sure I'm not just picking retail areas or something where any kind of store could be; I want to find what makes a good place for a comic store specifically.

### Compare to My Area
Finally, I will take the venue types from the comic stores in other cities and see if there's a similar area near me.  I will answer this question in a couple ways:
1. How many comic stores are there in the other cities, and how does that number relate to characteristics in the Chicago link?  For example, there could be a simple relationship where larger cities have more stores, but perhaps it also varies with income level.
2. How does my area fit into a clustering analysis of those areas?  I'll do a similar analysis to our week 3 assignment, adding my area to the comparison cities, and see which cluster my area fits into if any.
3. A discrimination/logistic regression - type analysis to distinguish between areas that have comic stores and those that don't, to better see where in my area would be a good choice.

At the end of this analysis, I should have an idea of how many comic stores an area might support (since mine already has two) and what kinds of areas they tend to be in as defined by Foursquare venues.

# Methodology
As noted above, I need to gather data on my area and the seven comparison cities.  The Federal Reserve Bank of Chicago site allows for downloads but doesn't let you customize the cities in a list, or combine across their various similarity measures, so I entered the relevant data by hand into Excel and saved it as a CSV.  I'll read that in here and show the table so we know what we have.  

In [4]:
# Fetch the file
my_file = project.get_file("city_data.csv")

# Read the CSV data file from the object storage into a pandas DataFrame
my_file.seek(0)
import pandas as pd
city_data = pd.read_csv(my_file)
city_data

Unnamed: 0,city,hispanic_white_dissimilarity_index,black_white_disimilarity_index,poverty_rate,change_poverty_rate,wage_based_gini,change_inequality_index,percent_white,percent_bachelors,share_metro_population,...,percent_foreign_born,percent_change_population,percent_family_with_children,percent_population_20_64,population,percent_old_houses,vacancy_rate,home_value_to_income_ratio,homeownership_rate,percent_rent_burdened
0,Atlantic City,34.3,55.8,33.1,14.0,0.3745,0.0455,15.3,16.4,14.3,...,33.0,-5.3,60.0,60.0,38372,74.0,24.0,6.0,26.6,59.9
1,Las Vegas,45.6,39.3,11.8,3.2,0.3316,-0.0013,44.2,23.9,29.3,...,20.8,31.0,49.7,59.2,626637,23.4,11.4,4.3,52.5,52.3
2,North Port,13.9,7.0,4.6,-1.0,0.3312,0.0118,79.9,20.4,8.2,...,11.0,182.6,37.7,54.1,64425,14.8,16.0,2.9,74.2,41.3
3,Cleveland,36.5,67.7,30.2,7.3,0.3212,0.0247,33.7,16.6,18.8,...,5.4,-19.0,53.6,61.1,387398,89.6,20.2,2.4,41.3,53.7
4,Dayton,42.8,74.4,27.4,9.2,0.3247,0.023,52.7,18.1,17.5,...,5.0,-15.3,53.0,61.0,140782,88.1,21.9,2.1,47.0,56.0
5,Albany,37.5,50.6,27.3,5.8,0.332,0.0014,21.5,20.3,48.7,...,2.1,-3.0,52.6,57.8,74631,61.5,14.8,2.9,39.4,51.2
6,Port Arthur,47.2,40.6,23.9,1.0,0.3476,0.0214,20.3,11.8,13.5,...,22.6,-4.3,55.4,57.9,55249,64.1,19.0,1.9,56.4,51.7
7,New Orleans,38.1,65.9,17.8,-5.9,0.3546,-0.0071,30.6,36.8,30.8,...,5.6,-19.6,45.8,63.9,389648,78.5,19.7,5.5,47.4,62.0


Next, I need to pull Foursquare data for these cities on what comic stores are in the area for each city.  As described before, I'll use Wikipedia to find a lat/long for each city and its rough area so that I can use that lat/long and search area in the Foursquare API call.  I put those into an Excel file as well.

In [17]:
# Fetch the file
my_file = project.get_file("geo_data.csv")

# Read the CSV data file from the object storage into a pandas DataFrame
my_file.seek(0)
geo_data = pd.read_csv(my_file)

# the city area, listed in square miles, needs to be converted to a radius in feet for the Foursquare call
# I'll assume that cities can be roughly estimated by circles, although that is obviously not true
import numpy as np

geo_data['radius'] = np.round_(np.sqrt(geo_data['area']/np.pi)*5280, decimals = -2)
geo_data

Unnamed: 0,city,lat,long,area,radius
0,Atlantic City,39.377297,-74.451082,17.0,12300.0
1,Las Vegas,36.175,-115.136389,135.8,34700.0
2,North Port,27.066111,-82.171944,104.0,30400.0
3,Cleveland,41.482222,-81.669722,82.5,27100.0
4,Dayton,39.759444,-84.191667,56.5,22400.0
5,Albany,31.582222,-84.165556,55.8,22300.0
6,Port Arthur,29.885,-93.94,144.0,35700.0
7,New Orleans,29.95,-90.08,349.9,55700.0


In [85]:
# I don't expect there to be more than, say, 10 comic stores in any city.  But I'll allow up to 50
LIMIT = 50
query = 'comic'

import requests

store_list = []
for city, lat, long, radius in zip(geo_data['city'], geo_data['lat'], geo_data['long'], geo_data['radius']):
    print(city)
    # we can skip Atlantic City because I'm not looking for comic stores there.  Also, the comic stores in the area aren't actually in AC
    if city=='Atlantic City':
        continue
        
    url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&query={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            long, 
            query,
            radius, 
            LIMIT)
    results = requests.get(url).json()["response"]['venues']
        
    # return only relevant information for each nearby venue.  Not all have a category, so have to deal with that
    for v in results:
        if len(v['categories'])==0:
            store_list.append([(
            city, 
            lat, 
            long, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],  
            'NA')])
        else:
            store_list.append([(
            city, 
            lat, 
            long, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],  
            v['categories'][0]['name'])])
    
    stores = pd.DataFrame([item for venue_list in store_list for item in venue_list])
    stores.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
stores.sort_values(by=['City','Venue'], inplace=True)
stores.reset_index(inplace=True, drop = True)

Atlantic City
Las Vegas
North Port
Cleveland
Dayton
Albany
Port Arthur
New Orleans


The Foursquare results have a number of repeats or redundancies; at the top in Las Vegas, you can see that the two entries for Cosmic Comics and Wishing Well Comics look suspiciously similar.  But, some stores could have multiple locations.  So I googled any potentially redundant stores and remove them below.

In [86]:
redundant_rows = [11,22,23,35,36,41,57,62,67]
final_stores = stores.drop(redundant_rows).reset_index(drop=True)
final_stores.head()

Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Albany,31.582222,-84.165556,Comics and Cards,31.607982,-84.207638,Hobby Shop
1,Cleveland,41.482222,-81.669722,A & A Comics,41.424399,-81.611015,Bookstore
2,Cleveland,41.482222,-81.669722,Astound Comics!,41.468459,-81.906546,Bookstore
3,Cleveland,41.482222,-81.669722,B & L Comics,41.409581,-81.734234,Bookstore
4,Cleveland,41.482222,-81.669722,Carol & John's Comic Book Shop,41.451862,-81.818361,Comic Shop


Now that I have the comic store data, I want to do two things.  First, count the number of stores per city and put that with the city data.  Second, pick another place per each store that can serve as a kind of 'control' - a place relatively nearby but that doesn't have a comic store.  This will be used for distinguishing between locations that would be good or bad for a store in my area.

In [112]:
# count the number of stores per city and add it to the city data
store_count = stores.groupby('City').count()['Venue'].to_frame()
store_count.reset_index(level=0,inplace=True)
store_count.columns = ['city','store_count']
city_data.merge(store_count,how='left',on='city')

Unnamed: 0,city,hispanic_white_dissimilarity_index,black_white_disimilarity_index,poverty_rate,change_poverty_rate,wage_based_gini,change_inequality_index,percent_white,percent_bachelors,share_metro_population,...,percent_change_population,percent_family_with_children,percent_population_20_64,population,percent_old_houses,vacancy_rate,home_value_to_income_ratio,homeownership_rate,percent_rent_burdened,store_count
0,Atlantic City,34.3,55.8,33.1,14.0,0.3745,0.0455,15.3,16.4,14.3,...,-5.3,60.0,60.0,38372,74.0,24.0,6.0,26.6,59.9,
1,Las Vegas,45.6,39.3,11.8,3.2,0.3316,-0.0013,44.2,23.9,29.3,...,31.0,49.7,59.2,626637,23.4,11.4,4.3,52.5,52.3,29.0
2,North Port,13.9,7.0,4.6,-1.0,0.3312,0.0118,79.9,20.4,8.2,...,182.6,37.7,54.1,64425,14.8,16.0,2.9,74.2,41.3,3.0
3,Cleveland,36.5,67.7,30.2,7.3,0.3212,0.0247,33.7,16.6,18.8,...,-19.0,53.6,61.1,387398,89.6,20.2,2.4,41.3,53.7,16.0
4,Dayton,42.8,74.4,27.4,9.2,0.3247,0.023,52.7,18.1,17.5,...,-15.3,53.0,61.0,140782,88.1,21.9,2.1,47.0,56.0,12.0
5,Albany,37.5,50.6,27.3,5.8,0.332,0.0014,21.5,20.3,48.7,...,-3.0,52.6,57.8,74631,61.5,14.8,2.9,39.4,51.2,1.0
6,Port Arthur,47.2,40.6,23.9,1.0,0.3476,0.0214,20.3,11.8,13.5,...,-4.3,55.4,57.9,55249,64.1,19.0,1.9,56.4,51.7,3.0
7,New Orleans,38.1,65.9,17.8,-5.9,0.3546,-0.0071,30.6,36.8,30.8,...,-19.6,45.8,63.9,389648,78.5,19.7,5.5,47.4,62.0,20.0


In [None]:
# 