# Capstone Project:  InfoDyne Relocation search of Baltimore, Maryland
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

InfoDyne Corporation is considering relocating its manufacturing operations from China to a location in the United States. InfoDyne's main customers are farmers worldwide which use its data acquistion and transmission products to improve crop yields. The State of Maryland and the United States Federal Government are offering tax incentives and loan guarantees to companies that relocate their businesses to cities that traditionally have high crime.

InfoDyne management has asked its data science team to investigate the City of Baltimore in Maryland to identify a location that has acceptable crime rates along with access to the interstate highway system and ports for shipment. Additionally, if this project is successful, Infodyne would like a location identified that has expansion opportunities.

Baltimore City is problematic. Many areas are unsuitable for relocation because the crime rate is so high, that employees would refuse to work in that part of the city. In addition, a workforce would need access to facilities such as food, gas, and other conveniences that may not be available in certain places in the city.

 ## Data <a name="data"></a>

In order to identify an optimal location for InfoDyne's manufacturing business, several factors must be considered.

* Crime rate by neighborhood: Infodyne wants to take advantage of the tax and loan incentives, but does not want crime to be a detrimment to its business. Since InfoDyme Management is unfamiliar with the city, an analysis of crime rates and neiborhoods is warranted.

* Transport location: Infodyne needs quick access to both the interstate highway system and local ports so that it can ship its product at reasonable cost across the world.

* Local amenities: Once an optimal low crime location has been identified in Baltimore, InfoDyne Management would like to identify locations of interest for its employees such as food, gas, and other local businesses

### Crime Data

Baltimore City provides a large amount of city data at its https://data.baltimorecity.gov/ website. This website has data available such as crime data which is updated daily. Specifically we will be using the crime data located at URL: https://data.baltimorecity.gov/api/views/wsfq-mvij/rows.csv?accessType=DOWNLOAD

This data is not entirely complete and has many missing entries. The data needs to be cleaned up so that its usable. There is also a very large amount of data which prevents it from being loaded correctly on Maps, I will only include the data necessary for the analysis



### Venue Data (Eateries and Gas Stations)

Foursquare is a service that provides crowdsourced information about various venues in a geographic format. In this case we will be using the foursquare API to pull information about local Eateries, and Gas Stations in Baltimore City.

In [1]:
#!conda install -c conda-forge folium --yes
import folium

In [2]:
import pandas as pd
import numpy as np
import requests
import io
from IPython.display import IFrame
from folium.plugins import HeatMap

## Methodology <a name="methodology"></a>

First lets import the crime data from the baltimore city open data service.   This data is available in csv format and can be downloaded through a URL

In [3]:
#URL for Baltimore Crime Data
BALT_URL = 'https://data.baltimorecity.gov/api/views/wsfq-mvij/rows.csv?accessType=DOWNLOAD'
s=requests.get(BALT_URL).content
baltimoredata=pd.read_csv(io.StringIO(s.decode('utf-8')))

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
baltimoredata.head()

Unnamed: 0,CrimeDate,CrimeTime,CrimeCode,Location,Description,Inside/Outside,Weapon,Post,District,Neighborhood,Longitude,Latitude,Location 1,Premise,vri_name1,Total Incidents
0,10/19/2019,15:17:00,6J,6500 OLIV,LARCENY,O,,,SOUTHEAST,,,,,OTHER - OUTSIDE,,1
1,10/19/2019,10:45:00,7A,3000 EASTERN AVE,AUTO THEFT,O,,221.0,SOUTHEAST,BALTIMORE-LINWOOD,-76.574312,39.286411,,STREET,,1
2,10/19/2019,21:25:00,3B,2700 DILLON ST,ROBBERY - STREET,O,,214.0,SOUTHEAST,CANTON,-76.577923,39.281131,,STREET,,1
3,10/19/2019,21:51:00,4E,2400 SHIRLEY AVE,COMMON ASSAULT,I,,533.0,NORTHERN,GREENSPRING,-76.659335,39.335694,,ROW/TOWNHOUSE-OCC,,1
4,10/19/2019,16:40:00,6G,600 WYNDHURST AVE,LARCENY,I,,521.0,NORTHERN,WYNDHURST,-76.629834,39.353382,,OTHER - INSIDE,,1


Some of the data formats are not readable, and we need to cleanup and drop records that are usable

In [5]:
#Clean up Crime Data
crimedata = baltimoredata
crimedata['CrimeDateTime'] = crimedata['CrimeDate'].astype(str) + ' ' + crimedata['CrimeTime'].astype(str) # Concat Date and TIme Fields
crimedata = crimedata.drop(['CrimeDate', 'CrimeTime'], axis=1) # Delete the old Date and Time Columns 
crimedata['CrimeDateTime']=pd.to_datetime(crimedata['CrimeDateTime'], errors='coerce')  #Fix the Date type
#Drop all rows prior to 2014 since they are invalid or poorly collected
crimedata = crimedata[crimedata['CrimeDateTime'].dt.year >= 2014]
#Drop all rows that have NaN for eitherr Lat or Long
crimedata = crimedata.dropna(subset=['Longitude', 'Latitude']) 

crimedata.shape

(278302, 15)

In [6]:
crimedata.head(5)

Unnamed: 0,CrimeCode,Location,Description,Inside/Outside,Weapon,Post,District,Neighborhood,Longitude,Latitude,Location 1,Premise,vri_name1,Total Incidents,CrimeDateTime
1,7A,3000 EASTERN AVE,AUTO THEFT,O,,221,SOUTHEAST,BALTIMORE-LINWOOD,-76.574312,39.286411,,STREET,,1,2019-10-19 10:45:00
2,3B,2700 DILLON ST,ROBBERY - STREET,O,,214,SOUTHEAST,CANTON,-76.577923,39.281131,,STREET,,1,2019-10-19 21:25:00
3,4E,2400 SHIRLEY AVE,COMMON ASSAULT,I,,533,NORTHERN,GREENSPRING,-76.659335,39.335694,,ROW/TOWNHOUSE-OCC,,1,2019-10-19 21:51:00
4,6G,600 WYNDHURST AVE,LARCENY,I,,521,NORTHERN,WYNDHURST,-76.629834,39.353382,,OTHER - INSIDE,,1,2019-10-19 16:40:00
5,4C,1900 MCHENRY ST,AGG. ASSAULT,O,OTHER,934,SOUTHERN,CARROLLTON RIDGE,-76.647989,39.284216,,STREET,Tri-District,1,2019-10-19 03:24:00


## Analysis <a name="analysis"></a>

Now that we have the crime data imported, lets take a look. 

First we count the data by the description category to understand what types of crimes are represented in this data source

In [7]:
# Show how many crimes by category
crimedata['Description'].groupby([crimedata.Description]).agg('count')

Description
AGG. ASSAULT            30057
ARSON                    1225
AUTO THEFT              24570
BURGLARY                40644
COMMON ASSAULT          45576
HOMICIDE                 1785
LARCENY                 62112
LARCENY FROM AUTO       37024
RAPE                     1738
ROBBERY - CARJACKING     2336
ROBBERY - COMMERCIAL     4965
ROBBERY - RESIDENCE      2802
ROBBERY - STREET        19785
SHOOTING                 3683
Name: Description, dtype: int64

Because infodyne is looking for a location where its company and employees can be safe, lets focus on the Commercial Robbery and Street Robbery categories of data

One way to easily analyze the data is to visualize it.  Here we are going to use folium to built a map of baltimore and add the crime data to it. 

In [8]:
#Baltimore's Longitude and Latitude
latitude = 39.2904
longitude = -76.6122

In [9]:
# Create the map and visualize the city
city_map = folium.Map(location=[latitude, longitude], tiles='Stamen Terrain', zoom_start=14) 

We further reduce the large dataset down to just the records in the ROBBERY categories we want and also create a random sample of the data for visualization

In [10]:
# We are interested in extracting the commercial and street robberies to determine safe places to work
#sample = crimedata[(crimedata['Description'] == 'HOMICIDE') & (crimedata['CrimeDateTime'].dt.year == 2018)]
robbery_sample = crimedata[(crimedata['Description'] == 'ROBBERY - COMMERCIAL') | (crimedata['Description'] == 'ROBBERY - STREET')] 
sample = robbery_sample.sample(n=1000)
robbery_sample.shape                                                                           

(24750, 15)

A great way to visualize data is through a heatmap, here we use the folium heatmap package to add it to the map

In [11]:
# Create a heatmap using the sample of rows specifically for robberies 2014-2019
heat_data = [[row['Latitude'],row['Longitude']] for index, row in robbery_sample.iterrows()]
HeatMap(heat_data,radius=20).add_to(city_map)

<folium.plugins.heat_map.HeatMap at 0x7f84621f8b00>

Add smaller dots to the screen to show the random 1000 samples of robbery crimes over the time period

In [12]:
# add a smaller sample (1000) of the robbery crimes to the map as small blue circles
for lat, lng, Borough in zip(sample.Latitude, sample.Longitude, sample.District):
    folium.CircleMarker(
        [lat, lng],
        radius=1,
        color='blue',
        popup=Borough,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(city_map)

Display the initial map to look for "lower" crime areas

In [13]:
#Display the heat map and identify location of where to place business
city_map

The area of Baltimores South side (around the horseshoe casino, looks interesting.  lets identify conveniences nearby to see if this location is workable

In [14]:
#Baltimore's south side Longitude and Latitude
loc_latitude = 39.2736
loc_longitude = -76.6274
folium.CircleMarker(
    [loc_latitude, loc_longitude],
    radius=10,
    color='black',
    popup='Desired Location',
    fill = True,
    fill_color = 'gray',
    fill_opacity = 0.6
).add_to(city_map)

<folium.vector_layers.CircleMarker at 0x7f846afbe550>

### add venue data to diagram
Now that we have a map built, a heat map showing areas of high and "lower" crime, lets get some venue data


In this case, we are leveraging the foursquare API to pull specific venue data for food and gas into panda dataframe to be visualized

In [15]:
# Lets get the venue data for establishments nearby this lat/long
#My foursquare developer account info
#CLIENT_ID
#CLIENT_SECRET

In [16]:
{
    "tags": [
        "hide_input",
    ]
}
# Lets get the venue data for establishments nearby this lat/long
#My foursquare developer account info
#CLIENT_ID
#CLIENT_SECRET
VERSION = '20180604'
LIMIT = 30

Lets form a query to send to foursquare

In [17]:
## Identify Convenience stores around this location within 2000 meters of lat/long
# The foursquare API only allows one category at a time
search_query = 'food'
radius = 1500 #1.5km from the location
#Define the Foursquire URL we will use for our Queries
#url = 'https://api.foursquare.com/v2/search/recommendations?client_id={}&client_secret={}&ll={},{}&v={}&sections={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, loc_latitude, loc_longitude, VERSION, search_query, radius, LIMIT)
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, loc_latitude, loc_longitude, VERSION, search_query, radius, LIMIT)


Lets get the result back (in JSON Format) and normalize the values for processing

In [18]:
#Get Request and examine results
results = requests.get(url).json()
#print(results)
#Get the relevant part of the returned JSON and convert it to a pandas dataframe
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
# tranform venues into a dataframe
df_venues = json_normalize(venues)
df_venues.shape

(15, 24)

In [19]:


#Define data of interest and filter dataframe for just what we want
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in df_venues.columns if col.startswith('location.')] + ['id']
df_filtered = df_venues.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
df_filtered['categories'] = df_filtered.apply(get_category_type, axis=1)

# clean column names by keeping only last term
df_filtered.columns = [column.split('.')[-1] for column in df_filtered.columns]

df_filtered.head()



Unnamed: 0,name,categories,address,cc,city,country,crossStreet,distance,formattedAddress,labeledLatLngs,lat,lng,postalCode,state,id
0,Dipal Food Market,Deli / Bodega,1001 W Cross St,US,Baltimore,United States,at W Hamburg St,795,"[1001 W Cross St (at W Hamburg St), Baltimore,...","[{'label': 'display', 'lat': 39.28057372686911...",39.280574,-76.629422,21230,MD,4df50b44a809141629a7f920
1,Ruben’s Mexican Food,Mexican Restaurant,,US,Baltimore,United States,,1197,"[Baltimore, MD 21230, United States]","[{'label': 'display', 'lat': 39.27718, 'lng': ...",39.27718,-76.61429,21230,MD,5a48991f8194fc667b02ca9f
2,Food Mart,Grocery Store,789 Washington Blvd,US,Baltimore,United States,Scott St,1050,"[789 Washington Blvd (Scott St), Baltimore, MD...","[{'label': 'display', 'lat': 39.28297822, 'lng...",39.282978,-76.628753,21230,MD,4c48c7566f1420a1a22eb954
3,Sunny Chinese Food,Chinese Restaurant,758 Washington Blvd,US,Baltimore,United States,,1055,"[758 Washington Blvd, Baltimore, MD 21230, Uni...","[{'label': 'display', 'lat': 39.28307398738335...",39.283074,-76.62778,21230,MD,4bb52cb746d4a593b9aec4c0
4,Shell,Gas Station,1712 Russell Street,US,Baltimore,United States,,175,"[1712 Russell Street, Baltimore, MD 21230, Uni...","[{'label': 'display', 'lat': 39.27321161908041...",39.273212,-76.629373,21230,MD,4b61787df964a520a7142ae3


In [20]:
#add convenience store facilities to map

for lat, lon, neigh, flg in zip(df_filtered['lat'], df_filtered['lng'], df_filtered['name'], df_filtered['categories']):
    label = folium.Popup(neigh, parse_html=True)
    folium.Marker(
    [lat, lon],
    popup=label,
    icon=folium.Icon(color='green')).add_to(city_map)
       


The foursquare API has some issues around querying multiple venue categories at once, so here, we pull for gas stations

In [21]:
#Lets add gas stations.   The foursquare API only allows one category at a time

search_query = 'gas'
radius = 1500 #1.5km from the location
#Define the Foursquire URL we will use for our Queries
#url = 'https://api.foursquare.com/v2/search/recommendations?client_id={}&client_secret={}&ll={},{}&v={}&sections={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, loc_latitude, loc_longitude, VERSION, search_query, radius, LIMIT)
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, loc_latitude, loc_longitude, VERSION, search_query, radius, LIMIT)

In [22]:
#Get Request and examine results
results = requests.get(url).json()
#print(results)
#Get the relevant part of the returned JSON and convert it to a pandas dataframe
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
# tranform venues into a dataframe
df_venues = json_normalize(venues)

#Define data of interest and filter dataframe for just what we want
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in df_venues.columns if col.startswith('location.')] + ['id']
df_filtered = df_venues.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
df_filtered['categories'] = df_filtered.apply(get_category_type, axis=1)

# clean column names by keeping only last term
df_filtered.columns = [column.split('.')[-1] for column in df_filtered.columns]

df_filtered.head()

Unnamed: 0,name,categories,address,cc,city,country,crossStreet,distance,formattedAddress,labeledLatLngs,lat,lng,postalCode,state,id
0,Quest Gas Station,Gas Station,,US,Baltimore,United States,,1240,"[Baltimore, MD, United States]","[{'label': 'display', 'lat': 39.2752037180956,...",39.275204,-76.641646,,MD,4e70ffcee4cd5c86943e8db7
1,US Gas,Gas Station,1463 Washington Blvd,US,Baltimore,United States,at Bush St,1048,"[1463 Washington Blvd (at Bush St), Baltimore,...","[{'label': 'display', 'lat': 39.27587361627909...",39.275874,-76.63921,21230.0,MD,4c4b1f3e5609c9b6e30ab290
2,Shell,Gas Station,1712 Russell Street,US,Baltimore,United States,,175,"[1712 Russell Street, Baltimore, MD 21230, Uni...","[{'label': 'display', 'lat': 39.27321161908041...",39.273212,-76.629373,21230.0,MD,4b61787df964a520a7142ae3
3,The No Name Gas Station,Gas Station,"1004 St Paul St, Baltimore, MD 21202",US,Baltimore,United States,,1316,"[1004 St Paul St, Baltimore, MD 21202, Baltimo...","[{'label': 'display', 'lat': 39.27725023708575...",39.27725,-76.612871,21202.0,MD,4f7c955be4b06a80abd9e15d
4,BP,Gas Station,Haines St 2000,US,Baltimore,United States,,209,"[Haines St 2000, Baltimore, MD, United States]","[{'label': 'display', 'lat': 39.27245782635501...",39.272458,-76.629339,,MD,4c24c876c9bbef3bd950afac


In [23]:
#add gas station facilities to map

for lat, lon, neigh, flg in zip(df_filtered['lat'], df_filtered['lng'], df_filtered['name'], df_filtered['categories']):
    label = folium.Popup(neigh, parse_html=True)
    folium.Marker(
    [lat, lon],
    popup=label,
    icon=folium.Icon(color='red')).add_to(city_map)
       


In [24]:
#Lets get the ports of baltimore to display on the map.   The foursquare API only allows one category at a time

search_query = 'Port'
radius = 5000 #1.5km from the location
#Define the Foursquire URL we will use for our Queries
#url = 'https://api.foursquare.com/v2/search/recommendations?client_id={}&client_secret={}&ll={},{}&v={}&sections={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, loc_latitude, loc_longitude, VERSION, search_query, radius, LIMIT)
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, loc_latitude, loc_longitude, VERSION, search_query, radius, LIMIT)

In [25]:
#Get Request and examine results
results = requests.get(url).json()
#print(results)
#Get the relevant part of the returned JSON and convert it to a pandas dataframe
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
# tranform venues into a dataframe
df_venues = json_normalize(venues)

#Define data of interest and filter dataframe for just what we want
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in df_venues.columns if col.startswith('location.')] + ['id']
df_filtered = df_venues.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
df_filtered['categories'] = df_filtered.apply(get_category_type, axis=1)

# clean column names by keeping only last term
df_filtered.columns = [column.split('.')[-1] for column in df_filtered.columns]

df_filtered.shape

(30, 15)

The ports in baltimore all have different case so we need to get a list that is case insensitive and only get the actaul "Port of Baltimore" ports

In [26]:
df_ports = df_filtered[df_filtered['name'].str.contains("Port of Baltimore", case=False)]
df_ports


Unnamed: 0,name,categories,address,cc,city,country,crossStreet,distance,formattedAddress,labeledLatLngs,lat,lng,postalCode,state,id
1,Port of Baltimore - South Locust Point Terminals,Pier,2001 E McComas St,US,Baltimore,United States,,3233,"[2001 E McComas St, Baltimore, MD 21230, Unite...","[{'label': 'display', 'lat': 39.26370385271901...",39.263704,-76.592131,21230.0,MD,4c5195cd0ef3a593ffcd3d7c
2,Port of Baltimore - Cruise Maryland Terminal,Pier,2001 E McComas St,US,Baltimore,United States,,2667,"[2001 E McComas St, Baltimore, MD 21230, Unite...","[{'label': 'display', 'lat': 39.26523001850823...",39.26523,-76.598401,21230.0,MD,4b4b489df964a520709626e3
4,Port Of Baltimore - North Locust Point Terminals,Pier,Nicholson St.,US,Baltimore,United States,,3753,"[Nicholson St., Baltimore, MD 21230, United St...","[{'label': 'display', 'lat': 39.26871966323875...",39.26872,-76.584309,21230.0,MD,4f1e6543003906787174406d
11,Port Of Baltimore cruise Terminal,Cruise,,US,,United States,,4500,"[Maryland, United States]","[{'label': 'display', 'lat': 39.2333984375, 'l...",39.233398,-76.621826,,Maryland,565c32b6498ee88863406afb
12,Port of Baltimore,Pier,,US,Baltimore,United States,,4767,"[Baltimore, MD 21230, United States]","[{'label': 'display', 'lat': 39.25698063437564...",39.256981,-76.576424,21230.0,MD,4d1b74635acb721e8e860eb2
15,Port of Baltimore - Fairfield/Masonville Marin...,Pier,2900 Childs St,US,Baltimore,United States,,4523,"[2900 Childs St, Baltimore, MD 21226, United S...","[{'label': 'display', 'lat': 39.24781962784956...",39.24782,-76.586831,21226.0,MD,4cbf120d7dc9a0936aa736f5
18,Port Of Baltimore - Clinton St Piers,Pier,S Clinton St.,US,Baltimore,United States,,4932,"[S Clinton St., Baltimore, MD 21224, United St...","[{'label': 'display', 'lat': 39.26822809326266...",39.268228,-76.570587,21224.0,MD,4f1e63f600390678717421f7


In [27]:
#Lets add the ports of baltimore to the map

for lat, lon, neigh, flg in zip(df_ports['lat'], df_ports['lng'], df_ports['name'], df_ports['categories']):
    label = folium.Popup(neigh, parse_html=True)
    folium.Marker(
    [lat, lon],
    popup=label,
    icon=folium.Icon(color='blue')).add_to(city_map)
       


## Results and Discussion <a name="results"></a>

Lets display the map with our data

In [28]:
city_map

In our displayed map, we can can see the following
* Heat Map of Commercial and Street Robberies in Baltimore City
* Grey Icon circled in black as a possible point in south baltimore for our location
* Red Icons depicting gas stations local to the area
* Green Icons depicting food establishments in the area
* Blue Icons depicting the "Port of Baltimore" locations

## Conclusion <a name="conclusion"></a>

InfoDyne corporation can leverage the city of Baltimore in Maryland for its manufacturing operation.  The location in south baltimore offers lower robbery crime that most other areas of the city and provides ample access to  highways such as interstate 95 and easy access to the ports of baltimore.  Food and Gas venues are not in easy walking distance, but employees can use public transportation or commute, using local roads to establishments nearby. 

Note: The location of this business will be very close to the Horseshoe Baltmore Maryland Casino. 