# Exploring Data Available for Austin, TX

This Jupyter Notebook is part of the Capstone Project for the IBM Applied Data Sciecne Professional Certificate.

The direct GitHub link for this file is https://github.com/blackard/Coursera_Capstone/blob/master/Capstone%20Project%20Austin%20Data%20Exploration.ipynb, and the Jupyter Notbook Viewer link to pull the file from GitHub is https://nbviewer.jupyter.org/github/blackard/Coursera_Capstone/blob/master/Capstone%20Project%20Austin%20Data%20Exploration.ipynb.  Please use the Jupyter Notebook Viewer link to view the Folium maps included at the document.

This is the first part of a two week Capstone Project for the IBM Applied Data Science Professional Certificate.  The requirement is to provide:
1. A description of the problem and a discussion of the background.
2. A description of the data and how it will be used to solve the problem.

I am presenting this in a somewhat mixed order, first providing some background about the selected city, then exploring the data available, and finally describing the proposed question and how the data will be used to answer that question.

## Background

Austin, TX is an interesting town.  I say town, but the Austin Greater Metropolitan area has a population in excess of 2.2 million af of 2019 according censust data found on [Wikipedia - Greater Austin](https://en.wikipedia.org/wiki/Greater_Austin).  Austin itself is the 11th largest US city, and the 2nd largest US State Capital according to [Wikipedia = List of US cities by population](https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population).

Austin bill's itself as The Live Music Capital of the world (see [Austin Relocation Guide](https://austinrelocationguide.com/live-music-capital-of-the-world/)), hosts University of Texas at Austin, is one of the very few State Capitols with a large economy unrelated to government business, and ranked number 1 in the [Best State Capitals to Live in](https://wallethub.com/edu/best-state-capitals/19030) in 2020.  Further, Austin is the heart of the Silicon Hills [Wikipedia - Silocon Hills](https://en.wikipedia.org/wiki/Silicon_Hills), hosting an incredible number of variety of high-tech providing a wealth of employment opportunities for people in a variety of areas.  Austin boasts a wide variety of oppotrunities for outdoor activity, and also the home of the first purpose-built Formula 1 venues in the United States, [Circuit of the Americas](http://circuitoftheamericas.com/), which hosts a number of events and activiteis throughout the year.

Considering Greater Austin has experienced growth of over 100 people per day for several consecutive years, and is one of the fastest growing cities with over 1 million population (see [Austin remains fastest-growing big city in country](https://www.bizjournals.com/austin/news/2020/03/26/austin-remains-fastest-growing-big-city-in-country.html), it makes an interesting city to investigate.

Unfortunately, there isn't an aggregated collection of datasets covering all the Counties and Municipalities that make up Greater Austin.  Further, I was unable to find disconnected datasets that could create a unified picture of the area.  Therefore, my investigation will be limited to The City of Austin itself.

### Austin Geographic Breakdowns

Austin privides access to a trove of datasets through the [Austin Open Data Portal](https://data.austintexas.gov/), as well as datatests from Austin's [GIS Data and Maps Department](http://www.austintexas.gov/department/gis-data), and [Demographic Data](https://www.austintexas.gov/page/demographic-data) aggregated by the city.  Digging into the data, there are three primary geographic breakdowns for the city.  First is what Austin calls Neighborhood Reporing Areas.  Each Neighborhood Reporting Area is compirised of several smaller elements such as Subdibisions or individual Neighborhoods, but The City of Austin seems to use Neiborhood Reporting Areas as the primary mechanism for planning and reporting. There are also breakdowns by Census Tract or Census Block Group, and breakdowns by Zip Code.

### Difficulties Relating Neighborhood, Zip Code, Counties, Census Tracts or Other Data

The US Department of Housing and Urban Development (HUD) has published what it calls crosswalk files that can map between Zip Codes and Census Tracts or Cencus Block Groups.  While there is a one-to-many relationship between Counties and Census Tracts, the relationships between Zip Codes and Cencus Tracts, Neighborhoods to Zip Code, or Neighborhoods to Cencus Tracts are many-to-many - meaning one Neighborhood may be related to more than one Zip Codes, and one Zip Code may be related to more than one Neighborhood, for example.

Other data that would be valuable in terms of choosing a neighborhood to live in are also problematic to work with.  For example, School District boundaries, or designated school assignments so not follow other existing boundary data, and therefore cannot be agregated properly with other candidate breakdowns.

For this reason, it is best to pick the breakdown of the reigon, and then use datasets that follow that same breakdown.

The choice here, then, is to use Neighborhood Reporting Areas, and to keep the analysis and reporting at that level.  This doesn't eliminate our challenges, as will be seen later, but provides a target for the investigation.

### Selected Candidate Datasources and Datasets

There is a GeoJSON file on the [Austin Open Data Portal](https://data.austintexas.gov/) providing Boundary data for each of Austin's 103 Neighborhood Reporting Areas.  The latest version of that file is from [January 4, 2021](https://data.austintexas.gov/resource/a7ap-j2yt.json).

The [Demographic Data](https://www.austintexas.gov/page/demographic-data) provides two interesting sets of demographic data as Excel files - [Table I: population, race and ethnicity, housing and density](https://www.austintexas.gov/sites/default/files/files/Planning/Demographics/Neighborhood_Reporting_Areas_Table_I.xlsx) and [Table II: household characteristics and age structure](https://www.austintexas.gov/sites/default/files/files/Planning/Demographics/Neighborhood_Reporting_Areas_Table_II.xlsx).

Adding this demographic data to the venue frequency data we've previously computed from content pulled from the Foursquare API may result in some interesting breakdowns in the K-Means Clustering.

### Possible Investigation Paths

Given the growth for The City of Austin over the last decade, it would be interesting to use Data Science methods to analize Neighborhoods that people may want to live if the actually choose to move to Austin.

There is some data that is not redily available for Neighborhood analysis.  For example, Texas assignes the schools a child would attend based on frequently changing zone maps that are not related to any of the breakdowns readily available, so school quality cannot be readily used in this analysis.  The crime report data available from the [Austin Open Data Portal](https://data.austintexas.gov/) provides Zip Code and Address, but not Neighborhood, and not readily accessible freely available mehtod for identifying the Neighborhood has been identified.  These data, therefore, are not available for investigation at this time.

## Problem Proposal

The question, then, is: What might be good neighborhoods for a new Austinite to choose to live in?

I will try to use Clustering to group Neighborhoods into related sets.  First, build an investigative dataset by combigning sets of demograhic data with frequencies of the most common venue types reported by Foursqure within the neighborhood.  Nex, uUse this data to perform K-Means Clustering, and analyze the data for the resulting clusters to try and extract key or defining characteristics of each cluster.  Present this information for use when seeking to determine where one might choose to live if moving to Austin, TX.

### Using the Datasets

Neighborhood Boundary Data from the [January 4, 2021](https://data.austintexas.gov/resource/a7ap-j2yt.json) GeoJSON file will be used to produce a color-coded Choropleth map in Folium, rather than color-coding map markers.  By colorcoding the bounding areas of each neighborhood, it is hoped the content will be more meaningful.

Housing and Population Density data from the [Table I: population, race and ethnicity, housing and density](https://www.austintexas.gov/sites/default/files/files/Planning/Demographics/Neighborhood_Reporting_Areas_Table_I.xlsx) will be used to provide information about population density of each neighborhood.  The race and enthnicity data in this table will be excluded from the analysis.

Counts of Families with Children, and a modified Age breakdown will be extracted from the [Table II: household characteristics and age structure](https://www.austintexas.gov/sites/default/files/files/Planning/Demographics/Neighborhood_Reporting_Areas_Table_II.xlsx).  Existing age groupings will be aggregated to fewer groups, including Schoold Age Children, Late Teens to Early 20s, Mid 20s to Late 50s, and Seniors.

Frequency of Venue Types data collected from the Foursqueare API will be agregated with the Housing, Population, Family and Age data above to build a Dataset suitable for K-Means clustering.  The resulting Clusters will be included on a Folium Choropleth map, and a review of the data for each cluster will attempt to extract key charachteristics of the group.

### Chalenges

One chalenge to overcome with the GeoJSON data and ArcGIS API will be determining the centroid for each Neighborhood.  Given the computation of a centroid for an irregular, non-convex polygon is quite complex, the original hope was that the Foursquare API could be used to collect the Latitude and Longitude of each Neighborhood.  Unfortunately, the results were not always ideal.

Let's take a look at what we got.

## An Initial Map of Neighborhoods

First, we have to perform installations and imports, define some functions.

In [1]:
# Imports for the Project as a whole

# NOTE: Outputs for this are hidden since they don't add to the value of the review

# Install and Import folium - we'll use this for creating maps, displaying boundaries and specific points of interest
!pip install folium
import folium

# Install and Import geocoder - we'll use this to get longitude and latitude data from ArcGIS
!pip install geocoder
import geocoder

# Import requests - we'll use this to process HTML requests
import requests

# Import pandas
import pandas as pd

# Import pandas json_normalize
from pandas.io.json import json_normalize

# Install and Import geopandas
!pip install geopandas
import geopandas

# Install and Import geojson
!pip install geojson
import geojson



In [2]:
# The code was removed by Watson Studio for sharing.

NOTE: We're reusing several functions from one of the pracitce exerciese.  Credit to [Alex Aklson](https://www.linkedin.com/in/aklson/) and [Polong Lyn](https://www.linkedin.com/in/polonglin) for their work on the [DS0701EN-3-3-2-Neighborhoods-New-York](https://labs.cognitiveclass.ai/tools/jupyterlab/lab/tree/labs/DS0701EN/DS0701EN-3-3-2-Neighborhoods-New-York-py-v1.0.ipynb) practice lab.

In [3]:
# Functions for the Project as a whole

# Foursquare API version
FOURSQUARE_VERSION = '20180605'
# A default Foursquare API limit value
FOURSQURE_LIMIT = 300
# A default Foursquare API search radius
FOURSQUARE_RADIUS = 1000

# Lookup the Latitude and Longitude of a named Location
def get_location_coordinates(location):
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        # NOTE: Google failed to return results, but ArcGIS was very good at finding the coordinates
        g = geocoder.arcgis(location)
        lat_lng_coords = g.latlng
      
    return(lat_lng_coords[0], lat_lng_coords[1])

# Extrat Category Types from Foursquare result
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# Collect venues for multiple locations
def getNearbyVenues(names, latitudes, longitudes, radius=FOURSQUARE_RADIUS):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            FOURSQUARE_CLIENT_ID, 
            FOURSQUARE_CLIENT_SECRET, 
            FOURSQUARE_VERSION, 
            lat, 
            lng, 
            radius, 
            FOURSQUARE_LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

# Get to top Venues
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### Collect the Latitude and Longitude of each Neighborhood

Our first step is to try to determine the Latitude and Longitude for each Neighborhood.  For this we'll use ArcGIS to do a lookup of each Neighborhood name.

In [4]:
# Find the Latitude and Longitude of Austin, TX
city_state = "Austin, TX"
latitude, longitude = get_location_coordinates(city_state)

# The Austin Neighorhood Reporting Area GeoJSON
austin_nra_geojson = 'https://data.austintexas.gov/resource/a7ap-j2yt.geojson'

# First read the GeoJSON file and extract the Neighborhoods
df_austin_nra_geojson = geopandas.read_file(austin_nra_geojson)
df = df_austin_nra_geojson[['neighname']].copy()
# Make it pretty by renaming the column and converting the data to camelcase
df.columns=(['Neighborhood'])
df['Neighborhood'] = df['Neighborhood'].str.title()

# Add the new Latituce and Longitude columns to the Dataframe with zeros - this technique avoids warnings about updating slices
df = df.assign(Latitude=[0.0 for _ in range(len(df))])
df = df.assign(Longitude=[0.0 for _ in range(len(df))])

# Collect the Latitude and Longitude for the Austin Neighborhoods from ArcGIS
for index, row in df.iterrows():
    neigh_lat, neigh_long = get_location_coordinates('{}, {}'.format(row['Neighborhood'], city_state))
    df.loc[index,'Latitude'] = neigh_lat
    df.loc[index,'Longitude'] = neigh_long                                                     

df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Anderson Mill,30.45619,-97.80468
1,Windsor Park,30.31711,-97.69981
2,Dawson,30.23569,-97.76082
3,West University,30.28814,-97.74728
4,Mlk,30.28387,-97.69531


### Create the Map

Now we begin assembling the map.  First, we'll create the Folium map, and add the Neighborhood Boundary data using the GeoJSON file.  We include a tooltip (aka hover-text) for each Neighborhood so that, as the mouse moves over the Neighborhood boundary, the Neighborhood Name is displayed.

Next, we add circular Markers for each Neighborhood based on its retrieved Latitude and Longituge.  We include a popup for each Marker that will display the Neighborhood name.  In all cases, it is hoped the Neighborhood Marker is within the Neighborhood Boundary.

In [5]:
# Create the Folium map
map_austin_neighborhood = folium.Map(location=[latitude, longitude], zoom_start=11, control_scale = True)
# Add Neighborhood Boundaries to the map using the GeoJSON file
folium.GeoJson(austin_nra_geojson, 
               name="Neighborhood",
               tooltip=folium.GeoJsonTooltip(
                   fields=['neighname'],
                   aliases=['Neighborhood'],
                   localize=True
               )).add_to(map_austin_neighborhood)

# add markers to map
for index, row in df.iterrows():
    label = row['Neighborhood']
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [row['Latitude'], row['Longitude']],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_austin_neighborhood)

folium.LayerControl().add_to(map_austin_neighborhood)

map_austin_neighborhood

### Good and Bad Results

As can be seen, there are a mixture of good and bad results.  Take for example Jester, West Oak Hill and Franklin Park. For those three neighborhoods, the Latitude and Longitude retrieved from ArcGIS seem reasonably located within their respective Neighborhood Boundary.  However, for Barton Creek Mall, Avery Ranch -- Lakeline, and Bluff Springs, the Latitude and Longitude are either on the border of, or lie entirely outside the Neghborhood boundary.

Looking at the content of the GeoJSON, it can be seen that the boudary data is provided as MultiPolygons, and in several cases, such as will Del Valle East, it is actuall three separate polygons that arent connected.  Let's look at an extract of the GeoJSON file, as well as a map of just the Del Valle East neighborhood boundaries.

In the GeoJSON extract below, we can see that Anderson Mill's geometry entry is a MultiPolygon.  Further, we can see the Del Valle East's geometry is a MultiPolygon with three sets of coordinates.

```json
{
	"type": "FeatureCollection",
	"features": [
		{
			"type": "Feature",
			"properties": {
				... content removed...
				"neighname": "ANDERSON MILL",
								... content removed...

			},
			"geometry": {
				"type": "MultiPolygon",
				"coordinates": [
					... content removed...
				]
			}
		},
		... content removed...
		{
			"type": "Feature",
			"properties": {
				... content removed...
				"neighname": "DEL VALLE EAST",
				... content removed...
			},
			"geometry": {
				"type": "MultiPolygon",
				"coordinates": [
					[
						... content removed...
					],
					[
						... content removed...
					],
					[
						... content removed...
					]
				]
			}
		},
    ]
	... content removed...
}
```

In order to create a map of just the Del Valle East neighborhood borders, I extracted that GeoJSON feature, saved it to a new file, and uploaded it to GitHub at [Austin_TX_Neighborhoods_Del_Valle_East.geojson](https://github.com/blackard/Coursera_Capstone/blob/master/Austin_TX_Neighborhoods_Del_Valle_East.geojson)