# This notebook is the Coursera/IBM Applied Data Science Project and Report

## Week 2 Assignment - Battle of Neighborhoods

### A Report on Choosing Housing Features


**Introduction** Companies seek to optimize their marketing efforts to customers in order to attract more, and hopefully better aligned, sales leads. In order to achieve this goal, companies must understand and respond to consumers’ needs and desires. They need to communicate their product offerings in ways that customer’s will find acceptable. Moreover, consumers must be able to find and make choices based on their unique situations. Watt (2019) explains how companies use multiple customer information data sets in their marketing stratagies for lead generation. These observations show that customer’spersonal perspectives are targets for companies’ marketing focus. This report demonstrates an approach to creating customer profiles that can be used for targeting using certain shared characteristics of their historical behavioral data. The example used in this report is of a prospective housing consumer. 


**Problem Statement** The general problem for all companies involved in housing is the need to optimize lead generation. The specific problem is housing suppliers and consumers find that the data needed to make informed decisions regarding housing is often not readily available. Consumers in the housing market may find that the information they need is spread out in multiple sources and locations. Prospective buyers and renter can’t efficiently compare available neighborhoods with their needs and desires. This problem leaves housing consumers with somewhat involved research and analysis tasks to do on their own. On the other hand, planners and suppliers that can identify and segment prospects are better able to focus their marketing strategies in ways to communicate products, features, and services most effectively.

The HADS Data derived from AHS National Data (2013) is a government survey that records US nationwide housing data. This dataset contains both consumer data and industry averages. Anecdotal observations suggest that consumers use other factors about neighborhoods they are considering for housing options such as social desirability, and trade availability. Foursquare.com created a database that contains information about events and venues in neighborhoods that is useful for the creation of consumer "clusters" that reflect their behaviors and choices. It is rational to infer that companies should seek to understand buyers and renters from a detailed analysis of such key characteristics.

The buyers and renters prefer the most desirable neighborhoods within their budget and other _"suitability"_ criteria. Therefore, while the HUD survey contains data about housing affordability using an economic lens, the Foursquare data on business venues provides addition _"quality-of-living features"_ that housing prospects may find useful. Accordingly, this report may be useful to those involved in various aspects of the housing market, such as, city and community planners. The report identifies the general characteristics of neighborhoods and residents. Data scientists commonly use _"clustering  techniques"_ to segment prospects, and customers which is the case here. Managers can use the customer segments as the basis for targeting purposes, such as, unique and specific marketing communications, personalized engagement and guidance, service and support, and other customer experience approaches. This scope of this project is limited to the ‘US West’ region as described in the HUD survey, and zip codes within the City of San Jose.


**Research Question** The main question regarding creating customer profiles and segments in the housing market is:<br>
Question 1: What are the primary characteristics of house buyers and renters in the western US Census Region?
Question 2: What are the primary characteristics of house buyers and renters in the City of San Jose?

The approach used in this project to answer the question is based on Maxwell (2013) which suggests the model components listed below.

Goal: This research and report supports both prospective housing consumers and agencies involved in the housing market. Govermental agencies have already invested in data acquisition to be used as proposed in this project.

Conceptual Framework: Watt (2019) reported the result of interviews with marketing executives which infers using multiple datasets to better understand consumer behaviors and enhance creativity and innovation to serve their needs.

Methods: The United States Office of Policy Development and Research (PD&R) created the HUD User program in 1978 for the purposes of supporting housing research. The PD&R agency owns HUD reporting which verifies suitability for this report. CMO.com is an industry leader in tracking and reporting on the digital transformation happening within companies. Their reporting reflects that of industry thought-leaders and experts.

validity: The limitation of this report is that its' format and scope are defined by the project definition for the Applied Data Science Capstone Project as a final assessment for course completion and certification. Therefore, the results presented should not be used beyond this intent.


**Methodology**

Each row in the HUD dataset represents a housing unit; it will be assigned a label for k-means clustering into feature groups. <br>Other HUD data provide the following items: 1. Age of head of household, 2. household income, 3. Income relative to median income, 4. Cost of homes relative to median <br>income, 5. Housing costs, 6. Fair market rent, 7. Census region.

Data extracted from the Foursquare Database are used to enhance the analysis by identifying the types of venues located within the <br>neighborhoods studied. Therefore, the customer segments, or clusters represent groups of shared characteristics. Individual profiles reflect the head of household’s <br>characteristics in relation to the types of units they live in, their income, their own versus rent status, and comparisons to known averages.

 - Customer_Id
 - Age of head of household
 - Household Income
 - HH Income Relative to Median Income (Category)
 - Cost at 6% Relative to Median Income (Category)
 - Housing_cost
 - Fair_Mkt_Rent(avg)
 - Census_REGION
 - Postal Code
 - Venues

Table 1. Data Variables

| Variables     | Source        | ID    |
| ------------- |:-------------:| -----:|
| Customer_id   | HUD           | control   |
| Age of head of household | HUD      |   age1 |
| Household Income | HUD      |    ZINC2 |
| HH Income Relative to Median Income (Category)   | HUD   | FMTINCRELAMICAT   |
| Cost at 6% Relative to Median Income (Category)   | HUD   | FMTCOST06RELAMICAT   |
| Housing_cost   | HUD   | burden   | 
| Fair_Mkt_Rent(avg)   | HUD   | FMR   |
| Census_REGION   | HUD   | FMTREGION   |
| Postal Code   | US Postal Service   | ZIP   |
| Venues   | Foursquare   | Venues   |


This report focuses on identifying consumer groups using some of the common characteristics identified among them related <br>to housing features. k-Means Clustering is a popular customer segmentation technique that is widely accepted for this type of study. The k-Means approach creates clusters based on the similarities between customer profiles using the data items listed in Table 1. 

**Process Steps**

Shown below are descriptions of the process steps for this analysis:

**Download and Explore Dataset**    
 - Data Acquisition: Creat a csv file by extracting variables from the HUD dataset.
 - Slice the HUD data to select only 'West' region rows.
 - Preprocessing: Use the csv file to create a Pandas dataframe for the analysis. Normalize analysis dataframe using<br> SKLearn StandardScaler().
    
**Explore Neighborhoods in San Jose, California**
 - Find Postal Codes for San Jose, CA and surrounding districts and neighborhoods on city-data.com site and create<br> a csv file.
 - Use the csv file to create a geoJSON file. 
 - These data will be used in searching for neighborhood venues via geolocation and the Foursquare API

**Modelling using k-Means** 
 - Analyze Each Neighborhood by zipcode
    
**Analysis**
 - Create Neighborhood Clusters using k-Means processing
 - Identify the key profile indicators within neighbor cluster groups.
 - Create textual label identifiers explaining the group characteristics in each cluster 


### Download all the dependencies that we will need.

In [1]:
# Import libraries
# library for data analsysis
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# JSON file into a pandas dataframe
from pandas.io.json import json_normalize

# library to handle data in a vectorized manner
import numpy as np

# library to handle requests
import requests 
import urllib.request
import time

from datetime import datetime

from bs4 import BeautifulSoup
import csv

In [2]:
# library to handle JSON files
import json 

# uncomment this line to use the Foursquare API
!conda install -c conda-forge geopy --yes 

# convert an address into latitude and longitude values
import geopy
from geopy.geocoders import Nominatim 

Collecting package metadata: ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [3]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# plotting library
import matplotlib.pyplot as plt
# backend for rendering plots within the browser
%matplotlib inline

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
import os

# uncomment this line to install support for maps in your environment
!conda install -c conda-forge folium=0.5.0 --yes

import folium
from folium import FeatureGroup, LayerControl, Map, Marker
from IPython.display import HTML, display

print('Libraries imported.')

Collecting package metadata: ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


## Download the HUD Housing Affordability Data System (HADS) dataset from data.gov web site

## Scrape city-data.com web page to obtain San Jose zip codes

In [None]:
# Set the URL you want to webscrape from
url = 'http://www.city-data.com/zipmaps/San-Jose-California.html'

# Connect to the URL
response = requests.get(url)

# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text,"html.parser")

# pause the code for a sec
time.sleep(1)

In [None]:
# .findAll returns list of all found elements 

# 1: search soup object for all 'a href=\' tags on the web page that have a title
# full_tag = soup.findAll('<a href', title=True)

# 2: search soup object for all 'a href=\' tags on the web page
full_tag = soup.findAll('<a href')

full_tag

In [None]:
# save full_tag list of strings to a Pandas dataframe
raw_neigh_df = pd.DataFrame(full_tag)

# add a column to the dataframe to use for validating neighborhood names; use to 'sort' the web listings in Excel
# raw_neigh_df ['valid'] = 0

In [None]:
# show the dimensions of the web scrapping dataframe
raw_neigh_df.shape
raw_neigh_df.head(10)

In [None]:
# save raw_neigh_df to a csv file
# example: df.to_csv('test.csv', index=False, header=False

# save the dataframe to a csv file to be used for editing with Excel functions
raw_neigh_df.to_csv('san_jose_web_scrapping.csv', index=False, header=None)

###  OPTIONAL: use this code to save list of found web strings to a csv file<br>
example:
   
with open("c:/sanjose_neigh.csv", "w") as fp:
   
     writer = csv.writer(fp)
   
     writer.writerows(full_tag)


## Create "Neighborhoods" Pandas dataframe

#### Create a dataset that contains the districts and neighborhoods in each of them

#### read 'san_jose_districts_ neighborhoods_valid.csv' file into a pandas dataframe


In [4]:
# after web scrapping, use Excel functions to edit csv file. Then, get the data into a dataframe
input_df = pd.read_csv('sanjose_districts_neighborhoods_valid.csv', header=None)

In [5]:
# show the size of the input_df
input_df.shape

(42, 3)

In [6]:
input_df.head()

Unnamed: 0,0,1,2
0,District,Name,Valid
1,Campbell,SJ_Campbell_Los Gatos,1
2,Central,SJ_Santa Clara,1
3,Central,SJ_Downtown1,1
4,Central,SJ_Downtown2,1


In [7]:
# input file has no headers. change column names those to be used later in the geojson file
input_df.columns = ['District', 'Neighborhood', 'Valid']

In [8]:
# drop column 'Valid' from the dataframe; it will not be used
input_df = input_df.drop('Valid', axis=1)

In [9]:
# show dataframe of districts and neighborhoods
input_df.head()

Unnamed: 0,District,Neighborhood
0,District,Name
1,Campbell,SJ_Campbell_Los Gatos
2,Central,SJ_Santa Clara
3,Central,SJ_Downtown1
4,Central,SJ_Downtown2


## Check the Dataframe of Districts and Neighborhoods

### Convert pandas 'input_df' to a csv file; will be input file for geopy to find latitude and longitude data

In [10]:
# import pandas as pd
# create sanjose_data csv file using input_df dataframe. 
input_df.to_csv('sanjose_data.csv', index=False, header=None)

## Use this code for testing purposes, only!

Use to get San Jose, CA location coordinates using geopy. 


In [None]:
# Use this code to get city latitude/longitude data

# NOTE: Nominatim(user_agent="my-application") is recommended on the geopy web site
# import  os

# Use geopy library to get the latitude and longitude values of Toronto, Ontario.
# In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent ny_explorer, 
# as shown below.
address = 'San Jose, CA'

geolocator = Nominatim(user_agent="my_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of San Jose, CA are {}, {}.'.format(latitude, longitude))


### This is the complete portable code to get location data.

In [None]:
 Once this code is finished, it will output a new CSV file with two columns: latitude and longitude
# import geopy
# import pandas
from geopy.geocoders import Nominatim
from geopy.geocoders import GoogleV3
# versions used: geopy 1.10.0, pandas 0.16.2, python 2.7.8

def main():
    io = pandas.read_csv('sanjose_neighborhoods.csv', index_col=None, header=0, sep=",")
    
    def get_latitude(x):
        return x.latitude
    
    def get_longitude(x):
        return x.longitude

    # uncomment the geolocator you want to use
    geolocator = Nominatim(user_agent="tor_explorer")

    geolocate_column = io['district'].apply(geolocator.geocode)
    io['latitude'] = geolocate_column.apply(get_latitude)
    io['longitude'] = geolocate_column.apply(get_longitude)
    io.to_csv('geocoding-output.csv')

if __name__ == '__main__':
    main()

### Use the 'geocoding-output.csv' file from above is input to the json file tool ( www.geojson.io).  Result will be a geoJSON file. Name it 'sanjose_geoJSON'. 

## Load and explore the sanjose_geoJSON data

In [11]:
# Next, let's load the data.
with open('sanjose_geoJSON.geojson') as json_data:
    sanjose_data = json.load(json_data)

# Let's take a quick look at the data.
sanjose_data

{'type': 'FeatureCollection',
 'features': [{'type': 'Feature',
   'properties': {'': 0,
    'name': 'SJ_Campbell_Los Gatos',
    'district': 'Campbell',
    'PostalCode': 95008},
   'geometry': {'type': 'Point', 'coordinates': [-121.9448624, 37.28847885]}},
  {'type': 'Feature',
   'properties': {'': 1,
    'name': 'SJ_Santa Clara',
    'district': 'Central',
    'PostalCode': 95110},
   'geometry': {'type': 'Point', 'coordinates': [-121.9550934, 37.35578919]}},
  {'type': 'Feature',
   'properties': {'': 2,
    'name': 'SJ_Downtown1',
    'district': 'Central',
    'PostalCode': 95112},
   'geometry': {'type': 'Point', 'coordinates': [-121.8854218, 37.33864975]}},
  {'type': 'Feature',
   'properties': {'': 3,
    'name': 'SJ_Downtown2',
    'district': 'Central',
    'PostalCode': 95013},
   'geometry': {'type': 'Point', 'coordinates': [-121.8853607, 37.33306122]}},
  {'type': 'Feature',
   'properties': {'': 4,
    'name': 'SJ_Santa Clara_sw (Chapman Morse)',
    'district': 'Centr

In [12]:
# Notice how all the relevant data is in the features key, which is basically a list of the neighborhoods.
# So, let's define a new variable that includes this data.
neighborhoods_data = sanjose_data['features']

# Let's take a look at the first item in this list.
neighborhoods_data[0]


{'type': 'Feature',
 'properties': {'': 0,
  'name': 'SJ_Campbell_Los Gatos',
  'district': 'Campbell',
  'PostalCode': 95008},
 'geometry': {'type': 'Point', 'coordinates': [-121.9448624, 37.28847885]}}

## Tranform the geojson data into a pandas dataframe


In [13]:
# The next task is essentially transforming this data of nested Python dictionaries into a pandas dataframe.
# So let's start by creating an empty dataframe.

# define the dataframe columns
column_names = ['District', 'Neighborhood', 'Latitude', 'Longitude'] 


In [15]:
# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

# Take a look at the empty dataframe to confirm that the columns are as intended.
neighborhoods


Unnamed: 0,District,Neighborhood,Latitude,Longitude


In [16]:
# Then let's loop through the data [from the json file] and fill the dataframe one row at a time.
for data in neighborhoods_data:
    district = neighborhood_name = data['properties']['district'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'District': district,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
# Quickly examine the resulting dataframe.
neighborhoods

Unnamed: 0,District,Neighborhood,Latitude,Longitude
0,Campbell,SJ_Campbell_Los Gatos,37.288479,-121.944862
1,Central,SJ_Santa Clara,37.355789,-121.955093
2,Central,SJ_Downtown1,37.33865,-121.885422
3,Central,SJ_Downtown2,37.333061,-121.885361
4,Central,SJ_Santa Clara_sw (Chapman Morse),37.34161,-121.930382
5,Central,SJ_Central (Tropicana),37.334229,-121.842277
6,Central,SJ_Fruitdale_Campbell,37.312008,-121.935577
7,Core,"San Jose, CA",37.33865,-121.885422
8,East,SJ_East1 (McKinley),37.336948,-121.860771
9,East,SJ_East2 (Ryan),37.348629,-121.825218


In [17]:
# And make sure that the dataset has all 11 districts and 41 neighborhoods.
print('The dataframe has {} districts and {} neighborhoods.'.format(
        len(neighborhoods['District'].unique()),
        neighborhoods.shape[0]
    )
)


The dataframe has 11 districts and 41 neighborhoods.


In [18]:
limit = 41
neighborhoods = neighborhoods.iloc[0:limit, :]

### Use geopy library to get the latitude and longitude values of San Jose, CA.

In [19]:
# In order to define an instance of the geocoder, we need to define a user_agent. 
# We will name our agent sj_explorer, as shown below.
address = 'San Jose, CA'

geolocator = Nominatim(user_agent="sj_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of San Jose are {}, {}.'.format(latitude, longitude))


The geograpical coordinates of San Jose are 37.3361905, -121.8905833.


### Create a map of San Jose with neighborhoods superimposed on top.

In [None]:
# For TESTING!
# LDN_COORDINATES = (37.3361905, -121.8905833) 
# myMap = folium.Map(location=LDN_COORDINATES, zoom_start=12)

# myMap

In [20]:
# create map of San Jose using latitude and longitude values
map_sanjose = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
# instantiate a feature group for the neighborhoods
markers=folium.map.FeatureGroup(name="Zips")

# loop through the 11 neighborhoods and add each to the markers feature group
for nameD, nameN, lat, lng, in zip(neighborhoods.District, neighborhoods.Neighborhood, neighborhoods.Latitude, neighborhoods.Longitude):
#  print('Current:', nameD, nameN, lat, lng)
  markers.add_child(
    folium.CircleMarker(location=[lat, lng], popup=str(nameN), 
      tooltip=str(nameN),
      radius=10,
      color='yellow',
      fill=True,
      fill_color='blue',
      fill_opacity=1.0
    ).add_to(map_sanjose)
  )

# add markers to map
map_sanjose.add_child(markers)
map_sanjose.add_child(folium.LayerControl())
map_sanjose.save("SJMap.html")

map_sanjose

In [21]:
# Use Folium. let's simplify the above map and segment and cluster only the neighborhoods in Downtown.
# So, let's slice the original dataframe and create a new dataframe of the Downtown data.
downtown_data = neighborhoods[neighborhoods['District'] == 'Central'].reset_index(drop=True)

downtown_data.head()


Unnamed: 0,District,Neighborhood,Latitude,Longitude
0,Central,SJ_Santa Clara,37.355789,-121.955093
1,Central,SJ_Downtown1,37.33865,-121.885422
2,Central,SJ_Downtown2,37.333061,-121.885361
3,Central,SJ_Santa Clara_sw (Chapman Morse),37.34161,-121.930382
4,Central,SJ_Central (Tropicana),37.334229,-121.842277


In [22]:
# Let's get the geographical coordinates of San Jose Downtown.
address = 'San Jose, CA'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of San Jose, CA are {}, {}.'.format(latitude, longitude))


The geograpical coordinates of San Jose, CA are 37.3361905, -121.8905833.


In [23]:
# As we did with all of San Jose city, let's visualize Downtown District and the neighborhoods in it.
# create map of San Jose Downtown using latitude and longitude values
map_downtown = folium.Map(location=[latitude, longitude], zoom_start=15)

# add markers to map
# loop through the 11 neighborhoods and add each to the markers feature group
for lat, lng, in zip(neighborhoods.Latitude, neighborhoods.Longitude):
#  print('Current coordinates:', lat, lng)
  markers.add_child(
    folium.CircleMarker(
      [lat, lng],
      radius=10,
      color='yellow',
      fill=True,
      fill_color='blue',
      fill_opacity=1.0
    ).add_to(map_downtown)
  )

# add markers to map
map_downtown.add_child(markers)
map_downtown.add_child(folium.LayerControl())
map_downtown.save("SJdowntown.html")

# map_downtown

## Use the Foursquare API to explore the neighborhoods and segment them

### Define Foursquare Credentials and Version

In [24]:
# The code was removed by Watson Studio for sharing.

### Let's explore the first neighborhood in our dataframe.

In [25]:
# Get the neighborhood's name.
downtown_data.loc[0, 'Neighborhood']

# Get the neighborhood's latitude and longitude values.
# neighborhood latitude value
neighborhood_latitude = downtown_data.loc[0, 'Latitude'] 
# neighborhood longitude value
neighborhood_longitude = downtown_data.loc[0, 'Longitude'] 
# neighborhood name
neighborhood_name = downtown_data.loc[0, 'Neighborhood'] 

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))


Latitude and longitude values of SJ_Santa Clara are 37.35578919, -121.9550934.


In [26]:
# Now, let's get the top 100 venues that are in ????above within a radius of 500 meters.
# First, let's create the GET request URL. Name your URL url.

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)


In [27]:
# Send the GET request and examine the results
results = requests.get(url).json()
results


{'meta': {'code': 200, 'requestId': '5d0548b8f129b500251796de'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'},
    {'name': '$-$$$$', 'key': 'price'}]},
  'headerLocation': 'Santa Clara',
  'headerFullLocation': 'Santa Clara',
  'headerLocationGranularity': 'city',
  'totalResults': 14,
  'suggestedBounds': {'ne': {'lat': 37.360289194500005,
    'lng': -121.94944275431403},
   'sw': {'lat': 37.3512891855, 'lng': -121.96074404568596}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4ad2614df964a52043e120e3',
       'name': 'Triton Museum of Art',
       'location': {'address': '1505 Warburton Ave',
        'crossStreet': 'at Lincoln St',
        'lat': 37.35597077916603,
        'lng': -121.95524600461

In [28]:
# all the information we need is in the items key. 
# Before we proceed, let's borrow the get_category_type function from the Foursquare lab.
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
# Now we are ready to clean the json and structure it into a pandas dataframe.
venues = results['response']['groups'][0]['items']
    
# flatten JSON
nearby_venues = json_normalize(venues) 

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()


Unnamed: 0,name,categories,lat,lng
0,Triton Museum of Art,Art Museum,37.355971,-121.955246
1,Kettle'e,Indian Restaurant,37.352102,-121.954609
2,BEST WESTERN University Inn Santa Clara,Hotel,37.352545,-121.955239
3,Maple Leaf Donuts,Donut Shop,37.355301,-121.959146
4,Kabab & Curry,Indian Restaurant,37.351468,-121.955424


In [29]:
# And how many venues were returned by Foursquare?
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))


14 venues were returned by Foursquare.


### 2. Explore Neighborhoods in San Jose Downtown

In [30]:
# Let's create a function to repeat the same process to all the neighborhoods in San Jose
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


### Now write the code to run the above function on each neighborhood and create a new dataframe called downtown_core_venues.

In [31]:
# the code to run the above function on each neighborhood and create a new dataframe called downtown_venues.
downtown_venues = getNearbyVenues(names=downtown_data['Neighborhood'],
                                   latitudes=downtown_data['Latitude'],
                                   longitudes=downtown_data['Longitude']
                                  )


SJ_Santa Clara
SJ_Downtown1
SJ_Downtown2
SJ_Santa Clara_sw (Chapman Morse)
SJ_Central (Tropicana)
SJ_Fruitdale_Campbell


In [32]:
# Let's check the size of the resulting dataframe
print(downtown_venues.shape)
downtown_venues.head()


(153, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,SJ_Santa Clara,37.355789,-121.955093,Triton Museum of Art,37.355971,-121.955246,Art Museum
1,SJ_Santa Clara,37.355789,-121.955093,Kettle'e,37.352102,-121.954609,Indian Restaurant
2,SJ_Santa Clara,37.355789,-121.955093,BEST WESTERN University Inn Santa Clara,37.352545,-121.955239,Hotel
3,SJ_Santa Clara,37.355789,-121.955093,Maple Leaf Donuts,37.355301,-121.959146,Donut Shop
4,SJ_Santa Clara,37.355789,-121.955093,Kabab & Curry,37.351468,-121.955424,Indian Restaurant


In [33]:
# Let's check how many venues were returned for each neighborhood
downtown_venues.groupby('Neighborhood').count()


Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
SJ_Central (Tropicana),5,5,5,5,5,5
SJ_Downtown1,29,29,29,29,29,29
SJ_Downtown2,80,80,80,80,80,80
SJ_Fruitdale_Campbell,11,11,11,11,11,11
SJ_Santa Clara,14,14,14,14,14,14
SJ_Santa Clara_sw (Chapman Morse),14,14,14,14,14,14


### Let's find out how many unique categories can be curated from all the returned venues

In [34]:
print('There are {} unique categories.'.format(len(downtown_venues['Venue Category'].unique())))


There are 82 unique categories.


### 3. Analyze Each Neighborhood

In [35]:
# one hot encoding
downtown_onehot = pd.get_dummies(downtown_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
downtown_onehot['Neighborhood'] = downtown_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [downtown_onehot.columns[-1]] + list(downtown_onehot.columns[:-1])
downtown_onehot = downtown_onehot[fixed_columns]

downtown_onehot.head()


Unnamed: 0,Neighborhood,American Restaurant,Art Gallery,Art Museum,Asian Restaurant,Bakery,Bank,Bar,Beer Bar,Beer Garden,Breakfast Spot,Bubble Tea Shop,Café,Chinese Restaurant,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Cafeteria,Comedy Club,Convenience Store,Cuban Restaurant,Dessert Shop,Diner,Donut Shop,Fast Food Restaurant,Fish & Chips Shop,Food Truck,Fried Chicken Joint,Frozen Yogurt Shop,Gas Station,Gift Shop,Greek Restaurant,Grocery Store,Hawaiian Restaurant,Health & Beauty Service,Hookah Bar,Hotel,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Juice Bar,Korean Restaurant,Latin American Restaurant,Lounge,Malay Restaurant,Market,Massage Studio,Mediterranean Restaurant,Mexican Restaurant,Mobile Phone Shop,Music Store,Music Venue,Nail Salon,New American Restaurant,Nightclub,Office,Opera House,Performing Arts Venue,Persian Restaurant,Pharmacy,Pizza Place,Plaza,Pool,Pub,Ramen Restaurant,Restaurant,Rock Club,Salad Place,Sandwich Place,Science Museum,Seafood Restaurant,Shipping Store,Spa,Sports Bar,Sushi Restaurant,Taco Place,Theater,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Wings Joint,Yoga Studio
0,SJ_Santa Clara,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,SJ_Santa Clara,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,SJ_Santa Clara,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,SJ_Santa Clara,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,SJ_Santa Clara,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [36]:
# And let's examine the new dataframe size.
downtown_onehot.shape


(153, 83)

### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [37]:
downtown_grouped = downtown_onehot.groupby('Neighborhood').mean().reset_index()
downtown_grouped

# Let's confirm the new size
downtown_grouped.shape


(6, 83)

### Let's print each neighborhood along with the top 5 most common venues

In [38]:
num_top_venues = 5

for hood in downtown_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = downtown_grouped[downtown_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


----SJ_Central (Tropicana)----
                 venue  freq
0   Mexican Restaurant   0.4
1   Chinese Restaurant   0.2
2             Pharmacy   0.2
3           Taco Place   0.2
4  American Restaurant   0.0


----SJ_Downtown1----
                venue  freq
0  Mexican Restaurant  0.10
1      Sandwich Place  0.10
2         Coffee Shop  0.03
3   Mobile Phone Shop  0.03
4         Music Store  0.03


----SJ_Downtown2----
                venue  freq
0  Mexican Restaurant  0.06
1                 Bar  0.06
2         Coffee Shop  0.06
3         Art Gallery  0.04
4               Hotel  0.04


----SJ_Fruitdale_Campbell----
                   venue  freq
0   Fast Food Restaurant  0.27
1  Vietnamese Restaurant  0.18
2                   Pool  0.09
3                    Spa  0.09
4      Convenience Store  0.09


----SJ_Santa Clara----
                  venue  freq
0  Fast Food Restaurant  0.14
1     Indian Restaurant  0.14
2   Fried Chicken Joint  0.14
3                 Hotel  0.14
4           Gas Stat

### Let's put that into a pandas dataframe

In [39]:
# First, let's write a function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]
# Now let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = downtown_grouped['Neighborhood']

for ind in np.arange(downtown_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(downtown_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,SJ_Central (Tropicana),Mexican Restaurant,Taco Place,Chinese Restaurant,Pharmacy,Yoga Studio,Fried Chicken Joint,Dessert Shop,Diner,Donut Shop,Fast Food Restaurant
1,SJ_Downtown1,Mexican Restaurant,Sandwich Place,Performing Arts Venue,Persian Restaurant,Hookah Bar,Gift Shop,Wings Joint,Korean Restaurant,Fish & Chips Shop,Diner
2,SJ_Downtown2,Mexican Restaurant,Coffee Shop,Bar,Art Gallery,Theater,Hotel,Yoga Studio,Café,Ice Cream Shop,Food Truck
3,SJ_Fruitdale_Campbell,Fast Food Restaurant,Vietnamese Restaurant,Convenience Store,Video Store,Pizza Place,Pool,Spa,Sandwich Place,Food Truck,Cuban Restaurant
4,SJ_Santa Clara,Indian Restaurant,Hotel,Fast Food Restaurant,Fried Chicken Joint,Gas Station,Diner,Donut Shop,Office,Italian Restaurant,Art Museum


### 4. Cluster Neighborhoods

In [40]:
# Run k-means to cluster the neighborhood into 5 clusters.
# set number of clusters
kclusters = 5

downtown_grouped_clustering = downtown_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(downtown_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

downtown_merged = downtown_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
downtown_merged = downtown_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

# check the last columns!
downtown_merged.head() 


Unnamed: 0,District,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Central,SJ_Santa Clara,37.355789,-121.955093,0,Indian Restaurant,Hotel,Fast Food Restaurant,Fried Chicken Joint,Gas Station,Diner,Donut Shop,Office,Italian Restaurant,Art Museum
1,Central,SJ_Downtown1,37.33865,-121.885422,1,Mexican Restaurant,Sandwich Place,Performing Arts Venue,Persian Restaurant,Hookah Bar,Gift Shop,Wings Joint,Korean Restaurant,Fish & Chips Shop,Diner
2,Central,SJ_Downtown2,37.333061,-121.885361,1,Mexican Restaurant,Coffee Shop,Bar,Art Gallery,Theater,Hotel,Yoga Studio,Café,Ice Cream Shop,Food Truck
3,Central,SJ_Santa Clara_sw (Chapman Morse),37.34161,-121.930382,4,Mexican Restaurant,Hotel,Breakfast Spot,Hookah Bar,Pharmacy,Nail Salon,Ramen Restaurant,Clothing Store,Mediterranean Restaurant,Massage Studio
4,Central,SJ_Central (Tropicana),37.334229,-121.842277,2,Mexican Restaurant,Taco Place,Chinese Restaurant,Pharmacy,Yoga Studio,Fried Chicken Joint,Dessert Shop,Diner,Donut Shop,Fast Food Restaurant


In [42]:
# Finally, let's visualize the resulting clusters
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(downtown_merged['Latitude'], downtown_merged['Longitude'], downtown_merged['Neighborhood'], downtown_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)


# add markers to map
map_clusters.add_child(markers)
map_clusters.add_child(folium.LayerControl())
map_clusters.save("SJClusters.html")


map_clusters


# 5. Examine Clusters

 Examine each cluster and determine the discriminating venue categories that distinguish each cluster. 

In [43]:

# Based on the defining categories, you can then assign a name to each cluster. 

# Cluster 1
downtown_merged.loc[downtown_merged['Cluster Labels'] == 0, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,SJ_Santa Clara,Indian Restaurant,Hotel,Fast Food Restaurant,Fried Chicken Joint,Gas Station,Diner,Donut Shop,Office,Italian Restaurant,Art Museum


### Cluster 2

In [44]:
# Cluster 2
downtown_merged.loc[downtown_merged['Cluster Labels'] == 1, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,SJ_Downtown1,Mexican Restaurant,Sandwich Place,Performing Arts Venue,Persian Restaurant,Hookah Bar,Gift Shop,Wings Joint,Korean Restaurant,Fish & Chips Shop,Diner
2,SJ_Downtown2,Mexican Restaurant,Coffee Shop,Bar,Art Gallery,Theater,Hotel,Yoga Studio,Café,Ice Cream Shop,Food Truck


### Cluster 3

In [45]:
# Cluster 3
downtown_merged.loc[downtown_merged['Cluster Labels'] == 2, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,SJ_Central (Tropicana),Mexican Restaurant,Taco Place,Chinese Restaurant,Pharmacy,Yoga Studio,Fried Chicken Joint,Dessert Shop,Diner,Donut Shop,Fast Food Restaurant


### Cluster 4

In [46]:
# Cluster 4
downtown_merged.loc[downtown_merged['Cluster Labels'] == 3, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,SJ_Fruitdale_Campbell,Fast Food Restaurant,Vietnamese Restaurant,Convenience Store,Video Store,Pizza Place,Pool,Spa,Sandwich Place,Food Truck,Cuban Restaurant


### Cluster 5

In [47]:
# Cluster 5
downtown_merged.loc[downtown_merged['Cluster Labels'] == 4, downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,SJ_Santa Clara_sw (Chapman Morse),Mexican Restaurant,Hotel,Breakfast Spot,Hookah Bar,Pharmacy,Nail Salon,Ramen Restaurant,Clothing Store,Mediterranean Restaurant,Massage Studio


## Using k-Means for Customer Segmentation

### Downloading Customer Data

### HADS dataset is available as a csv file. Use Excel functions for initial sorting and slicing tasks.

!wget -q -O 'customer_segmentation.csv 'https://www.huduser.gov/portal/datasets/hads/hads.html
print('Data Downloaded!')

In [48]:
# Import libraries
# import random
# import numpy as np
# import pandas as pd

# plotting library
# import matplotlib.pyplot as plt
# backend for rendering plots within the browser
# %matplotlib inline

from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs

In [49]:
# from HUD dataset
customers_df = pd.read_csv('customer_segmentation.csv')
customers_df.head()

Unnamed: 0,CONTROL,AGE1,Address,ZINC2,FMTINCRELAMICAT,FMTCOST06RELAMICAT,BURDEN,FMR
0,100008700141',26,,32000,4,4,0.366,773
1,'100008960141',60,,118987,7,6,0.116584,1125
2,100019710141',84,,28000,3,7,0.152143,975
3,'100022130103',55,,25441,3,2,0.226406,794
4,'100022330103',74,,44340,5,1,0.076319,752


### Pre-processing

In [50]:
df = customers_df.drop('Address', axis=1)
df.head()

Unnamed: 0,CONTROL,AGE1,ZINC2,FMTINCRELAMICAT,FMTCOST06RELAMICAT,BURDEN,FMR
0,100008700141',26,32000,4,4,0.366,773
1,'100008960141',60,118987,7,6,0.116584,1125
2,100019710141',84,28000,3,7,0.152143,975
3,'100022130103',55,25441,3,2,0.226406,794
4,'100022330103',74,44340,5,1,0.076319,752


In [51]:
# normalize the dataset
from sklearn.preprocessing import StandardScaler

X = df.values[:,1:]
X = np.nan_to_num(X)
cluster_dataset = StandardScaler().fit_transform(X)
cluster_dataset



array([[-1.43958201, -0.49232804, -0.14444478, -0.20560906, -0.02411662,
        -1.31109861],
       [ 0.49998244,  0.64104324,  1.1058141 ,  0.73237825, -0.02636805,
        -0.57522422],
       [ 1.86908676, -0.54444485, -0.56119773,  1.2013719 , -0.02604707,
        -0.88880706],
       ...,
       [ 0.21475237, -0.81805811, -1.39470365, -0.67460272, -0.01672751,
        -1.76474845],
       [-1.61072005, -0.38843317, -0.56119773,  0.26338459, -0.02141821,
         2.7194861 ],
       [ 0.3288444 , -0.12767973,  0.68906114, -1.61259003, -0.02666743,
        -1.31737027]])

### Modelling

In [52]:
# run model and group customer into three clusters

num_clusters = 3

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=12)
k_means.fit(cluster_dataset)
labels = k_means.labels_

print(labels)

[1 0 1 ... 1 0 1]


## Insights

In [53]:
# show customer groups
df["Labels"] = labels
df.head(5)

Unnamed: 0,CONTROL,AGE1,ZINC2,FMTINCRELAMICAT,FMTCOST06RELAMICAT,BURDEN,FMR,Labels
0,100008700141',26,32000,4,4,0.366,773,1
1,'100008960141',60,118987,7,6,0.116584,1125,0
2,100019710141',84,28000,3,7,0.152143,975,1
3,'100022130103',55,25441,3,2,0.226406,794,1
4,'100022330103',74,44340,5,1,0.076319,752,1


In [54]:
# show centroid values (profiles) using averaging of features in each cluster
df.groupby('Labels').mean()

Unnamed: 0_level_0,AGE1,ZINC2,FMTINCRELAMICAT,FMTCOST06RELAMICAT,BURDEN,FMR
Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,51.272976,116199.874537,6.28878,5.933854,0.247738,1633.842341
1,51.203514,27197.366147,2.56482,3.064551,1.29513,1185.279541
2,48.4,7.2,1.0,4.0,4806.327273,1541.0


## Create a profile for each group, considering the common characteristics of each cluster

## Results

**Insights - What was learned regarding the variables**<br> Considering the top five most frequest venues in the greater downtown San Jose, CA area, commercial locations dominate as expected. Within the top five clusters of types of venues, only one gas station, two convenience stores, and one fitness location were found. However, within the 500 feet radius selected for this analysis, the top ten venues include a hotel, two personal services companies, and several additional multi-ethnic food service offerings. 

**Profiles\Profiles**<br> This study used a three-group cluster of housing consumers in the US Western Census Region. The representative housing consumer profiles were determined to be as shown below:
1. middle-aged, high-income, high disposible income
2. middle-aged, above average income, little disposible income
3. young, average income, virtually no disposible
It should be noted that the HUD census data of many western states does not reflect the economy in the City of San Jose located in Silicon Valley. 

## Discussion

**Explain how results relate to the research question**<br> The results show that restaurants and other types of food service venues dominate the Downtown area of San Jose, CA which is an expected result. However, the trend towards building multi-unit housing for new consumers argues for inclusion of other types of venues that are suitable for this use. For example, transportation, sports & fitness, leisure destinations, and family entertainment offerings should be considered.

Of the three profile groups identified in the study, young individuals with average incomes seem to struggle with housing affordabilty. Downtown houding prices consumed a high percentage of individual incomes, perhaps leading to sharing arrangements. This results in little budget available for this group for other living expenditures.

On the other hand, from the HUD study, two groups were found to be above-average and high income profiles. Although, there was bifurcation in these groups between those with high versus low disposible incomes. Seemingly, older, high income availablity attracts these consumers to high-end housing choices to the detriment of young persons.


## Conclusions

**Findings and recommendations**<br> This report sought to describe the housing choice features that consumers' past behaviors have shown to be determative for the market. In general, high income individuals spend a lower proportion of their income on housing compared to medium and lower consumers. Still, the higher income population can easily take advantage of the restaurants and venues that exist in city centers. 

The challenge for the housing industry is to find innovation and flexibilty for the features housing consumers desire while simultaneously accounting for societal responsibility in planning and implementation. 

## Sources



References

Aklson, Alex, Aghabozorgi, A., & Lin, Polong (2019). IBM Data Science Professional Certificate Course. Coursera.com.

Housing Affordability Data System (HADS) (2013). HADS Data derived from AHS National Data [Data file and code book]. Retrieved from https://catalog.data.gov/dataset/housing-affordability-data-system-hads

Maxwell, Joseph A. (2013). Qualitative research design: An interactive approach. Thousand Oaks, CA: Sage Publications

Watt, Nick (2019). Weave data sets together and deliver real insights. Retrieved June 12, 2019 from   https://www.cmo.com/features/articles/2016/9/15/weave-data-sets-together-and-deliver-real-insights.html#gs.hyx5be . CMO.com

