<h1 align=center><font size = 5>Clustering and Comparing the Neighborhoods of Fairfax County in Virginia</font></h1>


# 1.1. Introduction

Fairfax County is a suburb of Washington D.C. in northern Virginia. It is the most populous county in the state and has some of the most expensive housing markets in the DC-Maryland-Virginia (DMV) metropolitan region. Despite the global impact of the coronavirus pandemic, the real estate market in Fairfax County remains strong and competitive.  A relatively strong economy mainly drives this competitiveness in the DMV area. Other factors that influence the real estate market are lack of inventory, low mortgage rates, and trim down payment options. However, the real estate market boom may not be uniform across different neighborhoods in the county; therefore, there is an opportunity to use data analysis to compare various localities and generate insights for potential investments. 

## 1.2. Business Problem

Insights and data often assist successful investments. This project will utilize a machine learning technique to derive these insights and provide a mechanism for potential homebuyers or investors to make an informed decision. In particular, the business problem that we are trying to answer is: What neighborhood offers the best opportunity for purchasing a home or an investment property in Fairfax County? 
To solve this problem, we will cluster various neighborhoods of Fairfax County, using local venues and amenities like coffee shops, parks, etc., and a smoothed and seasonally adjusted value of typical homes across the region. We will then compare these clusters' results with the forecasted increase in home prices in each neighborhood to derive valuable insight for purchasing a home or an investment property. 


# 2. Data Acquisition
## 2.1. Data 
In this project, we need three types of data to answer our business question. The data that we need include names of neighborhoods, venues and amenities, average home prices, and the forecasted increase in housing prices. 
The neighborhood dataset includes the names of various localities in Fairfax County. Using each neighborhood's name, we can obtain their complete address, latitude, and longitude information from the GeoCoder library in Python.  We can use the Foursquare API to get relevant information about different venues and amenities in each neighborhood; and housing data from Zillow. To clean, analyze, and process the data, we will utilize pandas dataframe. 

We will load the neighborhood list in Fairfax County from a CSV file into a pandas dataframe. The CSV file was downloaded from https://www.fairfaxcounty.gov/demographics/interactive-map-communities-places-and-towns. Fairfax County offers an interactive map for selecting communities and towns within its borders. 

The Foursquare API offers access to the largest sources of locational data, such as venues and amenities. It is a location technology platform. Zillow is the leading real estate and rental marketplace that serves the entire lifecycle of owning a home, such as buying, selling, renting, financing, and remodeling. Zillow makes housing data available for download via the following link: https://www.zillow.com/research/data/. We will utilize the Zillow Home Value Index (ZHVI) and Zillow Home Value Forecast (ZHVF) for this project, where ZHVI is a smoothed, seasonally adjusted measure of typical home values and market conditions for a given region. Zillow publishes ZHVI for different price tiers, but for this project, we are using the mid-tier dataset, which reflects home values in the 35th to 65th percentile range.  ZHVF is the one-year forecast of ZHVI.  Both ZHVI and ZHVF files are downloaded as CSV files and will be uploaded into pandas dataframe for post-processing.


## Import Libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
geolocator = Nominatim(user_agent="Fairfax_explorer")

import requests # library to handle requests
#from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from pandas import json_normalize
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import urllib.request
from bs4 import BeautifulSoup # to parse HTMLand XML
from IPython.display import display_html

import re
print('Libraries imported.')

Libraries imported.


### Loading Fairfax Neighborhood Data into a pandas dataframe

In [10]:
fairfax_neighborhoods_df = pd.read_csv('Data/FairfaxNeighborhoods.csv')
fairfax_neighborhoods_df.head()

Unnamed: 0,NAMELSAD10,YEAR,POPULATION,POP_5YEAR,LAND_AREA,TOTAL_HU,SFD,SFA,MF_LOW,MF_MID,MF_HIGH,HOUSEHOLDS,HU_30UP,HU_LT30,VALUE_LOW,VALUE_MID,VALUE_HIGH,TOTAL_GFA,OFFICE_GFA,RETAIL_GFA,INDUST_GFA
0,Woodburn CDP,2020,8937,9002,1790.464,3161,1486,587,1088,0,0,3116,3028,131,936,1369,91,3191725,3060943,130782,0
1,Burke Centre CDP,2020,17339,17370,2024.901,6172,2299,2953,920,0,0,6153,5981,189,3250,2321,27,1611310,170736,1091745,348829
2,Fair Oaks CDP,2020,34151,36135,3232.245,15101,1136,5007,8567,391,0,14763,4989,10111,4578,3764,247,13557126,6842130,6368746,346250
3,Crosspointe CDP,2020,6034,6042,1420.916,1820,1820,0,0,0,0,1810,807,1012,51,1715,51,347650,0,347650,0
4,Wakefield CDP,2020,11726,11769,2434.22,3926,3789,137,0,0,0,3911,3777,149,137,3700,83,0,0,0,0


### Removing the additional information 'CDP' from the Name column of the fairfax_neighborhoods dataframe and renaming the first column Neighborhoods

In [11]:
fairfax_neighborhoods_df["NAMELSAD10"] = fairfax_neighborhoods_df["NAMELSAD10"].str.replace(" CDP", "")
fairfax_neighborhoods_df.rename(columns = {'NAMELSAD10':'Neighborhoods'}, inplace = True)
# Un-comment the code below to save a clean copy of the fairfax_neighborhoods dataframe
#fairfax_neighborhoods_df.to_csv('fairfax_neighborhoods.csv')
fairfax_neighborhoods_df.head(30)

Unnamed: 0,Neighborhoods,YEAR,POPULATION,POP_5YEAR,LAND_AREA,TOTAL_HU,SFD,SFA,MF_LOW,MF_MID,MF_HIGH,HOUSEHOLDS,HU_30UP,HU_LT30,VALUE_LOW,VALUE_MID,VALUE_HIGH,TOTAL_GFA,OFFICE_GFA,RETAIL_GFA,INDUST_GFA
0,Woodburn,2020,8937,9002,1790.464,3161,1486,587,1088,0,0,3116,3028,131,936,1369,91,3191725,3060943,130782,0
1,Burke Centre,2020,17339,17370,2024.901,6172,2299,2953,920,0,0,6153,5981,189,3250,2321,27,1611310,170736,1091745,348829
2,Fair Oaks,2020,34151,36135,3232.245,15101,1136,5007,8567,391,0,14763,4989,10111,4578,3764,247,13557126,6842130,6368746,346250
3,Crosspointe,2020,6034,6042,1420.916,1820,1820,0,0,0,0,1810,807,1012,51,1715,51,347650,0,347650,0
4,Wakefield,2020,11726,11769,2434.22,3926,3789,137,0,0,0,3911,3777,149,137,3700,83,0,0,0,0
5,Mason Neck,2020,2367,2403,12813.096,817,815,2,0,0,0,797,637,172,224,438,111,1284,0,0,1284
6,McNair,2020,22015,27369,1320.213,9508,86,2815,6200,407,0,9256,285,9221,2564,833,0,9736348,6761503,2940195,34650
7,South Run,2020,6843,6836,1664.018,2123,2000,123,0,0,0,2119,1368,754,2,2056,64,2376,0,0,2376
8,Greenbriar,2020,8190,8193,1008.979,3134,1940,197,997,0,0,3125,2965,168,606,1737,0,798355,116581,680662,1112
9,Fair Lakes,2020,8956,9161,1545.597,3430,642,1452,978,358,0,3349,738,2689,1444,1064,63,5876702,2529740,3049725,297237


### Let's print the size of the dat

In [5]:
fairfax_neighborhoods_df.shape

(72, 21)

### Let's clean the fairfax_neighborhoods dataframe by removing rows that have no names e.g., Non Area 4, and create a new dataframe fairfax_data

In [12]:
temp_df = fairfax_neighborhoods_df;
temp_df['Neighborhoods'] = temp_df["Neighborhoods"].str.replace("Non Area ", "1Null")
fairfax_data = temp_df[~temp_df.Neighborhoods.str.match('^[1]')].reset_index(drop = True)
# Un-comment the code below to save a clean copy of the fairfax_data dataframe
#fairfax_data.to_csv('Fairfax_data.csv')
fairfax_data

Unnamed: 0,Neighborhoods,YEAR,POPULATION,POP_5YEAR,LAND_AREA,TOTAL_HU,SFD,SFA,MF_LOW,MF_MID,MF_HIGH,HOUSEHOLDS,HU_30UP,HU_LT30,VALUE_LOW,VALUE_MID,VALUE_HIGH,TOTAL_GFA,OFFICE_GFA,RETAIL_GFA,INDUST_GFA
0,Woodburn,2020,8937,9002,1790.464,3161,1486,587,1088,0,0,3116,3028,131,936,1369,91,3191725,3060943,130782,0
1,Burke Centre,2020,17339,17370,2024.901,6172,2299,2953,920,0,0,6153,5981,189,3250,2321,27,1611310,170736,1091745,348829
2,Fair Oaks,2020,34151,36135,3232.245,15101,1136,5007,8567,391,0,14763,4989,10111,4578,3764,247,13557126,6842130,6368746,346250
3,Crosspointe,2020,6034,6042,1420.916,1820,1820,0,0,0,0,1810,807,1012,51,1715,51,347650,0,347650,0
4,Wakefield,2020,11726,11769,2434.22,3926,3789,137,0,0,0,3911,3777,149,137,3700,83,0,0,0,0
5,Mason Neck,2020,2367,2403,12813.096,817,815,2,0,0,0,797,637,172,224,438,111,1284,0,0,1284
6,McNair,2020,22015,27369,1320.213,9508,86,2815,6200,407,0,9256,285,9221,2564,833,0,9736348,6761503,2940195,34650
7,South Run,2020,6843,6836,1664.018,2123,2000,123,0,0,0,2119,1368,754,2,2056,64,2376,0,0,2376
8,Greenbriar,2020,8190,8193,1008.979,3134,1940,197,997,0,0,3125,2965,168,606,1737,0,798355,116581,680662,1112
9,Fair Lakes,2020,8956,9161,1545.597,3430,642,1452,978,358,0,3349,738,2689,1444,1064,63,5876702,2529740,3049725,297237


### Get the corresponding latitude and longitude values for each neighborhood in Fairfax County

In [15]:
latitudes = []
longitudes = []
location_address = []

for line in fairfax_data['Neighborhoods']: #fairfax_neighborhoods_df['Neighborhoods']:
    address = line.strip()
    location = geolocator.geocode(address + ' Fairfax County Virginia')
      
    if location and len(location):
        
        latitude  = location.latitude
        longitude = location.longitude
        latitudes.append(latitude)
        longitudes.append(longitude)
        location_address.append(location.address) 
    else:
        sys.stderr.write("not found: %s\n" % address)

### Tranform the data into a pandas dataframe by creating an empty dataframe and populating it with latitude and longitude data.

In [16]:
fairfax_geo_location = pd.DataFrame() #creat an empty dataframe

#Syntex: DataFrameName.insert(loc, column, value, allow_duplicates = False)
fairfax_geo_location.insert(0, 'Neighborhoods', fairfax_data['Neighborhoods'], allow_duplicates = True)
fairfax_geo_location.insert(1, 'Full_Address', location_address, allow_duplicates = True)
fairfax_geo_location.insert(2,'Latitude', latitudes, allow_duplicates = True)
fairfax_geo_location.insert(3, 'Longitude', longitudes, allow_duplicates = True)

# Un-comment the code below to save a clean copy of the fairfax_geo_location dataframe
#fairfax_geo_location.to_csv('Fairfax_Geo_Location_Data.csv')
fairfax_geo_location.head()

Unnamed: 0,Neighborhoods,Full_Address,Latitude,Longitude
0,Woodburn,"Woodburn School Site Park, Holmes Run Acres, A...",38.852209,-77.21187
1,Burke Centre,"Burke Centre, Fairfax County, Virginia, 22015-...",38.790992,-77.300519
2,Fair Oaks,"Fair Oaks, Fairfax County, Virginia, 22035, Un...",38.863427,-77.35923
3,Crosspointe,"Crosspointe, Laurel Hill, Fairfax County, Virg...",38.724002,-77.265078
4,Wakefield,"Wakefield Forest, Fairfax County, Virginia, 22...",38.835395,-77.239581


## In a similar manner we will upload housing prices. We will use the Foursquare API to get the data for venues and amentities