# Housing Sales Prices & Venues Data Analysis of Taipei City
## A. Introduction
### A.1. Description & Discussion of the Background
Taipei is the 40th most-populous urban area in the world—roughly one-third of Taiwanese citizens live in the metro district.  Taipei city is home to an estimated population of **2,646,204** (2019). Taipei City is divided up into 12 administrative districts. I have been living in Taipei city for 10 years and had good experience there, then I decided to use Taipei City in my project. Taipei City is an enclave of the municipality of New Taipei City that limits cover an area of **271.7997** square kilometers [1].     

Taipei is a densely populated urban areas which continued to increase from year to year as well as the surrounding cities. Taipei is one of the world’s most expensive cities and crowded which make it harder for investors to do business around the city, unaffordability remains a serious issue. Most investors would prefer to have access to relevant information to invest in their preferable district at a lower real estate cost and the type of business that is more popular in this area. Obviously, most people are looking for lower price real estate value and less dense area as well. It is difficult for investor to get this information in one place.        

All these problems give an opportunity to create a map and information chart with real estate index marked in Taipei city and each district depending of the venue density.      
### A.2. Data Description
To provide a possible solution you will need data listed below:
*  I used **Foursquare API** to get the most common venues of given District of Taipei City [2].
*  I found the Second-level Administrative Divisions of the Taipei City from Spatial Data Repository of NYU [3]. The .json file has coordinates of the all city of Taiwan. I cleaned the data and reduced it to city of Taipei [(Here is my .json file)]( https://github.com/augeord/Taipei/blob/master/map.geojson) where I used it to create choropleth map of Housing Sales Price Index of Taipei.
*  I used **Google Map**, ‘Search Nearby’ option to get the center coordinates of each District. [4].
*  I collected latest per square meter Housing Sales Price (HSP) Averages for each District of Taipei from housing retail web page [5] [(Here is my datas)](https://github.com/augeord/Taipei/blob/master/Data.csv)

## B. Methodology

### B.1. Creating data table and data pre-processing

As a database, I used GitHub repository in my study. My master data df which has the main components District, Average House Price, Latitude and Longitude informations of the city.

In [1]:
import pandas as pd 
import numpy as np
import requests

url = 'https://raw.githubusercontent.com/augeord/Peer-graded-Assignment-Segmenting-and-Clustering-Neighborhoods-in-Toronto/master/taipei_city.csv'

df = pd.read_csv(url)

df.head()

Unnamed: 0,District,Avg-housePrice,Latitude,Longitude
0,Beitou,45026000,25.115176,121.515018
1,Daan,65344000,25.026158,121.542709
2,Datong,66177125,25.062724,121.511306
3,Nangang,43702353,25.031235,121.611195
4,Neihu,107481500,25.068942,121.590903


### B.2. Visual Map & Clustering in District

We need to import some external libraries for mapping and clustering of the datas and we will work on Forsquare API part, as well.

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

Libraries imported.


#### We will use folium library to visualize geographic details of Taipei and its Districts.

In [3]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!conda install -c conda-forge folium=0.7.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |   py36h9f0ad1d_1         149 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    openssl-1.1.1e             |       h516909a_0         2.1 MB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1

#### I use geopy library to get the latitude and longitude values of Taipei

In [4]:
address = 'Taipei, TP'

geolocator = Nominatim(user_agent="TP_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Taipei are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Taipei are 25.0375198, 121.5636796.


#### We can create a map of Taipei with Districts superimposed on top. We use latitude and longitude values to get the visual

In [5]:
map_taipei = folium.Map(location=[latitude, longitude], zoom_start=9.5)

# add markers to map
for lat, lng, district in zip(df['Latitude'], df['Longitude'], df['District']):
    label = '{}'.format(district)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_taipei)  
    
map_taipei

#### Let's utilizing the Foursquare API to explore the districts and segment them.

In [6]:
CLIENT_ID = '' # my Foursquare ID
CLIENT_SECRET = '' # my Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 
CLIENT_SECRET:


#### We will get the district's latitude and longitude values.

In [7]:
taipei_data = df

district_latitude = taipei_data.loc[0, 'Latitude'] # neighborhood latitude value
district_longitude = taipei_data.loc[0, 'Longitude'] # neighborhood longitude value

district_name = taipei_data.loc[0, 'District'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(district_name, 
                                                               district_latitude, 
                                                               district_longitude))

Latitude and longitude values of Beitou are 25.115176, 121.515018.


#### First, let's create the GET request URL. Name your URL url. I design the limit as 100 venue and the radius 750 meter for each district

In [8]:
LIMIT = 100
radius = 750
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    district_latitude, 
    district_longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=&client_secret=&v=20180605&ll=25.115176,121.515018&radius=750&limit=100'

In [9]:
results = requests.get(url).json()

#### From the Foursquare lab, we know that all the information is in the items key. Before we proceed, let's borrow the get_category_type function from the Foursquare lab.

In [10]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Now we are ready to clean the json and structure it into a pandas dataframe

In [12]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

KeyError: 'groups'

#### And how many venues were returned by Foursquare?

In [None]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

#### B.3. Exploring Districts in Taipei

##### Let's create a function to get all the boroughs in Istanbul

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['District', 
                  'District Latitude', 
                  'District Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

##### Now we can run the above function on each district and create a new dataframe called taipei_venues

In [None]:
taipei_venues = getNearbyVenues(names=taipei_data['District'],
                                   latitudes=taipei_data['Latitude'],
                                   longitudes=taipei_data['Longitude']
                                  )

##### Let's check the size of the resulting dataframe

In [None]:
print(taipei_venues.shape)
taipei_venues.head()

##### Let's check how many venues were returned for each district and sort them in count

In [None]:
summary = taipei_venues.groupby('District').count().reset_index()
summary['Count'] = summary['Venue']
summary = summary.drop(['District Latitude', 'District Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude','Venue Category'], axis=1)
summary = summary.sort_values('Count').reset_index(drop=True)
summary.head()

##### We can create a bar chart and analyze the big picture of it

In [None]:
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
 
objects = summary.District
y_pos = np.arange(len(objects))
performance = summary.Count

plt.bar(y_pos, performance, align='center', alpha=0.4)
plt.xticks(y_pos, objects)
plt.ylabel('Venue')
plt.title('Total Number of Venue in District')
plt.xticks(rotation=90)

plt.show()

From the graph it’s clearly show that all the districts are bellow 40 venues in our given coordinates with latitude and longitude. From this graph not all the possible results in each district are process. 

It’s all depends on a given latitude and longitude data available, it’s boils down on a single latitude and longitude pair for each borough. For future studies we can look for more data of neighborhood with more latitude and longitude information. 


##### Let's find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(taipei_venues['Venue Category'].unique())))

#### B.4. Analyzing Each Borough

##### We will anayze each district with venues informations

In [None]:
# one hot encoding
taipei_onehot = pd.get_dummies(taipei_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
taipei_onehot['District'] = taipei_venues['District'] 

# move neighborhood column to the first column
list_column = taipei_onehot.columns.tolist()
number_column = int(list_column.index('District'))
list_column = [list_column[number_column]] + list_column[:number_column] + list_column[number_column+1:] 
taipei_onehot = taipei_onehot[list_column]

taipei_onehot.head()

#### Next, let's group rows by district and by taking the mean of the frequency of occurrence of each category

In [None]:
taipei_grouped = taipei_onehot.groupby('District').mean().reset_index()
taipei_grouped.head()

#### Let's put that into a pandas dataframe
First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['District']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
districts_venues_sorted = pd.DataFrame(columns=columns)
districts_venues_sorted['District'] = taipei_grouped['District']

for ind in np.arange(taipei_grouped.shape[0]):
    districts_venues_sorted.iloc[ind, 1:] = return_most_common_venues(taipei_grouped.iloc[ind, :], num_top_venues)

districts_venues_sorted.head()

### B.5. Cluster of Boroughs
K-Means algorithm is one of the most common cluster method of unsupervised learning. I will use K-Means algorithm for my study in this project.

First, I will run K-Means to cluster the districts into 3 clusters because when I analyze the K-Means with elbow method it ensured me the 3 degree for optimum k of the K-Means

In [None]:
kclusters = 3

taipei_grouped_clustering = taipei_grouped.drop('District', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(taipei_grouped_clustering)

# check cluster labels generated for each row in the dataframe
labels = kmeans.labels_
labels

In [None]:
from scipy.spatial.distance import cdist

distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k, random_state=0).fit(taipei_grouped_clustering)
    #kmeanModel.fit(taipei_grouped_clustering)
    distortions.append(sum(np.min(cdist(taipei_grouped_clustering, kmeanModel.cluster_centers_, 'canberra'), axis=1)) / taipei_grouped_clustering.shape[0])

#There are different metric distance function for spatial distance. 
#I choose correlation instaed of euclidean because the canberra function gives me more clear view of elbow break point.

# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

##### Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
taipei_merged = taipei_data

# add clustering labels
taipei_merged['Cluster Labels'] = kmeans.labels_

# merge taipei_grouped with taipei_data to add latitude/longitude for each district
taipei_merged = taipei_merged.join(districts_venues_sorted.set_index('District'), on='District')

taipei_merged.head() # check the last columns!

##### We can also estimate the number of 1st Most Common Venue in each cluster.Thus, we can create a bar chart which may help us to find proper label names for each cluster.

In [None]:
count_venue = taipei_merged
count_venue = count_venue.drop(['District','Avg-housePrice', 'Latitude', 'Longitude'], axis=1)
count_venue = count_venue.groupby(['Cluster Labels','1st Most Common Venue']).size().reset_index(name='Counts')

#we can transpose it to plot bar chart
cv_cluster = count_venue.pivot(index='Cluster Labels', columns='1st Most Common Venue', values='Counts')
cv_cluster = cv_cluster.fillna(0).astype(int).reset_index(drop=True)
cv_cluster

In [None]:
#creating a bar chart of "Number of Venues in Each Cluster"
frame=cv_cluster.plot(kind='bar',figsize=(20,8),width = 0.8)

plt.legend(labels=cv_cluster.columns,fontsize= 14)
plt.title("Number of Venues in Each Cluster",fontsize= 16)
plt.xticks(fontsize=14)
plt.xticks(rotation=0)
plt.xlabel('Number of Venue', fontsize=14)
plt.ylabel('Clusters', fontsize=14)

When we examine above graph we can label each cluster as follows:

* Cluster 0 : "Cafe Venues"  
* Cluster 1 : "Multiple Social Venues"  
* Cluster 2 : "Accommodation & Intensive Cafe Venues"  

##### We can now assign those new labels to existing label of clusters:

In [None]:
Cluster_labels = {'Clusters': [0,1,2], 'Labels': ["Cafe Venues","Multiple Social Venues","Accommodation & Intensive Cafe Venues"]}
Cluster_labels = pd.DataFrame(data=Cluster_labels)
Cluster_labels

#### Let's analyze the housing sales prices for per square meter in specific range. Thus we can create new labels which involve pricing features, as well.

In [None]:
data_process = df.sort_values('Avg-housePrice').reset_index(drop=True)
data_process = data_process.drop(['Latitude', 'Longitude'], axis=1)
data_process.head()

#### We can examine that what is the frequency of housing sales prices in different ranges. Thus, histogram can help to visualization

In [None]:
num_bins = 5
n, bins, patches = plt.hist(data_process['Avg-housePrice'], num_bins, facecolor='blue', alpha=0.5)
plt.title("Average Housing Sales Prices in Range",fontsize= 16)
plt.xticks(fontsize=14)
plt.xticks(rotation=0)
plt.xlabel('Average Housing Prices (m2/sq.)', fontsize=14)
plt.ylabel('Counts', fontsize=14)
plt.show()

#### As it seems in above histogram, we can define the ranges as below:

* 50000000 AHP : "Low Level HSP"
* 75000000-100000000 AHP : "Mid Level HSP"
* '>' 100000000 AHP : "High Level HSP"
    
In this case, we can create **"Level_labels"** with those levels.

In [None]:
level = []
for i in range(0,len(data_process)):
    if (data_process['Avg-housePrice'][i] < 50000000):
        level.append("Low Level HSP")
    elif (data_process['Avg-housePrice'][i] >= 75000000 and data_process['Avg-housePrice'][i] < 100000000):
        level.append("Mid Level HSP")
    else:
        level.append("High Level HSP")   

data_process['Level_labels'] = level
data_process.head()

One of the goal was also show the number of top 3 venues information for each district on the map. Then, I grouped each district by the number of top 3 venues and I combined those informations in Join column.

In [None]:
top3 = taipei_venues.groupby(['District','Venue Category']).size().reset_index(name='Counts')
top3 = top3.sort_values(['District','Counts'],ascending=False).groupby('District').head(3).reset_index(drop=True)

top3['Join'] = top3['Counts'].map(str) + " " + top3['Venue Category']
top3 = top3.groupby(['District'])['Join'].apply(", ".join).reset_index()

top3.head()

## C. Results

### C.1. Main table with results

#### Let's merge those new variables with related cluster informations in our main Taipei_merged table.

In [None]:
import numpy as np

result = pd.merge(taipei_merged, 
                    top3[['District', 'Join']],
                    left_on = 'District',
                    right_on = 'District',
                    how = 'left')
result= pd.merge(result, 
                    Cluster_labels[['Clusters', 'Labels']],
                    left_on = 'Cluster Labels',
                    right_on = 'Clusters',
                    how = 'left')
result = pd.merge(result, 
                    data_process[['District', 'Level_labels']],
                    left_on = 'District',
                    right_on = 'District',
                    how = 'left')

result = result.drop(['Clusters'], axis=1)
result.head(3)

#### You can now see Join, Labels and Level_labels columns as the last three ones in above table

## C.2. Map of Cluster Results¶

Finally, let's visualize the resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=9.5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, join, cluster_number, label in zip(result['Latitude'], result['Longitude'], result['District'], result['Labels'], result['Join'], result['Cluster Labels'], result['Level_labels']):
    label = folium.Popup(str(poi) + " / " + str(cluster) + "-" + str(label) + " / " + str(join), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        color= rainbow[cluster_number-1],
        popup=label,
        fill_color = rainbow[cluster_number-1],
        fill_opacity=1).add_to(map_clusters)
       
map_clusters

## C.3. Map of Housing Sales Prices
Another goal of this project was to visualize the Average Housing Sale Prices per square meter with chloropleth style map. A json file of Second-lever Administrative Divisions of Taiwan from Spatial Data Repository of NYU. I cleaned the json file and pull out only Taipei city. 

Here is the final version of json file on GitHub link:


In [None]:
!wget --quiet https://raw.githubusercontent.com/augeord/Peer-graded-Assignment-Segmenting-and-Clustering-Neighborhoods-in-Toronto/master/map.geojson
    
#https://geo.nyu.edu/download/file/stanford-nj696zj1674-geojson.json    
print('GeoJSON file downloaded!')

taipei_geo = r'map.geojson'

#San Francisco La, Lo
latitude = 25.0329694
longitude = 121.56541770000001

# display San-Francisco
taipei_map = folium.Map(location=[latitude, longitude], zoom_start=10)

In final section, I created choropleth map which also has the below informations for each district:

* District name,
* Cluster name,
* Housing Sales Price (HSP) Levels,
* Top 3 number of venue

In [None]:
taipei_map.choropleth(
    geo_data=taipei_geo,
    data=taipei_data,
    columns=['District','Avg-housePrice'],
    key_on='feature.properties.name',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='House Sales Price in Taipei',
    highlight=True
)

markers_colors = []
for lat, lon, poi, cluster, join, cluster_number, label in zip(result['Latitude'], result['Longitude'], result['District'], result['Labels'], result['Join'], result['Cluster Labels'], result['Level_labels']):
    label = folium.Popup(str(poi) + " / " + str(cluster) + "-" + str(label) + " / " + str(join), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=7,
        popup=label,
        color= rainbow[cluster_number-1],
        fill=True,
        fill_color= rainbow[cluster_number-1],
        fill_opacity=1).add_to(taipei_map)
   


#display map
taipei_map