<h1 align=center><font size = 5>How Different Are Cities in Northern/Southern China?</font></h1>  
<h2 align=right> ---- Segmenting and Clustering Cities in China</h2>      
<h3 align=center><font size = 2>Victoria Wang</font></h3>  

### 1. Introduction    
In China, where I come from, the diversity of cultures is extraordinary.   
It is aften said that the largest cultural cleavage in China lies between people residing in the north and south side of the [the Qinhuai River](http://www.chinatraveldepot.com/CAD165-Qinhuai-River), where the dialects, foods, lifestyles, and social norms are so different they'd identify themselves as people of two different nations if they hadn't shared the same identity as Chinese citizens.     

I am curious to see if venues in the major cities of north/south China reflect such differences. Therefore, I selected 50 major Chinese cities scattered across the Chinese territory, extracted venue information in these cities from Four Square API, and attempted to cluster these cities based on common venue types.    

What I'm aiming to examine is if the resulting clusters are similar to the geograaphical division of these cities. This notebook shows the complete process of this simple analysis. 

In [53]:
# Install and import all necessary stuff
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
import bs4
from bs4 import BeautifulSoup

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

# import k-means from clustering stage
import sklearn
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Folium installed')
print('Libraries imported.')

Folium installed
Libraries imported.


### 2. Data Acquirement and Cleaning
I start by creating a dataset of cities in China. To do that, I first scrape a table of Chinese cities and their respective coordinates that I found a website. [This is the link](https://www.latlong.net/category/cities-46-15.html). 

In [54]:
# Get city names and coordinates from webpage
with open('Cities in China with Lat Long.htm') as html: 
    soup = BeautifulSoup(html)

table = soup.find('div', class_="col-8").table.tbody

In [55]:
# Get column names of cities
cols = []
for str in table.find_all('th'):
    cols.append(str.text)

df_cities = pd.DataFrame(columns = cols)

# Get all city names
city_names = []
for str in table.find_all('a'): 
    city_names.append(str.text)
    
df_cities['Place Name'] = city_names
df_cities.head()

Unnamed: 0,Place Name,Latitude,Longitude
0,"Foshan, Gundong Province, China",,
1,"Anshan, Liaoning, China",,
2,"Datong, Shanxi, China",,
3,"Luoyang, Henan, China",,
4,"Baotou, Inner Mongolia, China",,


In [56]:
# Get all coordinates
coords = []

# Search 'tr' category
for str in table.find_all('tr'): 
    # Search 'td' sub-cat
    str = str.find_all('td')
    ll = []
    for l in str: # If reuturned value is not in the first column
        if l.a is None:
            ll.append(l.text)
    coords.append(ll)

coords = coords[1:] # For some reason the first item is an empty list, quick fix...
df_cities[['Latitude','Longitude']] = coords
df_cities[['Latitude','Longitude']] = df_cities[['Latitude','Longitude']].astype(float) # Change coords to floats

Then , I do some basic cleaning of the data. 

In [57]:
# Simplify city names 
newnames = []
for str in df_cities['Place Name']:
    str = [x.strip() for x in str.split(',')]
    
    # If name is a normal city within a province
    if len(str) == 3:
        str = str[0] + ", China"
        newnames.append(str)
        
    # If name is Beijing, Shanghai, etc.
    elif len(str) == 2:
        str = str[0] + ", China"
        newnames.append(str)
        
    else: 
        str = str[1] + ", China" 
        newnames.append(str)

df_cities['Place Name'] = newnames
df_cities.head()

Unnamed: 0,Place Name,Latitude,Longitude
0,"Foshan, China",23.016666,113.116669
1,"Anshan, China",41.116669,122.98333
2,"Datong, China",40.083332,113.300003
3,"Luoyang, China",34.669724,112.442223
4,"Baotou, China",40.650002,109.833336


For centuries, China has been using the Qinhuai River to divide the north and south of its territory. I will extend this tradition and use its latitude
to categorize my cities as Northern or Southern. 

In [58]:
# Qinhuai River Coordinates
address = 'Qinhuai River'
geolocator = Nominatim(user_agent = "cn_explorer")
location = geolocator.geocode(address)
river_lat = location.latitude
river_lng = location.longitude
print('The geograpical coordinate of Qinhuai River are {}, {}.'.format(river_lat, river_lng))

The geograpical coordinate of Qinhuai River are 32.0148145, 118.8113083.


Awesome. I will compare the latitudes of the city to the latitude of Qinhuai River. 

In [59]:
def nors(row):
    if row[1] > river_lat:
        return 'N'
    return 'S'

df_cities['N/S'] = df_cities.apply(nors, 1)
df_cities.head()

Unnamed: 0,Place Name,Latitude,Longitude,N/S
0,"Foshan, China",23.016666,113.116669,S
1,"Anshan, China",41.116669,122.98333,N
2,"Datong, China",40.083332,113.300003,N
3,"Luoyang, China",34.669724,112.442223,N
4,"Baotou, China",40.650002,109.833336,N


### 3. Visualize Northern/Southern cities in China
Now, I've successfully achieved 50 Chinese cities' information, including their names and coordinates, as well as whether they are northern or southern cities in China. 

I should probably generate a map to show these cities, to get a more intuitive sense of how they are located geographically. 

In [60]:
# Get the coordinates of China 
address = 'China'
geolocator = Nominatim(user_agent = "cn_explorer")
location = geolocator.geocode(address)
cn_lat = location.latitude
cn_lng = location.longitude
print('The geograpical coordinate of China are {}, {}.'.format(cn_lat, cn_lng))

The geograpical coordinate of China are 35.000074, 104.999927.


In [61]:
# Create map of China using latitude and longitude values
map_cn = folium.Map(location = [cn_lat, cn_lng], zoom_start = 3.5)

# Change color according to North/South divide 
def color(ns): 
    if ns == 'N': 
        return 'darkblue'
    
    return 'lightpink'

# Add markers to the map
for name, lat, lng, group in zip(df_cities['Place Name'], 
                                 df_cities['Latitude'], 
                                 df_cities['Longitude'], 
                                 df_cities['N/S']):
    label = name
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 4,
        popup = label,
        color = color(group),
        fill = True,
        fill_color = '#ffffff',
        parse_html = False).add_to(map_cn)  
    
map_cn

And there -- I've successfully visualized all 50 cities in China that I want to analyze. 
Now let's get the top venues in these cities! 

### 4. Obtain venue information from Four Square

In [62]:
# Set up Foursquare credentials
CLIENT_ID = 'DU1RZNIYZKMR4ND3KK4F3FASRDB1GJGKM0TYVLLLF0NFQKZU' # your Foursquare ID
CLIENT_SECRET = '30BO30FW3YDGZM0BQO0VV0VR0WDPNYESACKI4UFCQD1IAPX4' # your Foursquare Secret
VERSION = '20180604'
print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: DU1RZNIYZKMR4ND3KK4F3FASRDB1GJGKM0TYVLLLF0NFQKZU
CLIENT_SECRET:30BO30FW3YDGZM0BQO0VV0VR0WDPNYESACKI4UFCQD1IAPX4


I borrow the 'get_category_type' function from the Foursquare lab and use it to get the venues' types. 

In [63]:
# Function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

I borrow the function from the previous lab, which gets the top 100 venues around the city center. 

In [71]:
def getNearbyVenues(names, latitudes, longitudes):
    LIMIT = 100
    url_format = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&near={}&limit={}'

    venues_list=[]
    
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = url_format.format(CLIENT_ID, 
                                CLIENT_SECRET, 
                                VERSION, 
                                name, 
                                LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

I apply the funtion to my city data. 

In [65]:
china_venues = getNearbyVenues(names = df_cities['Place Name'],
                               latitudes = df_cities['Latitude'],
                               longitudes = df_cities['Longitude'])

A quick look at the total number of venues I got from the search. 

In [66]:
# Count number of venues in each neighborhood, sort the result ascendingly. 
num = pd.DataFrame(china_venues.groupby('City').count()['Venue Category']).sort_values(by = 'Venue Category').reset_index()
num

Unnamed: 0,City,Venue Category
0,"Yichun, China",2
1,"Zibo, China",4
2,"Wuhu, China",4
3,"Wuhai, China",4
4,"Tieling, China",4
5,"Baotou, China",5
6,"Yancheng, China",5
7,"Tangshan, China",5
8,"Qiqihar, China",5
9,"Jiamusi, China",5


The result is rather disappointing, though, for **more than half of my cities got less than 15 results from the search.**    

I'm assuming that the lack of venue information of these cities **may affect the clustering results**, but let's proceed anyways.    

The next step is to count the frequency of occurances of each venue in a given city. 

In [24]:
# One hot encoding
china_onehot = pd.get_dummies(china_venues['Venue Category'], prefix="", prefix_sep="")

# Move neighborhood column to the first column
fixed_columns = [china_onehot.columns[-1]] + list(china_onehot.columns[:-1])
china_onehot = china_onehot[fixed_columns]
#china_onehot = china_onehot.drop('City', axis = 1)

# Put Neighbourhood names back to the one-hot-coded dataframe
names = pd.DataFrame(china_venues['City'])
china = pd.concat([names, china_onehot], axis=1)
china.head()

# Get the frenquency of occurances of each type of venue for each neighborhood
china_grouped = china.groupby('City').mean().reset_index()
china_grouped.head()

Unnamed: 0,City,Zoo,Airport,American Restaurant,Aquarium,Arcade,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,...,University,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Volcano,Whisky Bar,Wine Bar,Xinjiang Restaurant,Yoga Studio,Yunnan Restaurant,Zhejiang Restaurant
0,"Anshan, China",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Baotou, China",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Beijing, China",0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.03,0.01
3,"Changchun, China",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Changsha, China",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, we generate a dataframe that contains the top 10 venues in each city (now, because some of the cities have less than 10 results, this new dataframe may not correctly represent the attributes of venues in these cities). 

In [67]:
# Define a fundtion that returns the most common venue of a city
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [68]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# Create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Create a new dataframe
cities_venues_sorted = pd.DataFrame(columns = columns)
cities_venues_sorted['City'] = china_grouped['City']

for ind in np.arange(china_grouped.shape[0]):
    cities_venues_sorted.iloc[ind, 1:] = return_most_common_venues(china_grouped.iloc[ind, :], num_top_venues)

cities_venues_sorted.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Anshan, China",Coffee Shop,Hotel,Shopping Mall,Park,Dim Sum Restaurant,Train Station,Farmers Market,Furniture / Home Store,French Restaurant,Fountain
1,"Baotou, China",Fast Food Restaurant,Hotel,Coffee Shop,BBQ Joint,Shopping Mall,Chinese Restaurant,Farmers Market,Garden,Furniture / Home Store,French Restaurant
2,"Beijing, China",Historic Site,Hotel,Park,Coffee Shop,Chinese Restaurant,Peking Duck Restaurant,Yunnan Restaurant,Electronics Store,Café,Bookstore
3,"Changchun, China",Hotel,Coffee Shop,Fast Food Restaurant,Shopping Mall,Department Store,Plaza,Flea Market,Park,Train Station,History Museum
4,"Changsha, China",Coffee Shop,Shopping Mall,Hotel,Chinese Restaurant,Park,Department Store,Fast Food Restaurant,Historic Site,Multiplex,Mountain


Have a quick look at the top venues of each city. 

In [70]:
num_top_venues = 10

for city in china_grouped['City']:
    print("----"+city+"----")
    temp = china_grouped[china_grouped['City'] == city].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Anshan, China----
                venue  freq
0         Coffee Shop  0.22
1               Hotel  0.22
2       Shopping Mall  0.22
3       Train Station  0.11
4                Park  0.11
5  Dim Sum Restaurant  0.11
6               Plaza  0.00
7                 Pub  0.00
8           Nightclub  0.00
9        Noodle House  0.00


----Baotou, China----
                  venue  freq
0  Fast Food Restaurant  0.25
1             BBQ Joint  0.17
2           Coffee Shop  0.17
3         Shopping Mall  0.17
4                 Hotel  0.17
5    Chinese Restaurant  0.08
6                   Zoo  0.00
7          Optical Shop  0.00
8       Other Nightlife  0.00
9     Outdoor Sculpture  0.00


----Beijing, China----
                    venue  freq
0           Historic Site  0.19
1                   Hotel  0.11
2                    Park  0.08
3             Coffee Shop  0.06
4      Chinese Restaurant  0.05
5  Peking Duck Restaurant  0.04
6       Yunnan Restaurant  0.03
7     Dumpling Restaurant  0.02
8  

                  venue  freq
0                 Hotel  0.31
1                 River  0.08
2  Fast Food Restaurant  0.08
3           Bus Station  0.08
4             Gastropub  0.08
5         Shopping Mall  0.08
6           Coffee Shop  0.08
7         Train Station  0.08
8           Pizza Place  0.08
9            Restaurant  0.08


----Qingdao, China----
                venue  freq
0         Coffee Shop  0.19
1               Hotel  0.17
2       Shopping Mall  0.07
3                Park  0.06
4                 Bar  0.06
5  Italian Restaurant  0.04
6       Historic Site  0.04
7              Museum  0.04
8               Beach  0.04
9              Church  0.02


----Qiqihar, China----
                   venue  freq
0     Chinese Restaurant  0.29
1          Train Station  0.14
2         Clothing Store  0.14
3            Bus Station  0.14
4                   Lake  0.14
5                   Park  0.14
6           Optical Shop  0.00
7        Other Nightlife  0.00
8      Outdoor Sculpture  0.00
9 

### 5. K-Means Clustering 
Now, finally, with the cleaned data I perform the K-Means clustering analysis, and try to make my cities into 2 clusters according to venues in their cities. Below is the process and results. 

In [44]:
# Set number of clusters
kclusters = 2

china_grouped_clustering = china_grouped.drop('City', 1)

# Run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(china_grouped_clustering)

# Add clustering labels
cities_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
cities_venues_sorted.head()
# Merge toronto_grouped with toronto data to add latitude/longitude for each neighborhood
china_merged = df_cities
china_merged.rename(columns = {'Place Name':'City'}, inplace = True)
china_merged = china_merged.join(cities_venues_sorted.set_index('City'), on = 'City').dropna(axis = 0)
# For some reason the labels changed to floats after the join. Quick fix!
china_merged['Cluster Labels'] = china_merged['Cluster Labels'].astype(int)
china_merged.head()

Unnamed: 0,City,Latitude,Longitude,N/S,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Foshan, China",23.016666,113.116669,S,1,Shopping Mall,Coffee Shop,Hotel,Fast Food Restaurant,Pizza Place,Metro Station,Dim Sum Restaurant,History Museum,Neighborhood,Buddhist Temple
1,"Anshan, China",41.116669,122.98333,N,1,Coffee Shop,Hotel,Shopping Mall,Park,Dim Sum Restaurant,Train Station,Farmers Market,Furniture / Home Store,French Restaurant,Fountain
2,"Datong, China",40.083332,113.300003,N,0,Hotel,Hostel,Shopping Mall,Chinese Restaurant,Train Station,French Restaurant,Fountain,Food,Flea Market,Fish & Chips Shop
3,"Luoyang, China",34.669724,112.442223,N,1,Historic Site,Hotel,Hostel,Chinese Restaurant,Train Station,Furniture / Home Store,French Restaurant,Fountain,Food,Flea Market
4,"Baotou, China",40.650002,109.833336,N,1,Fast Food Restaurant,Hotel,Coffee Shop,BBQ Joint,Shopping Mall,Chinese Restaurant,Farmers Market,Garden,Furniture / Home Store,French Restaurant


### 6. Visualize the result
The dataframe doesn't give me an intuitive sense on how the cities are clustered. I'll create a graph to see it more clearly. 

In [45]:
# Create map
map_clusters = folium.Map(location=[cn_lat, cn_lng], zoom_start = 3.5)

# Set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


# Add markers to the map

for lat, lon, poi, cluster in zip(china_merged['Latitude'], china_merged['Longitude'], china_merged['City'], china_merged['Cluster Labels']):
    label = folium.Popup(poi, parse_html = True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color = '#ffffff',
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 7. Summary
As we can clearly see from the map, clusters based on the types of common venues is perhaps not a good way of distinguishing northern/southern cities in China. Instead, the majority of cities, including cities in both the northern and southern regions of China, are put into the second cluster, while only a few are put into the first cluster.    

Perhaps, no matter where a city is located, business-wise they are all urban environments whose residents have the demand for similar kinds of venues, such as hotels, coffee shops, shopping malls etc. In the future, I hope I can find a better way of distinguishing cities in different regions and/or cities whose populations have different social/cultural traits.    

Thanks for reading it through! 