# A Guide to Living in Shanghai

## Coursera Applied Data Science Specialization "The Battle of Neighborhoods" Capstone Project Notebook

### Part 2: Data 

The project uses two datasets, namely ***shanghai_demo*** and ***shanghai_data***.

***shanghai_demo*** contains demographic information of all 16 districts (区) in Shanghai. It has 6 variables: _District_, _Population Density_, _Salary_, _Home Price_, _GDP_ and _GDP Per Capita_.

In ***shanghai_demo*** dataset, _District_ is a character column. It lists Shanghai's 16 districts. _Population Density_ (person/square kilometer) is density of population in 2017 (data source: [Shanghai Statistical Yearbook 2018](http://www.stats-sh.gov.cn/tjnj/nje18.htm?d1=2018tjnje/E0202.htm)). _Salary_ (RMB/month) stands for personal monthly salary in 2019 (data source: [Sohu](http://www.sohu.com/a/297378775_391502)). _Home Price_ (RMB/square meter) is from the same source ([Sohu](http://www.sohu.com/a/297378775_391502)). _GDP_ (billion RMB) column has district level GDP in 2018 (data source: [Sohu](http://www.sohu.com/a/314957795_612645)). Finally, _GDP Per Capita_ (RMB) (year 2017 data) is sourced from [好金贵财经](https://www.haojingui.com/gdp/5058.html).

Note that all data are from different sources and are in different years. This is because it is extremely hard to find one single source that provides related district-level data. This is the most severe fallacy of this project, but this would not affect the results too greatly given that the data would not change too much within 1 to 2 years if the accuracy of data is fully guaranteed. 

In [1]:
# Import library
import pandas as pd

In [3]:
# Read shanghai_demo dataset
shanghai_demo=pd.read_csv("D:\Learning\Applied Data Science\Course 4 Applied Data Science Capstone\Week 4 and 5 Project\Shanghai District.csv")

# Print all rows of shanghai_demo
shanghai_demo

Unnamed: 0,District,Population Density,Salary,Home Price,GDP,GDP PP
0,Pudong,4567,8170,48713,1046.009,175448
1,Huangpu,32004,7160,81375,227.03,320701
2,Xuhui,19874,7640,71064,167.0,144983
3,Changning,18112,8030,68491,142.8,191305
4,Jing'an,28910,8380,66228,184.7,159550
5,Putuo,23431,7720,55738,100.17,72796
6,Hongkou,34058,7970,58927,83.801,96955
7,Yangpu,21627,7220,59443,184.77,130074
8,Minhang,6836,8030,47381,201.36,88089
9,Baoshan,7494,7910,38860,139.206,56506


***shanghai_data*** lists prominent neighborhoods in Shanghai (in both English and Chinese) as well as the districts they belong to. It has three variables: _District_, _Neighborhood_ and _Neighborhood Chinese Name_.

It is necessary to highlight that there is no such concept as "neighborhood" (社区) in Shanghai. "Neighborhood" is essentially a "western" concept and is not used to indicate the same thing in China. In the country, a "neighborhood" is more like a "residential community" (小区) that only has residential buildings rather than a large area that has shopping malls, stores, restaurants and attractions (and of course resdential communities). An equivalent concept, in Shanghai, is in fact "**subdistrict**" (街道).

"Subdistrict," however, is still not entirely the same as "neighborhood" in the western world. For instance, [Wujiaochang (五角场)](https://en.wikipedia.org/wiki/Wujiaochang) in Yangpu District is essentially a subdistrict ("五角场街道") in [Shanghai's township-level divisions](https://en.wikipedia.org/wiki/List_of_township-level_divisions_of_Shanghai) hierarchy and can be treated as a neighborhood to an extent. In contrast, the famous [Xintiandi (新天地)](https://www.travelchinaguide.com/attraction/shanghai/xin-tian-di.htm), an area full of delicacy, art, decent food and fashion in Huangpu District, is not a subdistrict but can be thought as a neighborhood.

Nevertheless, the report adopts the western convention and focuses on a total of 47 neighborhoods (subdistricts/towns) across 16 districts in Shanghai. The author built ***shanghai_data*** based on his own discretion.

In [4]:
# Read file
shanghai_data=pd.read_excel("D:\Learning\Applied Data Science\Course 4 Applied Data Science Capstone\Week 4 and 5 Project\Shanghai Neighborhood.xlsx")

In [5]:
# Print all rows of shanghai_data
shanghai_data

Unnamed: 0,District,Neighborhood,Neighborhood Chinese Name
0,Pudong,Lujiazui,陆家嘴
1,Pudong,Century Park,世纪公园
2,Pudong,Zhoujiadu,周家渡
3,Pudong,Zhangjiang,张江
4,Huangpu,People's Square,人民广场
5,Huangpu,Huaihai Road,淮海路
6,Huangpu,The Bund,外滩
7,Huangpu,Former French Concession,旧法租界
8,Huangpu,Xintiandi,新天地
9,Huangpu,Dapuqiao,打浦桥


***shanghai_data*** does not have longitude/latitude information for each neighborhood. Thus, further manipulation is conducted.

In [7]:
# Read libraries
import geopy
from geopy.geocoders import Nominatim

geolocator=Nominatim(user_agent="shanghai_explorer")

# Attach string ', Shanghai' to Neighborhood column
shanghai_data['Neighborhood']=shanghai_data['Neighborhood']+', Shanghai'
# This is for the location precesion concern

# Get coordinates
shanghai_data['Coordinates']=shanghai_data['Neighborhood'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))

# Seperate Coordinates column into latitude and longitude columns
shanghai_data[['Latitude','Longitude']]=shanghai_data['Coordinates'].apply(pd.Series)

# Drop Coordinates column
shanghai_data.drop(['Coordinates'],axis=1,inplace=True)

Now ***shanghai_data*** has geographic information included:

In [9]:
shanghai_data.head(10)

Unnamed: 0,District,Neighborhood,Neighborhood Chinese Name,Latitude,Longitude
0,Pudong,"Lujiazui, Shanghai",陆家嘴,31.240168,121.497945
1,Pudong,"Century Park, Shanghai",世纪公园,31.2187,121.554338
2,Pudong,"Zhoujiadu, Shanghai",周家渡,31.187146,121.489237
3,Pudong,"Zhangjiang, Shanghai",张江,31.207347,121.610182
4,Huangpu,"People's Square, Shanghai",人民广场,31.231926,121.471535
5,Huangpu,"Huaihai Road, Shanghai",淮海路,31.220936,121.467353
6,Huangpu,"The Bund, Shanghai",外滩,31.234038,121.488921
7,Huangpu,"Former French Concession, Shanghai",旧法租界,31.211806,121.464982
8,Huangpu,"Xintiandi, Shanghai",新天地,31.217936,121.469819
9,Huangpu,"Dapuqiao, Shanghai",打浦桥,31.208286,121.463941
