# 1 Data Preprocessing

#### Objective: Cleaning & filtering: 

You must use the pandas and geopandas packages to load and clean all the datasets. The process of cleaning & filtering the data should include: 

●Removing unnecessary columns, and only keeping columns needed to answer questions in the other parts of this project;

●Remove invalid data points (use your discretion!);

●Normalize column names & column types where needed;

●Normalize the Spatial Reference Identifiers (SRID) of any geometry.

Tips: Use Soql to control download data

https://dev.socrata.com/docs/queries/

#### Define a function to get data from NYCopen

In [27]:
import pandas as pd
import requests

def get_dataframe(url):
    headers = {
       'Accept': 'application/json',
       'X-App-Token': '8ITaLVGKJEzelLCfrNyuIi2rJ' 
        }

    response = requests.get(url, headers=headers)
    data = response.json()

    return pd.DataFrame(data)

###  1.1 Download 311 data

In [28]:
url_311 = "https://data.cityofnewyork.us/resource/erm2-nwe9.json?$limit=100 "
df_311=get_dataframe(url_311)

#### 1.1.1 choose specific data range:

In [None]:
#先用这个时间范围 等最后再导入真实的数据库


$where=created_date between'2015-10-01T12:00:00.000'and '2023-09-30T12:00:00.000'

In [None]:
df_311['created_date'].max()
df_311['created_date'].min()

#### 1.1.2 Choose specific column

In [50]:
new_311=df_311[['unique_key','created_date','incident_zip','complaint_type']]

#### 1.1.3 remove invalid data points 

*处理缺失值 fill missing data

*处理数据类型错误

*处理异常值

In [55]:
new_311.head()

Unnamed: 0,unique_key,created_date,incident_zip,complaint_type
0,59545060,2023-11-26T12:00:00.000,10466,Derelict Vehicles
1,59547157,2023-11-26T12:00:00.000,10466,Derelict Vehicles
2,59544006,2023-11-26T12:00:00.000,10466,Derelict Vehicles
3,59549309,2023-11-26T12:00:00.000,10466,Derelict Vehicles
4,59550253,2023-11-26T01:06:18.000,11203,Noise - Commercial


In [53]:
#删除没有unique_key的列
new_311['unique_key'].dropna()

0     59545060
1     59547157
2     59544006
3     59549309
4     59550253
        ...   
95    59542895
96    59542795
97    59548177
98    59549270
99    59544982
Name: unique_key, Length: 100, dtype: object

#### 1.1.4 Normalization column types 

In [36]:
new_311.fillna("None")
new_311['incident_zip'] = new_311['incident_zip'].astype(str)
new_311['complaint_type'] = new_311['complaint_type'].astype(str)
new_311['created_date'] = pd.to_datetime(new_311['created_date'])

new_311.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_311['incident_zip'] = new_311['incident_zip'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_311['complaint_type'] = new_311['complaint_type'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_311['created_date'] = pd.to_datetime(new_311['created_date'])


unique_key                object
created_date      datetime64[ns]
incident_zip              object
complaint_type            object
dtype: object

### 1.2 Download Tree data

In [44]:
url_tree = "https://data.cityofnewyork.us/resource/uvpi-gqnh.json?$limit=100"
df_tree=get_dataframe(url_tree)

#### 1.2.1Choose specific column

In [56]:
new_tree=df_tree[['tree_id','zipcode','x_sp', 'y_sp','longitude', 'latitude','spc_common','health','status']]
new_tree.head()

Unnamed: 0,tree_id,zipcode,x_sp,y_sp,longitude,latitude,spc_common,health,status
0,180683,11375,1027431.148,202756.7687,-73.84421522,40.72309177,red maple,Fair,Alive
1,200540,11357,1034455.701,228644.8374,-73.81867946,40.79411067,pin oak,Fair,Alive
2,204026,11211,1001822.831,200716.8913,-73.9366077,40.71758074,honeylocust,Good,Alive
3,204337,11211,1002420.358,199244.2531,-73.93445616,40.71353749,honeylocust,Good,Alive
4,189565,11215,990913.775,182202.426,-73.97597938,40.66677776,American linden,Good,Alive


#### 1.2.2Clean data

#### 1.2.3Normalization column types 

In [57]:
new_tree.dtypes

tree_id       object
zipcode       object
x_sp          object
y_sp          object
longitude     object
latitude      object
spc_common    object
health        object
status        object
dtype: object

### 1.3 Download Geo data

In [None]:
import geopandas as gpd

In [None]:
gdf = gpd.read_file('data/nyc_zipcodes.shp')
print(gdf.head())
gdf.plot()

In [None]:
gdf2 = gpd.read_file('data/nyc_zipcodes.dbf')
gdf2.head()

#### Normalize the Spatial Reference Identifiers (SRID) of any geometry.

使用GeoPandas库可以对数据集中的几何要素的空间参考标识（SRID）进行规范化。规范化SRID意味着确保数据集中的所有几何要素都采用相同的参考系统。这个过程有助于在处理空间数据时保持一致性和准确性。

### 1.4 Zillow Rent data

In [59]:
df_zillow=pd.read_csv('data/zillow_rent_data.csv')
df_zillow.head()

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,2015-01-31,...,2022-12-31,2023-01-31,2023-02-28,2023-03-31,2023-04-30,2023-05-31,2023-06-30,2023-07-31,2023-08-31,2023-09-30
0,91982,1,77494,zip,TX,TX,Katy,"Houston-The Woodlands-Sugar Land, TX",Fort Bend County,1606.206406,...,1994.653463,2027.438438,2042.237444,2049.325559,2016.531345,2023.438976,2031.558202,2046.144009,2053.486247,2055.771355
1,91940,3,77449,zip,TX,TX,Katy,"Houston-The Woodlands-Sugar Land, TX",Harris County,1257.81466,...,1749.6979,1738.217986,1747.30584,1758.407295,1758.891075,1762.980879,1771.751591,1779.338402,1795.384582,1799.63114
2,91733,5,77084,zip,TX,TX,Houston,"Houston-The Woodlands-Sugar Land, TX",Harris County,,...,1701.21752,1706.900064,1706.067787,1723.72232,1735.48467,1752.132904,1756.990323,1754.429516,1757.602011,1755.03149
3,93144,6,79936,zip,TX,TX,El Paso,"El Paso, TX",El Paso County,,...,1419.480272,1458.063897,1471.726681,1466.734658,1456.17566,1462.478506,1466.267391,1490.237063,1488.180414,1494.366097
4,62093,7,11385,zip,NY,NY,New York,"New York-Newark-Jersey City, NY-NJ-PA",Queens County,,...,2935.80822,2895.699421,2873.209025,2881.906361,2913.546218,2963.964134,3005.735342,3034.413822,3064.476503,3079.585783


# 2 Storing Data

# 3 Understanding Data

# 4 Visualizing Data