## Final Proposal: Weather Trends in Major Cities Across the World

Researcher: Fan Yang <br>
Email: [fan.yang@stern.nyu.edu](fan.yang@stern.nyu.edu)

This project will examine trends in weather indicators over the past 20 years in a sample of major cities around the world. Climate change is a widely discussed topic, but the relevant research is usually not easily accessible and relatable for the layperson. I want to examine how the weather patterns have evolved in the last two decades, and visualize these changes in a way that makes its relevance for these cities clear.   

The data for this project comes from the [National Oceanic and Atmospheric Administration Data Tool](https://www.ncdc.noaa.gov/cdo-web/datatools/selectlocation), which allows data requests specifying location, time range and variables, and provides the requested data as a csv. This is how I have gotten my data thus far and how I intend to get more data as it becomes necessary. 

I have selected several cities for analysis, since presenting a map format would require data for every region of the world with the same fineness of data, which would be very difficult if at all possible. My criteria for selecting cities is mainly size and diversity. Choosing big/wellknown cities would make this project interesting and relevant for a wider audience. I also want to include cities in different continents and different climate types, to see whether the weather trends vary by location and climate type over the same observation period. 

Based on this criteria, my initial list of cities to study is as follows: 

* New York City (Humid Subtropical climate; urban heat island)
* Shanghai (Subtropical Maritime Monsoon climate)
* Capetown (Warm Medditerranean climate)
* Paris (Temperate Oceanic climate)

Other potential cities to study: 
* Lima, Peru (Subtropical Desert climate, near Humboldt Current)
* Lhasa, Tibet, China (combination of/proximity to humid continental, cool semi-arid, subtropical highland climates)
* Arctic/Antarctica

Some key components I currently envision for the project: 
* Report basic metrics such as average tempeature and precipitation for each city to establish an understanding of the weather pattern of the location. 
* Relative metrics, such as percent increase in temperature, precipitation, etc.; temperature volatility (the difference between max and min temperatures). This can be compared across cities.
* Present visualizations of impact; e.g. plot temperature increase against increase in temperature volatility, use bubbles to represent cities and compare. 



## Data Report

**Overview**:
The data for this project comes from the National Oceanic and Atmospheric Administration database. I selected the time range to be 20 years (somewhat arbitrary and may change later on), 1998-01 to 2018-01, and frequency to be monthly. The data is currently in metric units. I've downloaded the requested data from NOAA and will import them as csv files.

The database includes many variables; a few key variables for this project are average daily max/min temperatures for the month, total monthly precipitation, any extreme max/min temperatures observed in that month, average windspeed, etc. Many other variables are available and may be incorporated into the analysis later on. Not all variables are available for all locations; I'm still working on cleaning data and figuring out which ones to include. 




In [1]:
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline

from pandas_datareader import data
import datetime as dt

Below I import the data I've downloaded, which are in two separate csv files: 

In [2]:
data1 = pd.read_csv("~/Desktop/Classes/Data Bootcamp/Data/New York.csv")

In [3]:
data2 = pd.read_csv("~/Desktop/Classes/Data Bootcamp/Data/Data_5cities.csv")

In [4]:
data1.shape

(241, 24)

In [5]:
data2.shape

(960, 18)

I combine the datasets by appending before I proceed to organizing the data:

In [6]:
data = data1.append(data2)
list(data.columns.sort_values())

['AWND',
 'CDSD',
 'DATE',
 'DP10',
 'DSND',
 'DSNW',
 'DT32',
 'DX90',
 'ELEVATION',
 'EMNT',
 'EMSD',
 'EMSN',
 'EMXP',
 'EMXT',
 'HDSD',
 'LATITUDE',
 'LONGITUDE',
 'NAME',
 'PRCP',
 'SNOW',
 'STATION',
 'TAVG',
 'TMAX',
 'TMIN',
 'WSF5']

In [7]:
data = data.rename(columns = {
    'AWND':'Avg Wind Speed',
    'DSND':'Snowdepth>1_inch',
    'DSNW':'Snowfall>1_inch',
    'DT32':'Min_temp<32',
    'DX90':'Max_temp>90',
    'EMNT':'Extreme_min_temp',
    'EMSD':'Max_daily_snowdepth', 
    'EMSN':'Max_daily_snowfall',
    'EMXP':'Max_daily_precip',
    'EMXT':'Extreme_max_temp',
    'PRCP':'Total_monthly_precip',
    'SNOW':'Total_monthly_snowfall',
    'TAVG':'T_midpoint',
    'TMAX':'Avg_daily_max_temp', 
    'TMIN':'Avg_daily_min_temp', 
    'WSF5':'Max_windspeed',
    'NAME':'Name', 
    'DATE':'Date'
    
})

In [8]:
data.head()

Unnamed: 0,Avg Wind Speed,CDSD,Date,DP10,Snowdepth>1_inch,Snowfall>1_inch,Min_temp<32,Max_temp>90,ELEVATION,Extreme_min_temp,...,LATITUDE,LONGITUDE,Name,Total_monthly_precip,Total_monthly_snowfall,STATION,T_midpoint,Avg_daily_max_temp,Avg_daily_min_temp,Max_windspeed
0,3.7,0.0,1998-01,,,0.0,13.0,0.0,42.7,-10.0,...,40.77898,-73.96925,"NY CITY CENTRAL PARK, NY US",132.1,13.0,USW00094728,4.4,7.7,1.2,17.9
1,3.9,0.0,1998-02,,,0.0,9.0,0.0,42.7,-8.3,...,40.77898,-73.96925,"NY CITY CENTRAL PARK, NY US",147.6,0.0,USW00094728,4.8,7.9,1.7,23.2
2,3.7,18.9,1998-03,,,1.0,9.0,0.0,42.7,-7.2,...,40.77898,-73.96925,"NY CITY CENTRAL PARK, NY US",129.2,127.0,USW00094728,7.4,11.2,3.7,20.1
3,3.0,18.9,1998-04,,,0.0,0.0,0.0,42.7,2.2,...,40.77898,-73.96925,"NY CITY CENTRAL PARK, NY US",179.2,0.0,USW00094728,12.2,16.9,7.6,22.8
4,3.0,66.1,1998-05,,0.0,0.0,0.0,0.0,42.7,6.7,...,40.77898,-73.96925,"NY CITY CENTRAL PARK, NY US",176.4,0.0,USW00094728,18.0,22.4,13.5,16.5


Rename the "Name" column to the city names: 

In [9]:
data.Name.unique()

array(['NY CITY CENTRAL PARK, NY US', 'CAPE TOWN INTERNATIONAL, SF',
       'LANZHOU, CH', 'PARIS LE BOURGET, FR', 'LHASA, CH', 'SHANGHAI, CH'], dtype=object)

In [10]:
data.Name.replace(["NY CITY CENTRAL PARK, NY US", 'CAPE TOWN INTERNATIONAL, SF', 'LANZHOU, CH',
       'PARIS LE BOURGET, FR', 'LHASA, CH', 'SHANGHAI, CH'] , 
                       ["New York", "Cape_Town", "Lanzhou", "Paris", "Lhasa", "Shanghai"], inplace = True)

The data is very messy; I will make separate dataframes for different variable categories, like temperature related or precipitation related. 

In [11]:
data["Temp_diff"] = data["Avg_daily_max_temp"]/data["Avg_daily_min_temp"]

In [12]:
temp_data = data[["Name","Date", 'T_midpoint','Avg_daily_max_temp', 'Avg_daily_min_temp', "Temp_diff",'Extreme_max_temp', 'Extreme_min_temp' ]]

In [14]:
temp_data = temp_data.groupby(temp_data["Name"])

In [15]:
temp_data.head(2)

Unnamed: 0,Name,Date,T_midpoint,Avg_daily_max_temp,Avg_daily_min_temp,Temp_diff,Extreme_max_temp,Extreme_min_temp
0,New York,1998-01,4.4,7.7,1.2,6.416667,18.3,-10.0
1,New York,1998-02,4.8,7.9,1.7,4.647059,14.4,-8.3
0,Cape_Town,1998-01,,27.2,,,36.5,
1,Cape_Town,1998-02,23.6,29.9,17.2,1.738372,35.2,10.9
201,Lanzhou,1998-01,-4.8,1.0,-10.5,-0.095238,8.6,-16.8
202,Lanzhou,1998-02,3.0,9.0,-3.1,-2.903226,15.1,-10.6
333,Paris,1998-01,5.8,8.4,3.1,2.709677,15.9,-4.8
334,Paris,1998-02,7.2,10.9,3.5,3.114286,17.6,-4.8
540,Lhasa,1998-01,0.5,8.6,-7.7,-1.116883,14.1,-11.5
541,Lhasa,1998-02,3.0,10.0,-4.1,-2.439024,15.5,-9.9


In [17]:
temp_data.T_midpoint.mean()

Name
Cape_Town    20.245455
Lanzhou      11.035606
Lhasa         9.796277
New York     13.143154
Paris        14.805000
Shanghai     17.628889
Name: T_midpoint, dtype: float64

## Summary

I think have the data necessary to proceed with the project. I still need to finetune the dataset, build separate dataframes for other variable categories like I did for temperature, and figure out how to deal with incomplete series (NaNs appearing for different locations in different variables).