# Exploring New Weather Data
### Andrew Attilio
### July 3, 2024

I acquired weather data that appears useful:
https://www.kaggle.com/datasets/guillemservera/global-daily-climate-data

We want to create a feature for forecasting that incorporates temperature 
data, as we assume this affects the spread of diseases.

In [1]:
import pandas as pd

cities_df = pd.read_csv('../weather_data/cities.csv')
cities_df.head()

Unnamed: 0,station_id,city_name,country,state,iso2,iso3,latitude,longitude
0,41515,Asadabad,Afghanistan,Kunar,AF,AFG,34.866,71.150005
1,38954,Fayzabad,Afghanistan,Badakhshan,AF,AFG,37.129761,70.579247
2,41560,Jalalabad,Afghanistan,Nangarhar,AF,AFG,34.441527,70.436103
3,38947,Kunduz,Afghanistan,Kunduz,AF,AFG,36.727951,68.87253
4,38987,Qala i Naw,Afghanistan,Badghis,AF,AFG,34.983,63.1333


In [2]:
cities_df = cities_df.loc[cities_df['iso2'] == 'US']
cities_df.head()

Unnamed: 0,station_id,city_name,country,state,iso2,iso3,latitude,longitude
1138,72518,Albany,United States of America,New York,US,USA,42.670017,-73.819949
1139,72406,Annapolis,United States of America,Maryland,US,USA,38.97833,-76.492499
1140,72219,Atlanta,United States of America,Georgia,US,USA,33.830014,-84.399949
1141,74389,Augusta,United States of America,Maine,US,USA,44.310563,-69.779989
1142,72254,Austin,United States of America,Texas,US,USA,30.26695,-97.742778


In [3]:
cities_df.nunique()

station_id    49
city_name     49
country        1
state         49
iso2           1
iso3           1
latitude      49
longitude     49
dtype: int64

Our existing data uses location codes. We need a mapping of these location 
codes with the location names in the weather data. 

In [4]:
# Read location data.
locations = pd.read_csv("../datasets/locations.csv").iloc[1:]  # skip first row

# Map location codes to state names.
location_to_state = dict(zip(locations["location"], 
                             locations["location_name"]))

location_to_state

{'01': 'Alabama',
 '02': 'Alaska',
 '04': 'Arizona',
 '05': 'Arkansas',
 '06': 'California',
 '08': 'Colorado',
 '09': 'Connecticut',
 '10': 'Delaware',
 '11': 'District of Columbia',
 '12': 'Florida',
 '13': 'Georgia',
 '15': 'Hawaii',
 '16': 'Idaho',
 '17': 'Illinois',
 '18': 'Indiana',
 '19': 'Iowa',
 '20': 'Kansas',
 '21': 'Kentucky',
 '22': 'Louisiana',
 '23': 'Maine',
 '24': 'Maryland',
 '25': 'Massachusetts',
 '26': 'Michigan',
 '27': 'Minnesota',
 '28': 'Mississippi',
 '29': 'Missouri',
 '30': 'Montana',
 '31': 'Nebraska',
 '32': 'Nevada',
 '33': 'New Hampshire',
 '34': 'New Jersey',
 '35': 'New Mexico',
 '36': 'New York',
 '37': 'North Carolina',
 '38': 'North Dakota',
 '39': 'Ohio',
 '40': 'Oklahoma',
 '41': 'Oregon',
 '42': 'Pennsylvania',
 '44': 'Rhode Island',
 '45': 'South Carolina',
 '46': 'South Dakota',
 '47': 'Tennessee',
 '48': 'Texas',
 '49': 'Utah',
 '50': 'Vermont',
 '51': 'Virginia',
 '53': 'Washington',
 '54': 'West Virginia',
 '55': 'Wisconsin',
 '56': 'Wyoming

Let's explore what the weather data has to offer. We want to predict the 
weather for a given day, so we can look at historical data and incorporate 
the current year's trend. 

In [5]:
daily_weather = pd.read_parquet('../weather_data/daily_weather.parquet')

In [6]:
daily_weather.head()

Unnamed: 0,station_id,city_name,date,season,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,snow_depth_mm,avg_wind_dir_deg,avg_wind_speed_kmh,peak_wind_gust_kmh,avg_sea_level_pres_hpa,sunshine_total_min
0,41515,Asadabad,1957-07-01,Summer,27.0,21.1,35.6,0.0,,,,,,
1,41515,Asadabad,1957-07-02,Summer,22.8,18.9,32.2,0.0,,,,,,
2,41515,Asadabad,1957-07-03,Summer,24.3,16.7,35.6,1.0,,,,,,
3,41515,Asadabad,1957-07-04,Summer,26.6,16.1,37.8,4.1,,,,,,
4,41515,Asadabad,1957-07-05,Summer,30.8,20.0,41.7,0.0,,,,,,


In [7]:
us_station_ids = cities_df['station_id'].unique()

In [8]:
daily_weather = daily_weather.loc[daily_weather['station_id'].isin(us_station_ids)]

In [9]:
daily_weather.head()

Unnamed: 0,station_id,city_name,date,season,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,snow_depth_mm,avg_wind_dir_deg,avg_wind_speed_kmh,peak_wind_gust_kmh,avg_sea_level_pres_hpa,sunshine_total_min
0,72518,Albany,1938-06-01,Summer,,8.3,26.7,0.0,0.0,,,,,
1,72518,Albany,1938-06-02,Summer,,9.4,26.1,5.1,0.0,,,,,
2,72518,Albany,1938-06-03,Summer,,13.9,23.3,4.1,0.0,,,,,
3,72518,Albany,1938-06-04,Summer,,9.4,25.0,0.0,0.0,,,,,
4,72518,Albany,1938-06-05,Summer,,13.3,22.2,0.0,0.0,,,,,


In [10]:
daily_weather = daily_weather.loc[daily_weather['date'] >= pd.to_datetime
('2000-01-01')]

In [11]:
daily_weather.head()

Unnamed: 0,station_id,city_name,date,season,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,snow_depth_mm,avg_wind_dir_deg,avg_wind_speed_kmh,peak_wind_gust_kmh,avg_sea_level_pres_hpa,sunshine_total_min
22493,72518,Albany,2000-01-01,Winter,1.0,-4.4,7.2,0.0,0.0,172.0,13.7,,1022.2,
22494,72518,Albany,2000-01-02,Winter,6.0,-1.7,11.7,0.5,0.0,,15.5,,1017.6,
22495,72518,Albany,2000-01-03,Winter,8.9,0.6,13.3,2.3,0.0,,14.8,,,
22496,72518,Albany,2000-01-04,Winter,7.6,0.6,15.6,16.5,0.0,,19.4,,1006.2,
22497,72518,Albany,2000-01-05,Winter,-2.5,-8.3,3.3,0.0,0.0,305.0,25.9,,1020.1,


For each station_id (which we can connect to our other 
data by station_id -> location_name -> location_code), we will average the 
past ~20 years of temp data, and then incorporate this year's trend. 

To get a week's average temperature, we will average the avg_temp_c over the
 7 days of that week. 