<a href="https://colab.research.google.com/github/envirodatascience/final-project-insect-team/blob/main/temp_data_insect_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

FINAL PROJECT: Insect Group

In this Colab, we will be summarizing temperature data by State and Year (2000-2024)

Data from NOAA: https://www.ncei.noaa.gov/cdo-web/datasets

Documentation: https://www.ncei.noaa.gov/data/global-summary-of-the-year/doc/GSOY_documentation.pdf

In [2]:
# import modules

import pandas as pd
import numpy as np

In [16]:
# download the data

df_temp = pd.read_csv('https://www.ncei.noaa.gov/orders/cdo/3987151.csv')

In [17]:
df_temp.head()

Unnamed: 0,STATION,NAME,LATITUDE,LONGITUDE,ELEVATION,DATE,DX70,DX70_ATTRIBUTES,DX90,DX90_ATTRIBUTES,...,EMXT,EMXT_ATTRIBUTES,HTDD,HTDD_ATTRIBUTES,TAVG,TAVG_ATTRIBUTES,TMAX,TMAX_ATTRIBUTES,TMIN,TMIN_ATTRIBUTES
0,USC00069388,"WEST THOMPSON LAKE, CT US",41.9442,-71.9031,109.7,2000,,,,,...,,,,,,,,,,
1,USC00069388,"WEST THOMPSON LAKE, CT US",41.9442,-71.9031,109.7,2001,,,,,...,,,,,,,,,,
2,USC00069388,"WEST THOMPSON LAKE, CT US",41.9442,-71.9031,109.7,2002,,,,,...,,,,,,,,,,
3,USC00069388,"WEST THOMPSON LAKE, CT US",41.9442,-71.9031,109.7,2003,,,,,...,,,,,,,,,,
4,USC00069388,"WEST THOMPSON LAKE, CT US",41.9442,-71.9031,109.7,2004,,,,,...,,,,,,,,,,


In [18]:
df_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   STATION          504 non-null    object 
 1   NAME             504 non-null    object 
 2   LATITUDE         504 non-null    float64
 3   LONGITUDE        504 non-null    float64
 4   ELEVATION        504 non-null    float64
 5   DATE             504 non-null    int64  
 6   DX70             384 non-null    float64
 7   DX70_ATTRIBUTES  384 non-null    object 
 8   DX90             384 non-null    float64
 9   DX90_ATTRIBUTES  384 non-null    object 
 10  EMNT             384 non-null    float64
 11  EMNT_ATTRIBUTES  384 non-null    object 
 12  EMXT             384 non-null    float64
 13  EMXT_ATTRIBUTES  384 non-null    object 
 14  HTDD             368 non-null    float64
 15  HTDD_ATTRIBUTES  368 non-null    object 
 16  TAVG             381 non-null    float64
 17  TAVG_ATTRIBUTES 

Data Documentation:


*   STATION:
*   DATE: Year
*   DX70: Number days with maximum temperature greater than 70F (21.1C)
*   DX90: Number days with maximum temperature greater than 90F (32.2C)
*   EMNT: Extreme minimum temperature
*   EMXT: Extreme maximum temperature
*   HTDD: Heating Degree Days
*   TAVG: Average Average Temperature
*   TMAX: Average Maximum Temperature
*   TMIN: Average Minimum Temperature



In [19]:
# check for duplicate rows

df_temp.duplicated().sum()

np.int64(0)

In [20]:
# find nunique for years

df_temp['DATE'].unique()

array([2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2011, 2012,
       2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023,
       2024, 2009, 2010])

In [22]:
# group data by year and temperature averages

df_temp.groupby(['STATION', 'DATE']).agg({'DX90': 'sum', 'EMNT': 'min', 'EMXT': 'max', 'TAVG': 'mean', 'TMAX': 'mean', 'TMIN': 'mean'}).reset_index()

Unnamed: 0,STATION,DATE,DX90,EMNT,EMXT,TAVG,TMAX,TMIN
0,USC00060227,2000,4.0,-6.0,93.0,47.1,57.6,36.5
1,USC00060227,2001,9.0,-1.0,95.0,48.7,59.8,37.6
2,USC00060227,2002,22.0,,94.0,,59.8,
3,USC00060227,2003,7.0,-13.0,95.0,46.3,56.5,36.1
4,USC00060227,2004,0.0,-10.0,88.0,46.3,56.5,36.2
...,...,...,...,...,...,...,...,...
499,USW00094702,2020,11.0,11.0,95.0,55.3,62.9,47.7
500,USW00094702,2021,13.0,12.0,96.0,54.8,62.4,47.1
501,USW00094702,2022,11.0,6.0,94.0,53.7,61.7,45.7
502,USW00094702,2023,4.0,-4.0,93.0,54.8,62.5,47.1
