# Bonus: Temperature Analysis I

In [1]:
import pandas as pd
import datetime as dt

In [2]:
# "tobs" is "temperature observations"
df = pd.read_csv('Resources/hawaii_measurements.csv')
df.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73


In [3]:
# Convert the date column format from string to datetime
df["date"] = pd.to_datetime(df["date"])

In [4]:
df.dtypes

station            object
date       datetime64[ns]
prcp              float64
tobs                int64
dtype: object

In [5]:
# Set the date column as the DataFrame index
# Drop the date column

df.set_index("date", inplace = True)

In [6]:
display(df)

Unnamed: 0_level_0,station,prcp,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,USC00519397,0.08,65
2010-01-02,USC00519397,0.00,63
2010-01-03,USC00519397,0.00,74
2010-01-04,USC00519397,0.00,76
2010-01-06,USC00519397,,73
...,...,...,...
2017-08-19,USC00516128,0.09,71
2017-08-20,USC00516128,,78
2017-08-21,USC00516128,0.56,76
2017-08-22,USC00516128,0.50,76


### Compare June and December data across all years 

In [7]:
from scipy import stats

In [8]:
# Filter data for desired months
df_june = df[df.index.month == 6]
df_dec = df[df.index.month == 12]

In [9]:
# Identify the average temperature for June
df_june["tobs"].mean().round(2)

74.94

In [10]:
# Identify the average temperature for December
df_dec["tobs"].mean().round(2)

71.04

<strong>This is a little tricky because temperatures at the same station are non-independent. Hence, the sample units should be the temperature stations. In order to accomodate a paired t-test, we would have to average the temperature over the entire month...and then compare the mean temperatures at each station in June vs. December.</strong>

In [11]:
# Create collections of temperature data

stations = list(df["station"].unique())

June_mean_temps = []
Dec_mean_temps = []

# For each station, average the temperature data and append to respective list for that month

for i in range(len(stations)):
    stations_temporary_june = df_june.loc[df_june["station"] == stations[i]]
    stations_temporary_dec = df_dec.loc[df_dec["station"] == stations[i]]
    
    stations_june_mean = stations_temporary_june["tobs"].mean()
    stations_dec_mean = stations_temporary_dec["tobs"].mean()
    
    June_mean_temps.append(stations_june_mean)
    Dec_mean_temps.append(stations_dec_mean)
    

In [12]:
print(June_mean_temps)
print(Dec_mean_temps)

[77.55932203389831, 74.05084745762711, 76.00537634408602, 76.6554054054054, 73.39473684210526, 76.66810344827586, 73.27118644067797, 74.13939393939394, 71.9372197309417]
[71.10952380952381, 71.06944444444444, 73.2247191011236, 71.8348623853211, 72.42105263157895, 72.43333333333334, 69.90322580645162, 69.6842105263158, 69.29126213592232]


In [13]:
# Run paired t-test

stats.ttest_rel(June_mean_temps, Dec_mean_temps)

Ttest_relResult(statistic=6.95696617044294, pvalue=0.00011759380231523222)

### Analysis

Based on the p-value (which is less than 0.05) we would reject the null hypothesis that the mean difference in tempreatures is zero.

Technically, we should also evaluate whether temperatures are spatially autocorrelated (i.e. stations that are near eachother have more similar mean temperatures than stations that are farther apart). If there is significant autocorrelation a paired t-test might not be appropriate (without adjustment of the degrees of freedom to account for potential pseudoreplication).