# Bonus: Temperature Analysis I

In [1]:
import pandas as pd
from datetime import datetime as dt

In [2]:
# "tobs" is "temperature observations"
df = pd.read_csv('Resources/hawaii_measurements.csv')
df.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73


In [3]:
# Convert the date column format from string to datetime
df["date"] = pd.to_datetime(df["date"])

In [4]:
df.dtypes

station            object
date       datetime64[ns]
prcp              float64
tobs                int64
dtype: object

In [5]:
# Set the date column as the DataFrame index
df = df.set_index("date")
df.head()

Unnamed: 0_level_0,station,prcp,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,USC00519397,0.08,65
2010-01-02,USC00519397,0.0,63
2010-01-03,USC00519397,0.0,74
2010-01-04,USC00519397,0.0,76
2010-01-06,USC00519397,,73


In [6]:
# Drop the date column
### Not sure why this is requested, since in the next steps we need the date
### column.  So I make a copy of the dataframe to use the copy in the next
### set of questions.  Date is the index.
df_copy = df.copy()
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,station,prcp,tobs
0,USC00519397,0.08,65
1,USC00519397,0.0,63
2,USC00519397,0.0,74
3,USC00519397,0.0,76
4,USC00519397,,73


### Compare June and December data across all years 

In [7]:
from scipy import stats

In [8]:
# Filter data for desired months
june_filter = (df_copy.index.month == 6)
december_filter = (df_copy.index.month == 12)
june_df = df_copy.loc[june_filter].copy()
december_df = df_copy.loc[december_filter].copy()
june_df.head()

Unnamed: 0_level_0,station,prcp,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-06-01,USC00519397,0.0,78
2010-06-02,USC00519397,0.01,76
2010-06-03,USC00519397,0.0,78
2010-06-04,USC00519397,0.0,76
2010-06-05,USC00519397,0.0,77


In [9]:
# Identify the average temperature for June (tobs)
june_df.mean()

prcp     0.136360
tobs    74.944118
dtype: float64

In [10]:
# Identify the average temperature for December (tobs)
december_df.mean()

prcp     0.216819
tobs    71.041529
dtype: float64

In [11]:
# Create collections of temperature data
## remove any null data
june_list     = [temp_inst
                         for temp_inst in june_df["tobs"]
                         if isinstance(temp_inst, int)
                ]
december_list = [temp_inst
                         for temp_inst in december_df["tobs"]
                         if isinstance(temp_inst, int)
                ]
len(june_list), len(december_list)

(1700, 1517)

In [12]:
# Run paired t-test
# Requires that the lists have the same shape, hence same number of values
### Trim down June list to be same quantity as December list
stats.ttest_rel(june_list[:1517], december_list[:1517])

Ttest_relResult(statistic=34.804545051754815, pvalue=1.4623155269997529e-195)

### Analysis

The average temperature in June for the 9 stations in Hawaii for multiple years is 74F.  Average temperature in December for the same 9 stations in Hawaii for multiple years is 71F.

I ran the recommended paired t-test because the same group of station temperatures at different points in time were being compared.  Had we used two distinct groups of stations temperatures we would have used unpaired t-tests.

The paired t-test requires arrays of the same size.  So the June list count was pared down to match the December list count using an index range.  December has 31 days while June has 30 days;  one way or another, the two arrays will not match in quantity of samples, even if taking in only June ##, #### sample from station yyyyy only if the sample from December ##, #### from station yyyyy was also taken.

Is there a meaningful difference between the temperature in, for example, June and December? The p-value returned was very small, below 0.05.  So we reject the null hypothesis of equal averages, and the difference is not statistically significant.