Question: 
Is there a meaningful difference between the temperature in Hawaii, for example, in June and December?

H0: there is no statistical difference between June & Dec Temps in Hawaii. 

H1: There is a statistical difference between June & Dec Temps in Hawaii. 

In [1]:
import pandas as pd
from datetime import datetime as dt

In [2]:
# "tobs" is "temperature observations"
df = pd.read_csv('Resources/hawaii_measurements.csv')
df.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73


In [3]:
# Printing dataframe type of each column
print(df.dtypes)

station     object
date        object
prcp       float64
tobs         int64
dtype: object


In [4]:
# Convert the date column format from string to datetime
df['date'] = pd.to_datetime(df['date'], format = '%Y-%m-%d')

In [5]:
# Printing dtypes to see if column date changed after running code
print(df.dtypes)

station            object
date       datetime64[ns]
prcp              float64
tobs                int64
dtype: object


In [6]:
# Set the date column as the DataFrame index
df.set_index('date', inplace = True)

df.head()

Unnamed: 0_level_0,station,prcp,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,USC00519397,0.08,65
2010-01-02,USC00519397,0.0,63
2010-01-03,USC00519397,0.0,74
2010-01-04,USC00519397,0.0,76
2010-01-06,USC00519397,,73


In [7]:
# Cleaning data by dropping null values
df.dropna(axis = 'columns', inplace = True)

df.head()

Unnamed: 0_level_0,station,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-01-01,USC00519397,65
2010-01-02,USC00519397,63
2010-01-03,USC00519397,74
2010-01-04,USC00519397,76
2010-01-06,USC00519397,73


### Compare June and December data across all years

In [8]:
from scipy import stats

In [9]:
# Filter data for June
june_df = df.loc[(df.index.month == 6)]

june_df.head()

Unnamed: 0_level_0,station,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-06-01,USC00519397,78
2010-06-02,USC00519397,76
2010-06-03,USC00519397,78
2010-06-04,USC00519397,76
2010-06-05,USC00519397,77


In [10]:
# Filter data for December
dec_df = df.loc[(df.index.month == 12)]

dec_df.head()

Unnamed: 0_level_0,station,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-12-01,USC00519397,76
2010-12-03,USC00519397,74
2010-12-04,USC00519397,74
2010-12-06,USC00519397,64
2010-12-07,USC00519397,64


In [11]:
# Checking the spread and distribution of June Data
# june_df.hist(column = 'tobs')

In [12]:
# Checking the spread and distribution of December Data
# dec_df.hist(column = 'tobs')

In [13]:
# Identify the average temperature for June
june_avg = june_df.mean()

june_avg

tobs    74.944118
dtype: float64

In [14]:
# Identify the average temperature for December
dec_avg = dec_df.mean()
dec_avg

tobs    71.041529
dtype: float64

In [15]:
# Create collections of temperature data
# df['column_name'].values.tolist()

june_list = june_df['tobs'].values.tolist()

dec_list = dec_df['tobs'].values.tolist()

In [16]:
# Run paired t-test
stats.ttest_ind(june_list, dec_list)

Ttest_indResult(statistic=31.60372399000329, pvalue=3.9025129038616655e-191)

### Analysis

The null hypothesis in this case is that these is no difference beween the temperatures in Hawaii during the July and December month. In this case because the p-value is so low, we fail to reject the null hypothesis. Concluding that there is no statistical difference between the July and December temps in Hawaii. Suggesting that the climate in Hawaii is modest year-around. 