# Programming Assignment
## Summary Statistics


### Summary Statistics and Boulder, Colorado Weather Data in 2017
***

After completion of this assignment you will be able to : 
- Use Pandas to compute summary statistics on Boulder weather data 
- Figure out how summary statistics like mean and standard deviation change under transformations of the data

Go ahead and load Numpy and Pandas using their common aliases, `np` and `pd`.                                                                                   

In [1]:
import numpy as np 
import pandas as pd

The data we'll explore in this notebook concerns temperatures and other weather observations in Boulder County over the month of July 2017.  

The data was obtained from the National Oceanic and Atmospheric Administration's [Climate.gov](https://www.climate.gov/) website.  You can find and download loads of climate-related data from NOAA [here](https://www.climate.gov/maps-data/datasets).   

The data is stored in a .csv file called `clean_boulder_weather.csv`.

In [2]:
# Two different paths to the data 
local_path = 'clean_boulder_weather.csv'

# Load the data into a DataFrame 
df = pd.read_csv(local_path)

Take a look at the first 50 or so rows of the DataFrame using the `head( )` method. 

In [3]:
df.head(50)

Unnamed: 0,STATION,NAME,DATE,PRCP,TMAX,TMIN
0,USW00094075,"BOULDER 14 W, CO US",2017-07-01,0.0,68.0,31.0
1,USW00094075,"BOULDER 14 W, CO US",2017-07-02,0.0,73.0,35.0
2,USW00094075,"BOULDER 14 W, CO US",2017-07-03,0.0,68.0,46.0
3,USW00094075,"BOULDER 14 W, CO US",2017-07-04,0.05,68.0,43.0
4,USW00094075,"BOULDER 14 W, CO US",2017-07-05,0.01,73.0,40.0
5,USW00094075,"BOULDER 14 W, CO US",2017-07-06,0.0,76.0,48.0
6,USW00094075,"BOULDER 14 W, CO US",2017-07-07,0.02,74.0,43.0
7,USW00094075,"BOULDER 14 W, CO US",2017-07-08,0.0,65.0,44.0
8,USW00094075,"BOULDER 14 W, CO US",2017-07-09,0.01,73.0,39.0
9,USW00094075,"BOULDER 14 W, CO US",2017-07-10,0.01,75.0,44.0


From this you should see that each row in the DataFrame refers to a particular weather station / date combination.  The columns of the DataFrame are as follows: 

- **STATION**: The unique identification code for each weather station 
- **NAME**: The location / name of the weather station 
- **DATE**: The date of the observation 
- **PRCP**: The precipitation (in inches)
- **TMAX**: The daily maximum temperature (in Fahrenheit)
- **TMIN**: The daily minimum temperature (in Fahrenheit)

Looking at the  DataFrame above you can see that we have data from multiple weather stations.  

To observe how many, we can pass the **NAME** column (or the **STATION** column) into Python's set function. 

In [4]:
set(df["NAME"])

{'BOULDER 14 W, CO US',
 'BOULDER, CO US',
 'GROSS RESERVOIR, CO US',
 'NIWOT, CO US',
 'NORTHGLENN, CO US',
 'RALSTON RESERVOIR, CO US',
 'SUGARLOAF COLORADO, CO US'}

It appears that we have data from seven different weather stations.  For consistency, let's reduce the data to just the reports from the weather station in `Northglenn`.  

### Exercise 1
***
Extract the rows of the DataFrame concerned with the Northglenn weather station.  Store this data in a new DataFrame called `dfNorthglenn`. 

Your dataframe should start with the following values: 

![Screenshot%202024-06-24%20at%2011.47.13%E2%80%AFAM.png](attachment:Screenshot%202024-06-24%20at%2011.47.13%E2%80%AFAM.png)

In [5]:
# Filter the data for the "Northglenn" weather station
dfNorthglenn = df[df['NAME'] == 'NORTHGLENN, CO US']

dfNorthglenn.head()   # run this line after performing the task 

Unnamed: 0,STATION,NAME,DATE,PRCP,TMAX,TMIN
184,USC00055984,"NORTHGLENN, CO US",2017-07-01,0.0,74.0,51.0
185,USC00055984,"NORTHGLENN, CO US",2017-07-02,0.0,91.0,55.0
186,USC00055984,"NORTHGLENN, CO US",2017-07-03,0.0,91.0,57.0
187,USC00055984,"NORTHGLENN, CO US",2017-07-04,0.0,91.0,56.0
188,USC00055984,"NORTHGLENN, CO US",2017-07-05,0.0,96.0,56.0


### Exercise 2  
***
Pandas (and Numpy) have canned functions that compute each of the summary statistics discussed in lecture.  You use the .mean( ) function as an example.  All of these functions can be called either on a Pandas Series (i.e. a column of a DataFrame) or on an entire DataFrame at one time.  

For instance, the sample mean of the maximum daily temperature is given by: 

In [6]:
dfNorthglenn["TMAX"].mean()

92.33333333333333

Let us observe what happens if we call .mean( ) on the entire DataFrame. 

In [7]:
# Using this code below, it will select only the numerical values within the data frame 
dfNorthglenn = dfNorthglenn.select_dtypes(include='number')

# Calculate the mean values now 
mean_vals = dfNorthglenn.mean()


print(mean_vals)

PRCP     0.021667
TMAX    92.333333
TMIN    59.666667
dtype: float64


In this case, Pandas returned a Series with the means of all of the **numerical** data in the DataFrame. 

The functions for the other summary statistics are as follows: 

\begin{array}{l|l}
\textrm{Function} & \textrm{Statistics} \\
\hline
\textrm{.var()} & \textrm{variance} \\
\textrm{.std()} & \textrm{standard deviation} \\
\textrm{.min()} & \textrm{minimum value} \\
\textrm{.max()} & \textrm{maximum value} \\
\textrm{.median()} & \textrm{value} \\
\textrm{.quantile(q)} & \textrm{quantile, where q is the desired percentage as a decimal} \\
\end{array}

Your job is to use these functions to compute the 5-number summary for the maximum daily temperature for `dfNorthglenn`

In [8]:
dfNorthglenn.columns

Index(['PRCP', 'TMAX', 'TMIN'], dtype='object')

In [9]:
# Use the following outputs
# STEP 1: 'minval' , 'maxval' , 'Q1', 'Q2', 'Q3' 
# STEP 2: then print out using the following code: print("5-Number Summary: {:.2f}    {:.2f}    {:.2f}    {:.2f}

# Calculate the 5-number summary for TMAX in the dfNorthglenn DataFrame
minval = dfNorthglenn['TMAX'].min()  # Minimum value
maxval = dfNorthglenn['TMAX'].max()  # Maximum value
Q1 = dfNorthglenn['TMAX'].quantile(0.25)  # First quartile (25th percentile)
Q2 = dfNorthglenn['TMAX'].median()  # Median (50th percentile)
Q3 = dfNorthglenn['TMAX'].quantile(0.75)  # Third quartile (75th percentile)

print("5-Number Summary: {:.2f}    {:.2f}    {:.2f}    {:.2f}    {:.2f}".format(minval, Q1, Q2, Q3, maxval))

5-Number Summary: 74.00    89.25    93.00    98.00    101.00


### Exercise 3 
***
It turns out that Pandas has a nice function called .describe( ) that will compute all of the standard summary statistics for you.  You can apply it either to a Pandas Series or to an entire DataFrame.  

Run the .describe( ) function on the **TMAX** column of your DataFrame `dfNorthglenn`, and check that the results agree with your computations from Exercise 2. 

In [10]:
tmax = dfNorthglenn['TMAX'].describe()

# your code here


print(tmax)

count     30.000000
mean      92.333333
std        7.345340
min       74.000000
25%       89.250000
50%       93.000000
75%       98.000000
max      101.000000
Name: TMAX, dtype: float64


In [11]:
### BGEIN HIDDEN TESTS

assert (tmax.size == 8, "")


In [12]:
# Filter the data for "Northglenn" weather station, keeping all columns
dfNorthglenn = df[df['NAME'] == 'NORTHGLENN, CO US']

# Display the columns of the filtered DataFrame
print(dfNorthglenn.columns)

Index(['STATION', 'NAME', 'DATE', 'PRCP', 'TMAX', 'TMIN'], dtype='object')


### Exercise 4 
***
In this exercise we'll explore how the mean and the standard deviation change when we perform basic transformations on the data.  In particular, we're interested in what happens if we 

1. Add or subtract some value from every entry in the data set 
1. Multiply every entry in the data set by some value 

We know from above that the mean and standard deviation of the `Northglenn` **TMAX** value are 92.333 and 7.345340.  Experiment by **adding** and **multiplying** an integer of the value of `3` with the **TMAX** column and then recomputing the statistics. You will have 4 computations for `mean()`, `std()`, `mean()`, and `std()`. Use the `print` functions to display your 4 computaions

In [13]:
# Original values for mean and std from the previous task
original_tmax_mean = 92.333
original_tmax_std = 7.345340

# 1. Adding 3 to every entry in the TMAX column
new_tmax_add_3 = dfNorthglenn['TMAX'] + 3
new_tmax_mean_add_3 = new_tmax_add_3.mean()
new_tmax_std_add_3 = new_tmax_add_3.std()

# 2. Multiplying every entry in the TMAX column by 3
new_tmax_multiply_3 = dfNorthglenn['TMAX'] * 3
new_tmax_mean_multiply_3 = new_tmax_multiply_3.mean()
new_tmax_std_multiply_3 = new_tmax_multiply_3.std()

# Print the results for all four computations
print(f"Mean after adding 3: {new_tmax_mean_add_3:.2f}")
print(f"Standard deviation after adding 3: {new_tmax_std_add_3:.2f}")

print(f"Mean after multiplying by 3: {new_tmax_mean_multiply_3:.2f}")
print(f"Standard deviation after multiplying by 3: {new_tmax_std_multiply_3:.2f}")



Mean after adding 3: 95.33
Standard deviation after adding 3: 7.35
Mean after multiplying by 3: 277.00
Standard deviation after multiplying by 3: 22.04


See if you can prove that your guess works in general mathematically using the formulas for the two statistics: For your personal work, education and observation. Not graded.

$$
\bar{x} = \frac{1}{n} \displaystyle\sum_{k=1}^n x_k \quad \quad \textrm{and} \quad \quad s = \sqrt{\frac{1}{n-1} \sum_{k=1}^n \left( x_k - \bar{x}\right)^2} 
$$

**Solution**: 

**The Mean with Addition**: It appears that when we add a constant to each observation the constant also gets added to the mean.  We can show this in general as follows.  Let $y_k = x_k + a$ be the shifted observations.  We then have  


$$
\bar{y} = \frac{1}{n} \sum_{k=1}^n y_k \quad = \quad \frac{1}{n} \sum_{k=1}^n (x_k + a) \quad = \quad 
\frac{1}{n} \sum_{k=1}^n x_k  + \frac{1}{n} \sum_{k=1}^n a 
\quad = \quad \bar{x}  + \frac{1}{n} \cdot an 
\quad = \quad \bar{x}  + a 
$$

**The Std Dev with Addition**: On the contary, it appears that the standard deviation stays the same when we add a constant to each observation.  This should make intuitive sense because the std dev is a measure of the spread of the data, and by adding a constant to each observation we're just shifting things down the number line.  Let's see if we can use the formula for std dev to confirm this mathematically. 

$$
\sqrt{\frac{1}{n-1}\sum_{k=1}^n \left( y_k - \bar{y} \right)^2 } \quad = \quad 
\sqrt{\frac{1}{n-1}\sum_{k=1}^n \left[ (x_k + a) - (\bar{x}+a) \right]^2 } \quad = \quad 
\sqrt{\frac{1}{n-1}\sum_{k=1}^n \left( x_k - \bar{x} \right)^2 } 
$$

We thus see that the standard deviation of both the $x$'s and the  $y$'s are the same. 

**The Mean with Multiplication**: When we multiply each observation by a constant the mean also gets multiplied by the constant.  We can show this in general as follows.  Let $z_k = b \cdot x_k$ be the multiplied observations.  We then have  


$$
\bar{z} = \frac{1}{n} \sum_{k=1}^n z_k \quad = \quad \frac{1}{n} \sum_{k=1}^n b \cdot x_k  \quad = \quad 
b \cdot \frac{1}{n} \sum_{k=1}^n x_k  
\quad = \quad b\cdot \bar{x}  
$$

**The Std Dev with Addition**: Further, std dev gets multiplied by the constant.  Let's see 

$$
\sqrt{\frac{1}{n-1}\sum_{k=1}^n \left( z_k - \bar{z} \right)^2 } \quad = \quad 
\sqrt{\frac{1}{n-1}\sum_{k=1}^n \left( b\cdot x_k - b\cdot \bar{x} \right)^2 } \quad = \quad 
\sqrt{\frac{1}{n-1}\sum_{k=1}^n b^2 \cdot \left( x_k - \bar{x} \right)^2 } \quad = \quad 
b\cdot \sqrt{\frac{1}{n-1}\sum_{k=1}^n  \left( x_k - \bar{x} \right)^2 } 
$$

This has thus been Confirmed.

### Exercise 5 
***
Let us now apply a common transformation to the **TMAX** and **TMIN** columns by converting the temperatures from Fahrenheit to Celsius.  Remember that the transformation is given by: 

$$
\textrm{CELSIUS} = \frac{5}{9} (\textrm{FAHRENHEIT}-32) 
$$

First, use the Fahrenheit data in columns **TMAX** and **TMIN** to create Celsius columns in the `Northglenn` DataFrame called **TMAX-C** and **TMIN-C**. Use the `.loc` function to achieve this and print out your results using `dfNorthglenn.head()`.

In [14]:
# Apply the Fahrenheit to Celsius conversion to the TMAX and TMIN columns
dfNorthglenn.loc[:, 'TMAX-C'] = (5/9) * (dfNorthglenn['TMAX'] - 32)
dfNorthglenn.loc[:, 'TMIN-C'] = (5/9) * (dfNorthglenn['TMIN'] - 32)

# Display the first few rows of the updated DataFrame
dfNorthglenn.head()


Unnamed: 0,STATION,NAME,DATE,PRCP,TMAX,TMIN,TMAX-C,TMIN-C
184,USC00055984,"NORTHGLENN, CO US",2017-07-01,0.0,74.0,51.0,23.333333,10.555556
185,USC00055984,"NORTHGLENN, CO US",2017-07-02,0.0,91.0,55.0,32.777778,12.777778
186,USC00055984,"NORTHGLENN, CO US",2017-07-03,0.0,91.0,57.0,32.777778,13.888889
187,USC00055984,"NORTHGLENN, CO US",2017-07-04,0.0,91.0,56.0,32.777778,13.333333
188,USC00055984,"NORTHGLENN, CO US",2017-07-05,0.0,96.0,56.0,35.555556,13.333333


From the material from Exercise 4, what do you expect the mean and the standard deviation of the daily maximum temperature to be in Celsius?

Calculate `y_min` and `y_max`. 


Let $\bar{x}^{min}$ and $\bar{x}^{max}$ represent the mean of **TMAX** and **TMIN** in Fahrenheit, respectively, then we expect the means of the associated values in Celsius to be `y_min` and `y_max`.  Use the following 2 formuuals for your calcuations.

$$
\bar{y}^{min} = \frac{5}{9}\left( \bar{x}^{min} - 32 \right) \\
\bar{y}^{max} = \frac{5}{9}\left( \bar{x}^{max} - 32 \right)
$$

Your two output values should be of the form Calculate `y_min` and `y_max`. 

Once you've done your calculation, see if you're right by applying the .mean( ) method to **TMAX-C** and **TMIN-C**. Your two output values should be `ybar_min` and `ybar_max`. Make sure again to use the `dfNorthglenn` data set. 

In [15]:
# Known mean values for TMAX and TMIN in Fahrenheit
x_min = 92.333  # Mean of TMIN in Fahrenheit
x_max = 7.345340  # Mean of TMAX in Fahrenheit (based on the previous given value)

# Calculate expected mean temperatures in Celsius using the provided formulas
y_min = (5/9) * (x_min - 32)
y_max = (5/9) * (x_max - 32)

# Now, let's calculate the actual mean values for TMAX-C and TMIN-C in the DataFrame
ybar_min = dfNorthglenn['TMIN-C'].mean()
ybar_max = dfNorthglenn['TMAX-C'].mean()

# Print the results
print("Expected Mean Min Temp in Celsius = {:.3f}".format(y_min))
print("Expected Mean Max Temp in Celsius = {:.3f}".format(y_max))
print("Mean Min Temp in Celsius (from data) = {:.3f}".format(ybar_min))
print("Mean Max Temp in Celsius (from data) = {:.3f}".format(ybar_max))


Expected Mean Min Temp in Celsius = 33.518
Expected Mean Max Temp in Celsius = -13.697
Mean Min Temp in Celsius (from data) = 15.370
Mean Max Temp in Celsius (from data) = 33.519


### Exercise 6 
***

(a) Compute the daily temperature range (max **minus** min) in Fahrenheit for each row in the `Northglenn` DataFrame and store it in a column called **TDIFF**.  Then answer these questions. Use the `iloc` function to do this. This is just one line of code. Display your results using the `dfNorthglenn.head()` output. 

(b)  What is the mean temperature difference over the month of July? 

(c)  What is the difference between the means of the max and min daily temperatures? 

(d)  Do you see a relationship between these two quantities?  If so, can you prove that it's always the case for mean difference and          difference of means? (open ended question). This will be show to you. 

In [16]:
dfNorthglenn.columns

Index(['STATION', 'NAME', 'DATE', 'PRCP', 'TMAX', 'TMIN', 'TMAX-C', 'TMIN-C'], dtype='object')

In [17]:
dfNorthglenn["DATE"]

184    2017-07-01
185    2017-07-02
186    2017-07-03
187    2017-07-04
188    2017-07-05
189    2017-07-06
190    2017-07-07
191    2017-07-08
192    2017-07-09
193    2017-07-10
194    2017-07-11
195    2017-07-12
196    2017-07-13
197    2017-07-14
198    2017-07-15
199    2017-07-17
200    2017-07-18
201    2017-07-19
202    2017-07-20
203    2017-07-21
204    2017-07-22
205    2017-07-23
206    2017-07-24
207    2017-07-25
208    2017-07-26
209    2017-07-27
210    2017-07-28
211    2017-07-29
212    2017-07-30
213    2017-07-31
Name: DATE, dtype: object

In [18]:
# Compute the daily temperature range (TMAX - TMIN) and store it in a new column 'TDIFF'
dfNorthglenn.loc[:, 'TDIFF'] = dfNorthglenn['TMAX'] - dfNorthglenn['TMIN']

# Display the first few rows to verify the new column
dfNorthglenn.head()


Unnamed: 0,STATION,NAME,DATE,PRCP,TMAX,TMIN,TMAX-C,TMIN-C,TDIFF
184,USC00055984,"NORTHGLENN, CO US",2017-07-01,0.0,74.0,51.0,23.333333,10.555556,23.0
185,USC00055984,"NORTHGLENN, CO US",2017-07-02,0.0,91.0,55.0,32.777778,12.777778,36.0
186,USC00055984,"NORTHGLENN, CO US",2017-07-03,0.0,91.0,57.0,32.777778,13.888889,34.0
187,USC00055984,"NORTHGLENN, CO US",2017-07-04,0.0,91.0,56.0,32.777778,13.333333,35.0
188,USC00055984,"NORTHGLENN, CO US",2017-07-05,0.0,96.0,56.0,35.555556,13.333333,40.0


To compute the mean temperature difference we just need to compute the mean of the **TDIFF** column we just created. Use the output `mean_diff` and run the line of code `print("Mean Temp Diff = {:.3f}".format(mean_diff))` to print out youre result!

In [19]:
# PART B
# Calculate the mean temperature difference (mean of TDIFF)
mean_diff = dfNorthglenn['TDIFF'].mean()

print("Mean Temp Diff = {:.3f}".format(mean_diff))


Mean Temp Diff = 32.667


You now compute the difference of the **max** and **min** temperature means from the `dfNorthglenn` data frame. Use the output `diff_of_means` and print your results out with the following line of code `print("Diff of Mean Temps = {:.3f}".format(diff_of_means))`

In [20]:
# PART C
# Calculate the difference between the means of TMAX and TMIN
mean_max_temp = dfNorthglenn['TMAX'].mean()
mean_min_temp = dfNorthglenn['TMIN'].mean()

# Calculate the difference
diff_of_means = mean_max_temp - mean_min_temp

print("Diff of Mean Temps = {:.3f}".format(diff_of_means))

Diff of Mean Temps = 32.667
