#Assignment 2



An NOAA dataset has been stored in the file `weather.csv`. This is the dataset to use for this assignment. Note: The data for this assignment comes from a subset of The National Centers for Environmental Information (NCEI) [Daily Global Historical Climatology Network](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt) (GHCN-Daily). The GHCN-Daily is comprised of daily climate records from thousands of land surface stations across the globe.

Each row in the assignment datafile corresponds to a single observation.

The following variables are provided to you:

* **id** : station identification code
* **date** : date in YYYY-MM-DD format (e.g. 2012-01-24 = January 24, 2012)
* **element** : indicator of element type
    * TMAX : Maximum temperature (tenths of degrees C)
    * TMIN : Minimum temperature (tenths of degrees C)
* **value** : data value for element (tenths of degrees C)

For this assignment, you must:

1. Read the documentation and familiarize yourself with the dataset, then write some python code which returns a line graph of the record high and record low temperatures by day of the year over the period 2005-2014. The area between the record high and record low temperatures for each day should be shaded.
2. Overlay a scatter of the 2015 data for any points (highs and lows) for which the ten year record (2005-2014) record high or record low was broken in 2015.
3. Watch out for leap days (i.e. February 29th), it is reasonable to remove these points from the dataset for the purpose of this visualization.
4. Make the visual nice! Leverage principles from the first module in this course when developing your solution. Consider issues such as legends, labels, and chart junk.



## Synposis

I read the csv file called 'weather.csv' and I removed leap days for only the 2015 year period.  For the period year 2015 and not 2015 I: converted the named column 'Date' to a 'Date' data type; grouped by **'day of year'**; found the minimum and maximum each for **'day of year'**

## Processing

The aim is to take the max of the TMAX and the min of the TMIN records, for each day.  Here's the code to load modules and functions that are used later in the code - and to read the file 'weather.csv'

In [182]:
#1. step one - import pyplot and pandas, get the backend
%matplotlib notebook
import matplotlib as mpl
import matplotlib.pyplot as plt

import pandas as pd
import datetime
import numpy as np

#2 - create use a function to get day of year from the the date column - this will be used later

def get_day_year(dataframe):
    dataframe['day_of_year'] = (dataframe['Date']
                                          .apply(lambda x: int(datetime.datetime.strftime(x,'%j'))))
    return dataframe

#3. create function to find the maximum for each day of the year - this will be used later

def get_max_day(dataframe):
    
    return (dataframe
          .loc[dataframe['Element']=='TMAX',:]
          .set_index('day_of_year')
          .groupby(level=0)['Data_Value']
        .agg(np.max))
#4. create function to find the minimum for each day of the year - this will be used later
def get_min_day(dataframe):
    return (dataframe
          .loc[dataframe['Element']=='TMIN',:]
          .set_index('day_of_year')
          .groupby(level=0)['Data_Value']
        .agg(np.min))

#5. replace the year from a Date data type - this will be used later to ensure every year has the same calender day
#, e.g. 1st march 2012 and 1st march 2013 have 60 as the calender day - otherwise they have different calender days
def change_date(dataframe, col_str):
        return miss_leap_days_not_2015['Date'].map(lambda x: x.replace(year=2015))
        
#6 - read the raw data

df = pd.read_csv('weather.csv')


Here's the code to find the maximum and minimum for each calender day.

In [200]:
#7 - change data type from string to date for the column Date

df["Date"] = pd.to_datetime(df["Date"])

#8 - remove the date 2015 from the dataframe df
df['year']=df['Date'].apply(lambda date: date.year)
not_2015_df = df[df['year']!=2015]

#9 - remove leap years from df
miss_leap_days_not_2015=not_2015_df[((not_2015_df['Date']!=pd.Timestamp("2008-02-29"))
                                     &(not_2015_df['Date']!=pd.Timestamp("2012-02-29")))]

#10 - replace years with the year '2015'.  this ensures corresponding day of year grouped by correctly, later

miss_leap_days_not_2015['Date'] = change_date(miss_leap_days_not_2015, 'Date')

#11 - get the calender day of the year using the get day of year function

miss_leap_days_not_2015= get_day_year(miss_leap_days_not_2015)

#12 - get the maximum for each calender day for the year not 2015

max_not_2015_df = get_max_day(miss_leap_days_not_2015)

#13 - get the minimum for each calender day for the year not 2015

min_not_2015_df = get_min_day(miss_leap_days_not_2015)
#9. step 9 - remove the dates other than 2015 in the dataframe df
df['year']=df['Date'].apply(lambda date: date.year)
df_2015 = df[df['year']==2015]

#10. - get the day of the year for the period 2015 using the function get_day_year function above

df_2015=get_day_year(df_2015)

#11 - get the minimum for each calender day for the year 2015
min_2015_df = get_min_day(df_2015)
#12 - get the maximum for each calender day for the year 2015
max_2015_df = get_max_day (df_2015)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 185

In [184]:
#13. get the backend
mpl.get_backend()

'nbAgg'

In [245]:
#14. make tuple list of minimum temps during 2015, minimum temps btwn 2005 & 2014, and the calender day
min_tuple = zip(min_2015_df.tolist(), min_not_2015_df.tolist(), list(range(1,366)))

def test_min(min_2015, not_2015_series, calender_day):
    if min_2015<not_2015_series[calender_day]:
        return True
    else:
        return False

def test_max(max_2015, not_2015_series, calender_day):
    if max_2015>not_2015_series[calender_day]:
        return True
    else:
        return False
#15. create a tuple list where the temp during 2015 are less than the temp btwn 2005 and 2014 for respective calender days
lessthan_before_2015 = [(min_2015, day) for min_2015, min_not2015, day in min_tuple if min_not2015>min_2015]

#16. unpack the tuple list into 2015 min temp - less than btwn 2005-2014  - and the calender day it happend
min_temp, day_broke_min = zip(*lessthan_before_2015)

#17. make tuple list of maximum temps during 2015, maximum temps btwn 2005 & 2014, and the calender day
max_tuple = zip(max_2015_df.tolist(), max_not_2015_df.tolist(), list(range(1,366)))

#15. create a tuple list where the temp during 2015 are more than the temp btwn 2005 and 2014 for respective calender days
morethan_before_2015 = [(max_2015, day) for max_2015, max_not2015, day in max_tuple if max_not2015<max_2015]

#16. unpack the tuple list into 2015 max temp - greater than btwn 2005-2014  - and the calender day it happend
max_temp, day_broke_max = zip(*lessthan_before_2015)

print(test_max(306, max_not_2015_df, 127))


False


In [234]:
#14. draw two lines for maximum and minimum temperture and label graph
plt.figure()
plt.plot(max_not_2015_df, '-o', min_not_2015_df, '-o')
plt.xlabel('Calender Day')
plt.ylabel('Temperature, tenth of degrees C')
plt.title('When 2015 temperatures broke past maximum and minimum temperatures between 2005 & 2014')
#15. create legend entries when we add the legend itself.
plt.legend(['2005-2014 maximum temperature', '2005-2014 minimum temperature'])

#16. highlight the difference between the blue and orange curves.
plt.gca().fill_between(range(1, 366), min_not_2015_df, max_not_2015_df, facecolor='grey', alpha=0.25)

#17. plot the scatter graph of 2015 min temp - less than btwn 2005-2014  - the calender day it happend - on the same figure
plt.scatter(day_broke_min, min_temp, c='blue', label='2015 temperature')
plt.legend()

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0xe228e50>