# Exercise 6: Weather anomalies 

The aim of this exercise is to analyze historical weather data.
- In Problem 1 you read in a tricky data file and explore it's contents.
- In problem 2, you will convert and aggregate the data from daily temperatures in Fahrenheit, to monthly average temperatures in Celsius.
- In Problem 3, you will finally analyze weather anomalies by comparing monthly average temperatures to a long-term average.

### Tips for completing this exercise

- Use **exactly** the same variable names as in the instructions because your answers will be automatically graded, and the tests that grade your answers rely on following the same formatting or variable naming as in the instructions.
- **Please do not**:

   

## Problem 1 - Reading in a tricky data file 

You first task for this exercise is to read in the data file [data/1091402.txt](data/1091402.txt) to a variable called `data`. Pay attention to the input data structure and no data values.



**Your score on this problem will be based on following criteria:**

- Reading the data into a variable called `data` using pandas
    - Skipping the second row of the datafile that contains `----------` characters that don't belong to the data
    - Convert the no-data values (`-9999`) into `NaN` 
- Calculating basic statistics from the data
- Including comments that explain what most lines in the code do

### Part 1 

You should start by loading the data file.

- Read the data file into variable the variable `data`
    - Skip the second row
    - Convert the no-data values (`-9999`) into `NaN`

In [49]:
# Importin pandas
import pandas as pd

# Data reading
data = pd.read_csv("data/1091402.txt", delim_whitespace = True, skiprows = [1], na_values = [-9999])

  data = pd.read_csv("data/1091402.txt", delim_whitespace = True, skiprows = [1], na_values = [-9999])


In [50]:
# Check that the dataframe looks ok:
data.head()

Unnamed: 0,STATION,ELEVATION,LATITUDE,LONGITUDE,DATE,PRCP,TAVG,TMAX,TMIN
0,GHCND:FIE00142080,51,60.3269,24.9603,19520101,0.31,37.0,39.0,34.0
1,GHCND:FIE00142080,51,60.3269,24.9603,19520102,,35.0,37.0,34.0
2,GHCND:FIE00142080,51,60.3269,24.9603,19520103,0.14,33.0,36.0,
3,GHCND:FIE00142080,51,60.3269,24.9603,19520104,0.05,29.0,30.0,25.0
4,GHCND:FIE00142080,51,60.3269,24.9603,19520105,0.06,27.0,30.0,25.0


In [51]:
# Check the last rows of the data (there should be some NaN values)
data.tail()

Unnamed: 0,STATION,ELEVATION,LATITUDE,LONGITUDE,DATE,PRCP,TAVG,TMAX,TMIN
23711,GHCND:FIE00142080,51,60.3269,24.9603,20170930,,47.0,49.0,44.0
23712,GHCND:FIE00142080,51,60.3269,24.9603,20171001,0.04,47.0,48.0,45.0
23713,GHCND:FIE00142080,51,60.3269,24.9603,20171002,,47.0,49.0,46.0
23714,GHCND:FIE00142080,51,60.3269,24.9603,20171003,0.94,47.0,,44.0
23715,GHCND:FIE00142080,51,60.3269,24.9603,20171004,0.51,52.0,56.0,


### Part 2 

In this section, you will calculate some basic statistics of the input data.

- Calculate how many no-data (NaN) values there are in the `TAVG` column
    - Assign your answer to a variable called `tavg_nodata_count`

In [52]:
# Calculating how many NaN values are there in the TAVG column
tavg_nodata_count = data["TAVG"].isna().sum()

In [53]:
# Print out the solution:
print(f'Number of no-data values in column "TAVG": {tavg_nodata_count}')

Number of no-data values in column "TAVG": 3308


- Calculate how many no-data (NaN) values there are for the `TMIN` column
    - Assign your answer into a variable called `tmin_nodata_count`

In [54]:
# Calculating how many NaN values are there in the TMIN column
tmin_nodata_count = data["TMIN"].isna().sum()

In [55]:
# Print out the solution:
print(f'Number of no-data values in column "TMIN": {tmin_nodata_count}')

Number of no-data values in column "TMIN": 365


- Calculate the total number of days covered by this data file
    - Assign your answer into a variable called `day_count`

In [56]:
# Getting the total mnumber of days covered by the dataframe
day_count = len(data)

In [57]:
# Print out the solution:
print(f'Number of days: {day_count}')

Number of days: 23716


- Find the date of the oldest (first) observation
    - Assign your answer into a variable called `first_obs`

In [58]:
# Getting the data of the oldest observation
first_obs = data["DATE"].head(1).values[0]

In [59]:
# Print out the solution:
print(f'Date of the first observation: {first_obs}')

Date of the first observation: 19520101


- Find the date of the most recent (last) observation
    - Assign your answer into a variable called `last_obs`

In [60]:
# Getting the data of the most recent observation
last_obs = data["DATE"].tail(1).values[0]

In [61]:
# Print out the solution:
print(f'Date of the last observation: {last_obs}')

Date of the last observation: 20171004


- Find the average temperature for the whole data file (all observtions) from column `TAVG`
    - Assign your answer into a variable called `avg_temp`

In [62]:
# Calculating the temperature average
avg_temp = data["TAVG"].mean()

In [63]:
#Print out the solution:
print(f'Average temperature (F) for the whole dataset: {round(avg_temp, 2)}')

Average temperature (F) for the whole dataset: 41.32


- Find the average `TMAX` temperature over the [Summer of '69]( (months May, June, July, and August of the year 1969)
    - Assign your answer into a variable called `avg_temp_1969`

In [64]:
# Selecting data for summer of 1969 including July(07), August(08) and September(09)
selected_summer_1969 = data[(data["DATE"] >= 19690701) & (data["DATE"] < 19691001)]
avg_temp_1969 = selected_summer_1969["DATE"].mean()

In [65]:
# This test print should print a number
print(f"Average temperature (F) for the Summer of '69: {round(avg_temp_1969, 2)}")

Average temperature (F) for the Summer of '69: 19690814.75


## Problem 2 - Calculating monthly average temperatures 

For this problem your goal is to calculate monthly average temperatures in degrees Celsius from the daily Fahrenheit values we have in the data file. You can continue working with the same DataFrame that you used in Problem 1.


- Calculating the monthly average temperatures in degrees Celsius for the each month in the dataset (i.e., for each month of each year)
    - You should store the monthly average temperatures in a new Pandas DataFrame called `monthly_data`
    - `monthly_data` should contain a new column called `temp_celsius` the monthly average temperatures in Celsius
    - Convert the `TAVG` values in Fahrenheit into Celsius and store the output in the `temp_celsius`
- Including comments that explain what most lines in the code do



In [83]:
# Coverting the DATE column to string in a new file called DATE_STR
data["DATE_STR"] = data["DATE"].astype(str)

# Extracting the year_month from the "DATE_STR" column and fill in a new column called MONTH
data["MONTH"] = data["DATE_STR"].str.slice(start = 0, stop = 6)

# Grouping dataframe based on year and month
grouped = data.groupby(by = "MONTH")

# Calculating the monthly average in Fahrenheit
monthly_data = pd.DataFrame()
mean_col = ["TAVG"]
for key, group in grouped:
    mean_values = group[mean_col].mean()
    mean_values["MONTH"] = key
    row = mean_values.to_frame().transpose()
    monthly_data = pd.concat([monthly_data, row])

# Converting the temps to celsius in a newly created column
## Defining a function for converting temp unit
def fahr_to_celsius(fahr):
    return (fahr - 32) / 1.8

## creating a new column called temp_celsius and filling it with the converted data
monthly_data["temp_celsius"] = monthly_data["TAVG"].apply(fahr_to_celsius)
monthly_data.head()

Unnamed: 0,TAVG,MONTH,temp_celsius
0,29.478261,195201,-1.400966
0,24.8,195202,-4.0
0,13.807692,195203,-10.106838
0,39.607143,195204,4.22619
0,44.666667,195205,7.037037


In [67]:
# This test prints the length of variable monthly_data
print(len(monthly_data))

790


In [68]:
# This test prints the column names of monthly_data
print(monthly_data.columns.values)

['TAVG' 'MONTH' 'temp_celsius']


In [69]:
# This test prints the mean of temp_celsius
print(monthly_data['temp_celsius'].mean())

5.097114347669991


In [70]:
# This test prints the median of temp_celsius
print(round(monthly_data['temp_celsius'].median(), 2))

4.73


## Problem 3 - Calculating temperature anomalies 

Our goal in this problem is to calculate monthly temperature anomalies in order to see how temperatures have changed over time, relative to an observation period between 1952-1980. You can continue working with the same data that you used in Problems 1 and 2.

**Your score on this problem will be based on following criteria:**

### Part 1

- Calculating ***the average (mean) temperature for each month (e.g., January, February, March, ...) over the period from 1952 up to and including 1980*** in a new DataFrame called `reference_temps`
    - You should end up with 12 values, 1 mean temperature for each month during the time period (see example table and figure below).
    - The columns in the new DataFrame should be `month` and `ref_temp`
    
Your `reference_temps` dataframe should have the following structure: 1 value for each month of the year (12 total) and the values represent and average in the observation period 1952-1980. The `ref_temp` temperatures should be in degrees Celsius.
   
| month    | ref_temp         |
|----------|------------------|
| 01       | -5.838761        |
| 02       | -7.064088        |
| 03       | -3.874213        |
| ...      | ...              |

### Part 2

- Calculating **a temperature anomaly for every month** in the `monthly_data` DataFrame using the corresponding monthly average temperature for each of the 12 months:
    - In order to achieve this you need to make **a table join**  between `monthly_data` and `ref_temps` based on the month.
    - The temperature anomaly is calculated as the difference between the temperature for a given month (`temp_celsius` column in `monthly_data`) and the corresponding monthly reference temperature (`ref_temp` column in`reference_temps`).
    - Store the result in a new column `"diff"` 
    
As the output of the table join and the calculation, you should have three new columns in the `monthly_data` DataFrame:

1. `diff`: The temperature anomaly, i.e. the difference between the temperature for a given month (e.g., February 1960) and the mean temperature during the reference period (e.g., the average of all Februaries between 1952 and 1980), 
2. `month`: The month for that row of observations
3. `ref_temp`: The monthly reference temperature

A summary of the relationships between the `monthly_data` and `reference_temps` DataFrames, as well as how the `diff` value should be calculated in the `monthly_data` DataFrame is presented in the figure below.

![Exercise 6 dataframes](img/exercise-6-dataframes.png)<br/>
*Figure 1. Relationships between the `monthly_data` and `reference_temps` DataFrames.*

You should finally report which month had the greatest weather anomaly during the observed time period.

Remember to include comments in your code.

In [71]:
# Creating a new Dataframe
ref_dataframe = pd.DataFrame()

In [72]:
# Extracting the data from 1952 up to include 1980
ref_dataframe = monthly_data[(monthly_data["MONTH"] >= "195201") & (monthly_data["MONTH"] < "198101")]
ref_dataframe.head()

Unnamed: 0,TAVG,MONTH,temp_celsius
0,29.478261,195201,-1.400966
0,24.8,195202,-4.0
0,13.807692,195203,-10.106838
0,39.607143,195204,4.22619
0,44.666667,195205,7.037037


In [73]:
# Slicing the month date
ref_dataframe["month"] = ref_dataframe["MONTH"].str.slice(start = 4, stop = 6)
ref_dataframe.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ref_dataframe["month"] = ref_dataframe["MONTH"].str.slice(start = 4, stop = 6)


Unnamed: 0,TAVG,MONTH,temp_celsius,month
0,29.478261,195201,-1.400966,1
0,24.8,195202,-4.0,2
0,13.807692,195203,-10.106838,3
0,39.607143,195204,4.22619,4
0,44.666667,195205,7.037037,5


In [74]:
# Grouping the month
ref_dataframe = ref_dataframe.groupby(by = "month")
ref_dataframe.head()

Unnamed: 0,TAVG,MONTH,temp_celsius,month
0,29.478261,195201,-1.400966,1
0,24.8,195202,-4.0,2
0,13.807692,195203,-10.106838,3
0,39.607143,195204,4.22619,4
0,44.666667,195205,7.037037,5
0,56.5,195206,13.611111,6
0,61.214286,195207,16.230159,7
0,57.483871,195208,14.157706,8
0,47.230769,195209,8.461538,9
0,35.892857,195210,2.162698,10


In [75]:
# Calculating monthly averag in the whole period
reference_temps = pd.DataFrame()
mean_col = ["temp_celsius"]
for key, group in ref_dataframe:
    mean_values = group[mean_col].mean()
    mean_values["month"] = key
    row = mean_values.to_frame().transpose()
    reference_temps = pd.concat([reference_temps, row])

In [76]:
# Check the monthly data:
reference_temps

Unnamed: 0,temp_celsius,month
0,-5.838761,1
0,-7.064088,2
0,-3.874213,3
0,2.370749,4
0,9.482356,5
0,14.661728,6
0,16.520986,7
0,15.04565,8
0,9.934222,9
0,4.95224,10


In [77]:
# Create a new column caled ref_temp and filling it with monthly averages
reference_temps["ref_temp"] = reference_temps["temp_celsius"].copy()

In [78]:
reference_temps

Unnamed: 0,temp_celsius,month,ref_temp
0,-5.838761,1,-5.838761
0,-7.064088,2,-7.064088
0,-3.874213,3,-3.874213
0,2.370749,4,2.370749
0,9.482356,5,9.482356
0,14.661728,6,14.661728
0,16.520986,7,16.520986
0,15.04565,8,15.04565
0,9.934222,9,9.934222
0,4.95224,10,4.95224


In [79]:
# Removing extra columns
reference_temps = reference_temps[["month", "ref_temp"]]
reference_temps

Unnamed: 0,month,ref_temp
0,1,-5.838761
0,2,-7.064088
0,3,-3.874213
0,4,2.370749
0,5,9.482356
0,6,14.661728
0,7,16.520986
0,8,15.04565
0,9,9.934222
0,10,4.95224


In [89]:
monthly_data.head()

Unnamed: 0,TAVG,MONTH,temp_celsius,_month,month
0,29.478261,195201,-1.400966,1,1
0,24.8,195202,-4.0,2,2
0,13.807692,195203,-10.106838,3,3
0,39.607143,195204,4.22619,4,4
0,44.666667,195205,7.037037,5,5


In [98]:
# Creating a month column to monthly_data to use as the common column when joining with the reference_temps dataframe
monthly_data["month"] = monthly_data["MONTH"].str.slice(start = 4, stop = 6)
monthly_data.head()

Unnamed: 0,TAVG,MONTH,temp_celsius,month
0,29.478261,195201,-1.400966,1
0,24.8,195202,-4.0,2
0,13.807692,195203,-10.106838,3
0,39.607143,195204,4.22619,4
0,44.666667,195205,7.037037,5


In [99]:
# Joining two dataframes
monthly_data = pd.merge(monthly_data, reference_temps, on = ["month"], how = "inner")
monthly_data.head()

Unnamed: 0,TAVG,MONTH,temp_celsius,month,ref_temp
0,29.478261,195201,-1.400966,1,-5.838761
1,24.8,195202,-4.0,2,-7.064088
2,13.807692,195203,-10.106838,3,-3.874213
3,39.607143,195204,4.22619,4,2.370749
4,44.666667,195205,7.037037,5,9.482356


In [105]:
# Calculating differences between temperature for each month in each year and the monthly average in the whole period
monthly_data["diff"] = monthly_data["temp_celsius"] - monthly_data["ref_temp"]
monthly_data.head()

Unnamed: 0,TAVG,MONTH,temp_celsius,month,ref_temp,diff
0,29.478261,195201,-1.400966,1,-5.838761,4.437795
1,24.8,195202,-4.0,2,-7.064088,3.064088
2,13.807692,195203,-10.106838,3,-3.874213,-6.232625
3,39.607143,195204,4.22619,4,2.370749,1.855441
4,44.666667,195205,7.037037,5,9.482356,-2.445319


In [104]:
# Print out desriptive statistics for the relevant columns:
monthly_data[["temp_celsius", "ref_temp", "diff"]].describe()

Unnamed: 0,temp_celsius
count,682.0
mean,5.097114
std,8.483949
min,-17.97491
25%,-1.685185
50%,4.726105
75%,12.87037
max,22.329749


Remember also to calculate which month had the largest temperature anomaly during the observed time period in comparison with the reference data. Use the cell below to calculate and print out the answers. Note, you may want to consider the largest absolute value of the temperature anomaly, as well as the largest positive and negative anomalies.

In [115]:
# Find the largest anomaly
max_pos_anomaly = monthly_data["diff"].max()
max_neg_anomly = monthly_data["diff"].min()

In [119]:
# Conver the maximom anomaly with negative sign to compare with the max_pos_anomaly
max_neg_anomly = max_neg_anomly * -1

In [123]:
if max_pos_anomaly > max_neg_anomly:
    print(f'The larget anomaly compared to the monthly average is {round(max_pos_anomaly, 2)}')
else:
    print(f'The larget anomaly compared to the monthly average is {round(max_neg_anomly, 2)}')

The larget anomaly compared to the monthly average is 8.23


# Thank you this exercise. It as realy Challengeable for me!