# Exercise 6 - Data analysis with NumPy

In this week's exercise we will continue developing our skills using NumPy to analyze climate data.

After making your changes, you will need to upload your changes to GitHub as usual.
The answers to the questions in this week's exercise should be given by modifying the document in the requested places.

If you are uncertain about **the style of your code**, take a look at the **[PEP 8 - Style guide for Python code](https://www.python.org/dev/peps/pep-0008/)**.  

 - **Exercise 6 is due by 16:00 on 17.10.**
 - Don't forget to check out the [hints for this week's exercise](https://geo-python.github.io/2018/lessons/L6/exercise-6.html) if you're having trouble.
 - Scores on this exercise are out of 20 points.
 - There are altogether 3 problems that you should solve. The fourth problem is optional for more advanced students (does not affect grading).

## Data

For problems 1-3 in this exercise we will be using climate data from the Helsinki-Vantaa airport station.
For these problems, we have daily observations obtained from the [NOAA Global Historical Climatology Network](https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND).
The file was downloaded using the "Custom GHCN-Daily Text" output format, including following attributes:

| Attribute                | Description                      |
|--------------------------|----------------------------------|
| `STATION`                | Unique ID of the weather station |
| `ELEVATION`              | Elevation of the station         |
| `LATITUDE` , `LONGITUDE` | Coordinates of the station       |
| `DATE`                   | Date of the measurement          |
| `PRCP`                   | Precipitation                    |
| `TAVG`                   | Average temperature              |
| `TMAX`                   | Maximum temperature              |
| `TMIN`                   | Minimum temperature              |

The file for this problem is exactly as available from the NOAA website.
You may want to take a look at the [data](data/1091402.txt).

**Note**: Once again temperatures in this dataset are given in degrees Fahrenheit.

Additional information about the data format can be found in the [hints for Exercise 6](https://geo-python.github.io/2018/lessons/L6/exercise-6.html).

# Problem 1 - Reading in a tricky data file (5 points)

#### Overview

You first task for this exercise is to read in the data file (`data/1091402.txt`) to a variable called **`data`**.
This should be done using the `np.genfromtxt()` function, and the resulting NumPy array should have the following attributes:

  - The numerical values for date, precipitation and temperatures read in as numbers (skip the other columns)
  - The header rows of the datafile should be skipped
  - The no-data values (assigned with value **`-9999`**) should properly be converted to `nan`
  
After successfully reading the data file you should find answers to the specific questions below, and upload your notebook to **your own repository** for this week's exercise.

You can find hints about how to do these things in the [hints for Exercise 6](https://geo-python.github.io/2018/lessons/L6/exercise-6.html).

1. Read the file into variable **data**
   - Skip the first two rows
   - Read in only the date, precipitation, and temperatures values
   - Convert the no-data values into `NaN` (values -9999)
   - Split the data into 1D NumPy arrays called `date`, `precip`, `tavg`, `tmax`, and `tmin`

In [11]:
# YOUR CODE HERE
import numpy as np

In [12]:
fp = 'data/1091402.txt'
data = np.genfromtxt(fp, skip_header=2, usecols=(4, 5, 6, 7, 8) )

In [13]:
# Test print that should work
print(data[0,:])

[1.9520101e+07 3.1000000e-01 3.7000000e+01 3.9000000e+01 3.4000000e+01]


In [14]:
data_mask = (data < -9998)
data[data_mask] = np.nan

In [15]:
print(data)

[[1.9520101e+07 3.1000000e-01 3.7000000e+01 3.9000000e+01 3.4000000e+01]
 [1.9520102e+07           nan 3.5000000e+01 3.7000000e+01 3.4000000e+01]
 [1.9520103e+07 1.4000000e-01 3.3000000e+01 3.6000000e+01           nan]
 ...
 [2.0171002e+07           nan 4.7000000e+01 4.9000000e+01 4.6000000e+01]
 [2.0171003e+07 9.4000000e-01 4.7000000e+01           nan 4.4000000e+01]
 [2.0171004e+07 5.1000000e-01 5.2000000e+01 5.6000000e+01           nan]]


In [16]:
data.shape

(23716, 5)

In [17]:
type(data)

numpy.ndarray

In [18]:
data[0, 4:8]

array([34.])

In [19]:
print(data)

[[1.9520101e+07 3.1000000e-01 3.7000000e+01 3.9000000e+01 3.4000000e+01]
 [1.9520102e+07           nan 3.5000000e+01 3.7000000e+01 3.4000000e+01]
 [1.9520103e+07 1.4000000e-01 3.3000000e+01 3.6000000e+01           nan]
 ...
 [2.0171002e+07           nan 4.7000000e+01 4.9000000e+01 4.6000000e+01]
 [2.0171003e+07 9.4000000e-01 4.7000000e+01           nan 4.4000000e+01]
 [2.0171004e+07 5.1000000e-01 5.2000000e+01 5.6000000e+01           nan]]


In [107]:
date = data[:, 0]
precip = data[:, 1]
tavg = data[:, 2]
tmax = data[:, 3]
tmin = data[:, 4]

- How many no-data values (`nan`) are there for **`tavg`**?
  - Assign your answer to a variable called **`tavg_nan_count`**

In [125]:
# How many no-data values?
# tavg_nan_count = 

# YOUR CODE HERE
tavg_nan_count_mask = ~np.isfinite(tavg)
print(tavg_nan_count)
np.count_nonzero(tavg_nan_count_mask)

False


3308

In [128]:
# This test print should print a number
print(np.count_nonzero(tavg_nan_count_mask))

3308


- How many no-data values (`nan`) are there for `tmin`?
  - Assign your answer to a variable called **`tmin_nan_count`**

In [115]:
# How many no-data values?
# tmin_nan_count = 

# YOUR CODE HERE
tmin_nan_count_mask = ~np.isfinite(tavg)
print(tmin_nan_count_mask)
np.count_nonzero(tmin_nan_count_mask)

[False False False ... False False False]


3308

In [129]:
# This test print should print a number
print(np.count_nonzero(tmin_nan_count_mask))


3308


- How many days total are covered by this data file?
  - Assign your answer into a variable called **`day_count`**

In [131]:
# How many days?
# day_count = 

# YOUR CODE HERE
date_count_mask = np.isfinite(date)
print(date_count_mask)
np.count_nonzero(date_count_mask)

[ True  True  True ...  True  True  True]


23716

In [134]:
# This test print should print a number
#print(date_count)


- When was the first observation made (i.e., the oldest)?
  - Assign your answer to a variable called **`first_obs`**

In [152]:
# YOUR CODE HERE
first_obs = date[0]


In [153]:
# This test print should print a number
print(first_obs)


19520101.0


- When was the last observation made (i.e., the most recent)?
  - Assign your answer to a variable called **`last_obs`**

In [154]:
# YOUR CODE HERE
last_obs = date[-1]

In [155]:
# This test print should print a number
print(last_obs)


20171004.0


- What was the average temperature of the whole data file (all years)?
  - Assign your answer to a variable called **`avg_temp`**

In [192]:
# YOUR CODE HERE
avg_temp_69_mask = np.isfinite(tavg)
avg_temp1=tavg[avg_temp_mask]
avg_temp=np.mean(avg_temp1)

In [193]:
# This test print should print a number
print(avg_temp)


41.32408859270874


- What was the average **`TMAX`** temperature of the `Summer of 69` (i.e., including the months May, June, July, August of the year 1969)?
  - Assign your answer to a variable called **`avg_temp_69`**

In [200]:
# YOUR CODE HERE
date_fmt = "YYYYMMDD"
date_fmt[0:4]
year = [date[0:5]]
print(year)

[array([19520101., 19520102., 19520103., 19520104., 19520105.])]


In [None]:
avg_temp_69 = np.isfinite(tavg)


In [None]:
# This test print should print a number
print(avg_temp_69)


# Problem 2 - Calculating monthly average temperatures (7.5 points)

For this problem your goal is to calculate monthly average temperature values in degrees Celsius from the daily values we have in the data file.
You can use the approaches taught in Lessons 4, 5 and 6 to solve this.
You can again consult the [hints for Exercise 6](https://geo-python.github.io/2018/lessons/L6/exercise-6.html) if you are stuck.

**You can continue working with the same data that you used in Problem 1.**

#### For this problem modify you should:

1. Calculate the monthly average temperatures for the entire dataset (i.e., for each year separately) file using the approach shown in the lesson this week.
    - You should store the average temperatures into a new NumPy array called **`temp_monthly`**.
    - You will also need to create separate arrays called **`year`** and **`month`** to store the year and month from the date data. Be sure to use the temperature and date data with the `nan` values removed!
    - Note you should only do this calculation for the years **1952-2016**!
    - Also be aware that there is missing data for `tavg` for several years.
2. Create a new array called **`temp_monthly_celsius`** with the same size as **`temp_monthly`** that has the monthly average temperatures in Celsius.
3. Update and commit your changes to the notebook in your **own repository** of this week's exercise.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Problem 3 - Calculating temperature anomalies (7.5 points)

Our goal in this problem is to calculate monthly temperature anomalies in order to see how temperatures have changed over time, relative to the observation period between 1952-1980.

We will again continue working with this same notebook.

In order to complete the problem, you must do following things:

- You need to calculate a mean temperature ***for each month*** over the period 1952-1980 using the data in the data file. As a result, you should end up with 12 values, 1 mean temperature for each month over that period, and store them in a new NumPy array called **`ref_temps`**.
- You should also create a character string array called **`unique_months`** that contains a value for each month in the form below (i.e., January = `'01'`).
   
For example, your data should be something like that below, 1 value for each month of the year (12 total):
   
| unique_months    | ref_temps        |
|------------------|------------------|
| 01               | -5.350916        |
| 02               | -5.941307        |
| 03               | -2.440364        |
| ...              | ...              |
   
*Remember, these temperatures should be in degrees Celsius.*

- Once you have the monthly mean values for each of the 12 months, you can then calculate a temperature anomaly for every month in the `temp_monthly_celsius` array.
- The temperature anomaly we want to calculate is simply the temperature for one month in `temp_monthly_celsius` minus the corresponding monthly reference temperature in `ref_temps`.
    - Notice that these arrays are not the same size and you'll need a creative solution to this problem. The [hints for Exercise 6](https://geo-python.github.io/2018/lessons/L6/exercise-6.html) may help.
- You should thus end up with three new arrays: 

    1. **`anomaly`**  showing the temperature anomaly, the difference in temperature for a given month (e.g., February 1960) compared to the average (e.g., for February 1952-1980) 
    2. **`unique_months`** indicating the month
    3. **`ref_temps`** indicating the (monthly) reference temperature
    
- Update and commit your changes to the notebook in your **own repository** of this week's exercise.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

- What is the largest value in `anomaly` array?
   - Print the answer in the cell below

#### Done!

That's it. Now you are ready with Problems 1-3. If you want, you can still continue with an optional [Problem 4](Exercise-6-problem-4.ipynb).