# 2.2 Activity

For our **Unit 2** activities, we will continue working with snowpack, precipitation, and temperature measurements from 2014 to 2019 at the Central Sierra Snow Laboratory in Sierra Nevada, California.

In the previous activity, we explored data types, type conversions, and data manipulation. In this activity, we'll perform more data operations but this time with NumPy instead of Pandas.

By the end of this activity, you will:

1. Create a NumPy array from a Pandas DataFrame.
2. Select values by indexing and ranges.
3. Take a simple random sample from an array.
4. Obtain descriptive statistics from an array.
5. Reflect on working with arrays vs DataFrames.

**Acknowledgements**

Osterhuber, Randall; Schwartz, Andrew (2021), Snowpack, precipitation, and temperature measurements at the Central Sierra Snow Laboratory for water years 1971 to 2019, Dryad, Dataset, https://doi.org/10.6078/D1941T

## Task 1: Setup Workspace

Import NumPy and Pandas, then read in the data file **exported at the end of Acitivty 2.1** to a variable called `precip` and drop any 'Unnamed' columns. 

Return the first few rows of data to ensure it has loaded correctly.

*Note: If you did not save the file your Google Drive, upload it to this workspace using the **Files** folder on the left.*

In [11]:
#Importing libraries
import pandas as pd
import numpy as np
#Reading in the data from last lesson
from google.colab import drive
drive.mount('/content/drive') #mount the drive
precip = pd.read_csv('/content/drive/MyDrive/revised_precip.csv')
#Dropping the "Unnamed" columns
precip.info() #here we can see the first column is unnamed so we'll remove that
precip.drop('Unnamed: 0', axis = 1, inplace = True)
#Let's take a look
precip.head() #it's gone

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1826 entries, 0 to 1825
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Unnamed: 0                  1826 non-null   int64  
 1   Date                        1826 non-null   object 
 2   Air Temp Max (C)            1826 non-null   int64  
 3   Air Temp Min (C)            1826 non-null   int64  
 4   24-hour Total Precip (mm)   1826 non-null   int64  
 5   Season Total Precip (mm)    1826 non-null   int64  
 6   % of Precip as Snow         400 non-null    float64
 7   % of Precip as Rain         322 non-null    float64
 8   New Snow (cm)               1826 non-null   object 
 9   Season Total Snow (cm)      1826 non-null   float64
 10  Snowpack depth (cm)         1826 non-null   object 
 11  Snow Water Equival

Unnamed: 0,Date,Air Temp Max (C),Air Temp Min (C),24-hour Total Precip (mm),Season Total Precip (mm),% of Precip as Snow,% of Precip as Rain,New Snow (cm),Season Total Snow (cm),Snowpack depth (cm),Snow Water Equivalent (cm),Air Temp Max (F),Air Temp Min (F)
0,10/1/14,16,3,0,0,,,0,0.0,0,0,60.8,37.4
1,10/2/14,22,1,0,0,,,0,0.0,0,0,71.6,33.8
2,10/3/14,24,5,0,0,,,0,0.0,0,0,75.2,41.0
3,10/4/14,26,5,0,0,,,0,0.0,0,0,78.8,41.0
4,10/5/14,25,8,0,0,,,0,0.0,0,0,77.0,46.4


Before we can begin using NumPy, you must convert your Pandas DataFrame to a NumPy array using `df.to_numpy()`. 

Save this as a new variable called `np_precip` and print its values.

In [16]:
np_precip = precip.to_numpy()
print(np_precip)

[['10/1/14' 16 3 ... '0' 60.8 37.4]
 ['10/2/14' 22 1 ... '0' 71.6 33.8]
 ['10/3/14' 24 5 ... '0' 75.2 41.0]
 ...
 ['9/28/19' 8 -3 ... '0' 46.4 26.6]
 ['9/29/19' 2 -3 ... '--' 35.6 26.6]
 ['9/30/19' 3 -4 ... '1' 37.4 24.8]]


***Do you notice anything different?***

Compared to what we've seen in the demonstration video, when we're converting a data frame, it appears that 1. it won't display the array identifier at the beginning of the statement, and 2. it looks like this created an n dimensional array based on however many observatiosna are captured in the data frame > array (in our case, it's an 1826-dimensional array).

## Task 2: Selecting Data

One common task in NumPy is selecting rows, columns, or individual observations.

Let's try slicing and dicing the data with various techniques.

### Row Indexing

Grab the first row in the dataset with `array[row_index]` where `array` is the name of the variable you created earlier.

*Hint: Indicies in Python typically start at `0`.*

In [18]:
np_precip[0] #np_precip is our new array and 0 is the starting index

array(['10/1/14', 16, 3, 0, 0, nan, nan, '0', 0.0, '0', '0', 60.8, 37.4],
      dtype=object)

Now, let's select the `first 100` rows using the syntax:

 `array[row_start_index:row_end_index]`

*Hint: The starting row index is **inclusive**, whereas the ending row index is **exclusive**.*

In [19]:
np_precip[0:100] #0-100 cuts off the observation at 99 so there are 100 observations

array([['10/1/14', 16, 3, ..., '0', 60.8, 37.4],
       ['10/2/14', 22, 1, ..., '0', 71.6, 33.8],
       ['10/3/14', 24, 5, ..., '0', 75.2, 41.0],
       ...,
       ['1/6/15', 18, 0, ..., '17.6', 64.4, 32.0],
       ['1/7/15', 16, -2, ..., '17.6', 60.8, 28.4],
       ['1/8/15', 14, -3, ..., '--', 57.2, 26.6]], dtype=object)

Now, let's grab the `last 100` rows by negating our selection with:

`array[-num_rows:]`

*Note: If a value is `:` when specifying an index or range of indicies, NumPy assumes you meant everything before or after the given value.*

In [20]:
np_precip[-100:]

array([['6/23/19', 20, 3, ..., '0', 68.0, 37.4],
       ['6/24/19', 19, 4, ..., '0', 66.2, 39.2],
       ['6/25/19', 18, 2, ..., '0', 64.4, 35.6],
       ...,
       ['9/28/19', 8, -3, ..., '0', 46.4, 26.6],
       ['9/29/19', 2, -3, ..., '--', 35.6, 26.6],
       ['9/30/19', 3, -4, ..., '1', 37.4, 24.8]], dtype=object)

### Column Indexing

Columns in NumPy can be selected using:

`array[:,column_index]`

Try selecting all rows of the column containing dates.

*Note: You can skip specifying rows or columns using the same `:` notation as within a range.*

In [21]:
np_precip[:,0] #date column is column 0

array(['10/1/14', '10/2/14', '10/3/14', ..., '9/28/19', '9/29/19',
       '9/30/19'], dtype=object)

Notice that the data that was returned no longer looks like a column. That's because it's not! It's an array. For simplicity's sake, we'll still refer to these as columns.



### Single Values

We can also select a single value in an array by specifying `array[row_index,column_index]`.

Select row `316` column `0`.

In [22]:
np_precip[316,0]

'8/13/15'

### Combining Ranges

Using what we learned above, let's select the first `5 columns` and the first `3 years` of data (1,095 rows).

Save this as a new variable called `first_three_seasons` and display the results.

**Syntax:**

`array[row_start:row_end,column_start:column_end]`

**Note:** Columns should correspond to `Date`, `Air Temp Max (C)`, `Air Temp Min (C)`, `24-hour Total Precip (mm)`, and `Season Total Precip (mm)` if dataset unchanged.

In [25]:
first_three_seasons = np_precip[0:1095,0:5]
print(first_three_seasons) #printing here to check

[['10/1/14' 16 3 0 0]
 ['10/2/14' 22 1 0 0]
 ['10/3/14' 24 5 0 0]
 ...
 ['9/27/17' 19 4 0 3064]
 ['9/28/17' 19 4 0 3064]
 ['9/29/17' 17 4 0 3064]]


Call `first_three_seasons.shape` (without parenthesis) to get the number of rows and columns.

In [26]:
first_three_seasons.shape

(1095, 5)

Now, what if we want to drop the columns `Air Temp Max (C)` and `Air Temp Min (C)` and keep the precipitation information?

We could use `array[:,[index,index,index]]` to create a sub-selection.

Select the `Date`, `24-hour Total Precip (mm)`, and `Season Total Precip (mm)` columns (with all rows) using this method.

Overwrite `first_three_seasons` with this new data.

In [27]:
first_three_seasons = np_precip[:,[0,3,4]]

Call `first_three_seasons.shape` again to verify the change.

In [29]:
first_three_seasons.shape #1826 rows, 3 columns is correct

(1826, 3)

### Sampling Techniques

In addition to creating selections using indicies and ranges, you can also use sampling methods to generate selections.

Take a simple random sample of `250` observations from `first_three_seasons` column `24-hour Total Precip (mm)` using `np.random.choice()` ([Reference](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html))

Save this as a new variable called `sample` and display the results.

In [38]:
sample = np.random.choice(first_three_seasons[:,1], size = 250) #1 should align to column "24-hour Total Precip (mm)" as it was the second to be attached in the last part.
sample.shape #shows a size of 250

(250,)

Take the mean of this sample using `np.mean()`.

In [39]:
np.mean(sample) #sample mean of 5.0

5.0

## Task 3: Thought Question

> *Compare working with Arrays in NumPy to working with DataFrames in Pandas. Do you think Pandas functions are easier to work with than those in NumPy? Why do you feel this way?*

I think so far it seems easier (and a little bit more intuitive) to work with DataFrames in Pandas than Arrays in NumPy, mostly for the visualization aspect. I think for most people, thinking in terms of rows and columns in terms of data interaction/ manipulation is easier to grasp than thinking of mulit-dimensional arrays. I know NumPy is overall better for calculations (the demonstration talked about the computational cost that we haven't had to consider yet) but for me personally, it seems like the visualization and general syntax of using Pandas functions are more easy to work with than functions in NumPy. I'm sure with more exposure, I'll find out they're not that different intuitively.