# Welcome to our 1-Hour Python Introduction!

This is a jupyter notebook containing a very brief introduction to Python, and the applications it has to CIGLR summer fellows. 

In [None]:
# We must import the packages we use for our coding! Python has A LOT of packages
import numpy as np

# Quick Introduction: Variables, Strings, and Arrays

In [None]:
# Setting a variable

# Setting a variable can be integers, floats, or string...you don't have to specify!
a = 5
b = 10.0
c = 'hello'

print('int',a)
print('float',b)
print('string',c)


In [None]:
# math with variables

# you can do math between floats and integers...Python will keep the highest precision!

test1 = a+b
test2 = a+5 # this will stay integer!
test3 = a+5.0

print(test)
print(test2)
print(test3)

### Arrays

In [None]:
# The Numpy Python package makes working with arrays incredibly easy. 

# create random array
arr = np.random.rand(10)
print(arr)

In [None]:
# Math with arrays 

# you can add/multiply a scalar value to an array and it will be applied to all entries
arr_add = arr + 1.0
print(arr_add)
print('')

arr_mult = arr * 2.0
print(arr_mult)

In [None]:
# you can also add/multiply two arrays together...but they must be the same shape!

#create an array of integers
ints = np.arange(0,10)
print(ints)
print('')

# numpy makes it easy to see the "shape" of each array
print(arr.shape, ints.shape) #<-- each of these arrays have 10 entries
print('')

# now add them together
arr_plus_int = arr + ints
print(arr_plus_int)

**Activity #1**

Find the average between two arrays "a" and "b", and assign it to the variable "ab_mean" (important: do not find the mean of EACH array, instead create a new array that is the average between the two given)

In [None]:
a = np.asarray([2,5,8,3,6])
b = np.asarray([4,3,2,3,10])

# code your solution below





### Array Statistics

In [None]:
# Numpy makes it SUPER easy to find the statistics of different arrays

# create another random array for example purposes
rand = np.random.randint(0,50,size=30).astype(float)
print(rand)

In [None]:
# max and min
print('max',np.max(rand),'min',np.min(rand))

# mean
print('mean',np.mean(rand))

# standard deviation
print('standard deviation',np.std(rand))

In [None]:
# UH OH! Our dataset all of a sudden has NaNs! 
rand[5] = np.nan
print(rand)
print('')

# when you do python math with a nan, there's trouble
print('Oh no!' ,1.0 + np.nan)
print('')

# cue dramatic music
print(np.max(rand),np.min(rand),np.mean(rand),np.std(rand))

In [None]:
# don't worry...there's an easy fix

# max and min
print('max',np.nanmax(rand),'min',np.nanmin(rand))

# mean
print('mean',np.nanmean(rand))

# standard deviation
print('standard deviation',np.nanstd(rand))

Numpy has quite a bit of statistical options. See the link below for additional documentation:

https://numpy.org/doc/stable/reference/routines.statistics.html

### Indexing Arrays

Array "indexing" allows us to pull out only a section of an array! But here's the kicker: Python is a 0-based array system! Meaning the first entry is "the 0th entry".

In [None]:
# Grab the first entry in an array:
print(rand[0])

# Grab the last entry in an array:
print(rand[-1])

In [None]:
# grab the first four entries in an array:
print(rand[:4])

# grab the last four entries in an array:
print(rand[-4:])

Another great thing about Numpy arrays is that you can ask for entries in an array that meet a certain condition! This makes it really handy when you have data that isn't QC'd yet

In [None]:
# Here, we have an array with 999.0 data in it
data = np.asarray([4.,5.,3.2,999.0,6.,5.4,7.8,9.2,999.0,3.1,6.,8.])
print(data)
print('')

# this will really muddy up our statistics
print('max',np.max(data),'min',np.min(data),'mean',np.mean(data))

In [None]:
# The best way to take care of these 999 issues is to replace them with NaNs

# First, set a condition: we know that this dataset uses 999 as bad data. So our condition will be this:
condition = data==999.0
print(condition)


In [None]:
# Now, we can index the array with that "condition"
print(data[condition])

In [None]:
# We can assign NEW ENTRIES based on that condition, too! 
data[condition] = np.nan
print(data)
print('')

# now we can see the true statistics of the data
print('max',np.nanmax(data),'min',np.nanmin(data),'mean',np.nanmean(data))

In [None]:
# How about if we DON'T know the bad data code? Just use a reasonable condition
data = np.asarray([4.,5.,3.2,999.0,6.,5.4,7.8,9.2,999.0,3.1,6.,8.])
condition = data > 500

print(data[condition])

**Activity!!!**

Using the new array provided ("array"), REPLACE every entry that is less than 0 with NaNs (np.nan). 

In [None]:
array = np.random.randint(-10,20,size=30).astype(float)

In [None]:
# Code your solution here. 



# Pandas: A Time-Series Analysis Package

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

## A Real Data Example: Opening a CSV into a Pandas Dataframe

The CSV file provided contains 2m temperature data and precipitation accumulation data at a 6-hourly resolution from 2019-01-01 to 2021-12-31. The data is from the ERA5 reanalysis, and is taken from the gridpoint closest to CIGLR. 

In [None]:
# Read in CSV file of CIGLR temperature and precipitation data from ERA5
df = pd.read_csv('era5_ciglr_temp.nc',index_col='time')
df.index = pd.to_datetime(df.index) #<--this converts our time index into Pandas Datetime. It's very useful.
df

In [None]:
# right now, "t2" is temperature in Kelvin. Let's convert to Celsius 
# T(C) = T(K) - 273.15

df['t2'] = df['t2']-273.15
print(df.head()) # this just prints the first 5 entries



**ACTIVITY!!**

The precipitation columnn of this data is in units of "meters". Please convert to millimeters. 

In [None]:
# code your solution here


## Indexing Pandas Dataframes

A dataframe with a Pandas Datetime index (i.e., the first column) is helpful, because we can index our time series based on a desired date, range of dates, year, time, etc. 

In [None]:
# Grab temp and precip data from one specific day/time
date = pd.to_datetime('2019-04-01 00:00:00')
print(df.loc[date])
print('')

# Now grab data from a range of day/times
date1 = pd.to_datetime('2019-04-01 00:00:00')
date2 = pd.to_datetime('2019-05-01 00:00:00')
print(df.loc[date1:date2])
print('')

We rarely are looking for one specific date/time. More often, we want data from one whole year, one whole month, one whole day, etc. Sometimes we are looking for data at the same time every day (i.e., 00Z every day). This is where the **pandas datetime index** comes in handy. 

In [None]:
# Grab data from the year 2020
print(df.loc[df.index.year==2020])

# How about data from the first of every month?
print(df.loc[df.index.day==1])

# How about data on 00Z on the first day of every month?
print(df.loc[(df.index.day==1)&(df.index.hour==0)])

In [None]:
# Don't forget you can grab only the specific variables! 

print(df['t2'].head())

# The code below gets temperature data from 00Z on the first of every month
df['t2'].loc[(df.index.day==1)&(df.index.hour==0)]

### Pandas Statistics

Much like with numpy arrays, you can find the statistics of a pandas time series using very simple commands

In [None]:
# Find the max and min temperature during July of 2021.
tmax = df['t2'].loc[(df.index.month==7)&(df.index.year==2021)].max()
tmin = df['t2'].loc[(df.index.month==7)&(df.index.year==2021)].min()
print(tmax,tmin)

**Activity!!!**

Find the average difference between temperature at 00Z and temperature at 12Z from 2019-2021 at CIGLR

In [None]:
# Code your solution here



## Visualizing with Pandas: Plotting a Time Series

In [None]:
# Let's plot our time series of temperature!

fig,ax = plt.subplots(1,1,figsize=(10,3))
df['t2'].plot(ylabel='2 m Temp (C)',title='Temperature at CIGLR')

Hmm...that's pretty noisy. Let's plot the daiily mean temperature time series instead.

The pandas "resample" function is INCREDIBLY handy here

In [None]:
# "Resample" our time series at a frequency of 1 Day (or '1D') using "mean" as the operator
dailymean_df = df['t2'].resample('1D').mean()
print(dailymean_df.head())

fig,ax = plt.subplots(1,1,figsize=(10,3))
dailymean_df.plot(ylabel='2 m Temp (C)',title='Daily Mean Temperature at CIGLR')

Still too noisy? Lets plot our monthly mean temperatures, and put all three on the same plot

In [None]:
monmean_df = df['t2'].resample('1M').mean()

fig,ax = plt.subplots(1,1,figsize=(10,3))

df['t2'].plot(ylabel='2 m Temp (C)',color='k',alpha=0.4,
              title='Different Frequencies of Temperature at CIGLR')
dailymean_df.plot(color='black')
monmean_df.plot(color='red',linewidth=2);'
;/'

Notice...the monthly mean averages are assigned to the first of every month. That's why the red line ends before the rest of the data.

In [None]:
# What about precipitation? Histograms are a great way to show precipitation data

# get daily precipitation totals!
daily_precip = df['precip'].resample('1D').sum()

fig,ax = plt.subplots(1,1,figsize=(10,4))
daily_precip.hist()

The thing about precipitation is that, of COURSE there is going to be a lot of tiny precipitation accumulation amounts (think of how many times we don't get any precip on one day). So let's rearrange the bins (the x-axis) to accomodate this by not including the 0mm days. 

In [None]:
bins = [0.1,0.2,0.3,0.4,0.6,0.8,1,1.5,2,3,4,5,10]

fig,ax = plt.subplots(1,1,figsize=(10,4))
daily_precip.plot(kind='hist',bins=bins,edgecolor='k',title='Daily Precipitation (mm)')

Honestly, it's not the prettiest thing in the world. But it gets the job done. Just for s**** and giggles, let's plot the time series of percipitation


In [None]:
fig,ax = plt.subplots(1,1,figsize=(10,3))
daily_precip.plot()

## More Pandas Statistics

Problem: How many days did CIGLR exceed the 99th percentile of precipitation for the location? By how much?

In [None]:
# First, find the upper 99th percentile, or "quantile", value for precipitation
ninenine_perc = daily_precip.quantile(0.99)
print(ninenine_perc)

# Create a dataframe containing only days that exceed that amount
df_upper = daily_precip[daily_precip>=ninenine_perc]

# How many days exceed the 99th percentile of precip?
print(len(df_upper),'days')

# By how much did each day exceed the upper percentile?
print(df_upper-ninenine_perc)
