## **ATOC4500 Data Science Lab: Final Project**
## **Using rapid ice loss events to predict when CESM1 ensemble members go ice free**
#### **Author: Daphne Quint, daqu2831@colorado.edu**
#### **Last updated: April 14, 2022**

---------------------------------------------------------------------------------------

### Import packages

In [1]:
import xarray as xr
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from random import randint
from sklearn import preprocessing
from sklearn.utils import resample
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, confusion_matrix
import seaborn as sns
from sklearn.linear_model import LogisticRegression

### Define functions

In [2]:
def find_year(data, member):
    '''
    Finds the year 1 member goes below 1 million square km
    '''
    
    data_ = data.sel(member=member)
    
    year = 2020
    for i in data_:
        if i>1:
            year += 1
        else:
            return year

In [3]:
def create_df(month):
    #### find ice free year for that month
    
    ###first find 5 year mean
    
    # define September SIE
    SIE_sept = SIE['CESM1'].sel(time=SIE['time.month']==month).sel(member=np.arange(1, 41, 1))

    # find the 5 year running mean
    five_year_mean = SIE_sept*0

    for i in range(1, 41):
        five_year_mean[i-1] = SIE_sept.sel(member=i).rolling(time=5).mean()

    five_year_mean = five_year_mean.sel(time=slice('2020', '2100'))
    
    ### then find the ice free year
    ice_free_year = []
    for i in range(1, 41):
        ice_free_year.append(find_year(five_year_mean, i))
    ice_free_year = np.array(ice_free_year)
    
    #### find max amt of ice lost in the month
    
    ice_lost_max = []
    for i in range(1, 41):
        ice_lost_max.append(float(nb_ext_data['RILE Indicator'].sel(member=i).sel(month=month).min().values))
    ice_lost_max = np.array(ice_lost_max)
    
    #### find longest duration for the month
    
    length_max = []
    for i in range(1, 41):
        length_max.append(float(length_data['Length'].sel(member=i).sel(month=month).max().values))
    length_max = np.array(length_max)
    
    #### create dataframe
    
    member = pd.DataFrame(data=np.arange(1, 41), columns=['Member'])
    month = pd.DataFrame(data=(np.zeros(40)+month), columns=['Month'])
    ice_free_yr_df = pd.DataFrame(data=ice_free_year, columns=['Ice Free Year'])
    ice_lost_max_df = pd.DataFrame(data=ice_lost_max, columns=['Max Ice Lost'])*-1
    length_max_df = pd.DataFrame(data=length_max, columns=['Longest Duration'])
    
    this_month_df = pd.concat([member, month, ice_free_yr_df, ice_lost_max_df, length_max_df], axis=1)
    
    return this_month_df

In [9]:
def define_holdout_data(x, y, verbose):
    """Perform a 80/20 test-train split (80% of data is training, 20% is testing). Split is randomized with each call."""
    random_state = randint(0,1000)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=random_state)
    if verbose==True:
        print("Prior to scaling and rebalacing...")
        print("Shape of training predictors: "+str(np.shape(x_train)))
        print("Shape of testing predictors: "+str(np.shape(x_test)))
        print("Shape of training predictands: "+str(np.shape(y_train)))
        print("Shape of testing predictands: "+str(np.shape(y_test)))
        print(" ")
    return x_train, x_test, y_train, y_test

In [10]:
def scale_data(x_train, x_test):
    """
    Scale training data so that model reaches optimized weights much faster. 
    
    *All data that enters the model should use the same scaling used to scale the training data.*
    Thus, we also perform scaling on testing data for validation later. 
    Additionally, we return the scaler used to scale any other future input data.
    """
    
    scaler = preprocessing.MinMaxScaler() # normalize 
    x_train_scaled = pd.DataFrame(data=scaler.fit_transform(x_train),index=x_train.index,columns=x_train.columns) 
    x_test_scaled = pd.DataFrame(data=scaler.transform(x_test),index=x_test.index,columns=x_test.columns)
    
    return scaler, x_train_scaled, x_test_scaled

In [12]:
def balance_data(x,y,verbose):
    """Resample data ensure model is not biased towards a particular outcome of precip or no precip."""
    # Combine again to one dataframe to ensure both the predictor and predictand are resampled from the same 
    # observations based on predictand outcomes. 
    dataset = pd.concat([x, y],axis=1)

    # Separating classes
    early = dataset[dataset['early_bin'] == 1]
    not_early = dataset[dataset['early_bin'] == 0]

    random_state = randint(0,1000)
    oversample = resample(early, 
                           replace=True, 
                           n_samples=len(not_early), #set the number of samples to equal the number of the majority class
                           random_state=random_state)

    # Returning to new training set
    oversample_dataset = pd.concat([not_early, oversample])

    # reseparate oversampled data into X and y sets
    x_bal = oversample_dataset.drop(['early_bin'], axis=1)
    y_bal = oversample_dataset['early_bin']

    if verbose==True:
        print("After scaling and rebalacing...")
        print("Shape of predictors: "+str(np.shape(x_bal)))
        print("Shape of predictands: "+str(np.shape(y_bal)))
        print(" ")
    
    return x_bal, y_bal

In [13]:
def dataprep_pipeline(x, y, verbose):
    """ Combines all the functions defined above so that the user only has to 
    call one function to do all data pre-processing. """
    # verbose=True prints the shapes of input & output data

    # split into training & testing data
    x_train, x_test, y_train, y_test = define_holdout_data(x, y, verbose) 

    # perform feature scaling
    scaler, x_train_scaled, x_test_scaled = scale_data(x_train, x_test)

    # rebalance according to outcomes (i.e., the number of precipitating 
    # observations & non-precipitating outcomes should be equal)
    if verbose==True:
        print("for training data... ")
    x_train_bal, y_train_bal = balance_data(x_train_scaled, y_train, verbose)
    if verbose==True:
        print("for testing data... ")
    x_test_bal, y_test_bal = balance_data(x_test_scaled, y_test, verbose)
    
    return x_train_bal, y_train_bal, x_test_bal, y_test_bal

In [14]:
def bin_metrics(x, y):
    """Prints accuracy and recall metrics for evaluating 
    classification predictions."""
    
    accuracy = metrics.accuracy_score(x, y)
    recall = metrics.recall_score(x, y)

    print('Accuracy:', round(accuracy, 4))
    print('Recall:', round(recall, 4))
    
    return accuracy, recall

In [15]:
def plot_cm(x, y):
    """Plots the confusion matrix to visualize true 
    & false positives & negatives"""
    cm = confusion_matrix(x, y)
    df_cm = pd.DataFrame(cm, columns=np.unique(x), index = np.unique(x))
    df_cm.index.name = 'Actual'
    df_cm.columns.name = 'Predicted'
    sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 25}, fmt='g')# font size
    plt.ylim([0, 2])
    plt.xticks([0.5, 1.5], ['Negatives','Positives'])
    plt.yticks([0.5, 1.5], ['Negatives','Positives'])

In [27]:
def rand_atmos_conditions_precip(index='rand'):
    """
    Function returns atmospheric conditions in a dataframe as well as the scaled
    conditions in a numpy array so that they output a prediction in the model.
    
    If no input is passed, the function will randomly generate an in index to 
    choose from those observations in some training data with precipitation. 
    Otherwise, an integer index between 0 and 200 should be passed.
    """
    # First, perform a test-train split
    x_train, x_test, y_train, _ = define_holdout_data(x, y, verbose=False) 

    # perform feature scaling
    _, x_train_scaled, _ = scale_data(x_train, x_test)

    # this is what will go into the model to output a prediction
    if index=='rand':
        index = randint(0,len(y_train[y_train==1].index)) 
    precipindex = y_train[y_train==1].index.values[index]
    testpredictor = x_train_scaled.loc[precipindex] 
    
    return sept_df.iloc[precipindex], testpredictor 

## Step 1: Read in Data

In [4]:
data_path = '/home/daphne/Documents/School/research/icefreeproject/Data/'

# Amount of sea ice lost and Sea ice extent data for each RILE
nb_ext_data = xr.open_dataset(data_path+'RILE_nbext_CESM.nc')

# length data (consecutive years in a row there is a rile for that month)
length_data = xr.open_dataset(data_path+'CESM_rile_length.nc')

# extent data - can be used to find ice free year for each member
SIE = xr.open_dataset(data_path+'CLIVAR_SIE_1850_2100_RCP85.nc')

## Step 2: Munge Data

In [5]:
sept_df = create_df(9)

In [6]:
#sept_df

## Step 3: Apply Data Science Method

#### Logistic Regression

In [7]:
# create a feature that indicates whether or not the member goes ice free before 2043
sept_df['early_bin'] = np.array(sept_df['Ice Free Year']<=2043).astype(int)

In [19]:
# features that we will use to predict ice free year
x = sept_df.drop(['Month','Member', 'Ice Free Year', 'early_bin'],axis=1)

# what we are trying to predict- early ice free year
y = sept_df.drop(['Month','Member', 'Longest Duration', 'Max Ice Lost', 'Ice Free Year'], axis=1)

In [11]:
y.value_counts()

early_bin
0            35
1             5
dtype: int64

In [22]:
x_train_bal, y_train_bal, x_test_bal, y_test_bal = dataprep_pipeline(x, y, verbose=True)

Prior to scaling and rebalacing...
Shape of training predictors: (32, 2)
Shape of testing predictors: (8, 2)
Shape of training predictands: (32, 1)
Shape of testing predictands: (8, 1)
 
for training data... 
After scaling and rebalacing...
Shape of predictors: (58, 2)
Shape of predictands: (58,)
 
for testing data... 
After scaling and rebalacing...
Shape of predictors: (12, 2)
Shape of predictands: (12,)
 


In [23]:
lr = LogisticRegression(solver='lbfgs')
lr.fit(x_train_bal, y_train_bal);

In [24]:
y_pred = lr.predict(x_test_bal)
lr_acc, lr_rec = bin_metrics(y_test_bal, y_pred)

Accuracy: 0.4167
Recall: 0.0


In [25]:
print("Training metrics:")
pred_train= lr.predict(x_train_bal) 
bin_metrics(y_train_bal,pred_train);

print(" ")
print("Testing metrics:")
bin_metrics(y_test_bal, y_pred);

Training metrics:
Accuracy: 0.6897
Recall: 0.6552
 
Testing metrics:
Accuracy: 0.4167
Recall: 0.0


In [28]:
origvals, testpredictor = rand_atmos_conditions_precip()

In [30]:
lr_prediction = lr.predict_proba(np.array(testpredictor).reshape(1, -1))[0][1]*100 
print("The conditions are: ")
print(origvals)
print(" ")
print("There is a {0:.{digits}f}% chance a member will go ice free 2043 or before given those conditions.".format(lr_prediction, digits=2))

The conditions are: 
Member                30.000000
Month                  9.000000
Ice Free Year       2046.000000
Max Ice Lost           0.329655
Longest Duration      12.000000
early_bin              0.000000
Name: 29, dtype: float64
 
There is a 48.33% chance a member will go ice free 2043 or before given those conditions.




## Step 4: Present graphs visually using 2-3 graphs

## Summary