# Homework 3

This homework will explore linear regression and resampling techniques by analysing data from a database of glaciers. The database is *Glatilda* for [*Glacier Ice Thickness Database*](!https://www.gtn-g.ch/data_catalogue_glathida/).

1. Data prep (5 points)
2. Mapping (10 points)
3. Correlations between parameters (5 points)
4. Linear regression and resampling techniques (10 points)

## 1. Data Prep (5 points total)

### a) Download data (1 point) 
The database is saved on a GitLab repository that you may clone: https://gitlab.com/wgms/glathida.git



In [85]:
# !git clone https://gitlab.com/wgms/glathida.git

In [86]:
band = 'glathida/data/band.csv'
glacier = 'glathida/data/glacier.csv'
point = 'glathida/data/point.csv'
survey = 'glathida/data/survey.csv'

### b) Import Python modules (1 point) 
Import pandas, geopandas, plotting, raster files,  numpy, netcdf

In [87]:
# !pip install wget
# !pip install plotly

In [88]:
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

import netCDF4 as nc
import numpy as np
import pandas as pd
import rasterio
import wget
import requests, zipfile , os, io

from rasterio.mask import mask
from rasterio.plot import show

 
import plotly.express as px



### c) Read data (2 points)
Read the glacier data from the file ``glathida/data/glacier.csv`` into a pandas data frame, and decribe briefly the dataframe content and its first few lines.

In [89]:
# solution
t_path = "glathida/data/glacier.csv"
glacier = pd.read_csv(t_path, index_col=0)

print("This data set has", glacier.shape, "rows and columns" )
print(glacier.describe())
#There are many columns that contain rows with NaN values
#This dataset contains location, data, area, ID, thickness parameters, mean slop data
#It also contains additional important information pertating to the profiles that were conducted, and methods used


This data set has (1013, 20) rows and columns
         survey_id          lat          lon          area  mean_slope  \
count  1013.000000  1013.000000  1013.000000    985.000000  426.000000   
mean    130.707799    57.084700    14.987602    217.599979    9.091549   
std      75.321429    29.473871    51.341702   1602.939314    5.806169   
min       1.000000   -74.583300  -151.300000      0.026400    0.000000   
25%      69.000000    46.453610    10.700000      2.204000    6.000000   
50%     128.000000    62.039400    14.687100     10.959000    8.000000   
75%     203.000000    78.776000    22.280000     98.341000   12.000000   
max     256.000000    81.767200   170.320000  40000.000000   48.000000   

       mean_thickness  mean_thickness_uncertainty  max_thickness  \
count      498.000000                  127.000000     525.000000   
mean        70.375502                    7.125984     197.563810   
std         69.053311                    5.988073     199.968677   
min          4.

In [90]:
glacier.head()

Unnamed: 0_level_0,survey_id,name,external_db,external_id,lat,lon,date,max_date,area,mean_slope,mean_thickness,mean_thickness_uncertainty,max_thickness,max_thickness_uncertainty,number_points,number_profiles,length_profiles,interpolation_method,flag,remarks
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1,Isfallsglaciären,WGI,SE4B000E0006,67.915,18.568,1979-03-01,1979-03-31,1.3,,72.0,,220.0,,,,,,,
2,1,Rabots glaciär,WGI,SE4B000E1016,67.91,18.496,1979-03-01,1979-03-31,4.1,,84.0,,175.0,,,10.0,,,,
3,1,Storglaciären,WGI,SE4B000E0005,67.9,18.57,1979-03-01,1979-03-31,3.1,,99.0,,250.0,,,,,,,
4,2,South Cascade Glacier,WGI,US2M00264006,48.35698,-121.05735,1975-01-01,1975-12-31,2.0,,99.0,,195.0,,,,,,,
5,3,Athabasca Glacier,FOG,7,52.1754,-117.284,,,3.8,,150.0,,,,,,,,,


**Explore the data with visualization**
Before making any inference of models with the data, we will start by exploring basic correlations among parameters by plotting. In particular, we will focus on ``mean_thickness``, ``area``, ``mean_slope`` parameters.

### d) Remove bad data (1 point)

The database may contain Nans and other "bad" values (welcome to the data world!). First we will clean the data by removing nans. We are mostly interested in the thickness, area, and slope



In [91]:
#answer below 
glacier.dropna(subset=['mean_thickness', 'area', 'mean_slope'], inplace=True)

## 2. Mapping glaciers (10 points)

Make a global map of the glaciers. Use either of the tools we learned in class:
* Geopandas, DEMs from NetCDFfiles (see chapter 2.4)
* Pandas and Plotly (see chapter 2.2). You may need to transform some of the series into log-spaced values for better visualization.

### Option 1: Tif and matplotlib

You can use the ``elevation`` data from the DEM seen in class. Download the DEM file (https://www.dropbox.com/s/j5lxhd8uxrtsxko/HYP_50M_SR.tif?dl=1)

In [92]:
# This calls for and stores the data for use in this notebook
elevation = rasterio.open("https://www.dropbox.com/s/j5lxhd8uxrtsxko/HYP_50M_SR.tif?dl=1")


___Tips___: when plotting a image in ``matplotlib`` you need to add information about the physical dimensions of the image. You can calculate the ``bounds``.

In [93]:
bounds = (elevation.bounds.left, elevation.bounds.right, \
          elevation.bounds.bottom, elevation.bounds.top)

In [94]:
elevation

<open DatasetReader name='https://www.dropbox.com/s/j5lxhd8uxrtsxko/HYP_50M_SR.tif?dl=1' mode='r'>

We will use ``matplotlib.pyplot`` to show the raster image in the background (tips: use ``imshow()``. The raster image in matplotlib can only import one frame and not three (R, G, B) frames. We will first stack the three images together. 

In [95]:
red = elevation.read(1)
green = elevation.read(2)
blue = elevation.read(3)
pix = np.dstack((red, green, blue))

ERROR 1: TIFFReadEncodedStrip:Read error at scanline 4294967295; got 0 bytes, expected 32400
ERROR 1: TIFFReadEncodedStrip() failed.
ERROR 1: /vsicurl/https://www.dropbox.com/s/j5lxhd8uxrtsxko/HYP_50M_SR.tif?dl=1, band 1: IReadBlock failed at X offset 0, Y offset 4805: TIFFReadEncodedStrip() failed.


RasterioIOError: Read or write failed. /vsicurl/https://www.dropbox.com/s/j5lxhd8uxrtsxko/HYP_50M_SR.tif?dl=1, band 1: IReadBlock failed at X offset 0, Y offset 4805: TIFFReadEncodedStrip() failed.

In [None]:
plt.imshow(pix, cmap=plt.cm.viridis,
                 extent=bounds)

## Option 2: Plotly

You may use plotly. For improved visibility, transform some of the data into log-spaced. You may add these transformed Series into the Pandas, and use them as input to plotly.

In [None]:

import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'vscode' # writes as standalone html, 
# pio.renderers.default = 'iframe' # writes files as standalone html, 
# pio.renderers.default = 'png' # writes files as standalone html, 
# try notebook, jupyterlab, png, vscode, iframe

In [None]:
# solution


## 3. Correlations between data parameters ( 5 points total)

Make plots to vizualise the correlation, or lack of, between all three data. Make at least three plots.

### a) Basic correlations using Matplotlib (2 points)

Make 3 plots using matplotlib to visualize slope, mean_thickness, and area. Use logscale to see the correlatons.

__Tips__: 
* Use the function ``scatter`` to plot the values of mean thickness, mean slope, area, and latitude. 
* use one of the dataframe columns as a color using the argument ``c``. You can also vary the ``colormap`` using the argument ``cmap``. Help on colormaps can be found here: https://matplotlib.org/stable/tutorials/colors/colormaps.html. Be mindful of Color-Vision Deficient readers and read *Crameri, F., Shephard, G.E. and Heron, P.J., 2020. The misuse of colour in science communication. Nature communications, 11(1), pp.1-10. https://doi.org/10.1038/s41467-020-19160-7* (find it on the class Gdrive). You can add a third "data" by choosing a marker color that scales with an other parameter. For instance, try coloring your marker with the ``LAT`` parameter to look at systematic latitudinal trends from the equator to the poles.
* Do not forget to adjust fontsize, figure size (at least 10,8), grid, labels with  of the features (example: km). ou may also explore the *logarithmic* correlations by mapping the axis from linear to logarithmic scale ``plt.xscale('log')``.

In [None]:
# Figure 1: Mean slope vs mean thickness
# solution
fig, axs = plt.subplots(figsize=(10,8))

slope_thickness = axs.scatter(data=glacier, x='mean_slope', y='mean_thickness', c='lat', cmap='Blues')
axs.set_xlabel('Mean Slope (º)', fontsize=12); axs.set_ylabel('Mean Thickness (m)', fontsize=12)

axs.set_ylim(0,125); #axs.set_xlim(0,100)
#easier to see correlation with log scale on x-axis

plt.colorbar(slope_thickness, label= 'Latitude (º)', orientation='horizontal')

plt.xscale('log')
plt.tight_layout(pad=1)
fig.suptitle('Preliminary Correlation Analysis of Galcier Parameters Mean Thickness and Mean Slope'
             , y=1.02, fontsize=15) # or plt.suptitle('Main title')

plt.show()

In [None]:
# Figure 2: area vs mean thickness
# solution
fig, axs = plt.subplots(figsize=(10,8))

area_thickness = axs.scatter(data=glacier, x='area', y='mean_thickness',  c='lat', cmap='Blues')

axs.set_ylabel('Glacier Area (km^2)', fontsize=12); axs.set_xlabel('Mean Thickness (m)', fontsize=12)

plt.colorbar(area_thickness, label= 'Latitude (º)', orientation='horizontal')
#easier to see correlation with log scale on x-axis
plt.xscale('log')
plt.tight_layout(pad=1)
fig.suptitle('Preliminary Correlation Analysis of Galcier Parameters Area and Mean Thickness',
             y=1.01, fontsize=15) # or plt.suptitle('Main title')

plt.show()

In [None]:
# Figure 2: area vs mean slope
# solution
fig, axs = plt.subplots(figsize=(10,8))

area_slope = axs.scatter(data=glacier, x='area', y='mean_slope',  c='lat', cmap='Blues')
axs.set_ylabel('Mean Slope (º)', fontsize=12); axs.set_xlabel('Glacier Area (km^2)', fontsize=12)

plt.colorbar(area_slope, label= 'Latitude (º)', orientation='horizontal')
#easier to see correlation with log scale on x-axis
plt.xscale('log')
plt.tight_layout(pad=1)
fig.suptitle('Preliminary Correlation Analysis of Galcier Parameters Area and Mean Slope',
             y=1.01, fontsize=15) # or plt.suptitle('Main title')

plt.show()

### b) 3D Scatter plot using Plotly (1 point)

Use the plotly ``scatter_3d`` plot. Make sure to change the pandas series for log scales.

In [None]:
# solution

px.scatter_3d(glacier, x='area', y='mean_slope', z='mean_thickness',
              color='lat', log_x=True)




### c) Pandas Correlation function (1 point)

You may use Pandas functionalities to explore correlation between data. Use the function ``corr`` on the dataframe and the matplotlib function ``matshow`` to plot a heatmap of the correlations

In [None]:
#setting which parameters I want to look at
glacier_parameters = glacier[['mean_thickness','mean_slope','area']]

#getting the correlations of all these parameters
correlations = glacier_parameters.corr()

correlations

In [None]:
#creating figure
fig, ax = plt.subplots(figsize=(10,8))

#ax.set_xticks(np.arange(len(labels)))  # show them all!
#ax.set_yticks(np.arange(len(labels)))  # show them all!
ax.set_xticklabels(['0','Mean Thickness','Mean Slope','Area'])  # set to be the abbv (vs useless #)
ax.set_yticklabels(['0','Mean Thickness','Mean Slope','Area'])

correlation_plot = ax.matshow(correlations, cmap='Blues')

plt.colorbar(correlation_plot, label= 'Correlation', orientation='vertical')

plt.show()


### f) Seaborn Plotting (1 point)

Seaborn is a great python package for basic data anlytics. See documentation [here](!https://seaborn.pydata.org/). You can visualize the data by plotting data features against each other and explore visually data correlations.

In [None]:
# !pip install seaborn

In [None]:
import seaborn as sns

In [None]:
fig, ax = plt.subplots(3,1,figsize=(10,8))

sns.scatterplot(data=glacier, x='mean_thickness', y='mean_slope', ax=ax[0])

sns.scatterplot(data=glacier, x='area', y='mean_thickness', ax=ax[1])

sns.scatterplot(data=glacier, x='area', y='mean_slope', ax=ax[2])



for i in range(0,3):
    ax[i].set(xscale="log")

plt.tight_layout(pad=1.05)
plt.suptitle('Preliminary Correlation Analysis with Seaborn on Glacier Parameters', y=1.01)
plt.show()

Discuss the basic correlations among the data. Do these correction make sense when you think about the shapes of glaciers?

Mean thickness and mean slope have a moderate negative correlation with a value of -0.412, this makes sence as a thicker iceberg is generally more flat than a smaller iceberg when just thinking of the enviorment. 

Area and mean thickness have a moderate postive correlation of 0.42, a number similar, though postive, to mean slope and mean thickness. When thinking of a glacial enviorment it makes sence that a glacier with a larger area is thicker as glaciers when larger are generally thick as well.

Area and mean slope have the least strong correlation of the three parameters against eachother at with a correlation value of -0.103678, making it not a significant correlation. This doesn't seem at the of the ordinary when you take into account how glacier can be large but very flat, or have many slopes, likely due to the enviorment they developed and sustained in.

## 4. Linear Regression (10 points total counted in the next section)
You found from basic data visualization that the three parameters ``mean_slope``, ``mean_thickness``, and ``area`` are correlated. It does make physical sense because a *steep* glaciers is likely to be in the high mountains regions, hanging on the mountain walls, and thus be constrained, and conversely, a flat glacier is either at its valley, ocean terminus or on ice sheets.

### a) Simple linear regression (2 points)
We will now perform a regression between the parameters (or their log!). Linear regressions are models that can be imported from scikit-learn. Log/exp functions in numpy as ``np.log()`` and ``np.exp()``.
Remember that a linear regression is finding $a$ and $b$ knowing both $x$ and the data $y$ in $y = Ax +b$. We want to predict ice thickness from a crude estimate of the glacier area.

__Tips__: 
a. make sure that the dimensions are correct and that there is no NaNs and zeros.
b. Make sure to inport the scikit learn linear regression function and the error metrics.

In [None]:
# solution
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score



In [None]:
#any nans or zeros?
area = glacier['area']
slope = glacier['mean_slope']
thick = glacier['mean_thickness']

print('Area:', len(glacier[area.isna()]), len(glacier[area==0]),
      'Thickness:', len(glacier[thick.isna()]), len(glacier[thick==0]),
      'Slope:', len(glacier[slope.isna()]), len(glacier[slope==0])
                                             )
# remove zeros from slope
glacier = glacier[(glacier.mean_slope != 0)]

print(len(glacier[slope==0]))

Make a plot of the data and the linear regression your performed

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
## thickness and mean slope
# converting the data into numpy arrays
y = np.asarray(glacier['mean_thickness']).reshape(-1, 1)# reshaping so that it works for linear regress as it requries it
area_log = np.log(glacier['area']) #putting area into log space
x = np.asarray(area_log).reshape(-1, 1) #reshaping area area
#tt = np.linspace(np.min(t),np.max(t),1000)

# performing the linear regression
regr = LinearRegression()
# fitting it
regr.fit(x,y)
# We will first predict the fit:
prediction=regr.predict(x) 

# The coefficients
print('The coefficent is', regr.coef_[0][0],
      'and intercept is', regr.intercept_)

plt.plot(x,prediction,color="red")
plt.grid(True)

plt.show()

In [None]:
# Figure 2: area vs mean thickness
# solution
fig, axs = plt.subplots(figsize=(10,8))
area_log = np.log(glacier['area'])
area_thickness = axs.scatter(data=glacier, x=area_log, y='mean_thickness',  c='lat', cmap='Blues')

axs.set_xlabel('Glacier Area (km^2)', fontsize=12); axs.set_ylabel('Mean Thickness (m)', fontsize=12)

plt.colorbar(area_thickness, label= 'Latitude (º)', orientation='horizontal')
#easier to see correlation with log scale on x-axis

# x is area
plt.plot(x,prediction,color="red")


plt.tight_layout(pad=1)
fig.suptitle('Correlation Analysis of Galcier Parameters Area and Mean Thickness',
             y=1.01, fontsize=15) # or plt.suptitle('Main title')

plt.show()

**Briefly comment on the quality of your fit and a linear regression (1 point)**

The fit looks pretty good, except that there are outliers on the extreme low and high values of mean area.

### b) Leave One Out Cross Validation linear regression (1 point)


Perform the LOCCV on the ``area`` and ``thickness`` values. Predict the ``thickness`` value knowing a ``area`` value. Use material seen in class. Make a plot of your fit.

In [None]:
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import mean_squared_error, r2_score
# solution
loo = LeaveOneOut()

In [None]:
# convert the data into numpy arrays.
E = np.asarray(glacier['mean_thickness']).reshape(-1, 1)# reshaping was necessary to be an argument of Linear regress
t = np.asarray(glacier['area']).reshape(-1, 1)

In [None]:
vel = np.zeros(len(E)) # initalize a vector to store the regression values
mse_train = np.zeros(len(E))
mse_val = np.zeros(len(E))
r2s = np.zeros(len(E))
i=0
for train_index, test_index in loo.split(E):    
    E_train, E_val = E[train_index], E[val_index]
    t_train, t_val = t[train_index], t[val_index]
    # now fit the data on the training set.
    regr = LinearRegression()
    # Fit on training data:
    regr.fit(t_train,E_train)
    # We will first predict the fit:
    Epred_train=regr.predict(t_train) 
    Epred_val=regr.predict(t_val) 

    # The coefficients
    vel[i]= regr.coef_[0][0]
    mse_train[i]= mean_squared_error(E_train, Epred_train)
    mse_val[i]= mean_squared_error(E_val, Epred_val)
    r2s[i]=r2_score(E_val, Epred_val)
    i+=1

# the data shows cleary a trend, so the predictions of the trends are close to each other:
print("mean estimate is %f4.2 and the standard deviation %f4.2"%(np.mean(vel),np.std(vel)))
# the test error is the average of the mean-square-errors
print("CV = %4.2f"%(np.mean(mse_val)))

In [None]:
# we randomly select values and split the data between training and validation set.
from sklearn.model_selection import ShuffleSplit
# we split once the data between a training and a validating set 
n=1 # we do this selectio once
v_size = 0.3 # 30% of the data will be randomly selected to be the validation set.

rs = ShuffleSplit(n_splits=n, test_size=.3, random_state=0)
for train_index, val_index in rs.split(E):
    E_train, E_val = E[train_index], E[val_index]
    t_train, t_val = t[train_index], t[val_index]
plt.scatter(t_train,E_train,marker="o");plt.grid(True);plt.ylabel('Thickness')
plt.scatter(t_val,E_val,marker="o",s=6,c="red")
plt.xlabel('Area (km^2)')
plt.title('Thickness Predictions using Area')
plt.legend(['training set','validation set'])

In [None]:
# now fit the data on the training set.
regr = LinearRegression()
# Fit on training data:
regr.fit(t_train,E_train)
# We will first predict the fit:
Epred=regr.predict(t_train) 
Epred_val=regr.predict(t_val) 

# The coefficients
print('Training set: Coefficient (m): ', regr.coef_[0][0])

print('MSE (mean square error) on training set (m): %.2f'
      % mean_squared_error(Epred, E_train))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination on training set: %.2f'
      % r2_score(Epred, E_train))

print('MSE on validation set (m): %.2f and coefficient of determiniation on %.2f' %(mean_squared_error(Epred_val, E_val), r2_score(Epred_val, E_val)))


plt.scatter(t,E);plt.grid(True);plt.ylabel('Glacier Thickness')
plt.plot(t_train,Epred,color="red",linewidth=4)
plt.plot(t_val,Epred_val,color="green")
plt.legend(['data','fit on training','fit on validation'])
plt.title('Random selection for data split')


### c) Bootstrapping (1 point)

Perform the same analysis but using a bootstrapping technique. Output the mean and standard deviation of the slope. An illustration with a histogram  may help.

In [None]:
from sklearn.utils import resample
# solution

# convert the data into numpy arrays.
E = np.asarray(glacier['mean_thickness']).reshape(-1, 1)# reshaping was necessary to be an argument of Linear regress
t = np.asarray(area_log).reshape(-1, 1) #TRANSFORM THIS INTO LOG SPACE

k=100

vel = np.zeros(k) # initalize a vector to store the regression values
mse = np.zeros(k)
r2s = np.zeros(k)
i=0
for iik in range(k):    
    ii = resample(np.arange(len(E)),replace=True,n_samples=len(E))# new indices
    E_b, t_b = E[ii], t[ii]
    # now fit the data on the training set.
    regr = LinearRegression()
    # Fit on training data:
    regr.fit(t_b,E_b)
    Epred_val=regr.predict(t) # test on the validation set.

    # The coefficients
    vel[i]= regr.coef_[0][0]
    i+=1

# the data shows cleary a trend, so the predictions of the trends are close to each other:
print("mean thickness estimates %f4.2 and the standard deviation %f4.2"%(np.mean(vel),np.std(vel)))

plt.hist(vel,50);plt.title('Distribution of Thickness');plt.grid(True)
plt.show()




### d) Predict the thickness of a glacier (2 points)

Let assume that you measure a glacier of area 10 km$^2$. Can you use your bootstrap regression framework to provide a distribution of possible values of the ice thickness ? Output the mean and standard deviation of the predicted ice thickness.

In [None]:
# solution
k=100
t = np.asarray(area_log).reshape(-1, 1)
vel = np.zeros(k) # initalize a vector to store the regression values
mse = np.zeros(k)
r2s = np.zeros(k)
i=0
for iik in range(k):    
    ii = resample(np.arange(len(E)),replace=True,n_samples=len(E))# new indices
    E_b, t_b = E[ii], t[ii]
    # now fit the data on the training set.
    regr = LinearRegression()
    # Fit on training data:
    regr.fit(t_b,E_b)
    Epred_val=regr.predict(t) # test on the validation set.

    # The coefficients
    vel[i]= np.exp(regr.predict(np.log(10)*np.ones(1).reshape(-1,1)))
    i+=1

print('mean:',np.mean(vel),'STD:',np.std(vel))