In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import missingno as msno


## Imputation Part II

In [None]:
inter_data = pd.read_csv('intermediate_df.csv')
print(inter_data.head())

In [None]:
inter_data.info()

There are still some NaNs in the Salinity, SiO3, ChlorA, O2 and Phaeop columns. While the aggregate distributions for some of these quantities might look a little complicated we know that each of these datasets are taken in consecutive bottle measurements on a single cast as a function of depth. 

For some of this data then it should be possible to interpolate. 

In [None]:
plt.scatter(inter_data['Depthm'], inter_data['Salnty'])
plt.show()

OK, anything above 1000 m is going to be imputed to the mean of the Salinity above 1000 m.

In [None]:
def consecutive_nan_visualiser(data, colname):
    cumNaNs= data[colname].isna().astype(int).groupby(data[colname].notna().astype(int).cumsum()).sum()
    plt.plot(cumNaNs)
    plt.xlabel('Series Index')
    plt.ylabel('# of consecutive NaNs')
    plt.title(colname + ' Consecutive NaN visualization')
    plt.show()


In [None]:
consecutive_nan_visualiser(inter_data, 'Salnty')

Hmmm...there are some small nan clumps and then there are a few large nan clumps. Imputing on these large clumps by simple interpolation is not going to work. However, lets impute the small clumps and then impute values at large depths after series leveling off with the large depth mean. The large depth cut off is deeplim, the nan clump size limit for interpolation is nan_limit.

I know its a little janky...i'll come up with something later that automatically estimates depth at which series levels off so that deeplim is not a free parameter.

In [None]:

def impute_by_depth(data, colname, deeplim = 1500, nan_limit = 3, depthname = 'Depthm' ):
    comb_df = pd.concat([data[depthname], data[colname]], axis = 1)
    

    print(comb_df[colname].isna().sum())

    comb_df.interpolate(method = 'linear', limit = nan_limit, inplace = True)
    lowdepthval = comb_df[comb_df[depthname] >= deeplim].loc[:,colname].mean()
    comb_df.loc[(comb_df[depthname] >= deeplim) & (comb_df[colname].isna() == True), colname] = lowdepthval
    print(comb_df[colname].isna().sum())
    return comb_df[colname]

Alright, lets impute using the impute_by_depth function for chlorophyll, phaeopigments, and salinity. We will do no further imputation on these columns as they do not have obvious single valued relationships with other quantities, and because at low depth the variance on these quantities is really high. And its not obvious that imputing with a mean -- even one conditioned on depth -- is a good idea. 

In [None]:
cols_to_impute = ['Salnty', 'ChlorA', 'Phaeop']
imputedcols = pd.concat([impute_by_depth(inter_data, col, deeplim = 500, nan_limit = 3 ) for col in cols_to_impute], axis = 1)
inter_data[cols_to_impute] =  imputedcols

In [None]:
inter_data.info()

The final thing to impute is the silicate column. From our first notebook, we do see some kind of a functional dependence between silicates vs. depth. We also have no NaNs in depth, so we can you use this for a function-based imputation. Note: there are some obvious interesting branchoffs from the main curve

In [None]:
plt.scatter(inter_data['Depthm'], inter_data['SiO3uM'])
plt.xlabel('Depthm')
plt.ylabel('SiO3um')
plt.show()

In [None]:
from scipy.optimize import curve_fit

def tanh_coeff(x, a, b, c):
    return a * np.tanh(b * x) + c

nonandat = inter_data[['Depthm', 'SiO3uM']].dropna()
xdata = nonandat['Depthm']
ydata =  nonandat['SiO3uM']
popt, pcov = curve_fit(tanh_coeff, xdata , ydata)

In [None]:
plt.scatter(inter_data['Depthm'], inter_data['SiO3uM'], label = 'data')
plt.scatter(xdata, tanh_coeff(xdata, *popt), c = 'r', label = 'fit')
plt.legend()
plt.show()

Good, we can use the tanh_coeff to impute. Provided that most of the NaNs dont live on the offshoots of the main tanh function, we should be pretty good.

In [None]:
SiO3vsdepth_impute = inter_data.loc[inter_data['SiO3uM'].isna() == True, ['Depthm','SiO3uM']]

In [None]:
imputed_SiO3 = tanh_coeff(SiO3vsdepth_impute['Depthm'], *popt)
print(imputed_SiO3)

In [None]:
inter_data.loc[inter_data['SiO3uM'].isna() == True, 'SiO3uM'] = imputed_SiO3

In [None]:
inter_data.info()

In [None]:
inter_data['SiO3uM'].head()

In [None]:
plt.scatter(inter_data['Depthm'], inter_data['SiO3uM'])
plt.xlabel('Depthm')
plt.ylabel('SiO3um')
plt.show()

OK, we can see the imputed data in the high depth region. Nice. This worked. We are now finished with data imputation. Any remaining NaNs in the dataframe will be dropped as we have no hope of imputing these in a way that might screw too much with the data distributions.

In [None]:
final_data = inter_data.dropna()
final_data.info()

Now we save this data to csv.

In [None]:
final_data.to_csv('final_data.csv')