# Lithology classification

## Data

This notebook uses data from the [Upper Condamine Catchment](http://www.bom.gov.au/qld/flood/brochures/condamine_balonne/map_upper.shtml) in the state of Queensland. The data is sourced from personal communication as a project output. It may be shared publicly and downloadable from this sample notebook in the future.

![Upper Condamine catchment formations](img/Upper_Condamine_formations.png "Upper Condamine catchment formations")

(Figure from [this paper](https://www.researchgate.net/figure/Upper-Condamine-catchment-Queensland-Australia-The-Marburg-Subgroup-consists-of_fig1_283184727))

## Status

As of May 2019 this present document is an output from exploratory work done during an internship by [Sudhir Gupta](https://github.com/Sudhir22).


## Purpose

This notebook compares the performance of two techniques for semi-automated classification . It also summarise work using ontologies for classification for cases where we do not have reliable training sets.


## Importing python packages

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
import rasterio
from rasterio.plot import show
import geopandas as gpd
import pickle

In [None]:
# Only True for co-dev of ela from this use case:
ela_from_source = False
ela_from_source = True

In [None]:
if ela_from_source:
    if ('ELA_SRC' in os.environ):
        root_src_dir = os.environ['ELA_SRC']
    elif sys.platform == 'win32':
        root_src_dir = r'C:\Users\SUD011\Documents\pyela-sudhir'
    else:
        username = os.environ['USER']
        root_src_dir = os.path.join('/home', username, 'src/github_jm/pyela')
    pkg_src_dir = root_src_dir
    sys.path.insert(0, pkg_src_dir)

from ela.textproc import *
from ela.utils import *
from ela.classification import *
from ela.visual import *

In [None]:
import striplog
from striplog import Lexicon

### Input Data

In [None]:
if ('ELA_DATA' in os.environ):
    data_path = os.environ['ELA_DATA']
elif sys.platform == 'win32':
    data_path = r'C:\data\Lithology'
else:
    username = os.environ['USER']
    data_path = os.path.join('/home', username, 'data', 'Lithology')

condamine_litho_dir = os.path.join(data_path,'Condamine')
condamine_litho_xl = os.path.join(condamine_litho_dir, 'MASTER_CONDAMINE_Interpretation_all_combined_Jan2017.xlsx')
condamine_litho_pkl = os.path.join(condamine_litho_dir, 'MASTER_CONDAMINE_Interpretation_all_combined_Jan2017.pkl')
lexicon_cleaned_pkl = os.path.join(condamine_litho_dir, 'lexicon_cleaned.pkl')

In [None]:
LITHO_DESC_COL='Lithology_original'
REGEX_LITHO_CLASS_COL='Regex_lithoclass'
# The column name with the simplified lithology as categorised by a hydrogeologist
LITHO_CLASS_COL = 'Simplified_lithology'

In [None]:
DL_LITHO_CLASS_COL = 'Simplified_Lithology'

# Initial data exploration

This first section deliberately covers some of the data discovery process for didactic purposes.

## Useful resources

Note that some of the terms used for classification are triangulated with [this description](https://www.bioregionalassessments.gov.au/assessments/21-22-data-analysis-clarence-moreton-bioregion/21211-lithological-and-stratigraphic-data)

We cache the data in a pickle as it is faster to re-read:


In [None]:
if not os.path.exists(condamine_litho_pkl):
    train_data=pd.read_excel(condamine_litho_xl)
    with open(condamine_litho_pkl, 'wb') as handle:
        pickle.dump(train_data, handle, protocol=pickle.HIGHEST_PROTOCOL)
else:
    with open(condamine_litho_pkl, 'rb') as handle:
        train_data = pickle.load(handle)

In [None]:
train_data.head()

In [None]:
# train_data_unprocessed = train_data.copy()

In [None]:
df = train_data

In [None]:
set(train_data[LITHO_CLASS_COL].values)

We massage the column of simplified lithologies, resulting from a manual classification

This is not obvious from inspection of the the pandas  data frame, but there appears to be NaNs that cause headaches later on.

In [None]:
set(type(x) for x in train_data[LITHO_DESC_COL].values)

In [None]:
# vv = [type(x) is not str for x in df[LITHO_DESC_COL].values]
# df.loc[np.array(vv)].head()
train_data[LITHO_CLASS_COL] = train_data[LITHO_CLASS_COL].str.lower()
train_data[LITHO_DESC_COL] = train_data[LITHO_DESC_COL].replace(np.nan,'',regex=True)
train_data[LITHO_DESC_COL] = train_data[LITHO_DESC_COL].str.lower()

In [None]:
%%time
# Not obvious upfront but noticed that there are terms where dots and slashes prevent tokenization and lithology term detection. 
train_data[LITHO_DESC_COL] = v_replace_punctuations(train_data[LITHO_DESC_COL].values)

In [None]:
train_data.head()

In [None]:
token_freq(train_data[LITHO_CLASS_COL].values, 50)

Given the low frequency of some of the classes, we remap them to one of the three main classes:

In [None]:
train_data[LITHO_CLASS_COL] = train_data[LITHO_CLASS_COL].replace('granite|granodiorite|diorite|basement','bedrock',regex=True)
train_data[LITHO_CLASS_COL] = train_data[LITHO_CLASS_COL].replace('gravel','alluvium',regex=False)
train_data[LITHO_CLASS_COL] = train_data[LITHO_CLASS_COL].replace('wrong_location|weathering_horizon|tertiary','unknown',regex=True)

In [None]:
token_freq(train_data[LITHO_CLASS_COL].values, 50)

In [None]:
descs = df[LITHO_DESC_COL]
descs = descs.reset_index()
descs = descs[LITHO_DESC_COL]
descs.head()

We apply a default lexical operation from striplog to expand abbreviations such as 'qtz.'

In [None]:
lex = Lexicon.default()

In [None]:
if not os.path.exists(lexicon_cleaned_pkl):
    expanded_descs = descs.apply(lex.expand_abbreviations) # takes 2-3 minutes
    y = expanded_descs.values
    train_data[LITHO_DESC_COL] = expanded_descs
    with open(lexicon_cleaned_pkl, 'wb') as handle:
        pickle.dump(train_data, handle, protocol=pickle.HIGHEST_PROTOCOL)
else:
    with open(lexicon_cleaned_pkl, 'rb') as handle:
        train_data = pickle.load(handle)
    y = train_data[LITHO_DESC_COL].values


We flatten the corpus of words to get the most frequend terms as a guide for which lithology classes we can define.

In [None]:
%%time
flat = flat_list_tokens(y)
len(set(flat))

In [None]:
df_most_common = token_freq(flat, 50)

In [None]:
df_most_common

In [None]:
plot_freq(df_most_common)

In [None]:
condamine_litho_cleaned_pkl = os.path.join(condamine_litho_dir, 'condamine_litho_cleaned.pkl')

if not os.path.exists(condamine_litho_cleaned_pkl):
    with open(condamine_litho_cleaned_pkl, 'wb') as handle:
        pickle.dump(train_data, handle, protocol=pickle.HIGHEST_PROTOCOL)
else:
    with open(condamine_litho_cleaned_pkl, 'rb') as handle:
        train_data = pickle.load(handle)
        y = train_data[LITHO_DESC_COL]



## Defining lithology classes

Starting with the three terms (simplified lithology classes) we want to end up with, we can also add the most frequent terms observed in the corpus.


In [None]:
litho_class_names = ["basalt", "bedrock", "alluvium", "unknown"]

lithologies = ['alluvium', 'basalt', 'bedrock', 'clay', 'sandstone','sand','shale','soil','honeycomb','gravel','coal','gravel','silt','soil','rock', 'limestone', 'metal']
# see https://www.bioregionalassessments.gov.au/assessments/21-22-data-analysis-clarence-moreton-bioregion/21211-lithological-and-stratigraphic-data
# for 'metal' or 'blue metal'
any_litho_markers_re = r'alluvium|sand|clay|ston|shale|basa|silt|soil|honey|coal|gravel|rock|mud|metal'

regex = re.compile(any_litho_markers_re)

lithologies_dict = dict([(x,x) for x in lithologies])
lithologies_dict['sands'] = 'sand'
lithologies_dict['basalts'] = 'basalt'
lithologies_dict['clays'] = 'clay'
lithologies_dict['shales'] = 'shale'
lithologies_dict['claystone'] = 'clay'
lithologies_dict['siltstone'] = 'silt'
lithologies_dict['mudstone'] = 'silt' # ??
lithologies_dict['capstone'] = 'limestone' # ??
lithologies_dict['ironstone'] = 'sandstone' # ??
lithologies_dict['topsoil'] = 'soil' # ??

lithologies_adjective_dict = {
    'sandy' :  'sand',
    'clayey' :  'clay',
    'clayish' :  'clay',
    'shaley' :  'shale',
    'silty' :  'ort=Truesilt',
    'gravelly' :  'gravel'
}

## Testing the regular expression model on the Condamine dataset

We have one first pass at classifying into one of the three target classes.

In [None]:
y = train_data[LITHO_DESC_COL]

In [None]:
%%time
v_tokens = v_word_tokenize(y)

In [None]:
vt = v_find_litho_markers(v_tokens, regex=regex)
print(summary_regex_tokens(vt))

In [None]:
prim_litho = [find_primary_lithology(x, lithologies_dict) for x in vt]

In [None]:
n = len(set(prim_litho))
plot_freq(token_freq(prim_litho, n_most_common = n))

We need to define a map from lithology classes down to the simplified classes: alluvium, bedrock, basalt or unknown. We will first do a bit of "snooping" on the labelled data; this is admitedly dodgy in a context of comparison of methods, but is in practice necessary to validate some classification assumptions.

Given the a priori classification of primary lithologies, what proportions of labels do we have for each primary lithology?

In [None]:
tmp_df = pd.DataFrame({ LITHO_CLASS_COL: train_data[LITHO_CLASS_COL], REGEX_LITHO_CLASS_COL: prim_litho, LITHO_DESC_COL: descs})

In [None]:
def class_freq_for_lithology(litho_name, df=tmp_df):
    blah = df.loc[tmp_df[REGEX_LITHO_CLASS_COL] == litho_name]
    return token_freq(blah[LITHO_CLASS_COL].values, 50)

In [None]:
# lithomap = pd.concat([class_freq_for_lithology(litho)['frequency'] for litho in lithologies], axis=1, names=lithologies)
# lithomap.columns = lithologies
# lithomap.rename(index=dict(zip(list(range(4)), litho_class_names)), inplace=True)
# lithomap

In [None]:
def horizontal_classfreq(litho_name, df=tmp_df):
    ff = class_freq_for_lithology(litho_name, df=df)
    dict(zip(ff['token'],ff['frequency']))
    x = pd.DataFrame(dict(zip(ff['token'],np.array(ff['frequency']))), index=[litho_name])
    return x

def table_litho_mappings(litho_names, df=tmp_df):
    return pd.concat( [horizontal_classfreq(x, df=df) for x in litho_names], sort=True)

In [None]:
table_litho_mappings(lithologies)

We define a map from primary lithology terms to the simplified lithologies we have in the labelled data set. 
Note that to some extent there is some "snooping" on the labelled data, above, though we think justifiable in the context using regular expression.  

In [None]:
lithology_map={
    'alluvium' :  'alluvium',
    'bedrock' :  'bedrock',
    'basalt' :  'basalt',
    'metal' :  'basalt',
    'honeycomb' :  'basalt',
    'clay' :  'alluvium',
    'coal' :  'bedrock', 
    'sandstone' :  'bedrock',
    'sand' :  'alluvium',
    '' :  'unknown',
    'soil' :  'alluvium',
    'shale': 'bedrock',
    'gravel': 'alluvium',
    'silt' : 'bedrock',
    'rock' : 'bedrock',
    'limestone' : 'alluvium'
}

In [None]:
final_prim_litho=list()
for x in prim_litho:
    final_prim_litho.append(lithology_map[x])


In [None]:
token_freq(final_prim_litho)

In [None]:
simplified_lithology=train_data[LITHO_CLASS_COL].values

In [None]:
def get_accuracy(modelled, expected):
    matches = np.equal(modelled, expected)
    return np.count_nonzero(matches)/len(matches)

In [None]:
print("Accuracy of regex for classifying primary lithologies: ", get_accuracy(final_prim_litho, simplified_lithology))

Let's check what lithology descriptions led to an 'unknown' classification to see whether we are missing something

In [None]:
tmp_df = pd.DataFrame({ LITHO_CLASS_COL: train_data[LITHO_CLASS_COL], REGEX_LITHO_CLASS_COL: final_prim_litho, LITHO_DESC_COL: descs})

In [None]:
match_and_sample_df(tmp_df, 'unknown', colname=REGEX_LITHO_CLASS_COL, size=100)

In [None]:
df_unk = tmp_df.loc[ tmp_df[REGEX_LITHO_CLASS_COL] == 'unknown' ]
df_unk.head()

In [None]:
flat = flat_list_tokens(df_unk[LITHO_DESC_COL].values)

In [None]:
s = ' '.join(flat)

In [None]:
show_wordcloud(s, title = 'Unclassified via regexp')

In [None]:
plot_freq(token_freq(flat, n_most_common = 30))

We did not use 'loam' as a term. There are others that should be used, but it is unclear how they should be remapped, like "metal" and "file"

In [None]:
lithologies.append('loam')
lithologies.append('granite')
lithologies.append('soapstone')

In [None]:
any_litho_markers_re = any_litho_markers_re + '|loam|granite|soap'
regex = re.compile(any_litho_markers_re)
lithologies_dict['loam'] = 'loam'
lithologies_dict['granite'] = 'granite'
lithologies_dict['soapstone'] = 'soapstone'

In [None]:
v_tokens = v_word_tokenize(y)
vt = v_find_litho_markers(v_tokens, regex=regex)

In [None]:
prim_litho = [find_primary_lithology(x, lithologies_dict) for x in vt]

In [None]:
n = len(set(prim_litho))
plot_freq(token_freq(prim_litho, n_most_common = n))

In [None]:
tmp_df = pd.DataFrame({ LITHO_CLASS_COL: train_data[LITHO_CLASS_COL], REGEX_LITHO_CLASS_COL: prim_litho, LITHO_DESC_COL: descs})

In [None]:
table_litho_mappings(lithologies, df=tmp_df)

In [None]:
lithology_map['loam'] = 'alluvium'
lithology_map['granite'] = 'bedrock'
lithology_map['soapstone'] = 'basalt'

In [None]:
final_prim_litho=list()
for x in prim_litho:
    final_prim_litho.append(lithology_map[x])


In [None]:
token_freq(final_prim_litho)

In [None]:
tmp_df = pd.DataFrame({ LITHO_CLASS_COL: train_data[LITHO_CLASS_COL], REGEX_LITHO_CLASS_COL: final_prim_litho, LITHO_DESC_COL: descs})

In [None]:
match_and_sample_df(tmp_df, 'unknown', colname=REGEX_LITHO_CLASS_COL, size=100)

In [None]:
print("Accuracy of regex for classifying primary lithologies: ", get_accuracy(final_prim_litho, simplified_lithology))

In [None]:
tmp_df = pd.DataFrame({ LITHO_CLASS_COL: train_data[LITHO_CLASS_COL], REGEX_LITHO_CLASS_COL: final_prim_litho, LITHO_DESC_COL: descs})

In [None]:
df_unk = tmp_df.loc[ tmp_df[REGEX_LITHO_CLASS_COL] == 'unknown' ]
df_unk.head()

In [None]:
flat = flat_list_tokens(df_unk[LITHO_DESC_COL].values)

In [None]:
s = ' '.join(flat)

In [None]:
show_wordcloud(s, title = 'Unclassified via regexp')

In [None]:
# find_regex_df( df_unk, '.*basalt.*', LITHO_DESC_COL)
# find_regex_df( df_unk, '.*file.*', LITHO_DESC_COL)

match_and_sample_df(tmp_df, 'unknown', colname=REGEX_LITHO_CLASS_COL, out_colname=None, size=50, seed=0)

In [None]:
df_regex = df.copy()
df_regex[REGEX_LITHO_CLASS_COL] = final_prim_litho

In [None]:
condamine_litho_regex = os.path.join(condamine_litho_dir, 'condamine_litho_regex.pkl')
with open(condamine_litho_regex, 'wb') as handle:
    pickle.dump(df_regex, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Testing the deep learning model on the same dataset

In [None]:
# train_data=train_data_unprocessed.copy()

In [None]:
# conda install gensim
# conda install tensorflow
# conda install keras
# pip install wordcloud

from ela.experiment.textproc import Model

In [None]:
train_data_dl = train_data.copy()
model=Model(train_data_dl,20)

In [None]:
predictions_dl = None

NOTE 2019-07-30: reloading cached results as the training takes a fair amount of time.

In [None]:
predictions_dl = pd.read_csv('prediction_file.csv')

In [None]:
if predictions_dl is None:
    model.initialise_model()

As of 2019-07-20: 

* Training Accuracy: 0.8620
* Testing Accuracy:  0.8562


In [None]:
x = train_data.copy()
x.head()

In [None]:
# TODO: fix the class to not be so hard coded... 
x['Description'] = x['Lithology_original']
x.columns

In [None]:

if predictions_dl is None: 
    model.predict(x)
    predictions_dl = pd.read_csv('prediction_file.csv')

In [None]:
predictions_dl.head()

In [None]:
print("Accuracy of DL for classifying primary lithologies: ", get_accuracy(predictions_dl[DL_LITHO_CLASS_COL], simplified_lithology))

## Explaining the predictive performance of regexp versus DL

* Confusion matrices for RE and DL
* Subset each case where DL gets is right and not RE, and get a wordcloud of terms
* Subset each case where RE gets is right and not DL, and get a wordcloud of terms

In [None]:
##
condamine_litho_regex = os.path.join(condamine_litho_dir, 'condamine_litho_regex.pkl')
with open(condamine_litho_regex, 'rb') as handle:
    df_regex = pickle.load(handle)

In [None]:
from sklearn.metrics import confusion_matrix

df_regex.head()

In [None]:
def build_confusion_matrix(df_regex, colname_true, colname_predicted, labels = litho_class_names):
    y_true = df_regex[colname_true].values
    y_pred = df_regex[colname_predicted].values    
    m = confusion_matrix(y_true, y_pred, labels=labels)
    return m

def normalise_confusion_matrix(cm, axis=1):
    return cm.astype('float') / cm.sum(axis=axis)[:, np.newaxis]
#     fractions = m / m.astype(np.float).sum(axis=0)
#     return fractions

def plot_cf_matrix(m, litho_class_names, title='confusion matrix', figsize = (10,7), cmap='cool', center=None):
    """
    """
    df_cm = pd.DataFrame(m, index = [i for i in litho_class_names],
                      columns = [i for i in litho_class_names])
    plt.figure(figsize = figsize)
    sn.heatmap(df_cm, annot=True, cmap=cmap, center=center)
    plt.title(title)
    plt.xlabel('predicted')
    plt.ylabel('actual')
    plt.show()
    

In [None]:
import scikitplot as skplt

In [None]:
sn.set(font_scale=1.4)

In [None]:
m_regex = build_confusion_matrix(df_regex, colname_true=LITHO_CLASS_COL, colname_predicted=REGEX_LITHO_CLASS_COL)

In [None]:
m = m_regex
m

In [None]:
m.astype('float') / m.sum(axis=0)[:, np.newaxis]

In [None]:
m = np.array([[5,1],[0,7]])

In [None]:
y_true = [2,2,2,1,2,1,2,2,1,2,1,1,1,1,2,1]
y_pred = [1,2,2,1,2,1,1,2,1,2,1,1,1,1,2,1]

m = confusion_matrix(y_true, y_pred, labels=[1,2])
m

In [None]:
normalise_confusion_matrix(m, 1)

In [None]:
skplt.metrics.plot_confusion_matrix(y_true=y_true, y_pred=y_pred, normalize=False, labels=[1,2], figsize=(7,7))
plt.show()

In [None]:
skplt.metrics.plot_confusion_matrix(y_true=y_true, y_pred=y_pred, normalize=True, labels=[1,2], figsize=(7,7))
plt.show()

In [None]:
fractions_regex = normalise_confusion_matrix(m_regex)

m_dl = build_confusion_matrix(predictions_dl, colname_true=LITHO_CLASS_COL, colname_predicted=DL_LITHO_CLASS_COL)
fractions_dl = normalise_confusion_matrix(m_dl)

m_dl

In [None]:
diff_frac = fractions_dl - fractions_regex

In [None]:
from sklearn.metrics import *
print(classification_report(df_regex[LITHO_CLASS_COL].values, df_regex[REGEX_LITHO_CLASS_COL].values, labels=litho_class_names))

In [None]:
print(classification_report(predictions_dl[LITHO_CLASS_COL].values, predictions_dl[DL_LITHO_CLASS_COL].values, labels=litho_class_names))

In [None]:
plot_cf_matrix(m_regex, litho_class_names, title='confusion matrix - Regex')

In [None]:
plot_cf_matrix(fractions_regex, litho_class_names, title='confusion matrix - Regex')

In [None]:
skplt.metrics.plot_confusion_matrix(df_regex[REGEX_LITHO_CLASS_COL], df_regex[LITHO_CLASS_COL], normalize=True, labels=litho_class_names, figsize=(7,7))
plt.show()

In [None]:
plot_cf_matrix(m_dl, litho_class_names, title='confusion matrix - DL')

In [None]:
plot_cf_matrix(fractions_dl, litho_class_names, title='confusion matrix - DL')

In [None]:
skplt.metrics.plot_confusion_matrix(predictions_dl[DL_LITHO_CLASS_COL], predictions_dl[LITHO_CLASS_COL], normalize=True, labels=litho_class_names, figsize=(7,7))
plt.show()

In [None]:
plot_cf_matrix(diff_frac, litho_class_names, title='', cmap="PiYG", center=0)

In [None]:
plot_cf_matrix(diff_frac.transpose() * 100, litho_class_names, title='DL vs regexp confusion matrix comparison (%)', cmap='bwr', center=0)

In [None]:
# double-checking the normalisation procedure
#true_counts = m.astype(np.float).sum(axis=1) # true preds
#f = m / true_counts
#f
#f = m / m.astype(np.float).sum(axis=0)
#f
#f.sum(axis=0), f.sum(axis=1)

len(df_regex)

There is a drop with DL in the accuracy predicting basalt as a simplified lithology. Can we get a hint why. 

In [None]:
df_regex.head()

In [None]:
predictions_dl.head()

In [None]:
len(predictions_dl), len(df_regex)

In [None]:
successful_re = np.logical_and((df_regex.Regex_lithoclass == "basalt"), (df_regex.Simplified_lithology == "basalt"))
successful_dl = np.logical_and((df_regex.Simplified_lithology == "basalt"), (predictions_dl[DL_LITHO_CLASS_COL] == "basalt"))
unsuccessful_dl = np.logical_and((df_regex.Simplified_lithology == "basalt"), (predictions_dl[DL_LITHO_CLASS_COL] != "basalt"))

v = np.logical_and(successful_re, unsuccessful_dl)

In [None]:
basalt_dl_fail = predictions_dl.loc[v]

In [None]:
flat = flat_list_tokens(basalt_dl_fail[LITHO_DESC_COL].values)
s = ' '.join(flat)
show_wordcloud(s, title = 'DL fail on basalt')

In [None]:
basalt_dl_fail.sample(n=20, frac=None, replace=False, weights=None, random_state=1)

In [None]:
len(basalt_dl_fail), np.sum(successful_re), np.sum(unsuccessful_dl), np.sum(successful_dl)

Conversely, when did the NLP/DL method get a false positive on Basalt?

In [None]:
#successful_re = np.logical_and((df_regex.Regex_lithoclass == "basalt"), (df_regex.Simplified_lithology == "basalt"))
false_pos_basalt_dl = np.logical_and((df_regex.Simplified_lithology != "basalt"), (predictions_dl.Simplified_Lithology == "basalt"))

In [None]:
basalt_dl_false_pos = predictions_dl.loc[false_pos_basalt_dl]

In [None]:
flat = flat_list_tokens(basalt_dl_false_pos[LITHO_DESC_COL].values)
s = ' '.join(flat)
show_wordcloud(s, title = 'DL False positive on basalt')

In [None]:
basalt_dl_false_pos.sample(n=20, frac=None, replace=False, weights=None, random_state=1)

In [None]:
len(basalt_dl_false_pos)

# Checking the accuracy if geolocation and log descriptions are taken as input features 

Logistic Regression for classifying lithologies based on geolocation. Combining the outputs of the logistic regression model and the deep learning model.

In [None]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

In [None]:
train_data = train_data_unprocessed.copy()
set(train_data['Simplified_lithology'].values)

In [None]:
train_data['Simplified_lithology']=train_data['Simplified_lithology'].replace(np.nan,'Unknown',regex=True)

In [None]:
train_data['Simplified_lithology']=train_data['Simplified_lithology'].str.lower()

In [None]:
set(train_data['Simplified_lithology'].values)

In [None]:
train_data['Simplified_lithology'],labels=pd.factorize(train_data['Simplified_lithology'])

In [None]:
train_X=train_data[['EASTING','NORTHING']][0:len(model.train_X)]
test_X=train_data[['EASTING','NORTHING']][len(model.train_X):]
train_y=train_data['Simplified_lithology'][0:len(model.train_X)]
test_y=train_data['Simplified_lithology'][len(model.train_X):]
train_X.replace(np.nan,0.0,inplace=True)
test_X.replace(np.nan,0.0,inplace=True)
print(train_X.shape)
print(train_y.shape)


In [None]:
clf=LogisticRegression(C=0.01)

In [None]:
clf.fit(train_X,train_y)

In [None]:
y_pred=clf.predict_proba(test_X)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
train_data=train_data_unprocessed.copy()
new_df=pd.DataFrame(train_data[['Lithology_original']].values,columns=['Description'])

In [None]:
new_df.head()

In [None]:
new_df = new_df[len(model.train_X):].copy()

In [None]:
y_pred_dl=model.predict_certainity(new_df)

In [None]:
y_pred, y_pred_dl

In [None]:
print(y_pred.shape)
print(y_pred_dl.shape)

In [None]:
final_class_prob=np.mean(np.array([y_pred,y_pred_dl]), axis=0 )

In [None]:
print(final_class_prob.shape)

In [None]:
final_output_numerical=np.argmax(final_class_prob,axis=1)

In [None]:
print(accuracy_score(test_y,final_output_numerical))

In [None]:
simplified_lithology_categories=[]
for x in final_output_numerical:
    simplified_lithology_categories.append(labels[x])

It does not increase the accuracy. Don't think the geolocation has an impact on the simplified lithologies

# Ontology based learning

Python provides libraries such as RDFLib, OWL 2 for working with ontologies. OWL 2 is better suited for ontology oriented programming since it offers a more pythonic way of managing/creating ontologies.
But then, RDFLib works well with .rdf files. Hence, there is a trade-off. Can use based on individual project needs. <br>

Protege is an 'IDE' which can be used to manage/create ontologies. It was developed at Stanford and can be downloaded from here [Protege](https://protege.stanford.edu/)<br>
Protege can be used to visualise the different relationships defined in an ontology <br>

The ontology that I am using can be downloaded from [here](http://ontologydesignpatterns.org/wiki/Ontology:CGI_Simple_Lithology_201001)

In [None]:
import rdflib as rd

In [None]:
from rdflib.namespace import SKOS
from rdflib.namespace import RDFS

In [None]:
ontology=rd.Graph()
condamine_litho_ontology = os.path.join(condamine_litho_dir, 'SimpleLithology201001.rdf')
ontology.parse(condamine_litho_ontology)

In [None]:
lithology_dictionary=dict()
for x,y in ontology.subject_objects(SKOS.prefLabel):
    lithology_dictionary[x]=y
    

In [None]:
for x,y in ontology.subject_objects(SKOS.broader):
    if x in lithology_dictionary.keys() and y in lithology_dictionary.keys():
        print(" broader class of ",lithology_dictionary[x]," is ",lithology_dictionary[y])

<h5>These relationships can be converted into fuzzy if-then rules and fed to the machine learning model to make better decisions. <br>
Above is only one kind of relationship. We have other relationships such as narrower, description of each label etc. All these combined would make a very "knowledgeable" machine learning model. </h5>

### Comparing visually

Optional



## Observations and discussions

DL much better than regular expressions.<br>

Regular expressions use a set of keywords and look for matches in the descriptions. Sometimes, these keywords might not be in the descriptions.<br>

Similarly, regular expressions map descriptions to a set of predefined catgories (clay,sand,etc.) which are further refined into broader categories ( alluvium,basalt,bedrock). Geoscientists confirm that the mapping is not one/many-to-one. For example,Clay could be part of alluvium/basalt. This cannot be achieved using the regular expression model.

DL, on the other hand, learns the descriptions of different lithology classes. Based on labelled data from geoscientists, DL model learns the different kinds of descriptions which would pertain to for e.g alluvium. 




## Conclusion and future work