# Lithology classification

## Data

This notebook uses data from the [Upper Condamine Catchment](http://www.bom.gov.au/qld/flood/brochures/condamine_balonne/map_upper.shtml) in the state of Queensland. The data is sourced from personal communication as a project output. It may be shared publicly and downloadable from this sample notebook in the future.

![Upper Condamine catchment formations](img/Upper_Condamine_formations.png "Upper Condamine catchment formations")

(Figure from [this paper](https://www.researchgate.net/figure/Upper-Condamine-catchment-Queensland-Australia-The-Marburg-Subgroup-consists-of_fig1_283184727))

## Status

As of May 2019 this present document is an output from exploratory work done during an internship by [Sudhir Gupta](https://github.com/Sudhir22).


## Purpose

This notebook compares the performance of two techniques for semi-automated classification . It also summarise work using ontologies for classification for cases where we do not have reliable training sets.


## Importing python packages

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import rasterio
from rasterio.plot import show
import geopandas as gpd
import pickle

In [None]:
# Only True for co-dev of ela from this use case:
ela_from_source = False
ela_from_source = True

In [None]:
if ela_from_source:
    if ('ELA_SRC' in os.environ):
        root_src_dir = os.environ['ELA_SRC']
    elif sys.platform == 'win32':
        root_src_dir = r'C:\Users\SUD011\Documents\pyela-sudhir'
    else:
        username = os.environ['USER']
        root_src_dir = os.path.join('/home', username, 'src/github_jm/pyela')
    pkg_src_dir = root_src_dir
    sys.path.append(pkg_src_dir)

from ela.textproc import *
from ela.utils import *
from ela.classification import *
from ela.visual import *

In [None]:
import striplog
from striplog import Lexicon

In [None]:
if ('ELA_DATA' in os.environ):
    data_path = os.environ['ELA_DATA']
elif sys.platform == 'win32':
    data_path = r'C:\data\Lithology'
else:
    username = os.environ['USER']
    data_path = os.path.join('/home', username, 'data', 'Lithology')

condamine_litho_dir = os.path.join(data_path,'Condamine')
condamine_litho_xl = os.path.join(condamine_litho_dir, 'MASTER_CONDAMINE_Interpretation_all_combined_Jan2017.xlsx')
condamine_litho_pkl = os.path.join(condamine_litho_dir, 'MASTER_CONDAMINE_Interpretation_all_combined_Jan2017.pkl')

# Exploring the lithology

In [None]:
if not os.path.exists(condamine_litho_pkl):
    train_data=pd.read_excel(condamine_litho_xl)
    with open(condamine_litho_pkl, 'wb') as handle:
        pickle.dump(train_data, handle, protocol=pickle.HIGHEST_PROTOCOL)
else:
    with open(condamine_litho_pkl, 'rb') as handle:
        train_data = pickle.load(handle)

In [None]:
train_data.columns

In [None]:
train_data.head()

In [None]:
train_data_unprocessed = train_data.copy()

In [None]:
LITHO_CLASS_COL = 'Simplified_lithology'
set(train_data[LITHO_CLASS_COL].values)

In [None]:
LITHO_CLASS_STRATA_COL = 'Simplified_lithology_stratigraphy'
set(train_data[LITHO_CLASS_STRATA_COL].values)

We massage the column of simplified lithologies, resulting from a manual classification

In [None]:
train_data[LITHO_CLASS_COL] = train_data[LITHO_CLASS_COL].replace(np.nan,'',regex=True)
train_data[LITHO_CLASS_COL] = train_data[LITHO_CLASS_COL].str.lower()

In [None]:
set(train_data[LITHO_CLASS_COL].values)

In [None]:
token_freq(train_data[LITHO_CLASS_COL].values, 50)

In [None]:
train_data[LITHO_CLASS_COL] = train_data[LITHO_CLASS_COL].replace('granite|granodiorite|diorite|basement','bedrock',regex=True)
train_data[LITHO_CLASS_COL] = train_data[LITHO_CLASS_COL].replace('gravel','alluvium',regex=False)
train_data[LITHO_CLASS_COL] = train_data[LITHO_CLASS_COL].replace('wrong_location|weathering_horizon|tertiary','unknown',regex=True)

In [None]:
token_freq(train_data[LITHO_CLASS_COL].values, 50)

In [None]:
df = train_data
LITHO_DESC_COL='Lithology_original'

In [None]:
descs = df[LITHO_DESC_COL]
descs = descs.reset_index()

In [None]:
# This is not obvious from inspection of the the pandas  data frame, but there appears to be NaNs that cause headaches later on.
vv = [x for x in df[LITHO_DESC_COL].values if not type(x) is str]

In [None]:
vv = [type(x) is not str for x in df[LITHO_DESC_COL].values]

In [None]:
df.loc[np.array(vv)].head()

In [None]:
descs[LITHO_DESC_COL] = descs[LITHO_DESC_COL].replace(np.nan,'',regex=True)
descs[LITHO_DESC_COL] = descs[LITHO_DESC_COL].str.lower()

descs = descs[LITHO_DESC_COL]
descs.head()

In [None]:
lex = Lexicon.default()

In [None]:
len(descs)

In [None]:
%%time
expanded_descs = descs.apply(lex.expand_abbreviations)
y = expanded_descs.values

In [None]:
y = v_lower(y)

In [None]:
y

In [None]:
%%time
flat = flat_list_tokens(y)
len(set(flat))

In [None]:
df_most_common= token_freq(flat, 50)

In [None]:
df_most_common

In [None]:
plot_freq(df_most_common)

# Defining lithology classes

In [None]:
lithologies = ['alluvium', 'basalt', 'bedrock', 'clay', 'sandstone','sand','shale','soil','honeycomb','gravel','coal','gravel','silt','soil','rock', 'limestone']

any_litho_markers_re = r'alluvium|sand|clay|ston|shale|basa|silt|soil|honey|coal|gravel|rock|mud'
regex = re.compile(any_litho_markers_re)

lithologies_dict = dict([(x,x) for x in lithologies])
lithologies_dict['sands'] = 'sand'
lithologies_dict['clays'] = 'clay'
lithologies_dict['shales'] = 'shale'
lithologies_dict['claystone'] = 'clay'
lithologies_dict['siltstone'] = 'silt'
lithologies_dict['mudstone'] = 'silt' # ??
lithologies_dict['capstone'] = 'limestone' # ??
lithologies_dict['ironstone'] = 'sandstone' # ??
lithologies_dict['topsoil'] = 'soil' # ??

lithologies_adjective_dict = {
    'sandy' :  'sand',
    'clayey' :  'clay',
    'clayish' :  'clay',
    'shaley' :  'shale',
    'silty' :  'silt',
    'gravelly' :  'gravel'
}

In [None]:
t = y 

In [None]:
v_tokens = v_word_tokenize(t)
vt = v_find_litho_markers(v_tokens, regex=regex)

In [None]:
zero_mark = len([x for x in vt if len(x) == 0 ])
at_least_one_mark = len([x for x in vt if len(x) >= 1])
at_least_two_mark = len([x for x in vt if len(x) >= 2])
print('There are %s entries with no marker, %s entries with at least one, %s with at least two'%(zero_mark,at_least_one_mark,at_least_two_mark))

# Testing the regular expression model on the Condamine dataset

In [None]:
prim_litho = [find_primary_lithology(x, lithologies_dict) for x in vt]

In [None]:
n = len(set(prim_litho))
plot_freq(token_freq(prim_litho, n_most_common = n))

In [None]:
lithology_map={
    'alluvium' :  'alluvium',
    'bedrock' :  'bedrock',
    'basalt' :  'basalt',
    'honeycomb' :  'basalt',
    'clay' :  'bedrock',
    'coal' :  'bedrock', 
    'sandstone' :  'bedrock',
    'sand' :  'alluvium',
    '' :  'unknown',
    'soil' :  'alluvium',
    'shale': 'bedrock',
    'gravel': 'alluvium',
    'silt' : 'bedrock',
    'rock' : 'bedrock',
    'limestone' : 'alluvium'
}


In [None]:
final_prim_litho=list()
for x in prim_litho:
    final_prim_litho.append(lithology_map[x])


In [None]:
token_freq(final_prim_litho)

In [None]:
simplified_lithology=train_data[LITHO_CLASS_COL].tolist()

def get_accuracy(final_prim_litho, actual):
    count=0
    for i in range(0,len(final_prim_litho)):
        if final_prim_litho[i].lower()==actual[i].lower():
            count=count+1
    return count/len(final_prim_litho)

print("Accuracy of regex for classifying primary lithologies: ", get_accuracy(final_prim_litho, simplified_lithology))


In [None]:

REGEX_LITHO_CLASS_COL='Regex_lithoclass'

blah = pd.DataFrame({ LITHO_CLASS_COL: train_data[LITHO_CLASS_COL], REGEX_LITHO_CLASS_COL: final_prim_litho, LITHO_DESC_COL: descs})

In [None]:
quick_check_lithoclass(blah, 'unknown', colname=REGEX_LITHO_CLASS_COL, size=100)

In [None]:
from wordcloud import WordCloud,STOPWORDS

In [None]:
def show_wordcloud(text, title = None, max_words=200, max_font_size=40, seed=1, scale=3, figsize=(12, 12)):
    """Plot wordclouds from text

        Args:
            text (str or list of str): text to depict
    """
    if text is list:
        text = ' '.join(text)
    stopwords = set(STOPWORDS)
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=max_words,
        max_font_size=max_font_size, 
        scale=scale,
        random_state=seed
    ).generate(text)
    fig = plt.figure(1, figsize=figsize)
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)
    plt.imshow(wordcloud)
    plt.show()

In [None]:
df_test = blah.loc[ blah[REGEX_LITHO_CLASS_COL] == 'unknown' ]
df_test.head()

In [None]:
flat = flat_list_tokens(df_test[LITHO_DESC_COL].values)

In [None]:
s = ' '.join(flat)

In [None]:
show_wordcloud(s, title = 'Unclassified via regexp')

In [None]:
plot_freq(token_freq(flat, n_most_common = 30))

In [None]:
lithologies.append('loam')

In [None]:
any_litho_markers_re = any_litho_markers_re + '|loam'
regex = re.compile(any_litho_markers_re)
lithologies_dict['loam'] = 'loam'

In [None]:
v_tokens = v_word_tokenize(t)
vt = v_find_litho_markers(v_tokens, regex=regex)

In [None]:
prim_litho = [find_primary_lithology(x, lithologies_dict) for x in vt]

In [None]:
n = len(set(prim_litho))
plot_freq(token_freq(prim_litho, n_most_common = n))

In [None]:

blah = pd.DataFrame({ LITHO_CLASS_COL: train_data[LITHO_CLASS_COL], REGEX_LITHO_CLASS_COL: prim_litho, LITHO_DESC_COL: descs})

In [None]:
quick_check_lithoclass(blah, 'loam', colname=REGEX_LITHO_CLASS_COL, size=100)

In [None]:
lithology_map['loam'] = 'alluvium'

In [None]:
final_prim_litho=list()
for x in prim_litho:
    final_prim_litho.append(lithology_map[x])


In [None]:
token_freq(final_prim_litho)

In [None]:
print("Accuracy of regex for classifying primary lithologies: ", get_accuracy(final_prim_litho, simplified_lithology))

In [None]:
blah = pd.DataFrame({ LITHO_CLASS_COL: train_data[LITHO_CLASS_COL], REGEX_LITHO_CLASS_COL: final_prim_litho, LITHO_DESC_COL: descs})

In [None]:
df_test = blah.loc[ blah[REGEX_LITHO_CLASS_COL] == 'unknown' ]
df_test

In [None]:
flat = flat_list_tokens(df_test[LITHO_DESC_COL].values)

In [None]:
s = ' '.join(flat)

In [None]:
show_wordcloud(s, title = 'Unclassified via regexp')

In [None]:
print("Accuracy of regex for classifying primary lithologies: ", get_accuracy(final_prim_litho, simplified_lithology))

# Testing the deep learning model on the same dataset

In [73]:
# train_data=train_data_unprocessed.copy()

In [74]:
# conda install gensim
# conda install tensorflow
# conda install keras
# pip install wordcloud

from ela.experiment.textproc import Model

Using TensorFlow backend.


In [75]:
model=Model(train_data,20)

In [76]:
model.initialise_model()

    RN                  Type   EASTING   NORTHING        Lithology_original  \
0  202  new_bedrock_addition  385411.0  7014147.0                BROWN CLAY   
1  202  new_bedrock_addition  385411.0  7014147.0  BROWN CLAY AND PIPE CLAY   
2  202  new_bedrock_addition  385411.0  7014147.0           WHITE PIPE CLAY   
3  202  new_bedrock_addition  385411.0  7014147.0             RED PIPE CLAY   
4  202  new_bedrock_addition  385411.0  7014147.0             CLAY AND SAND   

  Simplified_lithology Simplified_lithology_stratigraphy   From     To  
0              bedrock                           BEDROCK   0.00  10.67  
1              bedrock                           BEDROCK  10.67  24.38  
2              bedrock                           BEDROCK  24.38  26.82  
3              bedrock                           BEDROCK  26.82  31.70  
4              bedrock                           BEDROCK  31.70  50.90  
Instructions for updating:
Colocations handled automatically by placer.
Instructions fo

Deep Learning model gives us an accuracy of 87%

# Checking the accuracy if geolocation and log descriptions are taken as input features 

Logistic Regression for classifying lithologies based on geolocation. Combining the outputs of the logistic regression model and the deep learning model.

In [None]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

In [None]:
train_data = train_data_unprocessed.copy()
set(train_data['Simplified_lithology'].values)

In [None]:
train_data['Simplified_lithology']=train_data['Simplified_lithology'].replace(np.nan,'Unknown',regex=True)

In [None]:
train_data['Simplified_lithology']=train_data['Simplified_lithology'].str.lower()

In [None]:
set(train_data['Simplified_lithology'].values)

In [None]:
train_data['Simplified_lithology'],labels=pd.factorize(train_data['Simplified_lithology'])

In [None]:
train_X=train_data[['EASTING','NORTHING']][0:len(model.train_X)]
test_X=train_data[['EASTING','NORTHING']][len(model.train_X):]
train_y=train_data['Simplified_lithology'][0:len(model.train_X)]
test_y=train_data['Simplified_lithology'][len(model.train_X):]
train_X.replace(np.nan,0.0,inplace=True)
test_X.replace(np.nan,0.0,inplace=True)
print(train_X.shape)
print(train_y.shape)


In [None]:
clf=LogisticRegression(C=0.01)

In [None]:
clf.fit(train_X,train_y)

In [None]:
y_pred=clf.predict_proba(test_X)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
train_data=train_data_unprocessed.copy()
new_df=pd.DataFrame(train_data[['Lithology_original']].values,columns=['Description'])

In [None]:
new_df.head()

In [None]:
new_df = new_df[len(model.train_X):].copy()

In [None]:
y_pred_dl=model.predict_certainity(new_df)

In [None]:
y_pred, y_pred_dl

In [None]:
print(y_pred.shape)
print(y_pred_dl.shape)

In [None]:
final_class_prob=np.mean(np.array([y_pred,y_pred_dl]), axis=0 )

In [None]:
print(final_class_prob.shape)

In [None]:
final_output_numerical=np.argmax(final_class_prob,axis=1)

In [None]:
print(accuracy_score(test_y,final_output_numerical))

In [None]:
simplified_lithology_categories=[]
for x in final_output_numerical:
    simplified_lithology_categories.append(labels[x])

It does not increase the accuracy. Don't think the geolocation has an impact on the simplified lithologies

# Ontology based learning

Python provides libraries such as RDFLib, OWL 2 for working with ontologies. OWL 2 is better suited for ontology oriented programming since it offers a more pythonic way of managing/creating ontologies.
But then, RDFLib works well with .rdf files. Hence, there is a trade-off. Can use based on individual project needs. <br>

Protege is an 'IDE' which can be used to manage/create ontologies. It was developed at Stanford and can be downloaded from here [Protege](https://protege.stanford.edu/)<br>
Protege can be used to visualise the different relationships defined in an ontology <br>

The ontology that I am using can be downloaded from [here](http://ontologydesignpatterns.org/wiki/Ontology:CGI_Simple_Lithology_201001)

In [None]:
import rdflib as rd

In [None]:
from rdflib.namespace import SKOS
from rdflib.namespace import RDFS

In [None]:
ontology=rd.Graph()
condamine_litho_ontology = os.path.join(condamine_litho_dir, 'SimpleLithology201001.rdf')
ontology.parse(condamine_litho_ontology)

In [None]:
lithology_dictionary=dict()
for x,y in ontology.subject_objects(SKOS.prefLabel):
    lithology_dictionary[x]=y
    

In [None]:
for x,y in ontology.subject_objects(SKOS.broader):
    if x in lithology_dictionary.keys() and y in lithology_dictionary.keys():
        print(" broader class of ",lithology_dictionary[x]," is ",lithology_dictionary[y])

<h5>These relationships can be converted into fuzzy if-then rules and fed to the machine learning model to make better decisions. <br>
Above is only one kind of relationship. We have other relationships such as narrower, description of each label etc. All these combined would make a very "knowledgeable" machine learning model. </h5>

### Comparing visually

Optional



## Observations and discussions

DL much better than regular expressions.<br>

Regular expressions use a set of keywords and look for matches in the descriptions. Sometimes, these keywords might not be in the descriptions.<br>

Similarly, regular expressions map descriptions to a set of predefined catgories (clay,sand,etc.) which are further refined into broader categories ( alluvium,basalt,bedrock). Geoscientists confirm that the mapping is not one/many-to-one. For example,Clay could be part of alluvium/basalt. This cannot be achieved using the regular expression model.

DL, on the other hand, learns the descriptions of different lithology classes. Based on labelled data from geoscientists, DL model learns the different kinds of descriptions which would pertain to for e.g alluvium. 




## Conclusion and future work