<a href="https://colab.research.google.com/github/Yolantele/ML-data-clasifier/blob/master/NL_SpaCy_ML_Classifier_for_Waste_Data_Augmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Dutch Waste Data Classification and Augmentation** 

Performs text analysis operations with spaCy and  builds machine learning model with scikit-learn


scikit-learn :
https://scikit-learn.org/stable/

spaCy Language Models:
https://spacy.io/usage/models

scikit-learn + Spacy : 
https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/

Eural Code reference: 
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:02000D0532-20150601


Entity Analysis: https://spacy.io/api/token#attributes


In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
# mount data from drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pip install -U spacy
!pip install pandas
!python -m spacy download nl_core_news_md

## **Loading Languages model and Data Frames**

In [None]:
import spacy
import pandas as pd
from spacy.lang.nl import Dutch
import nl_core_news_md
spacy.prefer_gpu()

nlp = nl_core_news_md.load #medium size lang model (smae results as large size lanf model)

In [None]:
path = '/content/drive/My Drive/data/'

# train data frames:
all_data = pd.read_csv(path + '/nlDataRich.csv')


# test data frames:
materials_test = pd.read_csv(path + '/nlWithoutMaterialData.csv')

# set data frame: 
df = all_data


df.shape
df.info()
df.head()


## **Tokening the Data With spaCy**

creating a custom tokenizer function using spaCy to automatically strip unnecesarry information like stopwords and punctuation

In [None]:
from spacy.lang.nl.stop_words import STOP_WORDS


parser = Dutch()
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
stopwords = STOP_WORDS
# stopwords = [] # in some cases stopwords help (increase accuracy by 1%)

def spacy_tokenize(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    sentence = sentence.strip().lower()
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens  ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in punctuations and word not in stopwords]

    # return preprocessed list of tokens
    return mytokens


###**Vectorization Feature Engineering, TF-IDF, Bag of Words and N-grams**

Classifying text we end up with text snippets with their respective labels. But in machine learning model we need to convert into numeric representation (vector coordinates)

- **TF-IDF -Term Frequency-Inverse Document Frequency**- simply a way of normalizing our Bag of Words(BoW) by looking at each word’s frequency in comparison to the document frequency.

- **N-grams** - combinations of adjacent words in a given text. For example "who will win"
 1. when n = 1, becomes "who", "will", "win",
 2. when n = 2 , becomes "who will", "will win" etc. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin

bow_vector = CountVectorizer(tokenizer=spacy_tokenize, ngram_range=(1,1))

tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenize)

# print(bow_vector)
# print(tfidf_vector)

## **Splitting The Data into Training and Validation Sets**


### material classiciation

In [None]:
from sklearn.model_selection import train_test_split

df = df[df['material'].notna()]

X = df['description'] # the features we want to analyze
ylabels = df['material'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

print(X)

0                          SLIB VAN WASSEN EN SCHOONMAKEN
1                                           GFT Afgekeurd
2                                         GFT Categorie 3
3                                                       2
4                                         200 Boomstobben
                              ...                        
5843                                     beddingmateriaal
5844                                        bedding afval
5845    voedings-en genotmiddelen ongeschikt voor cons...
5846                   000 rum ongeschikt voor consumptie
5847                         afval van dierlijke weefsels
Name: description, Length: 5848, dtype: object


### mixedOrPure classification

In [None]:
from sklearn.model_selection import train_test_split


df = df[df['mixedOrPure'].notna()]

X = df['description'] # the features we want to analyze
ylabels = df['mixedOrPure'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

print(X)

0                          SLIB VAN WASSEN EN SCHOONMAKEN
1                                           GFT Afgekeurd
2                                         GFT Categorie 3
3                                                       2
4                                         200 Boomstobben
                              ...                        
5843                                     beddingmateriaal
5844                                        bedding afval
5845    voedings-en genotmiddelen ongeschikt voor cons...
5846                   000 rum ongeschikt voor consumptie
5847                         afval van dierlijke weefsels
Name: description, Length: 5848, dtype: object


### consistency classification

In [None]:
from sklearn.model_selection import train_test_split


df = df[df['consistency'].notna()]

X = df['description'] # the features we want to analyze
ylabels = df['consistency'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

print(X)

0                          SLIB VAN WASSEN EN SCHOONMAKEN
1                                           GFT Afgekeurd
2                                         GFT Categorie 3
3                                                       2
4                                         200 Boomstobben
                              ...                        
5843                                     beddingmateriaal
5844                                        bedding afval
5845    voedings-en genotmiddelen ongeschikt voor cons...
5846                   000 rum ongeschikt voor consumptie
5847                         afval van dierlijke weefsels
Name: description, Length: 5848, dtype: object


### cleanOrDirty classification

In [None]:
from sklearn.model_selection import train_test_split


df = df[df['cleanOrDirty'].notna()]

X = df['description'] # the features we want to analyze
ylabels = df['cleanOrDirty'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

print(X)

### directProduct classification

In [None]:
from sklearn.model_selection import train_test_split


df = df[df['directProduct'].notna()]

X = df['description'] # the features we want to analyze
ylabels = df['directProduct'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

print(X)

0                          SLIB VAN WASSEN EN SCHOONMAKEN
1                                           GFT Afgekeurd
2                                         GFT Categorie 3
3                                                       2
4                                         200 Boomstobben
                              ...                        
5843                                     beddingmateriaal
5844                                        bedding afval
5845    voedings-en genotmiddelen ongeschikt voor cons...
5846                   000 rum ongeschikt voor consumptie
5847                         afval van dierlijke weefsels
Name: description, Length: 5848, dtype: object


### indirectProduct Classification

In [None]:
from sklearn.model_selection import train_test_split


df = df[df['indirectProduct'].notna()]

X = df['description'] # the features we want to analyze
ylabels = df['indirectProduct'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

print(X)

0                          SLIB VAN WASSEN EN SCHOONMAKEN
1                                           GFT Afgekeurd
2                                         GFT Categorie 3
3                                                       2
4                                         200 Boomstobben
                              ...                        
5843                                     beddingmateriaal
5844                                        bedding afval
5845    voedings-en genotmiddelen ongeschikt voor cons...
5846                   000 rum ongeschikt voor consumptie
5847                         afval van dierlijke weefsels
Name: description, Length: 5848, dtype: object


### cType classification

In [None]:
from sklearn.model_selection import train_test_split

df = df[df['cType'].notna()]

X = df['description'] # the features we want to analyze
ylabels = df['cType'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

print(X)

0                          SLIB VAN WASSEN EN SCHOONMAKEN
1                                           GFT Afgekeurd
2                                         GFT Categorie 3
3                                                       2
4                                         200 Boomstobben
                              ...                        
5843                                     beddingmateriaal
5844                                        bedding afval
5845    voedings-en genotmiddelen ongeschikt voor cons...
5846                   000 rum ongeschikt voor consumptie
5847                         afval van dierlijke weefsels
Name: description, Length: 5848, dtype: object


### composite1 classification

In [None]:
from sklearn.model_selection import train_test_split

df = df[df['composite1'].notna()]

X = df['description'] # the features we want to analyze
ylabels = df['composite1'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

print(X)

0                          SLIB VAN WASSEN EN SCHOONMAKEN
1                                           GFT Afgekeurd
2                                         GFT Categorie 3
3                                                       2
4                                         200 Boomstobben
                              ...                        
5843                                     beddingmateriaal
5844                                        bedding afval
5845    voedings-en genotmiddelen ongeschikt voor cons...
5846                   000 rum ongeschikt voor consumptie
5847                         afval van dierlijke weefsels
Name: description, Length: 5848, dtype: object


### state classification

In [None]:
from sklearn.model_selection import train_test_split

df = df[df['state'].notna()]

X = df['description'] # the features we want to analyze
ylabels = df['state'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

print(X)
print(ylabels)

0                          SLIB VAN WASSEN EN SCHOONMAKEN
1                                           GFT Afgekeurd
2                                         GFT Categorie 3
3                                                       2
4                                         200 Boomstobben
                              ...                        
5843                                     beddingmateriaal
5844                                        bedding afval
5845    voedings-en genotmiddelen ongeschikt voor cons...
5846                   000 rum ongeschikt voor consumptie
5847                         afval van dierlijke weefsels
Name: description, Length: 5848, dtype: object
0       unclassified
1       unclassified
2       unclassified
3       unclassified
4       unclassified
            ...     
5843    unclassified
5844    unclassified
5845    unclassified
5846    unclassified
5847    unclassified
Name: state, Length: 5848, dtype: object


### mType classification

In [None]:
from sklearn.model_selection import train_test_split

df = df[df['mType'].notna()]

X = df['description'] # the features we want to analyze
ylabels = df['mType'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

print(X)

0                          SLIB VAN WASSEN EN SCHOONMAKEN
1                                           GFT Afgekeurd
2                                         GFT Categorie 3
3                                                       2
4                                         200 Boomstobben
                              ...                        
5843                                     beddingmateriaal
5844                                        bedding afval
5845    voedings-en genotmiddelen ongeschikt voor cons...
5846                   000 rum ongeschikt voor consumptie
5847                         afval van dierlijke weefsels
Name: description, Length: 5848, dtype: object


### otherCode classification

In [None]:
from sklearn.model_selection import train_test_split

df = df[df['otherCode'].notna()]

X = df['description'] # the features we want to analyze
ylabels = df['otherCode'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

print(X)

0                          SLIB VAN WASSEN EN SCHOONMAKEN
1                                           GFT Afgekeurd
2                                         GFT Categorie 3
3                                                       2
4                                         200 Boomstobben
                              ...                        
5843                                     beddingmateriaal
5844                                        bedding afval
5845    voedings-en genotmiddelen ongeschikt voor cons...
5846                   000 rum ongeschikt voor consumptie
5847                         afval van dierlijke weefsels
Name: description, Length: 5848, dtype: object


## **Creating a Pipeline and Generating the Model**

###we’ll create a pipeline with three components: a **cleaner, a vectorizer, and a classifier**. 

- The cleaner uses our predictors class object to clean and preprocess the text. 

- The vectorizer uses countvector objects to create the bag of words matrix for our text. 

- The classifier is an object that performs the logistic regression to classify the sentiments.

In [None]:
# Custom transformer using spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}


def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Creating Logistic Regression Classifier
classifier = LogisticRegression()

# Create pipeline using Bag of Words (BoW)
model = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

# model generation
model.fit(X_train,y_train)

### **Model Predictions - Classifications**

In [None]:
from sklearn import metrics

row = 20

# Predicting for a test dataset
predicted = model.predict(X_test)
n = 0
print('Using validation set:')
print( X_test.iloc[row], '--->', predicted[row])
print( materials_test.iloc[row].description,'--->', model.predict(materials_test)[row])



Using validation set:
PP Zakken ---> unclassified
Bestrijdingsmiddelen in klein verpakking ---> unclassified


### **Model Accuracy Reports**

**Accuracy** refers to the percentage of the total predictions our model makes that are completely correct.

**Precision** describes the ratio of true positives to true positives plus false positives in our predictions.

**Recall** describes the ratio of true positives to true positives plus false negatives in our predictions.


In [None]:
# Model Accuracy Reports

print("Accuracy:",metrics.accuracy_score(y_test, predicted ))
print("Precision:",metrics.precision_score(y_test, predicted, average='weighted'))
print("Recall:",metrics.recall_score(y_test, predicted, average='weighted'))
print( metrics.classification_report(y_test, predicted))

## NLP - Entity Analysis
 Get all Materials from Description 


#### Unique materials list

In [None]:
# 1. unique existing meterials list
MATERIALS = [
  'aardappel',
  'abs',
  'actieve koolstof',
  'aluminium',
  'anorganisch materiaal',
  'asbest',
  'avi-bodemas',
  'bentoniet',
  'bentoniet',
  'beryllium',
  'bier',
  'bitumen',
  'brandstof',
  'brons',
  'brood',
  'cacao',
  'cadmium',
  'cadmium',
  'cement',
  'chloriden',
  'chocolade',
  'chroom',
  'compost',
  'cyaniden',
  'deeg',
  'dioxinen',
  'epdm',
  'epp',
  'eps',
  'fipronil',
  'folie',
  'formaline',
  'gips',
  'glas',
  'glasvezel',
  'glaswol',
  'glycerine',
  'glycol',
  'gom',
  'gras',
  'grind',
  'grond',
  'hars',
  'hdpe',
  'hout',
  'hydrauliek-olie',
  'ijzer',
  'isolatiemateriaal',
  'jute',
  'kaas',
  'kalk',
  'kalksteen',
  'karton',
  'keramiek',
  'keramisch materiaal',
  'klei',
  'kool',
  'koolas',
  'koper',
  'kraftkarton',
  'kunststof',
  'kurk',
  'kwik',
  'kymene',
  'lavaliet',
  'lavasteen',
  'ldpe',
  'lithium',
  'lood',
  'loog',
  'magnesium',
  'material',
  'material2',
  'material3',
  'melamine',
  'melasse',
  'messing',
  'mest',
  'metalen',
  'metalen',
  'mineralen',
  'mo',
  'monochloorbenzeen',
  'natriumhypochloriet',
  'nikkel',
  'nikkel',
  'non-ferro',
  'nox',
  'nylon',
  'ocb',
  'olie',
  'oil',
  'oplosmiddelen',
  'opp',
  'organisch materiaal',
  'pa',
  'pak',
  'papier',
  'paraffine',
  'pc',
  'pcb',
  'pcb',
  'pak',
  'pcfic',
  'pe',
  'perliet',
  'pes',
  'pet',
  'pfos',
  'plastic',
  'pmma',
  'polyester',
  'pom',
  'porselein',
  'pp',
  'propyleenglycol',
  'ps',
  'pvc',
  'rubber',
  'rvs',
  'sbr',
  'schelpen',
  'schuimrubber',
  'sediment',
  'silex',
  'silicium',
  'sintels',
  'siroop',
  'slakken',
  'slakkenwol',
  'soja',
  'staal',
  'steen',
  'steenwol',
  'stenen',
  'stookolie',
  'styreen butadieen rubber',
  'suiker',
  'talksteen',
  'teer',
  'tin',
  'turf',
  'tyleen',
  'ureum',
  'vet',
  'vetzuren',
  'vinasse',
  'vocl',
  'water',
  'wei',
  'wijn',
  'zand',
  'zeoliet',
  'zetmeel',
  'zink',
  'zout',
  'zouten',
  'zuur'
]

#### extract unique materials 

In [None]:
 !python -m spacy download nl_core_news_md

In [None]:
import spacy
import re
import string
from spacy.lang.nl import Dutch
import nl_core_news_md
from spacy.lang.nl.stop_words import STOP_WORDS

spacy.prefer_gpu()

parser = nl_core_news_md.load()

In [None]:


def treat_sentence(sentence):
  sentence = str(sentence)
  sentence = sentence.strip().lower()
  # sentence = re.sub(r'\d+', '', sentence) # removes numbers 
  sentence = re.sub(r'[^\w\s]','',sentence) # removes punctuations
  return sentence

def extract_material(doc):
  #only material noun:    
  index = [1] # iterate through sentence once
  for idxValue in index:
    
    for token in doc:
      if token.pos_ is 'NOUN' or token.pos_ is 'PROPN' or token.pos_ is 'INTJ':
        # print(token, '--', token.pos_, token.lemma_, token.left_edge, token.right_edge)
        if token.text in MATERIALS or token.lemma_ or token.left_edge in MATERIALS:
          print('material ---------->',  token.lemma_,  '/', token.left_edge )


def get_described_materials(row):
  print('materials for row ->', row)
  sentence = treat_sentence(df.iloc[row]['description'])
  material = treat_sentence(df.iloc[row]['material'])
  material2 = treat_sentence(df.iloc[row]['material2'])
  doc = parser(sentence)
  extract_material(doc)
  print('sentence was ------>', sentence, '(material 1:', material, ', ' 'material 2:',material2, ')')
  print()

get_described_materials(311)
get_described_materials(321)
get_described_materials(339)
get_described_materials(349)
get_described_materials(359)
get_described_materials(363)
get_described_materials(379)
get_described_materials(389)
get_described_materials(397)



materials for row -> 311
material ----------> massa / massa
material ----------> slib / slib
sentence was ------> cacao massa slib (material 1: cacao , material 2: nan )

materials for row -> 321
material ----------> vot / vot
material ----------> sigarettjti / sigarettjti
material ----------> cjan / cjan
material ----------> de / de
material ----------> rijk / de
sentence was ------> vot sigarettjti cjan de rijk (material 1: unclassified , material 2: nan )

materials for row -> 339
material ----------> mengsel / mengsels
material ----------> beton / van
material ----------> steen / stenen
material ----------> tegel / stenen
material ----------> productenof / of
material ----------> fractie / afzonderlijke
material ----------> stof / die
sentence was ------> mengsels van beton stenen tegels of keramische productenof afzonderlijke fracties daarvan die gevaarl stoffen bevevatten (material 1: unclassified , material 2: nan )

materials for row -> 349
material ----------> mengsel / niet
m

### Dependency Parsing

Depenency parsing is a language processing technique that allows us to better determine the meaning of a sentence by analyzing how it’s constructed to determine how the individual words relate to each other.

In [None]:
# Visualize Entities
from spacy import displacy

row = 339
print( doc)
description = str(df.iloc[row]['description'])
doc = parser(description)

displacy.render(doc, style="dep", jupyter= True)


mengsels van beton, stenen, tegels of keramische producten,of afzonderlijke fracties daarvan, die gevaarl. stoffen bev.evatten


### **Plotting the Classification Outcomes (description and material dependencies)**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df = materials_test

#chosen column field option
chosen_material = 'slakken'
chosen_mixedOrPure = 'puur'
chosen_consistency = 'vast'
chosen_cleanOrDirty = 'vervuild'
chosen_indirectProduct = 'slib'

# set graph: 
outcome = chosen_indirectProduct
x_axis = 'indirectProduct'
x_data = df.indirectProduct

data = df.head(120).loc[x_data ==outcome]
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(1, 15)
sns.regplot(x=x_axis, y='description', data=data, ax=ax)


### **Save & Load the Trained Model**

In [None]:
import pickle

# save the model to disk
name = 'nl_mType_classification_model.sav'
filename = path + name
pickle.dump(model, open(filename, 'wb'))
 

In [None]:
#load models
import pickle
cType_name = 'nl_c_type_classification_model.sav'
cleanOrDirty_name = 'nl_clean_or_dirty_classification_model.sav'
consistency_name = 'nl_consistency_classification_model.sav'
composite1_name = 'nl_composite1_classification_model.sav'
directProduct_name = 'nl_direct_product_classification_model.sav'
indirectProduct_name = 'nl_indirect_product_classification_model.sav'
material_name = 'nl_material_classification_model.sav'
mixedOrPure_name = 'nl_mixed_or_pure_classification_model.sav'
mType_name = 'nl_mType_classification_model.sav'
state_name = 'nl_state_classification_model.sav'

state_model = pickle.load(open(path + state_name, 'rb'))
cType_model = pickle.load(open(path + cType_name, 'rb'))
cleanOrDirty_model = pickle.load(open(path + cleanOrDirty_name, 'rb'))
consistency_model = pickle.load(open(path + consistency_name, 'rb'))
composite1_model = pickle.load(open(path + composite1_name, 'rb'))
directProduct_model = pickle.load(open(path + directProduct_name, 'rb'))
indirectProduct_model = pickle.load(open(path + indirectProduct_name, 'rb'))
material_model = pickle.load(open(path + material_name, 'rb'))
mixedOrPure_model = pickle.load(open(path + mixedOrPure_name, 'rb'))
mType_model = pickle.load(open(path + mType_name, 'rb'))

#models' classifications 
# overal_score = _model.score(X_test, y_test)
row = 7

def print_classification_results(row=0, with_accuracy=False):
  print('row', row)
  print( 'description is --->', X_test.iloc[row])
  print( 'cType ------------>', cType_model.predict(X_test)[row], '(low samples / mostly unclassified)' if with_accuracy else '')
  print( 'clean or dirty --->', cleanOrDirty_model.predict(X_test)[row], '(OMS: 0.79)' if with_accuracy else '')
  print( 'consistency ------>', consistency_model.predict(X_test)[row], '(OMS: 0.96)' if with_accuracy else '')
  print( 'composite 1 ------>', composite1_model.predict(X_test)[row], '(OMS: 0.78)' if with_accuracy else '')
  print( 'direct product --->', directProduct_model.predict(X_test)[row], '(OMS: 0.71)' if with_accuracy else '')
  print( 'indirect product ->', indirectProduct_model.predict(X_test)[row],'(low samples / mostly unclassified)' if with_accuracy else '')
  print( 'material --------->', material_model.predict(X_test)[row], '(OMS: 0.85)'if with_accuracy else '')
  print( 'state ------------>', state_model.predict(X_test)[row], '(OMS: 0.82)'if with_accuracy else '')
  print( 'mixed or pure ---->', mixedOrPure_model.predict(X_test)[row],'OMS: 0.86)'if with_accuracy else '')
  print( 'm type ----------->', mType_model.predict(X_test)[row],'(low samples / mostly unclassified)'if with_accuracy else '')
  print()

print_classification_results(row+1, with_accuracy=True)
print_classification_results(row+2)
print_classification_results(row+3)
print_classification_results(row+4)
print_classification_results(row+5)
print_classification_results(row+6)

row 8
description is ---> Elektronika
cType ------------> unclassified (low samples / mostly unclassified)
clean or dirty ---> unclassified (OMS: 0.79)
consistency ------> vast (OMS: 0.96)
composite 1 ------> composiet (OMS: 0.78)
direct product ---> unclassified (OMS: 0.71)
indirect product -> composiet (low samples / mostly unclassified)
material ---------> unclassified (OMS: 0.85)
state ------------> unclassified (OMS: 0.82)
mixed or pure ----> gemengd OMS: 0.86)
m type -----------> unclassified (low samples / mostly unclassified)

row 9
description is ---> Grond zonder analyses
cType ------------> unclassified 
clean or dirty ---> unclassified 
consistency ------> vast 
composite 1 ------> unclassified 
direct product ---> unclassified 
indirect product -> grond 
material ---------> grond 
state ------------> unclassified 
mixed or pure ----> gemengd 
m type -----------> unclassified 

row 10
description is ---> Baggerspecie stort(asbesthoudend)
cType ------------> unclassified 
cl