# Using Sherlock out-of-the-box
This notebook shows how to predict a semantic type for a given table column.
The steps are basically:
- Download files for word embedding and paragraph vector feature extraction (downloads only once) and initialize feature extraction models.
- Extract features from table columns.
- Initialize Sherlock.
- Make a prediction for the feature representation of the column.

In [2]:
import numpy as np
import pandas as pd
import pyarrow as pa

from sherlock import helpers
from sherlock.deploy.model import SherlockModel
from sherlock.functional import extract_features_to_csv
from sherlock.features.paragraph_vectors import initialise_pretrained_model, initialise_nltk
from sherlock.features.preprocessing import (
    extract_features,
    convert_string_lists_to_lists,
    prepare_feature_extraction,
    load_parquet_values,
)
from sherlock.features.word_embeddings import initialise_word_embeddings

AttributeError: module 'distutils' has no attribute 'version'

In [2]:
#%env PYTHONHASHSEED

In [3]:
from pprint import pprint
from sherlock.deploy import helpers

feature_cols = helpers.categorize_features()
pprint(feature_cols['word']) # char word par rest



['word_embedding_avg_0',
 'word_embedding_avg_1',
 'word_embedding_avg_2',
 'word_embedding_avg_3',
 'word_embedding_avg_4',
 'word_embedding_avg_5',
 'word_embedding_avg_6',
 'word_embedding_avg_7',
 'word_embedding_avg_8',
 'word_embedding_avg_9',
 'word_embedding_avg_10',
 'word_embedding_avg_11',
 'word_embedding_avg_12',
 'word_embedding_avg_13',
 'word_embedding_avg_14',
 'word_embedding_avg_15',
 'word_embedding_avg_16',
 'word_embedding_avg_17',
 'word_embedding_avg_18',
 'word_embedding_avg_19',
 'word_embedding_avg_20',
 'word_embedding_avg_21',
 'word_embedding_avg_22',
 'word_embedding_avg_23',
 'word_embedding_avg_24',
 'word_embedding_avg_25',
 'word_embedding_avg_26',
 'word_embedding_avg_27',
 'word_embedding_avg_28',
 'word_embedding_avg_29',
 'word_embedding_avg_30',
 'word_embedding_avg_31',
 'word_embedding_avg_32',
 'word_embedding_avg_33',
 'word_embedding_avg_34',
 'word_embedding_avg_35',
 'word_embedding_avg_36',
 'word_embedding_avg_37',
 'word_embedding_avg_3

## Initialize feature extraction models

In [4]:
initialise_pretrained_model(400)

Initialise Doc2Vec Model, 400 dim, process took 0:00:00.033055 seconds. (filename = ../sherlock/features/par_vec_trained_400.pkl)


In [5]:
prepare_feature_extraction()
initialise_word_embeddings()
initialise_pretrained_model(400)
initialise_nltk()

Preparing feature extraction by downloading 4 files:
        
 ../sherlock/features/glove.6B.50d.txt, 
 ../sherlock/features/par_vec_trained_400.pkl.docvecs.vectors_docs.npy,
        
 ../sherlock/features/par_vec_trained_400.pkl.trainables.syn1neg.npy, and 
 ../sherlock/features/par_vec_trained_400.pkl.wv.vectors.npy.
        
All files for extracting word and paragraph embeddings are present.
Initialising word embeddings
Initialise Word Embeddings process took 0:00:08.467993 seconds.
Initialise Doc2Vec Model, 400 dim, process took 0:00:00.007890 seconds. (filename = ../sherlock/features/par_vec_trained_400.pkl)
Initialised NLTK, process took 0:00:00.108341 seconds.


[nltk_data] Downloading package punkt to /home/sunny/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/sunny/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Extract features

In [6]:
data = pd.Series(
    [
        ["Jane Smith", "Lute Ahorn", "Anna James"],
        ["Amsterdam", "Haarlem", "Zwolle"],
        ["Chabot Street 19", "1200 fifth Avenue", "Binnenkant 22, 1011BH"],
        ["1","2","3",],
    ],
    name="values"
)

In [7]:
data.shape

(4,)

In [8]:
data = pd.Series(
    [
        ["123213", "15457", "563"],
        ["Amsterdam", "Haarlem", "Zwolle"],
        ["Chabot Street 19", "1200 fifth Avenue", "Binnenkant 22, 1011BH"],
        ['0.0', '1.0', '2.0', '3.0']

    ],
    name="values"
)

In [9]:
extract_features(
    "../temporary.csv",
    data
)
feature_vectors = pd.read_csv("../temporary.csv", dtype=np.float32)

Extracting Features: 100%|██████████| 4/4 [00:00<00:00, 173.02it/s]

Exporting 1588 column features





In [10]:
feature_vectors

Unnamed: 0,n_[0]-agg-any,n_[0]-agg-all,n_[0]-agg-mean,n_[0]-agg-var,n_[0]-agg-min,n_[0]-agg-max,n_[0]-agg-median,n_[0]-agg-sum,n_[0]-agg-kurtosis,n_[0]-agg-skewness,...,par_vec_390,par_vec_391,par_vec_392,par_vec_393,par_vec_394,par_vec_395,par_vec_396,par_vec_397,par_vec_398,par_vec_399
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.0,0.0,...,0.000688,0.000144,-0.000203,0.001054,-0.000654,-0.001018,0.001158,0.001177,0.000443,0.001005
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.0,0.0,...,-8.7e-05,0.00056,0.000859,-0.000403,-0.000711,0.000393,9.9e-05,-0.000935,-0.000803,0.000734
2,1.0,0.0,1.0,0.666667,0.0,2.0,1.0,3.0,-1.5,0.0,...,-9.1e-05,-0.001026,-0.000603,-0.001188,0.000262,-0.001025,0.00027,0.000762,0.001139,-0.000117
3,1.0,1.0,1.25,0.1875,1.0,2.0,1.0,5.0,-0.666667,1.154701,...,-0.000787,0.000933,0.001049,0.00069,0.000376,0.000105,0.000888,-0.000193,-0.000621,-0.000328


## Initialize Sherlock

In [11]:
model = SherlockModel();
model.initialize_model_from_json(with_weights=True, model_id="sherlock");

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


2022-12-06 12:02:11.939002: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-06 12:02:11.959479: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4007995000 Hz
2022-12-06 12:02:11.960343: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d2d73eec70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-12-06 12:02:11.960384: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-7
OMP: Info #156: KMP_AFFINITY

## Predict semantic type for column

In [12]:
predicted_labels = model.predict(feature_vectors, "sherlock")

OMP: Info #250: KMP_AFFINITY: pid 4032 tid 4120 thread 1 bound to OS proc set 2
OMP: Info #250: KMP_AFFINITY: pid 4032 tid 4121 thread 2 bound to OS proc set 4
OMP: Info #250: KMP_AFFINITY: pid 4032 tid 4138 thread 3 bound to OS proc set 6
OMP: Info #250: KMP_AFFINITY: pid 4032 tid 4139 thread 4 bound to OS proc set 1
OMP: Info #250: KMP_AFFINITY: pid 4032 tid 4140 thread 5 bound to OS proc set 3
OMP: Info #250: KMP_AFFINITY: pid 4032 tid 4141 thread 6 bound to OS proc set 5
OMP: Info #250: KMP_AFFINITY: pid 4032 tid 4142 thread 7 bound to OS proc set 7
OMP: Info #250: KMP_AFFINITY: pid 4032 tid 4143 thread 8 bound to OS proc set 0
OMP: Info #250: KMP_AFFINITY: pid 4032 tid 4145 thread 9 bound to OS proc set 2
OMP: Info #250: KMP_AFFINITY: pid 4032 tid 4146 thread 10 bound to OS proc set 4
OMP: Info #250: KMP_AFFINITY: pid 4032 tid 4147 thread 11 bound to OS proc set 6
OMP: Info #250: KMP_AFFINITY: pid 4032 tid 4148 thread 12 bound to OS proc set 1
OMP: Info #250: KMP_AFFINITY: pid 403

In [13]:
predicted_labels

array(['credit', 'city', 'address', 'elevation'], dtype=object)