# Cross validation of the Web Of Science dataset
- Determine if the Web Of Science dataset can be used similar to mimic what is seen in representative dataset when running Performance Predictor
- Web Of Science has 3 datasets:
  - Web of Science Dataset WOS-11967
    - This dataset contains 11,967 documents with 35 categories which include 7 parents categories.
  - Web of Science Dataset WOS-46985
    - This dataset contains 46,985 documents with 134 categories which include 7 parents categories.
  - Web of Science Dataset WOS-5736
    - This dataset contains 5,736 documents with 11 categories which include 3 parents categories.

- WOS-46985 has 134 intents (need 92 to mimic)
- WOS-46985 has 1 datasets with 100 intents (20 super intents)
  - train (tn)
  - test (te)
- run a cross validation on the combined (co) dataset to get the accuracies on a trained svm classifier.
- check the accuracies across the splits to see if it is consistent.

In [1]:
import gzip
from IPython.display import display, HTML
import numpy as np
import os
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
import time
from typing import List

pd.options.display.max_colwidth = 100

%load_ext autoreload
%autoreload 2

# Increase the width of the notebook so that it is the width of the browser 
# which allows larger size for the dashboard
display(HTML('<style>.container { width:100% !important; }</style>'))

#### Load workspace dataset

In [2]:
%%time
# X is input data that include text sequences 
# Y is target value 
# YL1 is target value of level one (parent label)
# YL2 is target value of level one (child label)
x_gzip_file = '../../../data/WebOfScience/WebOfScience/WOS46985/X.txt.gzip'
y_file = '../../../data/WebOfScience/WebOfScience/WOS46985/Y.txt'
yl1_file = '../../../data/WebOfScience/WebOfScience/WOS46985/YL1.txt'
yl2_file = '../../../data/WebOfScience/WebOfScience/WOS46985/YL2.txt'

with gzip.open(x_gzip_file, 'rt') as f:
    lines = f.readlines()
df_x = pd.DataFrame(lines, columns=['example'])
df_y = pd.read_csv(y_file, header=None, names=['intent'])
df_yl1 = pd.read_csv(yl1_file, header=None, names=['yl1'])
df_yl2 = pd.read_csv(yl2_file, header=None, names=['yl2'])
print(f'df_x.shape     = {df_x.shape}')
print(f'df_y.shape     = {df_y.shape}')
print(f'df_yl1.shape   = {df_yl1.shape}')
print(f'df_yl2.shape   = {df_yl2.shape}')
print()
print(f'n uniq y       = {len(np.unique(df_y["intent"]))}')
print(f'min n uniq y   = {min(np.unique(df_y["intent"], return_counts=True)[1])}')
print(f'max n uniq y   = {max(np.unique(df_y["intent"], return_counts=True)[1])}')
print(f'n uniq yl1     = {len(np.unique(df_yl1["yl1"]))}')
print(f'min n uniq yl1 = {min(np.unique(df_yl1["yl1"], return_counts=True)[1])}')
print(f'max n uniq yl1 = {max(np.unique(df_yl1["yl1"], return_counts=True)[1])}')
print(f'n uniq yl2     = {len(np.unique(df_yl2["yl2"]))}')

df_merge = pd.concat([df_x, df_y], axis=1, sort=False)
print(f'df_merge.shape = {df_merge.shape}')

x = df_merge['example'].to_numpy()
y = df_merge['intent'].to_numpy().ravel()
print(f'x.shape        = {x.shape}')
print(f'y.shape        = {y.shape}')

display(HTML(df_merge.head(4).to_html()))

df_x.shape     = (46985, 1)
df_y.shape     = (46985, 1)
df_yl1.shape   = (46985, 1)
df_yl2.shape   = (46985, 1)

n uniq y       = 134
min n uniq y   = 43
max n uniq y   = 750
n uniq yl1     = 7
min n uniq yl1 = 3297
max n uniq yl1 = 14625
n uniq yl2     = 53
df_merge.shape = (46985, 2)
x.shape        = (46985,)
y.shape        = (46985,)


Unnamed: 0,example,intent
0,"(2 + 1)-dimensional non-linear optical waves through the coherently excited resonant medium doped with the erbium atoms can be described by a (2 + 1)-dimensional non-linear Schrodinger equation coupled with the self-induced transparency equations. For such a system, via the Hirota method and symbolic computation, linear forms, one-, two-and N-soliton solutions are obtained. Asymptotic analysis is conducted and suggests that the interaction between the two solitons is elastic. Bright solitons are obtained for the fields E and P, while the dark ones for the field N, with E as the electric field, P as the polarization in the resonant medium induced by the electric field, and N as the population inversion profile of the dopant atoms. Head-on interaction between the bidirectional two solitons and overtaking interaction between the unidirectional two solitons are seen. Influence of the averaged natural frequency. on the solitons are studied: (1). can affect the velocities of all the solitons; (2) Amplitudes of the solitons for the fields P and N increase with. decreasing, and decrease with. increasing; (3) With. decreasing, for the fields P and N, one-peak one soliton turns into the two-peak one, as well as interaction type changes from the interaction between two one-peak ones to that between a one-peak one and a two-peak one; (4) For the field E, influence of. on the solitons cannot be found. The results of this paper might be of potential applications in the design of optical communication systems which can produce the bright and dark solitons simultaneously.\n",12
1,"(beta-amyloid (A beta) and tau pathology become increasingly prevalent with age, however, the spatial relationship between the two pathologies remains unknown. We examined local (same region) and non-local (different region) associations between these 2 aggregated proteins in 46 normal older adults using [F-18]AV-1451 (for tau) and [C-11]PiB (for A beta) positron emission tomography (PET) and 1.5 T magnetic resonance imaging (MRI) images. While local voxelwise analyses showed associations between PiB and AV-1451 tracer largely in the temporal lobes, k-means clustering revealed that some of these associations were driven by regions with low tracer retention. We followed this up with a whole-brain region-by-region (local and non-local) partial correlational analysis. We calculated each participant's mean AV-1451 and PiB uptake values within 87 regions of interest (ROI). Pairwise ROI analysis demonstrated many positive PiB AV-1451 associations. Importantly, strong positive partial correlations (controlling for age, sex, and global gray matter fraction, p <.01) were identified between PiB in multiple regions of association cortex and AV-1451 in temporal cortical ROIs. There were also less frequent and weaker positive associations of regional PiB with frontoparietal AV-1451 uptake. Particularly in temporal lobe ROIs, AV-1451 uptake was strongly predicted by NB across multiple ROI locations. These data indicate that A beta and tau pathology show significant local and non-local regional associations among cognitively normal elderly, with increased PiB uptake throughout the cortex correlating with increased temporal lobe AV-1451 uptake. The spatial relationship between A beta and tau accumulation does not appear to be specific to A beta location, suggesting a regional vulnerability of temporal brain regions to tau accumulation regardless of where AP accumulates.\n",74
2,"(D)ecreasing of energy consumption and environmentally friendly energy resources are the issues in the foreground nowadays. As the electric energy consumed for the illumination is high, long-lasting and low-consumption LED (light-emitting diode) technology gets prominent. There have been made much reseacrh regarding the use of photovoltaic sytems in meeting the energy demand in housing and industry. However, there is need for more research with regards to photovoltaic sytems' integration with energy efficiency sytems. In this study, for the environments which have different lighting levels due to daylight factor, there has been proposed a low-cost PV (photovoltaics) based and distributed sensor smart LED illuminating system and there has been acquired 72.075% more energy saving in comparison with conventional LED illuminating system. (C) 2017 Elsevier Inc. All rights reserved.\n",68
3,"(Hybrid) electric vehicles are assumed to play a major role in future mobility concepts. Although sales numbers are increasing, little emphasis has been laid on the recycling of some key components such as power electronics or electric motors. Permanent magnet synchronous motors contain considerable amounts of rare earth elements that cannot be recovered in conventional recycling routes. Although their recycling could have large economic, environmental, and strategic advantages, no industrial recycling for permanent magnets is available in western countries at the moment. Regarding the essential steps, dismantling of electric vehicles as well as the extraction of magnets from the rotors, little has been published before. This paper therefore presents and discusses different recycling approaches for the recycling of NdFeB magnets from (hybrid) electric vehicles. Many results stem from the German research project ""Recycling of components and strategic metals of electric drive motors."".\n",26


CPU times: user 320 ms, sys: 71.9 ms, total: 392 ms
Wall time: 418 ms


#### Encode with USE encoder

In [3]:
%%time
class MiniLMEmbedding:
    def __init__(self):
        self.transformer = SentenceTransformer('paraphrase-MiniLM-L6-v2')
    def encode(self, input_sentences: List[str]) -> np.array:
        sentences = [sentence.lower() for sentence in input_sentences]
        embedded_sentences = [self.embed_sentence(s) for s in sentences]
        return np.array(embedded_sentences)
    def embed_sentence(self, sentence: str) -> np.array:
        embedding = self.transformer.encode(sentence, show_progress_bar=False, convert_to_numpy=True)
        return embedding

encoded_file = '../../../data/WebOfScience/WebOfScience/WOS46985/X_encoded.csv'
if os.path.exists(encoded_file):
    df = pd.read_csv(encoded_file, header=None)
    x_encoded = df.to_numpy()
else:
    encoder = MiniLMEmbedding()
    x_encoded = encoder.encode(x)
    # Save to file
    df = pd.DataFrame(x_encoded)
    df.to_csv(encoded_file, header=False, index=False)

print(f'x_encoded.shape = {x_encoded.shape}')

x_encoded.shape = (46985, 384)
CPU times: user 8.58 s, sys: 401 ms, total: 8.98 s
Wall time: 10.6 s


#### Run a cross validation on SVM classifiers
- Split the combined (x_encoded) dataset into 10 splits
- Each train (x_trn) is 4699 (46,985/10)
- Each test (x_tst) is 42,287 (46,985 * 9/10)
- Score the accuracy of each cross split
  - Normally you'd test against the test of each split (x_tst)
  - But in this case test against each dataset 
     - split train (x_trn)
     - split test (x_tst)
     - original x (x)
  - This is done to see if any of the datasets have problems, e.g. one has a very low score compared to the others.
  - split train (x_trn) has a high accuracy 99%
  - Other datasets have similar accuracies 93% to 95%.

In [4]:
%%time

skf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

runs = []
run = 0
# reverse the normal train/test split sizes.
# Keep the train small and the test large
# So the trains are similar in size to the representative dataset
for tst_index, trn_index in skf.split(x_encoded, y):
    x_trn = x_encoded[trn_index]
    x_tst = x_encoded[tst_index]
    y_trn = y[trn_index]
    y_tst = y[tst_index]
    start = time.time()
    model = SVC(probability=True, random_state=42)
    model.fit(x_trn, y_trn)
    print(f'fit() dur={time.time() - start}')
    start = time.time()
    runs.append({
        'run':     run,
        'trn_acc': f'{model.score(x_trn, y_trn):.0%}',
        'tst_acc': f'{model.score(x_tst, y_tst):.0%}',
        'x_acc':  f'{model.score(x_encoded, y):.0%}',
    })
    print(f'score()s dur={time.time() - start}')
    display(HTML(pd.DataFrame(runs).to_html()))
    start = time.time()
    run += 1

fit() dur=424.9942283630371
score()s dur=974.2140498161316


Unnamed: 0,run,trn_acc,tst_acc,x_acc
0,0,86%,51%,58%


fit() dur=188.7138111591339
score()s dur=897.379034280777


Unnamed: 0,run,trn_acc,tst_acc,x_acc
0,0,86%,51%,58%
1,1,86%,51%,58%


fit() dur=152.15538573265076
score()s dur=906.5250158309937


Unnamed: 0,run,trn_acc,tst_acc,x_acc
0,0,86%,51%,58%
1,1,86%,51%,58%
2,2,85%,51%,58%


fit() dur=170.8431088924408
score()s dur=792.7792429924011


Unnamed: 0,run,trn_acc,tst_acc,x_acc
0,0,86%,51%,58%
1,1,86%,51%,58%
2,2,85%,51%,58%
3,3,85%,51%,58%


fit() dur=166.70527744293213
score()s dur=798.4407136440277


Unnamed: 0,run,trn_acc,tst_acc,x_acc
0,0,86%,51%,58%
1,1,86%,51%,58%
2,2,85%,51%,58%
3,3,85%,51%,58%
4,4,85%,51%,58%


CPU times: user 1h 21min 47s, sys: 5.8 s, total: 1h 21min 53s
Wall time: 1h 31min 13s


In [5]:
df = pd.DataFrame(runs)
display(HTML(df.to_html()))
# 10 split cross validation
# run trn_acc tst_acc x_acc
#   0     90%     48%   52%
#   1     89%     48%   52%
#   2     90%     48%   52%
#   3     90%     48%   52%
#   4     89%     48%   52%
#   5     89%     48%   52%
#   6     89%     48%   52%
#   7     89%     48%   52%
#   8     90%     48%   52%
#   9     89%     48%   52%

# 5 split cross validation
# run trn_acc tst_acc x_acc
#   0     86%     51%   58%
#   1     86%     51%   58%
#   2     85%     51%   58%
#   3     85%     51%   58%
#   4     85%     51%   58%

Unnamed: 0,run,trn_acc,tst_acc,x_acc
0,0,86%,51%,58%
1,1,86%,51%,58%
2,2,85%,51%,58%
3,3,85%,51%,58%
4,4,85%,51%,58%
