# Inspec dataset analysis

The purpose of this notebook is to perform simple data exploration of `inspec` dataset. We want to perform a split of the dataset into two separate sets: test and train.

Those two sets has to meet arbitraly chosen properties:
* **Separate sets of keywords** - docs in those two sets has to have separate sets of keywords.
* **One doc should be a part of multiple clusters**

### Basic preparation

In [13]:
import pandas as pd
import numpy as np
import os
import re

from statistics import mean, median
from typing import List

In [3]:
DATASET_PATH = "../.dump/Inspec"

In [78]:
def clean_keyword(keyword: str) -> str:
  keyword = keyword.replace("\t", " ").replace("\n", " ").replace("\r", " ").replace("\f", " ").replace("\v", " ").strip().lower()
  keyword = re.sub(r'\s+', ' ', keyword)
  return keyword

def read_keywords() -> List[List[str]]:
  files = os.listdir(DATASET_PATH + "/keys")

  keywords: List[List[str]] = []

  for file in files:
        with open(DATASET_PATH + "/keys/" + file, "r") as f:
            keywords_list = f.read().splitlines()
            keywords_list = [clean_keyword(keyword) for keyword in keywords_list]
            keywords_list = list(set(keywords_list))
            keywords_list.sort()

            keywords.append(keywords_list)

  return keywords

In [5]:
def read_docs() -> List[str]:
    files = os.listdir(DATASET_PATH + "/docsutf8")

    docs: List[str] = []

    for file in files:
      with open(DATASET_PATH + "/docsutf8/" + file, "r") as f:
        docs.append(f.read())

    return docs

In [79]:
keywords = read_keywords()
docs = read_docs()

## Basic analysis

In [80]:
keyword_counts = {}

for keyword_list in keywords:
  for keyword in keyword_list:
    if keyword not in keyword_counts:
      keyword_counts[keyword] = 0
    keyword_counts[keyword] += 1

In [81]:
keyword_counts = {k: v for k, v in sorted(keyword_counts.items(), key=lambda item: item[1], reverse=True)}



keyword_counts

{'internet': 148,
 'information resources': 99,
 'probability': 65,
 'computational complexity': 64,
 'optimisation': 58,
 'medical image processing': 57,
 'matrix algebra': 51,
 'control system synthesis': 43,
 'nonlinear control systems': 41,
 'gender issues': 40,
 'statistical analysis': 40,
 'stability': 39,
 'neural nets': 38,
 'human factors': 38,
 'academic libraries': 37,
 'social aspects of automation': 36,
 'computer science education': 36,
 'library automation': 36,
 'learning (artificial intelligence)': 36,
 'psychology': 36,
 'polynomials': 35,
 'medical computing': 35,
 'robust control': 35,
 'iterative methods': 34,
 'electronic commerce': 34,
 'graph theory': 34,
 'differential equations': 34,
 'user interfaces': 33,
 'feedback': 33,
 'mobile computing': 32,
 'decision theory': 31,
 'business data processing': 31,
 'digital simulation': 30,
 'production control': 30,
 'computational geometry': 29,
 'electronic publishing': 28,
 'closed loop systems': 28,
 'image reconst

In [82]:
keyword_sets_counts = {}

for keyword_list in keywords:
  keyword_set = frozenset(keyword_list)
  if keyword_set not in keyword_sets_counts:
    keyword_sets_counts[keyword_set] = 0
  keyword_sets_counts[keyword_set] += 1

In [83]:
for keyword_set, count in keyword_sets_counts.items():
  if count > 1:
    print(keyword_set, count)

Apparently, there are no two docs with the same set of keywords.

In [87]:
keyword_occurences = {}

for i in range(len(keywords)):
  for keyword in keywords[i]:
    if keyword in keyword_occurences:
      keyword_occurences[keyword].append(i)
    else:
      keyword_occurences[keyword] = [i]

In [85]:
def analyse_keywords_occurences(keyword_occurences: dict[str, List[int]]):
  for i in range(1, 10):
    _keyword_occurences = {k: v for k, v in keyword_occurences.items() if len(v) >= i}

    # let's check how many unique ids are in pairs
    _unique_ids = set()
    for pair, ids in _keyword_occurences.items():
      _unique_ids.update(ids)

    print(f"For at least {i} occurences:")
    print(f"Unique ids: {len(_unique_ids)}")
    print(f"Total: {len(_keyword_occurences)}")
    print("--------")

In [88]:
analyse_keywords_occurences(keyword_occurences)

For at least 1 occurences:
Unique ids: 2000
Total: 18410
--------
For at least 2 occurences:
Unique ids: 1992
Total: 2165
--------
For at least 3 occurences:
Unique ids: 1981
Total: 1182
--------
For at least 4 occurences:
Unique ids: 1967
Total: 835
--------
For at least 5 occurences:
Unique ids: 1943
Total: 625
--------
For at least 6 occurences:
Unique ids: 1917
Total: 505
--------
For at least 7 occurences:
Unique ids: 1879
Total: 416
--------
For at least 8 occurences:
Unique ids: 1849
Total: 345
--------
For at least 9 occurences:
Unique ids: 1810
Total: 282
--------


In [89]:
lengths = []

for key in keyword_occurences:
  lengths.append(len(keyword_occurences[key]))

In [90]:
print(f"Median of keyword occurences: {median(lengths)}")
print(f"Mean of keyword occurences: {mean(lengths)}")

Median of keyword occurences: 1.0
Mean of keyword occurences: 1.480391091797936


In [91]:
docs_neighbours = {}

for keyword_list in keywords:
  for keyword in keyword_list:
    for doc_id in keyword_occurences[keyword]:
      if doc_id not in docs_neighbours:
        docs_neighbours[doc_id] = set()
      docs_neighbours[doc_id].update(keyword_occurences[keyword])

In [92]:
lengths = []

for key in docs_neighbours:
  lengths.append(len(docs_neighbours[key]))

print(f"Median of document neighbours: {median(lengths)}")
print(f"Mean of document neighbours: {mean(lengths)}")

Median of document neighbours: 53.0
Mean of document neighbours: 67.048


In [93]:
vertices = set(range(len(docs)))
dfs_queue = []
graphs = []

while len(vertices) > 0:
  vertex = vertices.pop()
  dfs_queue.append(vertex)
  inner_graph = set()

  while len(dfs_queue) > 0:
    vertex = dfs_queue.pop()
    inner_graph.add(vertex)
    for neighbour in docs_neighbours[vertex]:
      if neighbour in vertices:
        dfs_queue.append(neighbour)
        vertices.remove(neighbour)
  
  graphs.append(inner_graph)

for graph in graphs:
  print(len(graph))

1988
1
1
1
1
2
1
1
1
1
2


## Conclusions

That means that we've got one big connected graph inside of our docs graph if we treat docs as vertices and keywords as edges between them. We could probably remove some cut edges to make more connected graphs, but I guess it doesn't make sense. Removing a keyword would still mean that we would have two similar docs in separate sets (test and train).

Presumably, we should just split the dataset into two separate sets and that's it, just making sure that we don't have the same doc in all sets

## Keywords choice

Let's verify how many pairs and triples of keywords occur in multiple docs.

In [96]:
pairs_ids = {}

for doc_id in range(len(keywords)):
  docs_keywords = keywords[doc_id]
  for i in range(len(docs_keywords)):
    for j in range(i + 1, len(docs_keywords)):
      pair = (docs_keywords[i], docs_keywords[j])
      if pair not in pairs_ids:
        pairs_ids[pair] = [doc_id]
      else:
        pairs_ids[pair].append(doc_id)

analyse_keywords_occurences(pairs_ids)

For at least 1 occurences:
Unique ids: 2000
Total: 205452
--------
For at least 2 occurences:
Unique ids: 1651
Total: 3831
--------
For at least 3 occurences:
Unique ids: 1206
Total: 970
--------
For at least 4 occurences:
Unique ids: 924
Total: 456
--------
For at least 5 occurences:
Unique ids: 714
Total: 252
--------
For at least 6 occurences:
Unique ids: 566
Total: 159
--------
For at least 7 occurences:
Unique ids: 479
Total: 95
--------
For at least 8 occurences:
Unique ids: 406
Total: 68
--------
For at least 9 occurences:
Unique ids: 351
Total: 48
--------


In [97]:
triples_ids = {}

for doc_id in range(len(keywords)):
  docs_keywords = keywords[doc_id]
  for i in range(len(docs_keywords)):
    for j in range(i + 1, len(docs_keywords)):
      for k in range(j + 1, len(docs_keywords)):
        pair = (docs_keywords[i], docs_keywords[j], docs_keywords[k])
        if pair not in triples_ids:
          triples_ids[pair] = [doc_id]
        else:
          triples_ids[pair].append(doc_id)

analyse_keywords_occurences(triples_ids)

For at least 1 occurences:
Unique ids: 1996
Total: 1229994
--------
For at least 2 occurences:
Unique ids: 779
Total: 2637
--------
For at least 3 occurences:
Unique ids: 349
Total: 324
--------
For at least 4 occurences:
Unique ids: 229
Total: 107
--------
For at least 5 occurences:
Unique ids: 145
Total: 57
--------
For at least 6 occurences:
Unique ids: 115
Total: 27
--------
For at least 7 occurences:
Unique ids: 77
Total: 12
--------
For at least 8 occurences:
Unique ids: 49
Total: 8
--------
For at least 9 occurences:
Unique ids: 35
Total: 5
--------


It seems that there are not many pairs or triples keyowrds that occur in multiple docs. That means that we should rely on single keywords.
