In [1]:
%pip install elasticsearch gdown
%load_ext autoreload

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: C:\Users\andre\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


## Setting up the Elasticsearch

In [2]:
from datetime import datetime
from elasticsearch import Elasticsearch

ELASTIC_PASSWORD = "p2iFCHUbC7ze1QoIMVw"

es = Elasticsearch("http://localhost:9200",
                    basic_auth=("elastic", ELASTIC_PASSWORD))

es.info()

ObjectApiResponse({'name': 'es-node', 'cluster_name': 'tdt4117-ir-data-cluster', 'cluster_uuid': 'i6jKZRxDRJ212GIEPZpLWA', 'version': {'number': '8.4.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '89f8c6d8429db93b816403ee75e5c270b43a940a', 'build_date': '2022-09-14T16:26:04.382547801Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

## Setting up the documents

We got the documents from blackboard. Also normal enron from the web.

### Fix for windows:
Windows doesn't like the filenames used by enron. For normal enron, the files end with a dot, which is illegl in windows. Because of this, widnows doesn't let you rename the file because in windows's eyes, the file doesn't exist, so we need to use a linux approach.

Use MSYS to change all filenames to have suffix '.txt' with the following command:

```bash
cd enron_folder
find . -type f -name *. -exec bash -c 'mv "$1" "${1%.*}.txt"' _ {} \;
``` 

Additionally, in enron_short, the files are put in folders with the same name as the illegal files. We need to remove these folders as well. For this, we have created two scripts, `./rename_files.sh` and `./remove_dirs.sh`, which does this for us. This takes a lot of time to run however, even when they are multiprocessed (they are), so it is better to use linux instead.

Or there is a better way of doing it and we wasted a lot of time...

After fixing the data, we can read all the files recursively like this:

In [3]:
import os
def read_files_recursive(base): 
    """Recursive function to read all files in a directory and all subdirectories"""
    corpus = []
    paths = os.listdir(base)
    for path in paths:
        # If it is a directory, traverse by recursion:
        if os.path.isdir(base + path):
            new_corpus = read_files_recursive(base + path + "/")
            corpus.extend(new_corpus)
        # It is a file, READ IT:
        elif os.path.isfile(base + path):
            with open(base + path) as f:
                corpus.append({
                    'title': 'base_path + path',
                    'content': f.read()
                })
        else:
            raise FileNotFoundError(f"The file '{base + path}' was not recognized")
    return corpus

Or iteratively...

In [4]:
import os
from tqdm import tqdm
import codecs
def read_files(base_path):
    """
        An iterative version in case recursive doesn't work.
        Also, it has a counter, so... but seems to be much slower...
    """
    corpus = []
    paths = os.listdir(base_path)
    for path in tqdm(paths):
        # If it is a file, READ IT:
        if os.path.isfile(base_path + path):
            with open(base_path + path) as f:
                corpus.append({
                    'title': base_path + path,
                    'content': f.read()
                })
        # If it is a dir, list it and add to paths
        else:
            new_paths = os.listdir(base_path + path)
            new_paths = [path + "/" + new_path for new_path in new_paths]
            paths.extend(new_paths)
    return corpus

# Second iterative function, for good measure
import fnmatch
def find(pattern, path):
    """Event better"""
    result = []
    for root, _, files in os.walk(path):
        for name in files:
            if fnmatch.fnmatch(name, pattern):
                result.append(os.path.join(root, name))
    corpus = []
    for path in tqdm(result):
        with codecs.open(path) as f:
            corpus.append({
                'title': path,
                'content': f.read()
            })
    return corpus

Now let's actually read all the files

In [5]:
base = "./enron_short/maildir/"
corpus = find("*.txt", base)
print(f"Collection size: {len(corpus)}")

100%|██████████| 191560/191560 [24:37<00:00, 129.67it/s]

Collection size: 191560





### Indexing the documents

In [6]:
# Import documents into elastic
# because we might need to reload our indexer, and we don't want to read the whole dataset again
from indexer import CustomIndexer
%autoreload 2 
from elasticsearch import helpers
import time
index_name = "y"
if not es.indices.exists(index=index_name):
    print("Creating ElasticSearch Indexer:")
    es.indices.create(index=index_name)
    actions = []
    for i, doc in enumerate(tqdm(corpus)):
        actions.append({
                '_index': index_name,
                '_id': i,
                '_op_type': 'index',
                '_source': doc
            }
        )
    # Bulk load into elasticsearch
    helpers.bulk(es, actions)

# Initiate our own indexer:
print("Creating CustomIndexer:")
indexer = CustomIndexer([doc['content'] for doc in corpus])

Creating CustomIndexer:


100%|██████████| 191560/191560 [11:32<00:00, 276.70it/s] 
100%|██████████| 191560/191560 [00:30<00:00, 6339.60it/s] 


In [7]:
def print_es(results):
    print("ElasticSearch:")
    print(f"Results ({len(results['hits']['hits'])}):")
    for hit in results['hits']['hits']:
        content = hit['_source']['content'][:100].replace("\n", "")
        print(f"ID: {hit['_id']}, Score: {hit['_score']}, Content: {content}")
def print_custom(results):
    print("Custom (ours):")
    if results:
        print(f"Results ({len(results)}):")
        for res in results:
            content = res['content'][:100].replace("\n", "")
            print(f"ID: {res['id']}, Score: {res['score']:.4f}, Content: {content}")
    else:
        print("No results...")

Boolean query for NTNU

In [8]:
# Query for "Norwegian AND University AND Science AND Technology”
query = "Norwegian AND University AND Science AND Technology"
es_query = {
    "query": {
        "bool" : {
            "must": [{
                "match": {
                "preference_1": "Norwegian"
                }
            }, {
                "match": {
                "preference_1": "University"
                }
            }, {
                "match": {
                "preference_1": "Science"
                }
            }, {
                "match": {
                "preference_1": "Technology"
                }
            }]
        }
    }
}
start_time = time.time()
print_es(es.search(index=index_name, body=es_query))
end_time = time.time()
print(f"Elapsed {end_time - start_time}s\n")

start_time = time.time()
print_custom(indexer.search(query))
end_time = time.time()
print(f"Elapsed {end_time - start_time}s\n")

ElasticSearch:
Results (0):
Elapsed 0.00699925422668457s

Custom (ours):
Results (1):
ID: 122725, Score: 96.2405, Content: Message-ID: <15111552.1075841109495.JavaMail.evans@thyme>Date: Wed, 30 Jan 2002 07:22:55 -0800 (PST
Elapsed 0.002999544143676758s



Zero results for ElasticSearch. Nobody mentioned NTNU apparently.

We got one result however. Let's try without AND operators.

In [9]:
# Query for "Norwegian University Science Technology"
query = "Norwegian University Science Technology"
es_query = {
    "query": {
        "match": {
            "content": query
        }
    }
}
start_time = time.time()
print_es(es.search(index=index_name, body=es_query))
end_time = time.time()
print(f"Elapsed {end_time - start_time}s\n")

start_time = time.time()
print_custom(indexer.search(query))
end_time = time.time()
print(f"Elapsed {end_time - start_time}s\n")

ElasticSearch:
Results (10):
ID: 83815, Score: 20.404312, Content: Message-ID: <24164547.1075840774929.JavaMail.evans@thyme>Date: Fri, 21 Dec 2001 04:43:38 -0800 (PST
ID: 89895, Score: 20.404312, Content: Message-ID: <21996921.1075855468318.JavaMail.evans@thyme>Date: Fri, 21 Dec 2001 04:43:38 -0800 (PST
ID: 83809, Score: 19.514158, Content: Message-ID: <21179930.1075840774807.JavaMail.evans@thyme>Date: Sun, 23 Dec 2001 14:45:10 -0800 (PST
ID: 89900, Score: 19.514158, Content: Message-ID: <33145748.1075855468410.JavaMail.evans@thyme>Date: Sun, 23 Dec 2001 14:45:10 -0800 (PST
ID: 12286, Score: 18.755344, Content: Message-ID: <4007612.1075855889171.JavaMail.evans@thyme>Date: Mon, 24 Apr 2000 11:26:00 -0700 (PDT)
ID: 55651, Score: 18.755344, Content: Message-ID: <24069195.1075846942557.JavaMail.evans@thyme>Date: Mon, 24 Apr 2000 11:26:00 -0700 (PDT
ID: 64118, Score: 18.755344, Content: Message-ID: <9729632.1075847069332.JavaMail.evans@thyme>Date: Mon, 24 Apr 2000 11:26:00 -0700 (PDT)
ID: 8

Let's see what we are missing for the first doc...

In [10]:
[(term, term in corpus[83815]['content']) for term in query.split(" ")]

[('Norwegian', False),
 ('University', True),
 ('Science', True),
 ('Technology', True)]

Ofc. 'Norwegian' is not mentioned, it is never mentioned...

Or is it?

In [11]:
# Query for "Norwegian"
query = "Norwegian"
es_query = {
    "query": {
        "match": {
            "content": query
        }
    }
}
start_time = time.time()
print_es(es.search(index=index_name, body=es_query))
end_time = time.time()
print(f"Elapsed {end_time - start_time}s\n")

start_time = time.time()
print_custom(indexer.search(query))
end_time = time.time()
print(f"Elapsed {end_time - start_time}s\n")

ElasticSearch:
Results (10):
ID: 6256, Score: 12.662199, Content: Message-ID: <22494771.1075840315764.JavaMail.evans@thyme>Date: Fri, 11 Jan 2002 10:27:01 -0800 (PST
ID: 56755, Score: 12.292053, Content: Message-ID: <29112767.1075846916113.JavaMail.evans@thyme>Date: Tue, 31 Aug 1999 03:30:00 -0700 (PDT
ID: 70199, Score: 12.292053, Content: Message-ID: <16800314.1075847157526.JavaMail.evans@thyme>Date: Tue, 31 Aug 1999 03:30:00 -0700 (PDT
ID: 72575, Score: 12.040761, Content: Message-ID: <15991008.1075858947229.JavaMail.evans@thyme>Date: Thu, 11 Oct 2001 08:42:35 -0700 (PDT
ID: 62500, Score: 11.921822, Content: Message-ID: <8151064.1075858954516.JavaMail.evans@thyme>Date: Wed, 3 Oct 2001 06:23:51 -0700 (PDT)
ID: 62644, Score: 11.886801, Content: Message-ID: <18287813.1075858957976.JavaMail.evans@thyme>Date: Thu, 11 Oct 2001 08:36:47 -0700 (PDT
ID: 6315, Score: 11.823177, Content: Message-ID: <28710248.1075840311804.JavaMail.evans@thyme>Date: Fri, 1 Feb 2002 12:51:52 -0800 (PST)
ID: 6327

We get some hits on 'Norwegian' (yay), but nothing about NTNU there...