# Elasticsearch mini tutorial

👨‍🎓 Elasticsearch (ES) is one of those technologies you never hear about in DS courses, but it’s very common in industry.

You'll find ES in places like GitHub, Uber, or Facebook.

Not familiar with ES? Don't worry, I wrote a quick tutorial to get you started with it.

## 📝 What’s Elasticsearch in 30 secs or less?

It's a  distributed, fast, and easy-to-scale search engine capable of handling all types of data.

It’s frequently used as a search engine for apps, websites, and logs analytics.

Ok, enough theory. Let's start writing some code!

## 0️⃣ Prerequisites

1. Install `docker`

2. Create a `virtual env` and install `pandas`, `elasticsearch`, and `notebook`

```shell
$ pip install pandas notebook elasticsearch
```

## 1️⃣ Run an ES cluster

The easiest way to run elasticsearch locally is by using docker.

Open a terminal and run this code to start a single-node ES cluster you can use for local development:

```shell
$ docker run --rm -p 9200:9200 -p 9300:9300 -e "xpack.security.enabled=false" -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.3.3
```

## 2️⃣ Connect to your cluster

Create a new jupyter notebook, and run the following code, to connect to your newly created ES cluster.

If everything went well, you should see an output similar to mine.

In [8]:
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")
es.info().body

{'name': '734a2a515292',
 'cluster_name': 'docker-cluster',
 'cluster_uuid': 'gtUFPOptRBiCZllYvFkLlA',
 'version': {'number': '8.3.3',
  'build_flavor': 'default',
  'build_type': 'docker',
  'build_hash': '801fed82df74dbe537f89b71b098ccaff88d2c56',
  'build_date': '2022-07-23T19:30:09.227964828Z',
  'build_snapshot': False,
  'lucene_version': '9.2.0',
  'minimum_wire_compatibility_version': '7.17.0',
  'minimum_index_compatibility_version': '7.0.0'},
 'tagline': 'You Know, for Search'}

## 3️⃣ Read the dataset

Read the dataset and extract a sample from it.

You'll need some data to use in ES, so download this dataset: https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots

Read it using pandas and extract a sample from it (if you don't take a sample, the next steps may take a long time to run):

In [9]:
import pandas as pd

df = (
    pd.read_csv("wiki_movie_plots_deduped.csv")
    .dropna()
    .sample(5000, random_state=42)
)

## 4️⃣ Create an index

An index is a collection of documents that ES stores and represents through a very efficient data structure called an inverted index.

This process is what allows ES to perform very fast full-text searches.

You can create a new index like this:

In [10]:
mappings = {
        "properties": {
            "title": {"type": "text", "analyzer": "english"},
            "ethnicity": {"type": "text", "analyzer": "standard"},
            "director": {"type": "text", "analyzer": "standard"},
            "cast": {"type": "text", "analyzer": "standard"},
            "genre": {"type": "text", "analyzer": "standard"},
            "plot": {"type": "text", "analyzer": "english"},
            "year": {"type": "integer"},
            "wiki_page": {"type": "keyword"}
    }
}

es.indices.create(index="movies", mappings=mappings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'movies'})

## 5️⃣ Add data to your index

You can use `.index()` or `.bulk()` to add data to an index.

`.index()` adds one item at a time, while `.bulk()` lets you add multiple items at the same time.

You can use any of the two methods to add data to your index:

### Using `.index()`

In [None]:
for i, row in df.iterrows():
    doc = {
        "title": row["Title"],
        "ethnicity": row["Origin/Ethnicity"],
        "director": row["Director"],
        "cast": row["Cast"],
        "genre": row["Genre"],
        "plot": row["Plot"],
        "year": row["Release Year"],
        "wiki_page": row["Wiki Page"]
    }
            
    es.index(index="movies", id=i, document=doc)

### Using `.bulk()`

In [None]:
from elasticsearch.helpers import bulk

bulk_data = []
for i,row in df.iterrows():
    bulk_data.append(
        {
            "_index": "movies",
            "_id": i,
            "_source": {        
                "title": row["Title"],
                "ethnicity": row["Origin/Ethnicity"],
                "director": row["Director"],
                "cast": row["Cast"],
                "genre": row["Genre"],
                "plot": row["Plot"],
                "year": row["Release Year"],
                "wiki_page": row["Wiki Page"],
            }
        }
    )
bulk(es, bulk_data)

### Check the number of documents indexed

In [None]:
es.indices.refresh(index="movies")
es.cat.count(index="movies", format="json")

## 6️⃣ Make searches in your ES index

Finally, you'll want to start running searches using your index.

ES has a powerful DSL that lets you build many types of queries.

Here's an example of a search that looks for movies starring Jack Nicholson, but whose director isn't Roman Polanski:

In [None]:
resp = es.search(
    index="movies",
    query={
        "bool": {
            "must": {
                "match": {
                    "cast": {"query": "jack nicholson"},
                }
            },
            "filter": {"bool": {"must_not": {"match": {"director": "roman polanksi"}}}},
        },
    },
)