# Purpose of this Notebook
This notebook serves as the primary data preprocessing pipeline for **Movies Dataset** CSVs for our movie recommendation system. It's purpose is to ingest raw movie data from kaggle in form of multiple CSV files, perform extensive cleaning and feature engineering, and gives output a set of clean structured JSON documents ready for indexing into **Elastic Search**

## The Steps executed in this notebook are:
1. Loading data : From the Movie Dataset on kaggle, we use movies_metadata, credits, and keywords. These data is selected as base for the content based recommendation system.
2. Data Cleaning and Merging
3. Feature Engineering : Extracting key informatiuon like cast members, directors, and a clean list of keywords
4. Semantic Embedding : By using pre-trained SentenceTransformer (BERT-based), combined text field are generated to vector embedding that captures the movie's semantic meaning.
5. Final Output : Structuring the cleaned data and generated embeddings in form of JSON documents, which will be used to populate search result and recommendation engine.

In [1]:
import pandas as pd
import json
from sentence_transformers import SentenceTransformer, util
import ast
import numpy as np

from tqdm.notebook import tqdm
tqdm.pandas()

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch.helpers import streaming_bulk

# Data Cleaning

In [2]:
movies_df = pd.read_csv('movies_metadata.csv')
keywords_df = pd.read_csv('keywords.csv')
credits_df = pd.read_csv('credits.csv')

movies_df.drop_duplicates(inplace=True)
keywords_df.drop_duplicates(inplace=True)
credits_df.drop_duplicates(inplace=True)

  movies_df = pd.read_csv('movies_metadata.csv')


## Movies Metadata

In [3]:
movies_df.head().T

Unnamed: 0,0,1,2,3,4
adult,False,False,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",,"{'id': 96871, 'name': 'Father of the Bride Col..."
budget,30000000,65000000,0,16000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'id': 35, 'name': 'Comedy'}]"
homepage,http://toystory.disney.com/toy-story,,,,
id,862,8844,15602,31357,11862
imdb_id,tt0114709,tt0113497,tt0113228,tt0114885,tt0113041
original_language,en,en,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...,"Cheated on, mistreated and stepped on, the wom...",Just when George Banks has recovered from his ...


In [4]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45453 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45453 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45453 non-null  object 
 3   genres                 45453 non-null  object 
 4   homepage               7780 non-null   object 
 5   id                     45453 non-null  object 
 6   imdb_id                45436 non-null  object 
 7   original_language      45442 non-null  object 
 8   original_title         45453 non-null  object 
 9   overview               44499 non-null  object 
 10  popularity             45448 non-null  object 
 11  poster_path            45067 non-null  object 
 12  production_companies   45450 non-null  object 
 13  production_countries   45450 non-null  object 
 14  release_date           45366 non-null  object 
 15  revenue

### 1. The ID should be integer
If we see from the head preview, the ID should be integer, but in the info, the data is shown that the ID is object, so we might want to drop the data that can be integer

In [5]:
movies_df = movies_df[movies_df['id'].str.isnumeric()]
movies_df['id'] = movies_df['id'].astype(int)

In [6]:
movies_df['id'].info()

<class 'pandas.core.series.Series'>
Index: 45450 entries, 0 to 45465
Series name: id
Non-Null Count  Dtype
--------------  -----
45450 non-null  int32
dtypes: int32(1)
memory usage: 532.6 KB


The ID is now clean

### 2. Unnecessary columns

In [7]:
movies_df.isna().sum()

adult                        0
belongs_to_collection    40959
budget                       0
genres                       0
homepage                 37673
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   3
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      3
runtime                    260
spoken_languages             3
status                      84
tagline                  25042
title                        3
video                        3
vote_average                 3
vote_count                   3
dtype: int64

Some of the columns are not necessary because we wanna make the recommender system to be content-based. So not all the columns is relevant. We are only going to choose these columns:
- ID : As the identifier to the movies
- Title : As the identifier to the movies title
- Overview : Synopsis is useful as textual description of the movie storyline or the movies themes.
- Genres : This might be a strong identifier for the movies.
- Original Language : Based on language, some people tend to watch if the movies is from certain country or their is a language in their native language

In [8]:
movies = movies_df[['id', 'title', 'overview', 'genres', 'original_language']].copy()
movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45450 entries, 0 to 45465
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 45450 non-null  int32 
 1   title              45447 non-null  object
 2   overview           44496 non-null  object
 3   genres             45450 non-null  object
 4   original_language  45439 non-null  object
dtypes: int32(1), object(4)
memory usage: 1.9+ MB


While there are some missing values, especially in overview and we dont want to waste too much data, so i might just fill the missing values with empty string.

In [9]:
movies.fillna('', inplace=True)
display(movies.isna().sum()), display(movies.head())

id                   0
title                0
overview             0
genres               0
original_language    0
dtype: int64

Unnamed: 0,id,title,overview,genres,original_language
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",en
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",en
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",en
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",en
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[{'id': 35, 'name': 'Comedy'}]",en


(None, None)

## Credits Data

In [10]:
credits_df.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [11]:
credits_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45439 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45439 non-null  object
 1   crew    45439 non-null  object
 2   id      45439 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


The id is already an integer and this is a good sign meaning that it will be fine if we merge the dataset on id

## Keywords data

In [12]:
keywords_df.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [13]:
keywords_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45432 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        45432 non-null  int64 
 1   keywords  45432 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.0+ MB


The keywords is also looking good.

# Merging Data

In [14]:
df = pd.merge(movies, credits_df, on='id', how='left')
df = pd.merge(df, keywords_df, on='id', how='left')
df.drop_duplicates(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45440 entries, 0 to 45462
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 45440 non-null  int32 
 1   title              45440 non-null  object
 2   overview           45440 non-null  object
 3   genres             45440 non-null  object
 4   original_language  45440 non-null  object
 5   cast               45439 non-null  object
 6   crew               45439 non-null  object
 7   keywords           45439 non-null  object
dtypes: int32(1), object(7)
memory usage: 2.9+ MB


In [15]:
df.dropna(inplace=True)

In [16]:
df.head()

Unnamed: 0,id,title,overview,genres,original_language,cast,crew,keywords
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",en,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",en,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",en,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",en,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[{'id': 35, 'name': 'Comedy'}]",en,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


# Parse JSON columns

In [17]:
def parse_json_col(data):
    """Function to safely parse JSON-like strings in dataframe columns."""
    try:
        return ast.literal_eval(data)
    except(ValueError, SyntaxError):
        return [] # Return empty list if parsing fails

In [18]:
for col in ['genres', 'keywords', 'cast', 'crew']:
    df[col] = df[col].progress_apply(parse_json_col)

  0%|          | 0/45439 [00:00<?, ?it/s]

  0%|          | 0/45439 [00:00<?, ?it/s]

  0%|          | 0/45439 [00:00<?, ?it/s]

  0%|          | 0/45439 [00:00<?, ?it/s]

In [19]:
backup_df = df.copy()

In [20]:
df = backup_df.copy()

Next, I will extract the director's name from the crew column and the first three cast members from the cast column. This approach is based on the assumption that viewers often choose to watch movies directed by certain filmmakers or featuring specific actors. The top three cast members are selected because the most prominent or well-known actors are usually listed at the beginning of the cast list.

The other thing to do is to extract the keywords also. For the keywords, all keywords will be extracted

## Transform the JSON data into lists

In [21]:
#Extract the name of the director and the first 3 cast
def get_3_cast(cast):
    return[i['name'] for i in cast[:3] if isinstance(i, dict) and 'name' in i]

def get_director(crew):
    for i in crew:
        if isinstance(i, dict) and i.get('job') == 'Director':
            return i.get('name', '')
    return ''

In [22]:
df['actor'] = df['cast'].progress_apply(get_3_cast)
df['director'] = df['crew'].progress_apply(get_director)
df['keywords_list'] = df['keywords'].progress_apply(lambda x: [i['name'] for i in x if isinstance(i, dict) and 'name' in i])
df['genres_list'] = df['genres'].progress_apply(lambda x: [i['name'] for i in x if isinstance(i, dict) and 'name' in i])

  0%|          | 0/45439 [00:00<?, ?it/s]

  0%|          | 0/45439 [00:00<?, ?it/s]

  0%|          | 0/45439 [00:00<?, ?it/s]

  0%|          | 0/45439 [00:00<?, ?it/s]

In [23]:
df.drop(columns=['cast', 'crew', 'keywords', 'genres'], inplace=True)

In [24]:
df.head()

Unnamed: 0,id,title,overview,original_language,actor,director,keywords_list,genres_list
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",en,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy, friendship, friends, riva...","[Animation, Comedy, Family]"
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,en,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,en,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger, o...","[Romance, Comedy]"
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",en,"[Whitney Houston, Angela Bassett, Loretta Devine]",Forest Whitaker,"[based on novel, interracial relationship, sin...","[Comedy, Drama, Romance]"
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,en,"[Steve Martin, Diane Keaton, Martin Short]",Charles Shyer,"[baby, midlife crisis, confidence, aging, daug...",[Comedy]


Now the data is clean we can proceed to the next step, create the text for **Embedding**

# Create the Text for Embedding

This step is crucial, as it defines the input text that will be fed into the BERT model. By combining the movie's title, keywords, genres, top three cast members, director, original language, and overview into a single text representation, we aim to capture as much semantic context as possible. This enriched textual input helps BERT better understand the content and characteristics of each movie, which is essential for generating accurate and meaningful embeddings for recommendation.

And Since the **overview** column contains significantly more words than other features, this may make other important features, like genres, keywords, cast, and director_names, seem less significant in the similarity computation. To address this, I **apply weights** to these features, ensuring their importance is recognized and resulting in more balanced and diverse recommendations.

In [25]:
def create_soup(row):
    title = (str(row['title']) + ' ') * 1 if row['title'] else ''
    genres = (' '.join(row['genres_list']) + ' ') * 3 if row['genres_list'] else '' 
    keywords = (' '.join(row['keywords_list']) + ' ') * 3 if row['keywords_list'] else ''
    director = (str(row['director']) + ' ') * 4 if row['director'] else ''
    cast = (' '.join(row['actor']) + ' ') * 3 if row['actor'] else ''
    original_language = (str(row['original_language']) + ' ') * 1 if row['original_language'] else ''
    overview = str(row['overview']) if row['overview'] else ''
    return f"{title} {genres} {keywords} {director} {cast} {original_language} {overview}"

df['embed_text'] = df.progress_apply(create_soup, axis=1)

  0%|          | 0/45439 [00:00<?, ?it/s]

In [26]:
display(df[['title', 'embed_text']].head())

Unnamed: 0,title,embed_text
0,Toy Story,Toy Story Animation Comedy Family Animation C...
1,Jumanji,Jumanji Adventure Fantasy Family Adventure Fa...
2,Grumpier Old Men,Grumpier Old Men Romance Comedy Romance Comed...
3,Waiting to Exhale,Waiting to Exhale Comedy Drama Romance Comedy...
4,Father of the Bride Part II,Father of the Bride Part II Comedy Comedy Com...


# Generate the Embeddings

In [29]:
model = SentenceTransformer('mixedbread-ai/mxbai-embed-large-v1')

We convert the embeddings to lists for easier storage in dataframes and to avoid issues with serialization

In [30]:
batch_size = 64
embed_texts = df['embed_text'].tolist()
embeddings = []
for i in tqdm(range(0, len(embed_texts), batch_size)):
    batch = embed_texts[i:i+batch_size]
    batch_embeds = model.encode(batch)
    embeddings.extend(batch_embeds)

  0%|          | 0/710 [00:00<?, ?it/s]

In [31]:
df['movie_embedding'] = embeddings

In [32]:
df['movie_embedding'] = df['movie_embedding'].apply(
    lambda x: x.tolist() if isinstance(x, np.ndarray) 
    else x)

In [33]:
final_df = df[['id', 'title', 'overview', 'genres_list', 'actor', 
            'director', 'keywords_list', 'movie_embedding', 'original_language']].copy()
final_df.head()

Unnamed: 0,id,title,overview,genres_list,actor,director,keywords_list,movie_embedding,original_language
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[Animation, Comedy, Family]","[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy, friendship, friends, riva...","[0.47636154294013977, -0.32746538519859314, 0....",en
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[Adventure, Fantasy, Family]","[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[0.7078766822814941, -0.29459282755851746, -0....",en
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[Romance, Comedy]","[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger, o...","[0.99278324842453, -0.03799596056342125, 0.207...",en
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","[Comedy, Drama, Romance]","[Whitney Houston, Angela Bassett, Loretta Devine]",Forest Whitaker,"[based on novel, interracial relationship, sin...","[0.3308599889278412, -0.24533289670944214, -0....",en
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,[Comedy],"[Steve Martin, Diane Keaton, Martin Short]",Charles Shyer,"[baby, midlife crisis, confidence, aging, daug...","[0.40464338660240173, -0.33400699496269226, 0....",en


In [34]:
# We convert all of the columns to dictionaries so it will be easier to store it in ElasticSearch
movie_documents = final_df.to_dict(orient='records')

In [35]:
print("Example document for Elasticsearch:")
print(json.dumps(movie_documents[0], indent=2))

Example document for Elasticsearch:
{
  "id": 862,
  "title": "Toy Story",
  "overview": "Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",
  "genres_list": [
    "Animation",
    "Comedy",
    "Family"
  ],
  "actor": [
    "Tom Hanks",
    "Tim Allen",
    "Don Rickles"
  ],
  "director": "John Lasseter",
  "keywords_list": [
    "jealousy",
    "toy",
    "boy",
    "friendship",
    "friends",
    "rivalry",
    "boy next door",
    "new toy",
    "toy comes to life"
  ],
  "movie_embedding": [
    0.47636154294013977,
    -0.32746538519859314,
    0.9085360765457153,
    -0.42976704239845276,
    0.29712316393852234,
    -0.8626672029495239,
    -0.7377627491950989,
    -0.08446989953517914,
    0.4505971670150757,
    -0.1992226

In [36]:
final_df.to_parquet('movies_with_embeddings.parquet', index=False)

In [37]:
# to get data back from parquet file
# Uncomment the following lines if want to read from the parquet file
final_df = pd.read_parquet('movies_with_embeddings.parquet')
# movie_documents = final_df.to_dict(orient='records')

In [38]:
final_df.columns

Index(['id', 'title', 'overview', 'genres_list', 'actor', 'director',
       'keywords_list', 'movie_embedding', 'original_language'],
      dtype='object')

# Uploading the datas to Elastic Search

## Connects to ElasticSearch

In [39]:
# make sure that ElasticSearch is running, in this case i use docker
try:
    es_client = Elasticsearch("http://localhost:9200")
    # Verify the connection
    if not es_client.ping():
        raise Exception("Could not connect to Elasticsearch!")
    print("Successfully connected to Elasticsearch!")
except Exception as e:
    print(e) # Stop if connection failed

Successfully connected to Elasticsearch!


## Define Index and Mapping
Defining an index and mapping allows us to control how our data is structured and stored in Elasticsearch, ensuring it is imported in the desired format for efficient searching and analysis.

In [40]:
# Define the index and mapping for ElasticSearch
# index name = movies --> this like table name in SQL database
# The mapping is the schema for that table.
index_name = "movies"
embedding_dim = len(df['movie_embedding'][0])  # Get the dimension of the embedding
print(f"embedding dim = {embedding_dim}")
mapping = {
    "properties": {
        "id": {"type": "integer"},
        "title": {"type": "text", "analyzer": "english"},
        "overview": {"type": "text", "analyzer": "english"},
        "genres_list": {"type": "keyword"},
        "actor": {"type": "text", "analyzer": "english"},
        "director": {"type": "keyword"},
        "keywords_list": {"type": "keyword"},
        "original_language": {"type": "keyword"},
        "movie_embedding": {
            "type": "dense_vector",
            "dims": embedding_dim, # dimension based on the model used --> all-mpnet-base-v2
            "index": True,
            "similarity": "cosine" # Explicitly set the similarity metric
        }
    }
}

embedding dim = 1024


## Create Index and formatting

In [41]:
# create the index if it does not exist and delete it if it does already exist
if es_client.indices.exists(index=index_name):
    es_client.indices.delete(index=index_name)
    print(f"Deleted existing index '{index_name}'")

# Create the new index with our defined mapping
es_client.indices.create(index=index_name, mappings=mapping)
print(f"Created new index '{index_name}' with custom mapping.")

Deleted existing index 'movies'
Created new index 'movies' with custom mapping.


In [42]:
# We need to format our list of dictionaries into the format bulk() expects.
def generate_actions(documents):
    for doc in documents:
        yield {
            "_index": index_name,
            "_id": doc["id"],  # Use the movie's own ID as the document ID
            "_source": doc,
        }

## Uploading contents to Elastic Search

In [43]:
print("\nUploading documents to Elasticsearch... this may take a moment.")
try:
    action_generator = generate_actions(movie_documents)
    progress = tqdm(unit="docs", total=len(movie_documents))
    
    failed_doc =[]
    # Use streaming_bulk which yields results, perfect for a progress bar
    for ok, action in streaming_bulk(
        client=es_client, actions=action_generator, chunk_size=500
    ):
        if not ok:
            failed_doc.append(action)
        progress.update(1)
        
    if failed_doc:
        print(f"\nRetrying {len(failed_doc)} failed docs")
        for ok, action in streaming_bulk(
            client=es_client, actions=failed_doc, chunk_size=500
        ):
            if not ok:
                print(f"Failed to index document: {action}")
    

    progress.close()
    print("\nBulk upload complete!")
    
    # Refresh the index
    es_client.indices.refresh(index=index_name)
    print("Index refreshed.")

except Exception as e:
    print(f"\nAn error occurred during bulk upload: {e}")


Uploading documents to Elasticsearch... this may take a moment.


  0%|          | 0/45439 [00:00<?, ?docs/s]


Bulk upload complete!
Index refreshed.


## Review the upload result

In [44]:
# Check number of data on the index
try:
    count_response = es_client.count(index=index_name)
    doc_count = count_response['count']
    print(f"Found {doc_count} documents in the '{index_name}' index.")
except Exception as e:
    print(f"An error occurred while counting documents: {e}")

# Check example of a document
print("\nTo verify : Checking a document for 'The Dark Knight'")
try:
    search_response = es_client.search(
        index=index_name,
        query={
            "match": {
                "title": "The Dark Knight"
            }
        }
    )

    hits = search_response['hits']['hits']
    
    if hits:
        # 1st result
        first_hit = hits[0]['_source']
        
        # Print the data
        print(f"\nTitle: {first_hit['title']}")
        print(f"Director: {first_hit['director']}")
        print(f"Actors: {first_hit['actor']}")
        print(f"Overview: {first_hit['overview'][:100]} (and so on...)") #Print first 100 chars 
        print(f"Embedding: {first_hit['movie_embedding'][:5]}...") #Print first 5 numbers of embedding
    else:
        print("Could not find the movies we looking for.")

except Exception as e:
    print(f"An error occurred during search: {e}")

Found 45432 documents in the 'movies' index.

To verify : Checking a document for 'The Dark Knight'

Title: The Dark Knight
Director: Christopher Nolan
Actors: ['Christian Bale', 'Michael Caine', 'Heath Ledger']
Overview: Batman raises the stakes in his war on crime. With the help of Lt. Jim Gordon and District Attorney  (and so on...)
Embedding: [0.6480031609535217, -0.1263490915298462, -0.06502999365329742, 0.13747072219848633, -0.8616973161697388]...


In [45]:
# Check example of a document
print("To verify : Checking a document for 'Father of the Bride Part II'")
try:
    search_response = es_client.search(
        index=index_name,
        query={
            "match": {
                "title": "Father of the Bride Part II"
            }
        }
    )

    hits = search_response['hits']['hits']
    
    if hits:
        # 1st result
        first_hit = hits[0]['_source']
        
        # Print the data
        print(f"\nTitle: {first_hit['title']}")
        print(f"Director: {first_hit['director']}")
        print(f"Actors: {first_hit['actor']}")
        print(f"Overview: {first_hit['overview'][:105]}...") #Print first 100 chars 
        print(f"Embedding: {first_hit['movie_embedding'][:5]}...") #Print first 5 numbers of embedding
    else:
        print("Could not find the movies we looking for.")

except Exception as e:
    print(f"An error occurred during search: {e}")

To verify : Checking a document for 'Father of the Bride Part II'

Title: Father of the Bride Part II
Director: Charles Shyer
Actors: ['Steve Martin', 'Diane Keaton', 'Martin Short']
Overview: Just when George Banks has recovered from his daughter's wedding, he receives the news that she's pregnan...
Embedding: [0.40464338660240173, -0.33400699496269226, 0.2621404230594635, -0.11048413068056107, -0.9807517528533936]...
