# Project Week 1: ActivityNet Video Data Preparation and Indexing

In this example we will use the ActivityNet dataset https://github.com/activitynet/ActivityNet. 

 - Select the 10 videos with more moments.
 - Download these videos onto your computer.
 - Extract the frames for every video.
 - Read the textual descriptions of each video.
 - Index the video data in OpenSearch.

 In this week, you will index the video data and make it searchable with OpenSearch. You should refer to the OpenSearch tutorial laboratory.

## Select videos
Download the `activity_net.v1-3.min.json` file containing the list of videos. The file is in the github repository of ActivityNet.
Parse this file and select the 10 videos with more moments.

In [None]:
import json
from pprint import pprint

#[('o1WPnnvs00I', {'duration': 229.86, 'subset': 'training', 'resolution': '640x480',
data:list

with open('activity_net.v1-3.min.json', 'r') as json_data:
    data = json.load(json_data)
    
    # 'database' is a <key, valu> pair -> <video_id, video_info>
    videos = data['database']
    
    # Sort the list by number of annotations (video moments)
    sorted_list = sorted(videos.items(), key= lambda x: len(x[1]['annotations']), reverse = True)

    # Select the top 10 videos 
    top_10_videos = sorted_list[:10]

    # Convert the list of tuples to a dictionary before dumping
    top_10_dict = {video_id: video_info for video_id, video_info in top_10_videos}

    # Check the video id and number of moments of the items in the list
    for video_id, video_info in top_10_videos:
        print(f"{video_id} - {len(video_info['annotations'])} moments")

    #print(top_10_dict.keys) # Each key is a video id

with open('top10.json', 'w') as file: # Gotta use the full relative path if running on a python notebook
    json.dump(top_10_dict, file, indent=2)





dict_keys(['duration', 'subset', 'resolution', 'url', 'annotations'])
o1WPnnvs00I - 23 moments
oGwn4NUeoy8 - 23 moments
VEDRmPt_-Ms - 20 moments
qF3EbR8y8go - 19 moments
DLJqhYP-C0k - 18 moments
t6f_O8a4sSg - 18 moments
6gyD-Mte2ZM - 18 moments
jBvGvVw3R-Q - 18 moments
PJ72Yl0B1rY - 17 moments
QHn9KyE-zZo - 17 moments


## Video frame extraction

PyAV is a wrapper library providing you access to `ffmpeg`, a command-line video processing tool. In the example below, you will be able to extract frames from the a video shot.

In [35]:
import av
import av.datasets

content = av.datasets.curated("pexels/time-lapse-video-of-night-sky-857195.mp4")
with av.open(content) as container:
    # Signal that we only want to look at keyframes.
    stream = container.streams.video[0]
    stream.codec_context.skip_frame = "NONKEY"

    for i, frame in enumerate(container.decode(stream)):
        print(frame)
        frame.to_image().save(f"night-sky.{i:04d}.jpg", quality=80)

<av.VideoFrame, pts=0 yuv420p 1280x720 at 0x7fe5544f1ea0>
<av.VideoFrame, pts=75 yuv420p 1280x720 at 0x7fe52a097100>
<av.VideoFrame, pts=150 yuv420p 1280x720 at 0x7fe52d401660>


## Video metadata

Process the video metadata provided in the `json` file and index the video data in OpenSearch.

In [21]:
## New Index Mappings for k-nn vectors and embeddings
## (embeddings are the means from the words extracted from the captions)

from opensearchpy import OpenSearch
import requests
from opensearchpy import helpers

host = 'api.novasearch.org'
port = 443

user = 'user13' 
password = 'rumoao+20' 
index_name = user # We can only have an index with the same name has our user name.

# Create the client with SSL/TLS enabled, but hostname verification disabled.
client = OpenSearch(
    hosts = [{'host': host, 'port': port}],
    http_compress = True, # enables gzip compression for request bodies
    http_auth = (user, password),
    use_ssl = True,
    url_prefix = 'opensearch_v2',
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False
)

# The fields and how they are searched and how important they are, are defined in the mappings
index_body = {
   "settings":{
      "index":{
         "number_of_replicas":0,
         "number_of_shards":4,
         "refresh_interval":"-1",
         "knn":"true"
      }
   },
   "mappings":{
       "dynamic":      "strict", # Prevents accidental addition of new fields to the index. This way indexed documents must match the index mapping.
       "properties":{
         "video_id":{
            "type":"keyword"
         },
         "title":{
            "type":"text",
            "analyzer":"english",
            "similarity":"BM25"
         },
         "video_path":{
            "type":"text"
         },
         "duration":{
            "type":"float"
         },
         "description":{  # The description field is a text field of the join from the en_captions field that is an array of strings.
            "type":"text",
            "analyzer":"english",
            "similarity":"BM25"
         },
        "description_embedding":{
            "type":"knn_vector",
            "dimension": 768,
            "method":{
               "name":"hnsw",
               "space_type":"innerproduct", # cosinesimil > innerproduct  because the captions are normalized and this provides better semantic similarity
               "engine":"faiss",
               "parameters":{
                  "ef_construction":256,
                  "m":48
               }
            }
        },
        "annotations": {
                "type": "nested",
                "properties": {
                    "segment": {"type": "float"},
                    "label": {"type": "text"},
                    "is_answer": {"type": "boolean"},
                    "confidence": {"type": "float"}
                }
        },
      }
   }
}

# Create the index with the specified mappings and settings
response = client.indices.create(index=index_name, body=index_body)

# Check if the index creation was successful
if response['acknowledged']:
    print(f"Index '{index_name}' created successfully!")
else:
    print(f"Failed to create index: {response}")

Index 'user13' created successfully!


In [2]:
with open('C:/Git Repositories/MPDW-Project/top10.json', 'r') as data:
    data = json.load(data).items()

    # Check the video id and number of moments of the items in the list
    for video_id, video_info in data:
        print(f"{video_id} - {video_info['annotations'][0]}")
        print(video_info['duration'])

        # Creating the document to be indexed from the video in the dataset
        doc = {
            'video_id': video_id, # Document ID
            'title': video_info['annotations'][0]['label'], # Title
            'video_path': video_info['url'], # Video path
            'description': "",
            'duration': 10,
            "annotations": video_info['annotations']
        }

        

    

    

    

   

o1WPnnvs00I - {'segment': [4.303033313169262, 13.626272158369328], 'label': 'Playing flauta'}
229.86
oGwn4NUeoy8 - {'segment': [37.01843986684637, 42.0338413971933], 'label': 'Playing congas'}
153.09
VEDRmPt_-Ms - {'segment': [15.568780241809671, 21.723879407176288], 'label': 'Tumbling'}
232.07999999999998
qF3EbR8y8go - {'segment': [2.865726384157407, 9.55242128052469], 'label': 'Painting'}
204.1
DLJqhYP-C0k - {'segment': [11.083851549980366, 16.62577732497055], 'label': 'Playing ten pins'}
186.968
t6f_O8a4sSg - {'segment': [14.999980897195073, 30.681779107899008], 'label': 'Skateboarding'}
218.52
6gyD-Mte2ZM - {'segment': [21.43810386973302, 32.59766478822418], 'label': 'Playing ten pins'}
188.245
jBvGvVw3R-Q - {'segment': [19.776733229329174, 23.868471138845557], 'label': 'Snatch'}
218.62
PJ72Yl0B1rY - {'segment': [10.941581903276132, 24.457653666146648], 'label': 'Beach soccer'}
206.332
QHn9KyE-zZo - {'segment': [0.01, 7.961349719294894], 'label': 'Slacklining'}
196.279


In [None]:
# Importing the dataset and indexing it

from datasets import load_dataset

# Load the dataset, trust_remote_code=True is needed to load the dataset from the remote repository.
dataset = load_dataset('dataset-download.py', trust_remote_code=True) 

doc_list = []

index_number_id = 0 # Index number to use as document ID (0, 1, 2, ...)

with open('C:/Git Repositories/MPDW-Project/top10.json', 'r') as data:
    data = json.load(data).items()

    # Check the video id and number of moments of the items in the list
    for video_id, video_info in data:
        # Creating the document to be indexed from the video in the dataset
        doc = {
            'video_id': video_id, # Document ID
            'title': video_info['annotations'][0]['label'], # Title
            'video_path': video_info['url'], # Video path
            'description': "",
            'duration': video_info['duration'],
            "annotations": video_info['annotations']
        }

        doc_list.append(doc)
        #print(len(doc_list))

for split in ['train', 'test', 'validation']:
    for video in dataset[split]:
        # Iterate through the documents in doc_list
        for doc in doc_list:
            #print(video['video_id'].replace("v_", ""))
            video['video_id'] = video['video_id'].replace("v_", "") # clean the video_key from the captions dataset, it comes with the format v_<key> instead of just <key>

            if doc['video_id'] == video['video_id']:  # Check if the video_id matches
                print("doc found")
                
                description = ""  # Initialize the description string

                # Combine all the captions into the description
                for caption in video['en_captions']:
                    description += f" {caption}"

                # Update the document's description
                doc['description'] = description

                # Debugging print to confirm update
                print(f"Updated description for video_id: {doc['video_id']}")

# Make sure all the fields are filled
#print(doc_list)
#print(len(doc_list)) 

for doc in doc_list:
    response = client.index(index = index_name, id= index_number_id, body = doc)
    
    print(response)
    
    index_number_id+= 1


doc found
Updated description for video_id: DLJqhYP-C0k
doc found
Updated description for video_id: qF3EbR8y8go
doc found
Updated description for video_id: oGwn4NUeoy8
doc found
Updated description for video_id: VEDRmPt_-Ms
doc found
Updated description for video_id: o1WPnnvs00I
doc found
Updated description for video_id: jBvGvVw3R-Q
doc found
Updated description for video_id: t6f_O8a4sSg
doc found
Updated description for video_id: 6gyD-Mte2ZM
doc found
Updated description for video_id: QHn9KyE-zZo
doc found
Updated description for video_id: PJ72Yl0B1rY
doc found
Updated description for video_id: t6f_O8a4sSg
doc found
Updated description for video_id: 6gyD-Mte2ZM
doc found
Updated description for video_id: QHn9KyE-zZo
doc found
Updated description for video_id: PJ72Yl0B1rY
{'_index': 'user13', '_id': '0', '_version': 1, 'result': 'created', '_shards': {'total': 1, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1}
{'_index': 'user13', '_id': '1', '_version': 1, 'result': 

In [23]:
response = client.get(id = 0, index=index_name)
print(response)

{'_index': 'user13', '_id': '0', '_version': 1, '_seq_no': 0, '_primary_term': 1, 'found': True, '_source': {'video_id': 'o1WPnnvs00I', 'title': 'Playing flauta', 'video_path': 'https://www.youtube.com/watch?v=o1WPnnvs00I', 'description': ' A man is playing a flute in front of a microphone.  A few other men are shown playing guitars as they sit.  The group plays for the audience, occasionally zooming in on individuals.  One man is playing drums while the others are on flute and guitar.  The lights move fluidly as they crescendo, and they screen goes black.', 'duration': 229.86, 'annotations': [{'segment': [4.303033313169262, 13.626272158369328], 'label': 'Playing flauta'}, {'segment': [17.92930547153859, 24.025269212168485], 'label': 'Playing flauta'}, {'segment': [28.68688861154446, 35.50002465678627], 'label': 'Playing flauta'}, {'segment': [39.085885733229325, 40.52023016380655], 'label': 'Playing flauta'}, {'segment': [45.89902177847114, 48.767710639625584], 'label': 'Playing flaut

## Video captions

The ActivityNetCaptions dataset https://cs.stanford.edu/people/ranjaykrishna/densevid/ dataset provides a textual description of each videos. Index the video captions on a text field of your OpenSearch index.