<a href="https://colab.research.google.com/github/bongjoonsiong/Machine-Learning-Models/blob/main/Vector_Based_Movie_Recommendation_System_Using_Qdrant_DB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. **Introduction**
    - Movie recommendation systems have evolved from traditional methods to advanced machine learning and vector databases.
    - Focus on using Qdrant DB for efficient storage and recommendation of thousands of video files.

2. **Traditional Recommender System**
    - Uses machine learning algorithms like SVMs to predict user preferences.
    - Three main types:
        1. Collaborative Filtering: Collects user preferences to predict interests.
        2. Content-Based Filtering: Recommends items based on attributes and past interactions.
        3. Hybrid Systems: Combine both approaches to improve effectiveness and address challenges like cold start and data sparsity.

3. **Challenges in Traditional Systems**
    - Cold start problem and data sparsity.
    - Ethical considerations and scalability.
    - Integration of contextual information.

4. **Entry of Vector Databases**
    - Useful for efficient similarity searches in recommendation systems.
    - Represent movies as vectors in high-dimensional space to find similar movies.
    - Use distance metrics like cosine similarity or Euclidean distance.

5. **Scalability of Vector Databases**
    - Designed to handle large-scale data with high query performance.
    - Essential for large streaming platforms with extensive libraries and user bases.

6. **Qdrant Database**
    - Uses fast approximate nearest neighbor search, specifically the HNSW algorithm with cosine similarity search.
    - Suitable for large-scale movie recommendation systems.

7. **Recommender System Architecture with Vector Databases**
    - **Candidate Generation**
        1. Initial filtering based on language or accent (heuristic filtering).
        2. Convert video to textual embeddings using audio-to-text models like Whisper or SpeechRecognition.
        3. Store textual embeddings as vectors in Qdrant database.
        4. Retrieve similar videos using cosine similarity search in Qdrant.
    - **Re-Ranking**
        1. Arrange movies based on sentiments in textual information.
        2. Use large language models to obtain opinion scores.
        3. Re-rank movies for recommendations based on opinion scores.

## Code Implementation

In [None]:
#Install Important Libraries
!pip install -q torch
!pip install -q openai moviepy
!pip install SpeechRecognition
!pip install -q transformers
!pip install -q datasets
!pip install -q qdrant_client

In [None]:
#Import all the necessary Packages:
import os
import moviepy.editor as mp
import os
import glob
import speech_recognition as sr
import csv
import numpy as np
import pandas as pd
from qdrant_client import QdrantClient
from qdrant_client.http import models
from transformers import AutoModel, AutoTokenizer
import torch


In [None]:
#Create a directory to keep all audio transcriptions.

# specify your path
path = "/content/my_directory"

# create directory
os.makedirs(path, exist_ok=True)

## Code to convert the video to textual information

In [None]:
#directory containing video files
source_videos_file_path = r"/content/drive/MyDrive/qdrant_videos"

#directory for storing audio files
destination_audio_files_path = r"/content/my_directory/audios"

# CSV file for storing transcripts
csv_file_path = r"/content/my_directory/transcripts.csv"

# Create the destination directory if it doesn't exist
os.makedirs(destination_audio_files_path, exist_ok=True)

# Initialize recognizer class (for recognizing the speech)
r = sr.Recognizer()

# Open the CSV file in write mode
with open(csv_file_path, 'w', newline='') as csvfile:
    # Create a CSV writer
    writer = csv.writer(csvfile)
    # Write the header row
    writer.writerow(["Video File", "Transcript"])

    # Process video frame by frame
    for video_file in glob.glob(os.path.join(source_videos_file_path, '*.mp4')):
        # Convert video to audio
        video_clip = mp.VideoFileClip(video_file)
        audio_file_path = os.path.join(destination_audio_files_path, os.path.basename(video_file).replace("'", "").replace(" ", "_") + '.wav')
        video_clip.audio.write_audiofile(audio_file_path)

        # Transcribe audio to text
        with sr.AudioFile(audio_file_path) as source:
            # read the audio file
            audio_text = r.listen(source)
            # convert speech to text
            try:
                transcript = r.recognize_google(audio_text)
            except sr.UnknownValueError:
                print("Google Speech Recognition could not understand audio")
                transcript = "Error: Could not understand audio"
            except sr.RequestError as e:
                print("Could not request results from Google Speech Recognition service; {0}".format(e))
                transcript = "Error: Could not request results from Google Speech Recognition service; {0}".format(e)

        # Write the transcript to the CSV file
        writer.writerow([video_file, transcript])

## Transcripts in dataframe format

In [None]:
data = pd.read_csv('/content/my_directory/transcripts.csv')
data.head()

###There are some transcripts that “SpeechRecognition” is not able to understand, so we will eliminate the row from the dataframe.

In [None]:
data = data[~data['Transcript'].str.startswith('Error')]
data.head()

###create a QdrantClient instance with an in-memory database

In [None]:
client = QdrantClient(":memory:")

###create a collection where our vector embeddings will be stored, with distances measured using cosine similarity search

In [None]:
my_collection = "text_collection"
client.recreate_collection(
    collection_name=my_collection,
    vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE)
)

### Use a pre-trained model to extract the embedding layer from dataset. Accomplish this using the transformers library and the GPT-2 model.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModel.from_pretrained('gpt2')#.to(device) # switch this for GPU

### Need to extract movie names and create a new column so that it is known which embeddings belong to which movie.

In [None]:
def extract_movie_name(file_path):
    file_name = file_path.split("/")[-1]  # Get the last part of the path
    movie_name = file_name.replace(".mp4", "").strip()
    return movie_name

# Apply the function to create the new column
data['Movie_Name'] = data['Video File'].apply(extract_movie_name)

# Display the DataFrame
data[['Video File', 'Movie_Name', 'Transcript']]

### Create a helper function with which it is able to get embeddings for each movie trailer transcript.

In [None]:
def get_embeddings(row):
    tokenizer = AutoTokenizer.from_pretrained('gpt2')
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

    inputs = tokenizer(row['Transcript'], padding=True, truncation=True, max_length=128, return_tensors="pt")

    # Disable gradient computation for the following operations.
    with torch.no_grad():
      outputs = model(**inputs).last_hidden_state.mean(dim=1).cpu().numpy()

    # Return the computed embeddings.
    return outputs

### Apply the embedding function to the dataset. After that, save the embeddings so that don’t have to load them again.

In [None]:
data['embeddings'] = data.apply(get_embeddings, axis=1)
np.save("vectors", np.array(data['embeddings']))

### Now, create a payload with metadata for each movie transcript.

In [None]:
payload = data[['Transcript', 'Movie_Name', 'embeddings']].to_dict(orient="records")


### Create a helper function for mean pooling of token embeddings. Then loop through each transcript in the transcript column to create text embeddings.

In [None]:
# Set the expected size for the vector embeddings
expected_vector_size = 768

# Define a function for mean pooling of token embeddings
def mean_pooling(model_output, attention_mask):
    # Extract token embeddings from the model output
    token_embeddings = model_output[0]

    # Expand the attention mask to match the size of token embeddings
    input_mask_expanded = (attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float())

    # Calculate the sum of token embeddings, considering the attention mask
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)

    # Calculate the sum of the attention mask (clamped to avoid division by zero)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    # Return the mean-pooled embeddings
    return sum_embeddings / sum_mask

# Initialize a list to store text embeddings
text_embeddings = []

# Loop through each transcript in the 'Transcript' column of the 'data' variable
for transcript in data['Transcript']:
    # Tokenize the transcript, ensuring padding and truncation, and return PyTorch tensors
    inputs = tokenizer(transcript, padding=True, truncation=True, max_length=128, return_tensors="pt")

    # Perform inference using the model with the tokenized inputs
    with torch.no_grad():
        embs = model(**inputs)

    # Calculate mean-pooled embeddings using the defined function
    embedding = mean_pooling(embs, inputs["attention_mask"])

    # Ensure the embeddings are of the correct size by trimming or padding
    embedding = embedding[:, :expected_vector_size]

    # Append the resulting embedding to the list
    text_embeddings.append(embedding)

 ### Assign each transcript an explicit ID within the Qdrant database collection, then create a list of IDs and then upsert the combination of IDs, vectors, and payloads.

In [None]:
ids = list(range(len(data)))

# Convert PyTorch tensors to lists of floats
text_embeddings_list = [[float(num) for num in emb.numpy().flatten().tolist()[:expected_vector_size]] for emb in text_embeddings]

client.upsert(collection_name=my_collection,
              points=models.Batch(
                  ids=ids,
                  vectors=text_embeddings_list,
                  payloads=payload
                  )
              )

### Using a **sentiment analysis model**, you can generate a sentiment score where sentiment polarity between -1 and 1 will be calculated. A score of -1 will indicate negative sentiments, 0 will indicate neutral sentiment, and 1 will indicate positive sentiment.

In [None]:
from textblob import TextBlob

def calculate_sentiment_score(text):
    # Create a TextBlob object
    blob = TextBlob(text)

    # Get the sentiment polarity (-1 to 1, where -1 is negative, 0 is neutral, and 1 is positive)
    sentiment_score = blob.sentiment.polarity

    return sentiment_score

# Example usage:
text_example = data['Transcript'].iloc[0]
sentiment_score_example = calculate_sentiment_score(text_example)
print(f"Sentiment Score: {sentiment_score_example}")

### For this example, the resultant sentiment score will be 0.75. Now, we’ll apply the helper function for calculating sentiment score to the ‘data’ dataframe

In [None]:
data['Sentiment Score'] = data['Transcript'].apply(calculate_sentiment_score)
data.head()

### Take the average of the vector embeddings of each movie transcript and combine it with the sentiment score to get the final opinion score.

In [None]:
data['avg_embeddings'] = data['embeddings'].apply(lambda x: np.mean(x, axis=0))
data['Opinion_Score'] = 0.7 * data['avg_embeddings'] + 0.3 * data['Sentiment']

### In the above code, more weight is assigned to the embeddings because they capture the semantic content and the similarity between movie transcripts. Inherent content similarity is more critical in determining the overall opinion score. The “Sentiment” column defines the emotional tone of the movie transcript. It is assigned with a lower weight because sentiment, as a factor, is not as crucial as semantic content in calculating the overall opinion score. The weights are arbitrary (like we give weightages to train and test sets of a dataset while splitting).

### Create a movie recommender function, which pass a movie name and get the desired number of recommended movies.

In [None]:
def get_recommendations(movie_name):
    # Find the row corresponding to the given movie name
    query_row = data[data['Movie_Name'] == movie_name]

    if not query_row.empty:
      # Convert the 'Opinion_Score' column to a NumPy array
      opinion_scores_array = np.array(data['Opinion_Score'].tolist())
      # Upsert the 'Opinion_Score' vectors to the Qdrant collection
      opinion_scores_ids = list(range(len(data)))
      # Convert the 'Opinion_Score' array to a list of lists
      opinion_scores_list = opinion_scores_array.reshape(-1, 1).tolist()

      client.upsert(
          collection_name=my_collection,
          points=models.Batch(
              ids=opinion_scores_ids,
              vectors=opinion_scores_list
              )
          )
      # Define a query vector based on the opinion score you want to find similar movies for
      query_opinion_score = np.array([0.8] * 768)  # Adjust as needed

      # Perform a similarity search
      search_results = client.search(
          collection_name=my_collection,
          query_vector=query_opinion_score.tolist(),
          limit=3)

       # Extract movie recommendations from search results
      recommended_movie_ids = [result.id for result in search_results]
      recommended_movies = data.loc[data.index.isin(recommended_movie_ids)]

      # Display recommended movies
      print("Recommended Movies:")
      print(recommended_movies[['Movie_Name', 'Opinion_Score']])
    else:
      print(f"Movie '{movie_name}' not found in the dataset.")

# Example usage:
get_recommendations("Star Wars_ The Last Jedi Trailer (Official)")

 ### This is how Qdrant database is leveraged to create a movie recommender system.
 Vector databases have many use cases. Among those use cases, movie recommender systems have improved significantly with cosine similarity search and large language models.