# Dataset: embedded_movies_small

https://huggingface.co/datasets/acloudfan/embedded_movies_small

This dataset was created from the HuggingFace dataset **AIatMongoDB/embedded_movies**

**Why was it needed?**

1. The original dataset is close to 25 GB, for learning and experiments it is an overkill
2. Data in the dataset needs to be cleaned up e.g., some features are No that requires extra care
3. Some of the embeddings are missing

**How to use?**
* Use for sentiment analysis
* Text similarity (plot)
* Embeddings : ready to use with vector DB & search libraries

**Details**

* Embeddings generated on the full_plot column
* This data set contains details on movies with genres of Western, Action, or Fantasy
* Each document contains a single movie, and information such as its title, release year, and cast.* 
In addition, documents in this collection include a plot_embedding field that contains embeddings created using OpenAI's text-embedding-ada-00e.

## Setup environment

*Change keys file location, if you would like to create your own version*
*Do not forget to change the HuggingFace dataset name, in last step*

In [1]:
from dotenv import load_dotenv
import os

import warnings

warnings.filterwarnings("ignore")

# Load the file that contains the API keys
# CHANGE THIS TO YOUR ENV FILE LOCATION
load_dotenv('C:\\Users\\raj\\.jupyter\\.env')

True

## 1. Load, Split, and Cleanup dataset

In [2]:
from datasets import load_dataset, Dataset, DatasetDict
import pandas as pd
import numpy as np



movie_db = f"AIatMongoDB/embedded_movies"

# Downloads the dataset to local cache
docs = load_dataset(movie_db, split="train")  

# Split dataset
ds_split = docs.train_test_split(test_size=0.3)
ds_split

DatasetDict({
    train: Dataset({
        features: ['rated', 'writers', 'runtime', 'num_mflix_comments', 'title', 'cast', 'plot', 'directors', 'type', 'fullplot', 'languages', 'awards', 'imdb', 'plot_embedding', 'metacritic', 'countries', 'genres', 'poster'],
        num_rows: 1050
    })
    test: Dataset({
        features: ['rated', 'writers', 'runtime', 'num_mflix_comments', 'title', 'cast', 'plot', 'directors', 'type', 'fullplot', 'languages', 'awards', 'imdb', 'plot_embedding', 'metacritic', 'countries', 'genres', 'poster'],
        num_rows: 450
    })
})

In [3]:
def  convert_to_df_cleanup_dataset(ds):
    df = pd.DataFrame(ds)
    df = df.dropna(subset=['plot_embedding', 'fullplot','genres'])
    return df

## 2. Create the dataset

In [4]:
dataset_train_split = Dataset.from_pandas(convert_to_df_cleanup_dataset(ds_split['train']), split='train')
dataset_test_split = Dataset.from_pandas(convert_to_df_cleanup_dataset(ds_split['test']), split='test')

ds = DatasetDict()
ds['train'] = dataset_train_split
ds['test'] = dataset_test_split

ds

DatasetDict({
    train: Dataset({
        features: ['rated', 'writers', 'runtime', 'num_mflix_comments', 'title', 'cast', 'plot', 'directors', 'type', 'fullplot', 'languages', 'awards', 'imdb', 'plot_embedding', 'metacritic', 'countries', 'genres', 'poster', '__index_level_0__'],
        num_rows: 1017
    })
    test: Dataset({
        features: ['rated', 'writers', 'runtime', 'num_mflix_comments', 'title', 'cast', 'plot', 'directors', 'type', 'fullplot', 'languages', 'awards', 'imdb', 'plot_embedding', 'metacritic', 'countries', 'genres', 'poster', '__index_level_0__'],
        num_rows: 434
    })
})

## 3. Load to hub

*Change the name, if you would like to create your own version of the dataset*

In [5]:
ds_name='acloudfan/embedded_movies_small'

In [6]:
ds.push_to_hub(ds_name)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/3.58k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/acloudfan/embedded_movies_small/commit/a691b5f445035dc719d27c6fbe93b67b127447d1', commit_message='Upload dataset', commit_description='', oid='a691b5f445035dc719d27c6fbe93b67b127447d1', pr_url=None, pr_revision=None, pr_num=None)