# **Extract Unique of S3  files**
This Google Colab file interacts with an S3 bucket to retrieve YouTube video URLs stored in a CSV file. It extracts the unique YouTube video IDs from the URLs and saves them into a new CSV file. The workflow includes the following steps:

*   **Connect to S3:** It connects to an Amazon S3 bucket to access the CSV file containing YouTube video URLs.
*   **Extract YouTube Video IDs:** The notebook processes the URLs, extracts the unique video IDs from each YouTube URL.
*   **Remove Duplicates:** It removes duplicate video IDs to ensure the list contains only unique IDs.
*   **Save Cleaned Data:** The unique video IDs are saved into a new CSV file.

*  **Download the Cleaned CSV:** Once the data is processed, the cleaned CSV file is made available for download.

<br>

**Instructions to Run in Google Colab:**




*   **Set Bucket Names:** Specify the name of the S3 bucket that contains the CSV file with the YouTube URLs.
Modify the code in the notebook to match your S3 bucket's name.
*   **Run the Cells:** Run each cell sequentially by pressing Shift + Enter or clicking the Run button.The notebook will fetch the CSV file from the S3 bucket, extract the YouTube video IDs, remove duplicates, and save the cleaned list to a new CSV file.



*   **Download Cleaned Data:** After processing, the cleaned CSV file (clean_video_ids.csv) will be generated. A download link will appear in the notebook, allowing you to download the cleaned CSV file containing the unique video IDs.

In [None]:
!pip install git+https://github.com/yt-dlp/yt-dlp.git -q
!pip install boto3 -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for yt-dlp (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m69.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.2/84.2 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import boto3
import os
import csv
from collections import defaultdict

s3 = boto3.client(
    's3',
    aws_access_key_id='key',
    aws_secret_access_key='key'
)

bucket_name = 'yt-chunk-mp3'
prefix = 'chunks/'

paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket_name, Prefix=prefix)

video_id_counts = defaultdict(int)


total_objects = 0
unique_video_ids = set()

print("Retrieving objects from S3...")

for page in pages:
    if 'Contents' in page:
        # Iterate over the objects in this page
        for obj in page['Contents']:
            total_objects += 1
            file_path = obj['Key']


            parts = file_path.split('/')
            if len(parts) > 1:
                video_id = parts[1]
                video_id_counts[video_id] += 1
                unique_video_ids.add(video_id)

print(f"Processed {total_objects} objects in S3")
print(f"Found {len(unique_video_ids)} unique video IDs")


csv_filename = 'unique_video_ids.csv'
with open(csv_filename, 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile)


    csv_writer.writerow(['video_id', 'file_count'])


    for video_id, count in video_id_counts.items():
        csv_writer.writerow([video_id, count])

print(f"CSV file '{csv_filename}' has been created with {len(unique_video_ids)} unique video IDs")

try:
    from google.colab import files
    files.download(csv_filename)
    print(f"Download initiated for {csv_filename}")
except ImportError:
    print(f"File saved at: {os.path.abspath(csv_filename)}")

Retrieving objects from S3...
Processed 2407044 objects in S3
Found 2407044 unique video IDs
CSV file 'unique_video_ids.csv' has been created with 2407044 unique video IDs


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Download initiated for unique_video_ids.csv


In [None]:
import csv
import os
input_file = 'unique_video_ids.csv'
output_file = 'clean_video_ids.csv'
clean_video_ids = set()
print(f"Processing file: {input_file}")

# Read the input CSV file
with open(input_file, 'r', newline='') as csvfile:
    csv_reader = csv.reader(csvfile)


    header = next(csv_reader, None)


    if header is None or 'video_id' not in header:
        print(f"Error: Input file '{input_file}' does not have the expected header with 'video_id' column")
        exit(1)

    video_id_index = header.index('video_id')
    for row in csv_reader:
        if len(row) > video_id_index:

            full_id = row[video_id_index]
            clean_id = full_id.split('|')[0] if '|' in full_id else full_id
            clean_video_ids.add(clean_id)

print(f"Found {len(clean_video_ids)} unique clean video IDs")

with open(output_file, 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile)

    # Write header
    csv_writer.writerow(['video_id'])

    # Write each clean ID
    for clean_id in sorted(clean_video_ids):
        csv_writer.writerow([clean_id])

print(f"CSV file '{output_file}' has been created with {len(clean_video_ids)} clean video IDs")

try:
    from google.colab import files
    files.download(output_file)
    print(f"Download initiated for {output_file}")
except ImportError:
    print(f"File saved at: {os.path.abspath(output_file)}")

Processing file: unique_video_ids.csv
Found 85959 unique clean video IDs
CSV file 'clean_video_ids.csv' has been created with 85959 clean video IDs


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Download initiated for clean_video_ids.csv


In [None]:
import pandas as pd
file_path = '/content/clean_video_ids.csv'
data = pd.read_csv(file_path)
data.head()


Unnamed: 0,video_id
0,--6UTyrB6Yw
1,--7UpbYZ-4I
2,--7fU5T_8H0
3,--CFRdZtLwU
4,--CMkGANRMo


In [None]:
data.shape

(85959, 1)

In [None]:
import pandas as pd
file_path = '/content/unique_video_ids.csv'
data = pd.read_csv(file_path)
data.head()


Unnamed: 0,video_id,file_count
0,--6UTyrB6Yw|108|128.mp3,1
1,--6UTyrB6Yw|26|47.mp3,1
2,--6UTyrB6Yw|2|24.mp3,1
3,--6UTyrB6Yw|50|80.mp3,1
4,--6UTyrB6Yw|80|84.mp3,1


In [None]:
data.shape

(2407044, 2)