Copyright 2024 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

#YouTube Tech Services Asset Label Generator

The idea is to order your asset catalog. [Asset labels](https://support.google.com/youtube/answer/6063635?hl=en) are the perfect tool to organize your YouTube assets. They allow you to easily search and update a group of assets. Furthermore, they are included in many reports and allow you to analyze the performance of these asset groups. The asset labels generated by this script are the channel name of the uploader channel of the partner uploaded video of the asset and an optional additional label based on the channel.

Requirements:


*   Google Drive
*   Google Cloud project with [YouTube Reporting API](https://console.cloud.google.com/apis/library/youtubereporting.googleapis.com?), [YouTube Content ID API](https://console.cloud.google.com/apis/library/youtubepartner.googleapis.com), and [Google Sheets API](https://console.cloud.google.com/apis/library/sheets.googleapis.com) enabled
*   Google Cloud Project Service Account with Key ([Help Center](https://cloud.google.com/iam/docs/service-accounts-create#iam-service-accounts-create-console))
*  YouTube CMS (External Content Owner ID)
*  The Service Account need access to the CMS with the following rights (minimum): "Content Delivery", "Reporting and Analytics" and "Content ID Rights Management"

This script will:


*   Download two reports: content_owner_active_claims_a1 and content_owner_video_metadata_a3
*   Combine these reports so that video, channel and asset informations are available
*   Use the channel name of the uploaded video as additional label for the asset (plus an additional label if configured)
*  [Optional] Apply additional mapping based on the channel ID. The Google sheets need the header "channel_id" and "additional_label" (e.g. if you want to add another label for multiple channels: Channel1 (LabelA), Channel2 (LabelA), Channel3 (LabelB), Channel4 (LabelB))
*  [Optional] Upload and process the generated asset labels via API

Please note: The generated asset labels are based on the latest version of the content_owner_active_claims_a1 and content_owner_video_metadata_a3. Usually, these reports are generated once per day, and therefore they might not include the latest asset labels. If you run this Colab multiple times based on the same report (e.g., run it multiple times per day), you may generate the same asset labels. The YouTube system can handle this, and it should not result in any issues.



# Configuration

Please adjust the settings below:

*  EXTERNAL_CONTENT_OWNER_ID: ID of your YouTube Content Owner
*  DRIVE_FOLDER: folder in your Google Drive to store the downloaded reports and the generated update csv files.
*  JSON_FILE: json file with service account credentials, you need to upload this file to the configured Drive folder
*  ADDITIONAL_MAPPING: True, if you have an additional mapping based on channel IDs, otherwise False
*  ADDITIONAL_MAPPING_SPREADSHEET_ID: ID of the Google Spreadsheet with the additinal mapping (if you open the Spreadsheet this is part of the URL) | the speadsheet needs to columns named 'channel_id' and 'additional_label'
*  ADDITIONAL_MAPPING_RANGE: name of the tab in the Google Spreadsheet
*  YOUTUBE_CMS_AUTOMATICALLY_UPLOAD: True, if you want to directly upload and process the generated csv file
*  YOUTUBE_CMS_UPLOADER_NAME: the name of your YouTube uploader account (usually starts with 'web-yt-')



In [None]:
# add the YouTube Content Owner ID
EXTERNAL_CONTENT_OWNER_ID = ''  # @param {type:"string"}

# Google drive folder where the client_secret.json is stored as well as downloaded reports and the generated update csv files
DRIVE_FOLDER = 'AssetLabelGenerator'  # @param {type:"string"}

# Service Account Authentification file
# if you download the json file from your Google Cloud project the file has
# a different name, therefore you either need to rename the file itself or
# adjust the name here you need to upload this file to the configured Cloud folder
JSON_FILE = 'client_secret.json'  # @param {type:"string"}

# indicates if you need an additional mapping depending on the channel ID
ADDITIONAL_MAPPING = False  # @param {type:"boolean"}

# the Google Spreadsheet ID of the additional mapping
ADDITIONAL_MAPPING_SPREADSHEET_ID = '' # @param {type:"string"}

# the Google Spreadsheet tab name of the additional mapping
ADDITIONAL_MAPPING_RANGE = ''  # @param {type:"string"}

# if you want to automatically upload and process the generated csv update
YOUTUBE_CMS_AUTOMATICALLY_UPLOAD = False  # @param {type:"boolean"}

# the name of your YouTube uploader account (usually starts with 'web-yt-')
YOUTUBE_CMS_UPLOADER_NAME = ''  # @param {type:"string"}

# do not change arguments below
DRIVE_DIRECTORY_PATH = '/content/drive/My Drive/' + DRIVE_FOLDER + '/'
CLAIM_REPORT_TYPE_ID = 'content_owner_active_claims_a1'
CLAIM_OUTPUT_FILE = DRIVE_DIRECTORY_PATH + CLAIM_REPORT_TYPE_ID
VIDEO_REPORT_TYPE_ID = 'content_owner_video_metadata_a3'
VIDEO_OUTPUT_FILE = DRIVE_DIRECTORY_PATH + VIDEO_REPORT_TYPE_ID


# Connect to Drive

Mount the Google Drive destination and create a specific folder for the script.

**NOTE: If this is the first time you run the code, you need to upload your `client_secret.json` oauth file to the Drive folder after is has been created.**

In [None]:
from pathlib import Path
from google.colab import drive

drive.mount('/content/drive', force_remount=True)

# create folder if necessary
Path(DRIVE_DIRECTORY_PATH).mkdir(parents=True, exist_ok=True)

# Authentication and Google Service Usage

List the Google services and methods to access them.

In [None]:
import json
from google.oauth2 import service_account
from googleapiclient import discovery

# Scope, Name, Version for YouTube Data API
YOUTUBE_REPORT_API_SERVICE_NAME = 'youtubereporting'
YOUTUBE_REPORT_API_VERSION = 'v1'
YOUTUBE_REPORT_API_SCOPE = [
    'https://www.googleapis.com/auth/yt-analytics-monetary.readonly',
    'https://www.googleapis.com/auth/yt-analytics.readonly',
]

YOUTUBE_PARTNER_API_SERVICE_NAME = 'youtubePartner'
YOUTUBE_PARTNER_API_VERSION = 'v1'
YOUTUBE_PARTNER_API_SCOPE = [
    'https://www.googleapis.com/auth/youtube.force-ssl',
    'https://www.googleapis.com/auth/youtubepartner',
]

GOOGLE_SPREADSHEET_API_SERVICE_NAME = 'sheets'
GOOGLE_SPREADSHEET_API_VERSION = 'v4'
GOOGLE_SPREADSHEET_API_SCOPE = [
    'https://www.googleapis.com/auth/spreadsheets.readonly'
]


def get_credentials(scopes):
  f = open('/content/drive/My Drive/' + DRIVE_FOLDER + '/' + JSON_FILE)
  service_account_json = json.load(f)
  credentials = service_account.Credentials.from_service_account_info(
      service_account_json, scopes=scopes
  )
  return credentials


def get_google_spreadsheet_api_service():
  return discovery.build(
      GOOGLE_SPREADSHEET_API_SERVICE_NAME,
      GOOGLE_SPREADSHEET_API_VERSION,
      static_discovery=False,
      credentials=get_credentials(GOOGLE_SPREADSHEET_API_SCOPE),
  )


def get_youtube_partner_api_service():
  return discovery.build(
      YOUTUBE_PARTNER_API_SERVICE_NAME,
      YOUTUBE_PARTNER_API_VERSION,
      static_discovery=False,
      credentials=get_credentials(YOUTUBE_PARTNER_API_SCOPE),
  )


def get_youtube_report_api_service():
  return discovery.build(
      YOUTUBE_REPORT_API_SERVICE_NAME,
      YOUTUBE_REPORT_API_VERSION,
      static_discovery=False,
      credentials=get_credentials(YOUTUBE_REPORT_API_SCOPE),
  )


# Download Reports

The authentication happens at the YouTube Content Owner level; therefore, the EXTERNAL_CONTENT_OWNER_ID needs to be configured (above). The script downloads the [content_owner_active_claims_a1](https://developers.google.com/youtube/reporting/v1/reports/system_managed/claims) report and the [content_owner_video_metadata_a3](https://developers.google.com/youtube/reporting/v1/reports/system_managed/videos).  The reports are stored in the configured Google Drive folder.

In [None]:
from io import FileIO
from googleapiclient.http import MediaIoBaseDownload


def getReportJob(youtube_reporting, report_type_id):
  jobListResult = (
      youtube_reporting.jobs()
      .list(
          onBehalfOfContentOwner=EXTERNAL_CONTENT_OWNER_ID,
          includeSystemManaged=True,
      )
      .execute()
  )
  jobList = jobListResult['jobs']
  for job in jobList:
    if job['reportTypeId'] == report_type_id:
      return job
  return None


def getReportByJobId(youtube_reporting, id):
  reportListResult = (
      youtube_reporting.jobs()
      .reports()
      .list(onBehalfOfContentOwner=EXTERNAL_CONTENT_OWNER_ID, jobId=id)
      .execute()
  )
  reportList = reportListResult['reports']
  if reportList.count == 0:
    print('No reports available')
    return None
  return reportList[0]


def downloadReport(youtube_reporting, downloadUrl, output_path):
  request = youtube_reporting.media().download(resourceName=' ')
  request.uri = downloadUrl
  fh = FileIO(output_path, mode='wb')
  # Stream/download the report in a single request.
  downloader = MediaIoBaseDownload(fh, request, chunksize=-1)

  done = False
  while done is False:
    status, done = downloader.next_chunk()
    if status:
      print('Download %d%%.' % int(status.progress() * 100))
  print('Download Complete!')


if __name__ == '__main__':
  youtube_reporting = get_youtube_report_api_service()
  # Retrieve the reporting jobs for claim report
  reportJob = getReportJob(youtube_reporting, CLAIM_REPORT_TYPE_ID)
  print(reportJob)
  # Retrieve last Report for claim report
  report = getReportByJobId(youtube_reporting, reportJob['id'])
  print(report)
  # Download the claim report
  downloadReport(youtube_reporting, report['downloadUrl'], CLAIM_OUTPUT_FILE)

  # Retrieve the reporting jobs for claim report
  reportJob = getReportJob(youtube_reporting, VIDEO_REPORT_TYPE_ID)
  print(reportJob)
  # Retrieve last Report for claim report
  report = getReportByJobId(youtube_reporting, reportJob['id'])
  print(report)
  # Download the claim report
  downloadReport(youtube_reporting, report['downloadUrl'], VIDEO_OUTPUT_FILE)


# Additional Mapping

If you have an additional mapping configured, the script accesses the Google spreadsheet and converts the data into a Pandas data frame.

In [None]:
import pandas as pd


if ADDITIONAL_MAPPING:
  service = get_google_spreadsheet_api_service()

  worksheet = (
      service.spreadsheets()
      .values()
      .get(
          spreadsheetId=ADDITIONAL_MAPPING_SPREADSHEET_ID,
          range=ADDITIONAL_MAPPING_RANGE,
      )
      .execute()
  )

  values = worksheet.get('values', [])
  additional_mapping = pd.DataFrame(values[1:], columns=values[0])

else:
  print('No additional Mapping enabled.')


# Generate the update csv file to label assets

The script reads the two downloaded reports and converts them into a Pandas data frame. Then in join the two reports based on the video ID's. This combined report contains claim_id, video_id, custom_id, claim_origin, asset_id, asset_labels, channel_id and channel_display_name.

The scripts remove all entries where the channel is unknown (channel_display_name is empty) or where the asset already has an asset label with the name of the channel.

Finally, the script generated an asset update csv file to label all remaining assets and stores that file in the configured Google Drive folder.

In [None]:
import gzip
from pathlib import Path
import numpy as np
import pandas as pd


def already_labeled_condition(row):
  # no asset label
  if pd.isna(row['asset_labels']):
    return False
  # the script was not able to label this asset in a previous run
  if row['asset_labels'] in ['NO_CHANNEL_LABEL']:
    return True
  # check if the channel name is already in the asset labels
  return row['channel_display_name'] in row['asset_labels']


# read downloaded claim report from Google drive and put it in Pandas dataframe
with gzip.open(CLAIM_OUTPUT_FILE) as f:
  df_claims = pd.read_csv(
      f,
      usecols=[
          'claim_id',
          'video_id',
          'custom_id',
          'claim_origin',
          'asset_id',
          'asset_labels',
      ],
      dtype={
          'claim_id': 'string',
          'video_id': 'string',
          'custom_id': 'string',
          'claim_origin': 'string',
          'asset_id': 'string',
          'asset_labels': 'string',
      },
  )

# replace empty video_id with custom_id (video_id is stored in custom_id for
# deleted videos)
df_claims.loc[df_claims['video_id'].isnull(), 'video_id'] = df_claims[
    'custom_id'
]
# filter for partner uploaded claims only
df_claims_own = df_claims[
    df_claims['claim_origin'].isin(
        ['PARTNER_API', 'WEB_UPLOAD_BY_OWNER', 'CMS_UPLOAD']
    )
]

# read downloaded video report from Google drive and put it in Pandas dataframe
df_video = pd.read_csv(
    VIDEO_OUTPUT_FILE,
    usecols=['video_id', 'channel_id', 'channel_display_name'],
    dtype={
        'video_id': 'string',
        'channel_id': 'string',
        'channel_display_name': 'string',
    },
)

# join claims and videos on video_id
df_claim_video = pd.merge(df_claims_own, df_video, on='video_id', how='left')

# remove all rows with empty channel name
df_claim_video = df_claim_video[
    df_claim_video['channel_display_name'].notnull()
]
# remove character from channel name which are not allowed in asset labels
df_claim_video['channel_display_name'] = df_claim_video[
    'channel_display_name'
].str.replace(r'[,&<>:]+', '', regex=True)

df_claim_video['has_label_already'] = df_claim_video.apply(
    already_labeled_condition, axis=1
)

df_claim_video = df_claim_video[~df_claim_video['has_label_already']]

if ADDITIONAL_MAPPING:
  df_claim_video = pd.merge(
      df_claim_video, additional_mapping, on='channel_id', how='left'
  )
  df_claim_video['additional_label'] = df_claim_video[
      'additional_label'
  ].replace(np.nan, '')

# create a dataframe for the asset update
df_asset_update = pd.DataFrame()

# append columns to an empty DataFrame
df_asset_update['asset_id'] = df_claim_video['asset_id']
df_asset_update['custom_id'] = ''
df_asset_update['asset_type'] = ''
df_asset_update['title'] = ''
if ADDITIONAL_MAPPING:
  df_asset_update['add_asset_labels'] = (
      df_claim_video['channel_display_name']
      + '|'
      + df_claim_video['additional_label']
  )
else:
  df_asset_update['add_asset_labels'] = df_claim_video['channel_display_name']
df_asset_update['ownership'] = ''
df_asset_update['enable_content_id'] = ''
df_asset_update['reference_filename'] = ''
df_asset_update['reference_exclusions'] = ''
df_asset_update['match_policy'] = ''
df_asset_update['update_all_claims'] = ''
df_asset_update['clear_labels'] = ''

filepath = Path(DRIVE_DIRECTORY_PATH + 'asset_label_update.csv')
df_asset_update = df_asset_update.dropna(subset=['asset_id'])
df_asset_update['add_asset_labels'].fillna('NO_CHANNEL_LABEL', inplace=True)
df_asset_update = df_asset_update.drop_duplicates(subset=['asset_id'], keep='first')
df_asset_update.to_csv(filepath, index=False)


# Upload and process asset label update

If configured, the script uploads the generated update csv file to your YouTube CMS. Please note that the update will be immediately processed. In case the asset update csv contains more than 10k lines, it will be split into chunks of 10k lines. In case the upload fails here (which can happen), we kindly ask you to upload the update csv file manually from your configured Google Drive folder.

In [None]:
from datetime import datetime
from googleapiclient.http import MediaFileUpload
import numpy as np
import pandas as pd

if YOUTUBE_CMS_AUTOMATICALLY_UPLOAD:
  youtube_partner = get_youtube_partner_api_service()
  df = pd.read_csv(DRIVE_DIRECTORY_PATH + 'asset_label_update.csv')

  # split the update file into chunks of 10k lines
  df_chunks = np.array_split(df, len(df) // 10000 + 1)

  for i, chunk in enumerate(df_chunks):
    filepath = Path(DRIVE_DIRECTORY_PATH + f'asset_label_update_{i}.csv')
    chunk.to_csv(filepath, index=False)

    with open(
        DRIVE_DIRECTORY_PATH + f'asset_label_update_{i}.csv', 'r'
    ) as file:
      csv_update = file.read()

    body = {
        'kind': 'youtubePartner#package',
        'uploaderName': YOUTUBE_CMS_UPLOADER_NAME,
        'name': (
            'PTM Colab Label Generator '
            + datetime.utcnow().strftime('%F %T.%f')[:-3]
        ),
        'content': csv_update,
    }

    result = (
        youtube_partner.package()
        .insert(onBehalfOfContentOwner=EXTERNAL_CONTENT_OWNER_ID, body=body)
        .execute()
    )

    print('Response: {}'.format(result))

  print(
      'If there is no error message in the log the packages are uploaded and'
      ' will be processed automatically.'
  )
else:
  print(
      'Automatic Upload to YouTube is disabled. Either enable it in the config'
      ' or upload the files manually from your Drive to the YouTube CMS.'
  )
