# Project Proposal: Predicting the Popularity of YouTube Videos Using Early Metrics

## 1. Problem Description and Motivation (2%)

We want to predict how popular YouTube videos will be using early engagement metrics. YouTube's algorithm usually recommends videos based on early stats like views, likes, comments, and shares. But figuring out *which specific factors* have the biggest impact on whether a video goes viral can offer important insights for creators, marketers, and viewers.

**Motivation**: With billions of videos on YouTube, knowing which ones will become popular early on can be super helpful for improving content strategy. This project will help people (and me! i have a pathetic youtube channel that nobody seems to notice) predict whether a video will be a hit by analyzing early data (like view count) .

This is my channel i hope my project will help it one day :( ..... 
https://www.youtube.com/channel/UC91T6l13DPKKHhLMF10Gi2g?sub_confirmation=1

We’re focusing on two key questions:
1. **Which early engagement metrics (views, likes, comments) have the biggest influence on a video’s future success?**
2. **Can we accurately predict the future view count of a video using data from its first 24 hours?**

This project taps into the growing role of video content in today’s digital world and could be a valuable tool for content creators looking to improve their strategies.

---

## 2. Data Source and Collection Plan (2%)

For this project, we’ll gather data using the **YouTube Data API**(very unfortunately it only allows us to collect `1500` videos per day. That's why i've consecutively collected datas for the past week so if you pick my proposal we will start with about at least `4500` videos. yay!). We’ll focus on videos uploaded in the last 30 days from a specific category (e.g., "Data Science Tutorial") to keep the data fresh and relevant.

**Features we’ll collect:**
- **Numeric Features**: View Count, Like Count, Comment Count
- **Categorical Feature**: Video Category (e.g., "Education," "Science & Technology")

**Data Collection**: 
We’ll use Python and the YouTube Data API to collect the data. Each API call will retrieve up to 50 videos, and we’ll paginate through the results if needed. The data will be focused on engagement metrics ( we may later decide if we want the data to be collected only from the first 24 hours after the video is released by changing code here:  for day in range(days):  # Now we loop through each day to collect data
        # We need to format the start and end dates correctly for YouTube's API
        start_date = (today - timedelta(days=day+1)).strftime("%Y-%m-%dT%H:%M:%SZ")
        end_date = (today - timedelta(days=day)).strftime("%Y-%m-%dT%H:%M:%SZ")
        )  We’ll map category IDs to their actual names to make the data easier to understand.

We’ll store the data in a CSV file with the following columns:
- `Video ID`: Unique identifier for each video
- `View Count`: Number of views the video has received
- `Like Count`: Number of likes
- `Comment Count`: Number of comments
- `Video Category`: The category the video belongs to
- `Video Published At`: The date and time the video was published

**I've paste my code below for ur referennce**

---

In [1]:
pip install google-api-python-client pandas


Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import googleapiclient.discovery
import pandas as pd
from datetime import datetime, timezone, timedelta

# Here's where I'm putting my YouTube API key. 
# (pls don't share it with others i have number limits every day on the videos i can pull off)
API_KEY = "AIzaSyDOg4YBSWnZkOdsb67hDOVRsHuCht3TVDg"  

# Now we initialize the YouTube API client with the API key so we can make requests
youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=API_KEY)

# Next, let's write a function to fetch videos based on a search query
def get_videos(query, max_results=50, published_after=None, next_page_token=None):
    """
    This function will fetch videos from YouTube based on the query we pass in.
    
    Parameters:
        query (str): The search term to find relevant videos.
        max_results (int): The maximum number of videos to get in one request.
        published_after (str): Only fetch videos published after this timestamp (formatted in RFC 3339).
        next_page_token (str): Here we use pagination to fetch more videos, if available.
    
    Returns:
        response (dict): This will return the YouTube response with video data.
    """
    request = youtube.search().list(
        q=query,  # Here we set the search term
        part="snippet",  # Now we can get the basic info about each video (like title, channel, etc.)
        type="video",  # This will get only videos (not channels or playlists)
        maxResults=max_results,  # This controls how many videos we’re requesting in this call
        publishedAfter=published_after,  # Only fetch videos after a certain date
        order="relevance",  # This sorts the videos by relevance to our query
        pageToken=next_page_token  # If there are more pages, this will help us get the next set of results
    )
    response = request.execute()  # Now we execute the request to YouTube
    return response  # Here we return the fetched video data

# Now we can write a function to get the stats for each video (e.g., views, likes, comments)
def get_video_statistics(video_ids):
    """
    This function will grab all the stats for the videos we found, including category information.
    
    Parameters:
        video_ids (list): A list of video IDs for which to get statistics.
    
    Returns:
        response (dict): This returns the video statistics and metadata from YouTube.
    """
    request = youtube.videos().list(
        part="statistics, snippet",  # Now we can fetch both statistics (views, likes) and snippet (title, category)
        id=",".join(video_ids)  # Here we join the video IDs into a single string separated by commas
    )
    response = request.execute()  # Now we execute the request to get the stats
    return response  # Finally, we return the response containing video stats

# This is the main function that collects video data over a number of days
def collect_video_data(query, days=30, max_results_per_day=50, max_total_results_per_day=100):
    """
    Here we collect video data for a specific search query over several days.
    
    Parameters:
        query (str): The search query to find relevant videos.
        days (int): The number of days to collect video data.
        max_results_per_day (int): Maximum videos to fetch per request (per day).
        max_total_results_per_day (int): The total number of videos to fetch per day, including paginated results.
    
    Returns:
        pd.DataFrame: Now we’ll return a DataFrame with all the collected video statistics and metadata.
    """
    video_data = []
    today = datetime.now(timezone.utc)  # Here we get the current date and time in UTC to work with timestamps
    
    for day in range(days):  # Now we loop through each day to collect data
        # We need to format the start and end dates correctly for YouTube's API
        start_date = (today - timedelta(days=day+1)).strftime("%Y-%m-%dT%H:%M:%SZ")
        end_date = (today - timedelta(days=day)).strftime("%Y-%m-%dT%H:%M:%SZ")
        
        next_page_token = None  # Initialize the page token for pagination
        total_results_fetched = 0  # Keep track of how many results we’ve fetched so far
        
        # Now we loop until we've fetched the total results for the day
        while total_results_fetched < max_total_results_per_day:
            # Fetch the videos published after the start date
            response = get_videos(query, max_results=max_results_per_day, published_after=start_date, next_page_token=next_page_token)
            
            video_ids = [item['id']['videoId'] for item in response.get('items', [])]  # Extract video IDs from the response
            
            if not video_ids:
                break  # If no videos are found, we stop for this day
            
            # Now we can get the stats for each video
            stats_response = get_video_statistics(video_ids)
            
            # Next, we loop through each video and collect the data
            for item in stats_response.get('items', []):
                video_id = item['id']
                statistics = item['statistics']  # Get the statistics like views, likes, etc.
                snippet = item['snippet']  # Grab the metadata like title and category
                
                # Now we add the collected data to our list
                video_data.append({
                    'Video ID': video_id,
                    'View Count': int(statistics.get('viewCount', 0)),  # Make sure view count is an integer, defaulting to 0 if missing
                    'Like Count': int(statistics.get('likeCount', 0)),  # Same for like count
                    'Comment Count': int(statistics.get('commentCount', 0)),  # And for comment count
                    'Favorite Count': int(statistics.get('favoriteCount', 0)),  # Handle favorite count similarly
                    'Video Category': snippet.get('categoryId', 'Unknown'),  # We store the category ID as a categorical feature
                    'Video Published At': start_date  # Store the date the video was published
                })
            
            # Update the total number of results we've fetched
            total_results_fetched += len(video_ids)
            
            # Check if there is another page of results to fetch
            next_page_token = response.get('nextPageToken')
            if not next_page_token:
                break  # If there are no more pages, we're done here
    
    # Finally, convert the list of video data into a DataFrame
    return pd.DataFrame(video_data)

# Now we can use all of this to collect data and save it to a CSV
if __name__ == "__main__":
    query = "Data Science Tutorial"  # Set the search term to something relevant (e.g., "Data Science Tutorial")
    days_to_collect = 30  # Collect data for the last 30 days
    max_videos_per_day = 50  # Fetch up to 50 videos in one request
    max_total_results_per_day = 150  # Fetch up to 150 videos in total per day (with pagination)
    
    # Time to collect our video data
    df = collect_video_data(query, days=days_to_collect, max_results_per_day=max_videos_per_day, max_total_results_per_day=max_total_results_per_day)
    
    # Now we save the collected data to a CSV file
    df.to_csv('YP_Youtube1_3.csv', index=False)
    print("Data collection complete. Saved to YP_Youtube1_3.csv")

HttpError: <HttpError 403 when requesting https://youtube.googleapis.com/youtube/v3/search?q=Data+Science+Tutorial&part=snippet&type=video&maxResults=50&publishedAfter=2024-10-02T23%3A54%3A17Z&order=relevance&key=AIzaSyDOg4YBSWnZkOdsb67hDOVRsHuCht3TVDg&alt=json returned "The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.". Details: "[{'message': 'The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.', 'domain': 'youtube.quota', 'reason': 'quotaExceeded'}]">

While gathering the data from YouTube, I encountered an issue where the "Video Category" field was returned in numerical format. These numerical category IDs aren't very helpful for understanding what each category represents. I dopn't think the youtube API i am using natively support category names.

To make the data more meaningful and readable, we need to map these numerical category IDs to their actual category names (e.g., '27' becomes 'Education', '28' becomes 'Science & Technology', and so on).

In the next step, I create a dictionary to map these numerical IDs to their corresponding names, then apply this mapping to the dataset so that the "Video Category" field reflects the actual category names.

In [3]:
# Now, we're going to create a dictionary to map those category IDs to actual category names. 
# This way, we can easily understand what each category represents.
category_mapping = {
    '1': 'Film & Animation',
    '2': 'Autos & Vehicles',
    '10': 'Music',
    '15': 'Pets & Animals',
    '17': 'Sports',
    '18': 'Short Movies',
    '19': 'Travel & Events',
    '20': 'Gaming',
    '21': 'Videoblogging',
    '22': 'People & Blogs',
    '23': 'Comedy',
    '24': 'Entertainment',
    '25': 'News & Politics',
    '26': 'Howto & Style',
    '27': 'Education',
    '28': 'Science & Technology',
    '29': 'Nonprofits & Activism'
    # More categories can be added here if needed, but this should cover the most common ones. I didn't find a better list/map yet if  you do you  can add here
}

# Now let's load the CSV file that contains our YouTube video data.
# This file has all the video details that we collected earlier.
df = pd.read_csv('YP_Youtube1_3.csv')

# Here's where the transformation happens. We're going to replace the category IDs with the actual names.
# First, we make sure the 'Video Category' column is in string format, then we map those IDs to the category names.
df['Video Category'] = df['Video Category'].astype(str).map(category_mapping)

# Now that we've updated the data, let's save it to a new CSV file.
# This way, the original data stays untouched, and we get a more readable version with category names.
df.to_csv('YP_Youtube1_4.csv', index=False)

# Finally, let's print a message to confirm that everything worked and the file was saved successfully.
print("Category mapping completed. Saved to YP_Youtube1_4.csv")


FileNotFoundError: [Errno 2] No such file or directory: 'YP_Youtube1_3.csv'

  ## 3. How the Data Will Be Used and Questions of Interest (1%)

Once we have the data, we’ll analyze how early engagement metrics relate to a video’s future popularity. Specifically, we’ll focus on:

- **Predicting View Counts**: Using the early metrics (views, likes, comments), we’ll try to predict how many views a video will get in the future (like a week later). By building a machine learning model, we’ll look for patterns that can predict future video performance.
  
- **Identifying Key Metrics**: We’ll also try to figure out which engagement metric has the biggest influence on a video’s popularity. For example, is the number of comments more important than the number of likes?

The goal is to create a model that helps us understand and predict how successful a video will be based on its early performance.

---

## Conclusion

This project dives into the growing trend of using early data to predict video success on YouTube. By collecting and analyzing video data over time, we hope to uncover useful insights for content creators and marketers. Plus, it’s a great way to apply machine learning techniques to real-world data from one of the biggest video platforms in the world.