# Project Proposal: Predicting the Popularity of YouTube Videos Using Early Metrics

## 1. Problem Description and Motivation (2%)

We want to predict how popular YouTube videos will be using early engagement metrics. YouTube's algorithm usually recommends videos based on early stats like views, likes, comments, and shares. But figuring out *which specific factors* have the biggest impact on whether a video goes viral can offer important insights for creators, marketers, and viewers.

**Motivation**: With billions of videos on YouTube, knowing which ones will become popular early on can be super helpful for improving content strategy. This project will help people (and me! i have a pathetic youtube channel that nobody seems to notice) predict whether a video will be a hit by analyzing early data (like view count) .

This is my channel i hope my project will help it one day :( ..... 
https://www.youtube.com/channel/UC91T6l13DPKKHhLMF10Gi2g?sub_confirmation=1

We’re focusing on two key questions:
1. **Which early engagement metrics (views, likes, comments) have the biggest influence on a video’s future success?**
2. **Can we accurately predict the future view count of a video using data from its first 24 hours?**

This project taps into the growing role of video content in today’s digital world and could be a valuable tool for content creators looking to improve their strategies.

---

## 2. Data Source and Collection Plan (2%)

For this project, we’ll gather data using the **YouTube Data API**(very unfortunately it only allows us to collect `1500` videos per day. That's why i've consecutively collected datas for the past week so if you pick my proposal we will start with about at least `4500` videos. yay!). We’ll focus on videos uploaded in the last 30 days from a specific category (e.g., "Data Science Tutorial") to keep the data fresh and relevant.

**Features we’ll collect:**
- **Numeric Features**: View Count, Like Count, Comment Count
- **Categorical Feature**: Video Category (e.g., "Education," "Science & Technology")

**Data Collection**: 
We’ll use Python and the YouTube Data API to collect the data. Each API call will retrieve up to 50 videos, and we’ll paginate through the results if needed. The data will be focused on engagement metrics ( we may later decide if we want the data to be collected only from the first 24 hours after the video is released by changing code here:  for day in range(days):  # Now we loop through each day to collect data
        # We need to format the start and end dates correctly for YouTube's API
        start_date = (today - timedelta(days=day+1)).strftime("%Y-%m-%dT%H:%M:%SZ")
        end_date = (today - timedelta(days=day)).strftime("%Y-%m-%dT%H:%M:%SZ")
        )  We’ll map category IDs to their actual names to make the data easier to understand.

We’ll store the data in a CSV file with the following columns:
- `Video ID`: Unique identifier for each video
- `View Count`: Number of views the video has received
- `Like Count`: Number of likes
- `Comment Count`: Number of comments
- `Video Category`: The category the video belongs to
- `Video Published At`: The date and time the video was published

**I've paste my code below for ur referennce**

---

In [1]:
pip install google-api-python-client pandas


Note: you may need to restart the kernel to use updated packages.


In [73]:
import os
import googleapiclient.discovery
import pandas as pd
from datetime import datetime, timezone, timedelta

# Here's where I'm putting my YouTube API key. 
# (pls don't share it with others i have number limits every day on the videos i can pull off)
API_KEY = "AIzaSyDOg4YBSWnZkOdsb67hDOVRsHuCht3TVDg"  

# Now we initialize the YouTube API client with the API key so we can make requests
youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=API_KEY)

# Next, let's write a function to fetch videos based on a search query
def get_videos(query, max_results=50, published_after=None, next_page_token=None):
    """
    This function will fetch videos from YouTube based on the query we pass in.
    
    Parameters:
        query (str): The search term to find relevant videos.
        max_results (int): The maximum number of videos to get in one request.
        published_after (str): Only fetch videos published after this timestamp (formatted in RFC 3339).
        next_page_token (str): Here we use pagination to fetch more videos, if available.
    
    Returns:
        response (dict): This will return the YouTube response with video data.
    """
    request = youtube.search().list(
        q=query,  # Here we set the search term
        part="snippet",  # Now we can get the basic info about each video (like title, channel, etc.)
        type="video",  # This will get only videos (not channels or playlists)
        maxResults=max_results,  # This controls how many videos we’re requesting in this call
        publishedAfter=published_after,  # Only fetch videos after a certain date
        order="date",  # Sort by the latest videos
        pageToken=next_page_token  # If there are more pages, this will help us get the next set of results
    )
    response = request.execute()  # Now we execute the request to YouTube
    return response  # Here we return the fetched video data

# Now we can write a function to get the stats for each video (e.g., views, likes, comments)
def get_video_statistics(video_ids):
    """
    This function will grab all the stats for the videos we found, including category information.
    
    Parameters:
        video_ids (list): A list of video IDs for which to get statistics.
    
    Returns:
        response (dict): This returns the video statistics and metadata from YouTube.
    """
    request = youtube.videos().list(
        part="statistics, snippet",  # Now we can fetch both statistics (views, likes) and snippet (title, category)
        id=",".join(video_ids)  # Here we join the video IDs into a single string separated by commas
    )
    response = request.execute()  # Now we execute the request to get the stats
    return response  # Finally, we return the response containing video stats

# This is the main function that collects video data up to a maximum of 50 videos
def collect_video_data(query, max_results_total=50, max_results_per_page=50):
    """
    Here we collect video data for a specific search query, collecting up to max_results_total videos
    published within the last 24 hours.
    
    Parameters:
        query (str): The search query to find relevant videos.
        max_results_total (int): The total number of videos to collect.
        max_results_per_page (int): Maximum videos to fetch per request.
    
    Returns:
        pd.DataFrame: Now we’ll return a DataFrame with all the collected video statistics and metadata.
    """
    video_data = []
    total_videos_collected = 0  # Track the total number of videos collected
    next_page_token = None  # Initialize the page token for pagination
    
    # Get the current time and subtract 24 hours to get the time range for the last 24 hours
    current_time = datetime.now(timezone.utc)
    published_after = (current_time - timedelta(days=1)).strftime("%Y-%m-%dT%H:%M:%SZ")  # 24 hours ago in RFC 3339 format
    
    # Loop until we reach max_results_total videos
    while total_videos_collected < max_results_total:
        # Fetch videos published within the last 24 hours
        response = get_videos(query, max_results=max_results_per_page, published_after=published_after, next_page_token=next_page_token)
        
        video_ids = [item['id']['videoId'] for item in response.get('items', [])]  # Extract video IDs from the response
        
        if not video_ids:
            break  # If no more videos are found, we stop the loop
        
        # Get statistics for the fetched videos
        stats_response = get_video_statistics(video_ids)
        
        # Loop through each video and collect the data
        for item in stats_response.get('items', []):
            video_id = item['id']
            statistics = item['statistics']  # Get the statistics like views, likes, etc.
            snippet = item['snippet']  # Get the metadata like title, category, etc.
            
            # Append the collected data for each video
            video_data.append({
                'Video ID': video_id,
                'View Count': int(statistics.get('viewCount', 0)),  # Ensure view count is an integer, defaulting to 0 if missing
                'Like Count': int(statistics.get('likeCount', 0)),  # Same for like count
                'Comment Count': int(statistics.get('commentCount', 0)),  # Same for comment count
                'Favorite Count': int(statistics.get('favoriteCount', 0)),  # Handle favorite count similarly
                'Video Category': snippet.get('categoryId', 'Unknown'),  # Store the category ID as a feature
                'Video Published At': snippet.get('publishedAt', 'Unknown')  # Store the publication date
            })
        
        # Update the total number of videos collected
        total_videos_collected += len(video_ids)
        
        # Check if we have reached the max results or no more pages to fetch
        next_page_token = response.get('nextPageToken')
        if not next_page_token or total_videos_collected >= max_results_total:
            break
    
    # Finally, convert the list of video data into a DataFrame
    return pd.DataFrame(video_data[:max_results_total])

# Now we can use all of this to collect data and save it to a CSV
if __name__ == "__main__":
    query = "Youtube"  # Set the search term to something relevant (e.g., "Data Science Tutorial")
    max_videos = 50  # Collect up to 50 videos in total
    
    # Time to collect our video data
    df = collect_video_data(query, max_results_total=max_videos)
    
    # Now we save the collected data to a CSV file
    df.to_csv('YP_Youtube_50_Videos.csv', index=False)
    print("Data collection complete. Saved to YP_Youtube_50_Videos.csv")


Data collection complete. Saved to YP_Youtube_50_Videos.csv


      Video ID  View Count  Like Count  Comment Count  Favorite Count  \
0  _87rquuvcg4           0           0              0               0   
1  ibqkMQN4RyQ          20           2              2               0   
2  tya2UOzu6aM           1           0              0               0   
3  XtqSNNmpk7M          16           3              0               0   
4  29sjCAqOVLw           7           3              0               0   

   Video Category    Video Published At  
0              25  2024-10-21T04:21:53Z  
1              27  2024-10-20T16:01:15Z  
2              27  2024-10-20T12:45:06Z  
3              28  2024-10-20T11:30:18Z  
4              27  2024-10-20T10:40:10Z  


While gathering the data from YouTube, I encountered an issue where the "Video Category" field was returned in numerical format. These numerical category IDs aren't very helpful for understanding what each category represents. I dopn't think the youtube API i am using natively support category names.

To make the data more meaningful and readable, we need to map these numerical category IDs to their actual category names (e.g., '27' becomes 'Education', '28' becomes 'Science & Technology', and so on).

In the next step, I create a dictionary to map these numerical IDs to their corresponding names, then apply this mapping to the dataset so that the "Video Category" field reflects the actual category names.

In [74]:
# Now, we're going to create a dictionary to map those category IDs to actual category names. 
# This way, we can easily understand what each category represents.
category_mapping = {
    '1': 'Film & Animation',
    '2': 'Autos & Vehicles',
    '10': 'Music',
    '15': 'Pets & Animals',
    '17': 'Sports',
    '18': 'Short Movies',
    '19': 'Travel & Events',
    '20': 'Gaming',
    '21': 'Videoblogging',
    '22': 'People & Blogs',
    '23': 'Comedy',
    '24': 'Entertainment',
    '25': 'News & Politics',
    '26': 'Howto & Style',
    '27': 'Education',
    '28': 'Science & Technology',
    '29': 'Nonprofits & Activism'
    # More categories can be added here if needed, but this should cover the most common ones. I didn't find a better list/map yet if  you do you  can add here
}

# Now let's load the CSV file that contains our YouTube video data.
# This file has all the video details that we collected earlier.
df = pd.read_csv('YP_Youtube_50_Videos.csv')

# Here's where the transformation happens. We're going to replace the category IDs with the actual names.
# First, we make sure the 'Video Category' column is in string format, then we map those IDs to the category names.
df['Video Category'] = df['Video Category'].astype(str).map(category_mapping)

# Now that we've updated the data, let's save it to a new CSV file.
# This way, the original data stays untouched, and we get a more readable version with category names.
df.to_csv('YP_Youtube_50_Videos.csv', index=False)

# Finally, let's print a message to confirm that everything worked and the file was saved successfully.
print("Category mapping completed. Saved to YP_Youtube_50_Videos.csv")


Category mapping completed. Saved to YP_Youtube_50_Videos.csv


  ## 3. How the Data Will Be Used and Questions of Interest (1%)

Once we have the data, we’ll analyze how early engagement metrics relate to a video’s future popularity. Specifically, we’ll focus on:

- **Predicting View Counts**: Using the early metrics (views, likes, comments), we’ll try to predict how many views a video will get in the future (like a week later). By building a machine learning model, we’ll look for patterns that can predict future video performance.
  
- **Identifying Key Metrics**: We’ll also try to figure out which engagement metric has the biggest influence on a video’s popularity. For example, is the number of comments more important than the number of likes?

The goal is to create a model that helps us understand and predict how successful a video will be based on its early performance.

---

## Conclusion

This project dives into the growing trend of using early data to predict video success on YouTube. By collecting and analyzing video data over time, we hope to uncover useful insights for content creators and marketers. Plus, it’s a great way to apply machine learning techniques to real-world data from one of the biggest video platforms in the world.

In [75]:
# Load the CSV file into a pandas DataFrame
df = pd.read_csv('YP_Youtube_50_Videos.csv')

# Display the first 5 rows of the DataFrame
print(df.head(50))

       Video ID  View Count  Like Count  Comment Count  Favorite Count  \
0   id0-2uhkaU0       10763        1700             39               0   
1   M79F47l-qYc        6884         889             18               0   
2   uDVNRYxIMRE       67897         332              0               0   
3   Yfcs9OPrQz8       34864         340              0               0   
4   H_1-uFgQrrg      127473       10765             76               0   
5   OapndFASEvk       30654         502              9               0   
6   VET1qKCyq4c      259325       12911            107               0   
7   NUzGaZkGTu4       70507         381              0               0   
8   I1cPbwuMc3U      228437         622              2               0   
9   E98wLBO1v4I      107973        7895             17               0   
10  RzQpDaa2044     2134518       30997             21               0   
11  GN0r5_ynJ9A      287976       17923            124               0   
12  _D3_9kIFUMs        7357        294

In [83]:
import pandas as pd

# Define the function to extract video IDs from the CSV with an optional parameter to limit the number of rows
def get_video_ids_from_csv(file_path, n=None):
    """
    Extract the first n video IDs from a CSV file.
    
    Parameters:
        file_path (str): Path to the CSV file.
        n (int, optional): Number of rows to extract. If None, extract all rows.
    
    Returns:
        list: A list of video IDs.
    """
    # Load the CSV into a pandas DataFrame
    df = pd.read_csv(file_path)
    
    # If n is provided, extract only the first n rows, otherwise extract all rows
    if n:
        video_ids = df['Video ID'].head(n).tolist()  # Extract first n rows
    else:
        video_ids = df['Video ID'].tolist()  # Extract all rows if n is not provided
    
    return video_ids

# Example usage with your file path
csv_file_path = '/Users/yipengandrewwang/DS3000-3/Project1/YP_Youtube_50_Videos.csv'

#Extract only the first 10 video IDs
video_ids_list = get_video_ids_from_csv(csv_file_path, n=50)
print(video_ids_list)


['id0-2uhkaU0', 'M79F47l-qYc', 'uDVNRYxIMRE', 'Yfcs9OPrQz8', 'H_1-uFgQrrg', 'OapndFASEvk', 'VET1qKCyq4c', 'NUzGaZkGTu4', 'I1cPbwuMc3U', 'E98wLBO1v4I', 'RzQpDaa2044', 'GN0r5_ynJ9A', '_D3_9kIFUMs', 'ihtUyR9qjEs', '6iG4Zp8HLOA', 'p8SIBaDl9PE', 'IPwVtRypNbI', 'paTQ-p8K0BA', 'YhKkv5-4KAg', 'oUyle9aTwdI', 'LqIo3IwSWPY', 'Mq2M3Bz8zz0', 'T8GlYj2xDAs', 'dggwhiuSkqU', 'T5BERlOouTY', 'vSrhCetz1gA', 'fzcxFTd9o3I', 'uZUuwaJLM10', '_rwUsP3LF48', 's2rQtoyKb0c', 'NPh1XB-n7q4', '-oVJ7zALvD8', 'uoqCKKQLiQY', 'i5cehh3wkVg', 'IfYdsEwTBkI', 'VUIwbN2VTdE', 'HsQZE0iatg8', 'rtLGwJF9Jxc', 'OHd7DCBzCfE', 'N6Fo-gdaqTA', '7gaoy3uyfik', 'HBTCZ4gEetY', 'c_jI07Nw2d0', 'UCak6RsyF_o', 'o1Dn4hajLY8', 'EZXLSQDdeCk', 'hSE0tzHS684', '9hXbhuf-SzA', 'pD-ZHDaTm78', 'JEoLzhWp3mY']


In [None]:
import time
import pandas as pd
from datetime import datetime, timezone, timedelta
import googleapiclient.discovery

# Initialize YouTube API client
API_KEY = "AIzaSyDOg4YBSWnZkOdsb67hDOVRsHuCht3TVDg"  
youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=API_KEY)

# Function to fetch video statistics for a list of video IDs
def get_video_statistics(video_ids):
    """
    This function fetches video statistics like views, likes, comments for a list of video IDs.
    
    Parameters:
        video_ids (list): A list of video IDs for which to get statistics.
    
    Returns:
        response (dict): The YouTube response containing the video statistics.
    """
    request = youtube.videos().list(
        part="statistics, snippet",  # Get both statistics and metadata
        id=",".join(video_ids)  # Join the video IDs into a comma-separated string
    )
    response = request.execute()
    return response

# The main function for collecting data every 30 minutes
def collect_consecutive_sessions_data(video_ids, sessions=10, output_file="YP_Youtube_Consecutive_Sessions.csv"):
    """
    Collects video statistics for the same set of videos every 30 minutes over a specified number of sessions
    and saves the results to a CSV file after each session.
    
    Parameters:
        video_ids (list): List of video IDs for which to fetch statistics.
        sessions (int): The number of consecutive 30-minute sessions to collect data (default is 10 sessions).
        output_file (str): The name of the output CSV file to store the collected data.
    
    Returns:
        None: The data is saved to the output CSV file after each session.
    """
    
    # Create a list to store the data for all sessions
    all_sessions_data = []
    
    # Collect data for each video over consecutive sessions (every 30 minutes)
    for session in range(1, sessions + 1):
        print(f"Collecting data for Session {session} (every 3 hours)...")
        session_data = []  # Store data for each video in this session
        
        # Fetch statistics for the same video IDs
        stats_response = get_video_statistics(video_ids)
        
        # Loop through each video and collect the stats for the current session
        for item in stats_response.get('items', []):
            video_id = item['id']
            statistics = item['statistics']
            snippet = item['snippet']
            
            # Store video stats for the current session
            session_data.append({
                'Video ID': video_id,
                'Session': session,
                'View Count': int(statistics.get('viewCount', 0)),
                'Like Count': int(statistics.get('likeCount', 0)),
                'Comment Count': int(statistics.get('commentCount', 0)),
                'Favorite Count': int(statistics.get('favoriteCount', 0)),
                'Video Category': snippet.get('categoryId', 'Unknown'),
                'Video Published At': snippet.get('publishedAt', 'Unknown'),
            })
        
        # Append the current session's data to the all_sessions_data list
        all_sessions_data.extend(session_data)
        
        # Convert the collected data into a pandas DataFrame
        df = pd.DataFrame(session_data)
        
        # Save the data to CSV file after each session (appending if it exists)
        if session == 1:
            # For the first session, write a new CSV file with the header
            df.to_csv(output_file, mode='w', header=True, index=False)
        else:
            # For subsequent sessions, append to the existing file without writing the header
            df.to_csv(output_file, mode='a', header=False, index=False)
        
        # Print summary of collected data for this session
        print(f"Session {session} data collected. Collected data for {len(session_data)} videos.")
        print(f"Waiting 2 hours before the next collection...")
        
        # Sleep for 3 hours before collecting the next session's data
        time.sleep(2 * 60 * 60)  # Sleep for 2 hours

# Example usage of the function
if __name__ == "__main__":
  
    sessions_to_collect = 18  # Collect data for 18 consecutive 2-hour sessions
    
    # Collect video data over consecutive sessions for the provided video IDs
    collect_consecutive_sessions_data(video_ids_list, sessions=sessions_to_collect, output_file="YP_Youtube_Consecutive_Sessions4.csv")
    print("Data collection for consecutive 2-hour sessions complete. Saved to YP_Youtube_Consecutive_Sessions4.csv")


Collecting data for Session 1 (every 3 hours)...
Session 1 data collected. Collected data for 50 videos.
Waiting 2 hours before the next collection...


In [None]:
print(f"Built-in len function: {len}")  # Check if len is still the built-in function


In [None]:
del len  # This will delete the overridden 'len' variable


In [21]:
from datetime import datetime

def convert_to_dt_obj(string):
    date_obj = datetime.strptime(string, '%Y-%m-%dT%H:%M:%SZ')
    return date_obj

str1 = '2024-10-17T15:36:09Z'
str2 = '2024-10-13T15:36:09Z'

date1 = convert_to_dt_obj(str1)
date2 = convert_to_dt_obj(str2)

print(date1>date2)




True


In [25]:
import pandas as pd

df = pd.read_csv('YP_Youtube1_4.csv',header=0)

df.head()

Unnamed: 0,Video ID,View Count,Like Count,Comment Count,Favorite Count,Video Category,Video Published At
0,c9-Q4DMC4Rs,1856,139,0,0,Education,2024-10-17T15:36:09Z
1,68bWRSO8PYc,628,70,1,0,Education,2024-10-17T15:36:09Z
2,yRIU2nzIQl8,4225,132,0,0,Education,2024-10-17T15:36:09Z
3,BA_ZcBYWCDY,308,44,0,0,Education,2024-10-17T15:36:09Z
4,6ZLLfvPKTao,451,29,3,0,Education,2024-10-17T15:36:09Z


In [26]:
len = df.shape[0]
for i in range(len):
    df.loc[i,'Video Published At'] = convert_to_dt_obj(df['Video Published At'][i])

df.head()


Unnamed: 0,Video ID,View Count,Like Count,Comment Count,Favorite Count,Video Category,Video Published At
0,c9-Q4DMC4Rs,1856,139,0,0,Education,2024-10-17 15:36:09
1,68bWRSO8PYc,628,70,1,0,Education,2024-10-17 15:36:09
2,yRIU2nzIQl8,4225,132,0,0,Education,2024-10-17 15:36:09
3,BA_ZcBYWCDY,308,44,0,0,Education,2024-10-17 15:36:09
4,6ZLLfvPKTao,451,29,3,0,Education,2024-10-17 15:36:09


In [27]:
print(type(df['Video Published At'][0]))

<class 'datetime.datetime'>


In [35]:
sorted = df.sort_values(by='Video Published At',ascending=True)
sorted.head()

Unnamed: 0,Video ID,View Count,Like Count,Comment Count,Favorite Count,Video Category,Video Published At
4499,KQkA8Qk_xjg,168,16,0,0,Recreation,2024-09-18 15:36:09
4402,_gKU5DhYUn0,183,5,0,0,Recreation,2024-09-18 15:36:09
4401,Q8CeS_E_ewA,1832,130,0,0,Info,2024-09-18 15:36:09
4400,ic74OdsEM0A,1758,0,3,0,Info,2024-09-18 15:36:09
4399,7ZTWh6GGoPw,455,8,0,0,Info,2024-09-18 15:36:09


In [36]:
# category_mapping = {
#     '1': 'Film & Animation',
#     '2': 'Autos & Vehicles',
#     '10': 'Music',
#     '15': 'Pets & Animals',
#     '17': 'Sports',
#     '18': 'Short Movies',
#     '19': 'Travel & Events',
#     '20': 'Gaming',
#     '21': 'Videoblogging',
#     '22': 'People & Blogs',
#     '23': 'Comedy',
#     '24': 'Entertainment',
#     '25': 'News & Politics',
#     '26': 'Howto & Style',
#     '27': 'Education',
#     '28': 'Science & Technology',
#     '29': 'Nonprofits & Activism'
#     # More categories can be added here if needed, but this should cover the most common ones. I didn't find a better list/map yet if  you do you  can add here
# }

In [37]:
recreation = ['Film & Animation','Music','Pets & Animals','Sports','Short Movies','Travel & Events',\
              'Gaming''Videoblogging','People & Blogs','Comedy','Entertainment'
                ]

info = ['News & Politics','Howto & Style','Education','Science & Technology','Nonprofits & Activism']

In [38]:
def re_categorize(string,recreation_list):
    if string in recreation_list:
        return 'Recreation'
    else:
        return 'Info'



for i in range(len):
    df.loc[i,'Video Category'] = re_categorize(df['Video Category'][i],recreation)

df.head()



Unnamed: 0,Video ID,View Count,Like Count,Comment Count,Favorite Count,Video Category,Video Published At
0,c9-Q4DMC4Rs,1856,139,0,0,Info,2024-10-17 15:36:09
1,68bWRSO8PYc,628,70,1,0,Info,2024-10-17 15:36:09
2,yRIU2nzIQl8,4225,132,0,0,Info,2024-10-17 15:36:09
3,BA_ZcBYWCDY,308,44,0,0,Info,2024-10-17 15:36:09
4,6ZLLfvPKTao,451,29,3,0,Info,2024-10-17 15:36:09


In [39]:
grouped_df = df.groupby(['Video Category'])

