# Trending YouTube Video


### Data Cleaning and Preprocessing

Imports the necessary libraries. Pandas is a Python library for data analysis and manipulation.

In [57]:
import pandas as pd

Read the CSV file into a Pandas DataFrame:


In [58]:
df = pd.read_csv('USvideos.csv')

In [59]:
df.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40949 entries, 0 to 40948
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   video_id                40949 non-null  object
 1   trending_date           40949 non-null  object
 2   title                   40949 non-null  object
 3   channel_title           40949 non-null  object
 4   category_id             40949 non-null  int64 
 5   publish_time            40949 non-null  object
 6   tags                    40949 non-null  object
 7   views                   40949 non-null  int64 
 8   likes                   40949 non-null  int64 
 9   dislikes                40949 non-null  int64 
 10  comment_count           40949 non-null  int64 
 11  thumbnail_link          40949 non-null  object
 12  comments_disabled       40949 non-null  bool  
 13  ratings_disabled        40949 non-null  bool  
 14  video_error_or_removed  40949 non-null  bool  
 15  de

Drops duplicate rows from the DataFrame.

In [61]:
# Remove duplicate rows
df = df.drop_duplicates()

Handles missing values in the DataFrame by filling them with 0.

In [62]:
# Handle missing values
df['views'] = df['views'].fillna(0)
df['likes'] = df['likes'].fillna(0)
df['dislikes'] = df['dislikes'].fillna(0)
df['comment_count'] = df['comment_count'].fillna(0)

Converts the category_id and publish_time columns to integer and datetime types, respectively.


In [63]:
# Convert data types
df['category_id'] = df['category_id'].astype('int')
df['publish_time'] = pd.to_datetime(df['publish_time'])

Standardizes the feature names in the DataFrame by renaming the category_id column to category_id_numeric.

In [64]:
# Standardize feature names
df = df.rename(columns={'category_id': 'category_id_numeric'})

Adds new features to the DataFrame by extracting the video category from the tags column

In [65]:
# Add new features
df['video_category'] = df['tags'].apply(lambda x: x.split('"')[0])

Removes outliers from the DataFrame by dropping rows where the views, likes, or dislikes columns are greater than 1 million.

In [66]:
# Remove outliers
df = df.drop(df[(df['views'] > 10000000) | (df['likes'] > 1000000) | (df['dislikes'] > 1000000)].index)

In [67]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df, df['views'], test_size=0.25, random_state=42)

The training set will be used to train the machine learning model, and the test set will be used to evaluate the performance of the model.

## Identifiy the top 10 most popular products in a video dataset.

In [68]:
from collections import Counter

Converts the Pandas Series object containing the product IDs to a list.

In [69]:
# Extract the product IDs from the video data
product_ids = df['video_id'].tolist()

Count the number of times each product ID appears in the list of product IDs. The output of the Counter class is a dictionary where the keys are the product IDs and the values are the number of times each product ID appears in the list.

In [70]:
# Count the number of times each product ID appears in the video data
product_counts = Counter(product_ids)


Sorts the product IDs by count in descending order. 

In [71]:
# Sort the product IDs by count, in descending order
top_10_most_popular_product_ids = sorted(product_counts, key=lambda x: product_counts[x], reverse=True)[:10]

In [72]:
# Print the top 10 most popular product IDs
for product_id in top_10_most_popular_product_ids:
    print(product_id)

WIV3xNz8NoM
YI3tsmFsrOg
QBL8IRJ5yHU
t4pRQ0jn23Q
r-3iathMo7o
NBSAQenU2Bk
vjSohj-Iclc
2PH7dK6SLC8
0zZ0Y_UZRBw
pFc6I0rgmgY


## Authors

[Viktoriia Popovych](https://www.linkedin.com/in/viktoriia-popovych-4b478b262/)