# 1. Creating the Dataframe
### Loading the JSON file + Creating the CSV file

In [386]:
import json
import csv

In [387]:
with open('watch-history.json') as watch_history:
    watch_history_JSON = json.load(watch_history)

In [388]:
watch_history_CSV = open('watch-history.csv', 'w', newline='')
csv_writer = csv.writer(watch_history_CSV)

### Some necessary data cleaning

In the original data are a couple videos that are either a Google Ad, survey question, or currently unavailable/removed video. All such videos are either not representative of my watch history or unable to access crucial information from. I will be filtering these videos out from the csv before even reading it in.

More about the currently unavailable/removed videos in this step: the following code won't eliminate all such videos. However, what this will do is prevent any further bugs in the code. This is because certain videos that have been taken down are put into the CSV file as videos with less columns than other videos. I will deal with this later on in the code but for now, this will simplify the solution.

In [389]:
# filtering out Google ads
filtered_watch_history_JSON = [video for video in watch_history_JSON if (video["activityControls"] != ["Web \u0026 App Activity", "YouTube watch history", "YouTube search history"])]
filtered_watch_history_JSON = [item for item in filtered_watch_history_JSON if "details" not in item]

# filtering out survey questions
filtered_watch_history_JSON = [video for video in filtered_watch_history_JSON if (video['title'] != "Answered survey question")]

# filtering out publicly unavailable videos
filtered_watch_history_JSON = [video for video in filtered_watch_history_JSON if "titleUrl" in video]
filtered_watch_history_JSON = [video for video in filtered_watch_history_JSON if not (video["title"].startswith("Watched https://www.youtube.com"))]

### Creating the CSV file

In [390]:
is_header = True

for entry in filtered_watch_history_JSON:
    if is_header:
        header = entry.keys()
        csv_writer.writerow(header)
        is_header = False
    csv_writer.writerow(entry.values())

watch_history_CSV.close()

### Reading in the CSV file + Removing unused columns

In [391]:
import pandas as pd

video_df = pd.read_csv("watch-history.csv")

video_df = video_df.drop(labels = ['header', 'products', 'activityControls'], axis = 1)

display(video_df.iloc[0:5])

Unnamed: 0,title,titleUrl,subtitles,time
0,Watched OS V0006 Jani EN 16比9 14s,https://www.youtube.com/watch?v=UmKMxivCDmA,"[{'name': 'March 7th', 'url': 'https://www.you...",2024-02-08T00:56:14.512Z
1,Watched what does bbb stand for?,https://www.youtube.com/watch?v=2RkiJqrGPrc,"[{'name': 'YustShortz', 'url': 'https://www.yo...",2024-02-07T23:18:58.245Z
2,Watched My INSANE KRABER Play in a Pro Apex Lo...,https://www.youtube.com/watch?v=wSHCZ5sdaFs,"[{'name': 'iiTzTimmy', 'url': 'https://www.you...",2024-02-07T20:00:42.079Z
3,Watched she got bullied on CS:GO :(,https://www.youtube.com/watch?v=VuhVeU7fGNo,"[{'name': 'Fitz', 'url': 'https://www.youtube....",2024-02-07T16:47:24.189Z
4,Watched JUNGLE TEEMO LOCKED IN,https://www.youtube.com/watch?v=BYDrxt3vhdU,"[{'name': 'Pianta', 'url': 'https://www.youtub...",2024-02-07T16:36:49.137Z


# 2. Using the Youtube Data v3 API
With this API, we will extract more information on each video, including the assigned category, its description, and its tags!

### Building the API

In [392]:
# imports a py file that includes my privat API key
import config

api_key = config.api_key

In [393]:
from googleapiclient.discovery import build

service = build('youtube', 'v3', developerKey=api_key)

### Removing Publicly Unavailable Videos
As mentioned previously, we will be removing any videos that are now taken down, as it makes accessing information on them impossible. 

In order to accomplish this, we will need to find the video URL in our dataframe such that a certain pattern (e.x. "ERROR" or "private video") is found in its request text. During this, we must keep track of each of these videos' indexes in the dataframe. Finally, we will drop them simultaneously through the .drop method in pandas.

Now, in order to speed up this process -- as running the is_taken_down function for every data entry would take a long time -- I will be using parallel processing (more specifically the concurrent.futures package).

Do not ask me how this package works, I just looked up the documentation

In [394]:
index_urls = []
for index, row in video_df.iterrows():
    index_urls.append([index, row['titleUrl']]) 

In [395]:
import requests

# using parallel processing to make this code faster
import concurrent.futures

Not_Found = 'Not Found'
Unauthorized = "Unauthorized"
Forbidden = "Forbidden"

indexes_to_drop = []
i = 0
def append_to_list(index_url):
    text = requests.get(f"https://www.youtube.com/oembed?url={index_url[1]}", allow_redirects=False).text
    if Not_Found in text or Unauthorized in text or Forbidden in text:
        #video_df = video_df.drop(index=index_url[0])
        indexes_to_drop.append(index_url[0])

def is_taken_down(index_urls):
    with concurrent.futures.ThreadPoolExecutor() as executor:
        [executor.submit(append_to_list, index_url) for index_url in index_urls]

# def is_taken_down(index_url):
#     for index_url in index_urls:
#         append_to_list(index_url)


In [None]:
is_taken_down(index_urls)

print(indexes_to_drop)

video_df = video_df.drop(index=indexes_to_drop)

### Grabbing information on videos using API
First, we will be using Youtube's dictionary of categories, accessed through the requests library. This is because the video categories found through the API are presented in the form of an ID. So, we will be calling this dictionary to see what each ID corresponds to.

In [398]:
category_url = "https://www.googleapis.com/youtube/v3/videoCategories?part=snippet&regionCode=US&key=AIzaSyAAYRz4ctksjVf_VYKcmU6Zj5opYKQqUgE"
params = {
    'key': api_key,
    'part':'snippet',
}

response = requests.get(category_url,
                        params = params)

categories = {}

for item in response.json()['items']:
    categories[item['id']] = item['snippet']['title']

categories

{'1': 'Film & Animation',
 '2': 'Autos & Vehicles',
 '10': 'Music',
 '15': 'Pets & Animals',
 '17': 'Sports',
 '18': 'Short Movies',
 '19': 'Travel & Events',
 '20': 'Gaming',
 '21': 'Videoblogging',
 '22': 'People & Blogs',
 '23': 'Comedy',
 '24': 'Entertainment',
 '25': 'News & Politics',
 '26': 'Howto & Style',
 '27': 'Education',
 '28': 'Science & Technology',
 '29': 'Nonprofits & Activism',
 '30': 'Movies',
 '31': 'Anime/Animation',
 '32': 'Action/Adventure',
 '33': 'Classics',
 '34': 'Comedy',
 '35': 'Documentary',
 '36': 'Drama',
 '37': 'Family',
 '38': 'Foreign',
 '39': 'Horror',
 '40': 'Sci-Fi/Fantasy',
 '41': 'Thriller',
 '42': 'Shorts',
 '43': 'Shows',
 '44': 'Trailers'}

Now we can use the API to gather information on video categories, their descriptions, and their tags. However, there is just one issue: the Google API only allows for 10,000 requests a day. Given that our dataframe is almost 9000 videos long, this doesn't leave much room for error. So, we will diminish the amount of tokens used through batch requesting.

In [399]:
# this will be a list of batches, with each batch containing video IDs 
videoID_batches = [[]]

# the batch size
batch_size = 25

for index, row in video_df.iterrows():
    curr_url = row["titleUrl"]
    
    # the video ID is characterized by the string of characters 
    # found after the "=" in the corresponding video URL.
    videoID = curr_url[curr_url.rfind("=")+1:]
    
    # append to the most recent batch
    videoID_batches[-1].append(videoID)
    
    # create a new batch every {batch_size} iterations
    if (index + 1) % batch_size == 0:
        videoID_batches.append([])

In [400]:
print(f"Number of batches of size {batch_size}: {len(videoID_batches)}")

Number of batches of size 25: 932


In [401]:
category_col = []
description_col = []
tags_col = []

def get_info(videoID_batch):
    request = service.videos().list(part="snippet",
                                    id=videoID_batch)
    response = request.execute()
    
    # for each videoID in the current batch, append the information to their corresponding column/list
    # print(len(response['items']))
    # print(len(videoID_batch))
    for i in range(len(videoID_batch)):
        information = response['items'][i]['snippet']
        category_col.append(categories[information['categoryId']])
        description_col.append(information['description'])
        
        # some videos don't have a tags section, so just leave it blank without skipping it
        if "tags" in information:
            tags_col.append(information['tags'])
        else:
            tags_col.append([''])

In [370]:
prefix = "http://www.youtube.com/oembed?url=https://www.youtube.com/watch?v="
for error in videoID_batches[438]:
    print(f"{prefix}{error}")

czElEWEJwxI
http://www.youtube.com/oembed?url=https://www.youtube.com/watch?v=czElEWEJwxI
{
  "error": {
    "code": 403,
    "message": "SSL is required to perform this operation.",
    "status": "PERMISSION_DENIED"
  }
}

bJRXjaJgyIs
http://www.youtube.com/oembed?url=https://www.youtube.com/watch?v=bJRXjaJgyIs
{
  "error": {
    "code": 403,
    "message": "SSL is required to perform this operation.",
    "status": "PERMISSION_DENIED"
  }
}

wqWcwND98k0
http://www.youtube.com/oembed?url=https://www.youtube.com/watch?v=wqWcwND98k0
{
  "error": {
    "code": 403,
    "message": "SSL is required to perform this operation.",
    "status": "PERMISSION_DENIED"
  }
}

J5H-LCrN-5E
http://www.youtube.com/oembed?url=https://www.youtube.com/watch?v=J5H-LCrN-5E
{
  "error": {
    "code": 403,
    "message": "SSL is required to perform this operation.",
    "status": "PERMISSION_DENIED"
  }
}

aCuBBUdJIbk
http://www.youtube.com/oembed?url=https://www.youtube.com/watch?v=aCuBBUdJIbk
{
  "error": {

In [402]:
category_col = []
description_col = []
tags_col = []

count = 0
for videoID_batch in videoID_batches:
    print(count)
    get_info(videoID_batch)
    count += 1

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


IndexError: list index out of range

In [365]:
video_df.insert(3,"category", category_col)
video_df.insert(5,"description", description_col)
video_df.insert(4,"tags", tags_col)

# 3. Final Cleaning Before Data Analysis

### Fixing title Column

In [366]:
video_df.iloc[:4]

Unnamed: 0,title,titleUrl,subtitles,category,tags,time,description
0,Watched OS V0006 Jani EN 16比9 14s,https://www.youtube.com/watch?v=UmKMxivCDmA,"[{'name': 'March 7th', 'url': 'https://www.you...",Gaming,[],2024-02-08T00:56:14.512Z,
1,Watched what does bbb stand for?,https://www.youtube.com/watch?v=2RkiJqrGPrc,"[{'name': 'YustShortz', 'url': 'https://www.yo...",Gaming,[],2024-02-07T23:18:58.245Z,#shorts\nReuploaded Video from either TikTok o...
2,Watched My INSANE KRABER Play in a Pro Apex Lo...,https://www.youtube.com/watch?v=wSHCZ5sdaFs,"[{'name': 'iiTzTimmy', 'url': 'https://www.you...",Gaming,"[apex legends, apex legends new season, apex l...",2024-02-07T20:00:42.079Z,"LIKE & SUBSCRIBE IF YOU ENJOYED, New videos da..."
3,Watched she got bullied on CS:GO :(,https://www.youtube.com/watch?v=VuhVeU7fGNo,"[{'name': 'Fitz', 'url': 'https://www.youtube....",Gaming,"[counter, strike, counter strike 2, funny, mom...",2024-02-07T16:47:24.189Z,it was kinda funny though...\n\nFOLLOW ME EVER...


As you can see, the titles of each video have "Watched " at the start, which I removed with the following code.

In [164]:
# Not modular, but gets the job done
video_df.title = video_df.title.str[8:]

In [165]:
video_df.iloc[:4]

Unnamed: 0,title,titleUrl,subtitles,category,tags,time,description
0,OS V0006 Jani EN 16比9 14s,https://www.youtube.com/watch?v=UmKMxivCDmA,"[{'name': 'March 7th', 'url': 'https://www.you...",Gaming,[],2024-02-08T00:56:14.512Z,
1,what does bbb stand for?,https://www.youtube.com/watch?v=2RkiJqrGPrc,"[{'name': 'YustShortz', 'url': 'https://www.yo...",Gaming,[],2024-02-07T23:18:58.245Z,#shorts\nReuploaded Video from either TikTok o...
2,My INSANE KRABER Play in a Pro Apex Lobby 😱,https://www.youtube.com/watch?v=wSHCZ5sdaFs,"[{'name': 'iiTzTimmy', 'url': 'https://www.you...",Gaming,"[apex legends, apex legends new season, apex l...",2024-02-07T20:00:42.079Z,"LIKE & SUBSCRIBE IF YOU ENJOYED, New videos da..."
3,she got bullied on CS:GO :(,https://www.youtube.com/watch?v=VuhVeU7fGNo,"[{'name': 'Fitz', 'url': 'https://www.youtube....",Gaming,"[counter, strike, counter strike 2, funny, mom...",2024-02-07T16:47:24.189Z,it was kinda funny though...\n\nFOLLOW ME EVER...


In [151]:
mask_1

0        False
1        False
2        False
3        False
4        False
         ...  
24659     True
24660     True
24662     True
24663     True
24664     True
Name: subtitles, Length: 23380, dtype: bool

### Fixing subtitles Column

In the subtitles column, you can find the corresponding Youtuber's name, which may be important information. But in the shown format, it's neither accessible nor pretty. Let's turn this column into a Youtuber name column.

First off, the format is similar to JSON, except that it was converted into a string type when put into a pandas dataframe, so let's fix that with the ast package.

In [173]:
sub_1 = "https"

mask_1 = video_df['title'].str.startswith(sub_1)

video_df = video_df[~mask_1]

In [174]:
import ast

video_df['subtitles'] = video_df['subtitles'].apply(ast.literal_eval)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  video_df['subtitles'] = video_df['subtitles'].apply(ast.literal_eval)


In [175]:
subtitles = pd.json_normalize(video_df['subtitles'])

Now, let's turn this into a column of the dataframe

In [176]:
youtuber_name_col = []

def get_youtuber_name(entry):
    youtuber_name_col.append(entry[0]['name'])
    
video_df['subtitles'].apply(get_youtuber_name)

0        None
1        None
2        None
3        None
4        None
         ... 
24568    None
24569    None
24570    None
24571    None
24572    None
Name: subtitles, Length: 23294, dtype: object

In [177]:
video_df

Unnamed: 0,title,titleUrl,subtitles,category,tags,time,description
0,OS V0006 Jani EN 16比9 14s,https://www.youtube.com/watch?v=UmKMxivCDmA,"[{'name': 'March 7th', 'url': 'https://www.you...",Gaming,[],2024-02-08T00:56:14.512Z,
1,what does bbb stand for?,https://www.youtube.com/watch?v=2RkiJqrGPrc,"[{'name': 'YustShortz', 'url': 'https://www.yo...",Gaming,[],2024-02-07T23:18:58.245Z,#shorts\nReuploaded Video from either TikTok o...
2,My INSANE KRABER Play in a Pro Apex Lobby 😱,https://www.youtube.com/watch?v=wSHCZ5sdaFs,"[{'name': 'iiTzTimmy', 'url': 'https://www.you...",Gaming,"[apex legends, apex legends new season, apex l...",2024-02-07T20:00:42.079Z,"LIKE & SUBSCRIBE IF YOU ENJOYED, New videos da..."
3,she got bullied on CS:GO :(,https://www.youtube.com/watch?v=VuhVeU7fGNo,"[{'name': 'Fitz', 'url': 'https://www.youtube....",Gaming,"[counter, strike, counter strike 2, funny, mom...",2024-02-07T16:47:24.189Z,it was kinda funny though...\n\nFOLLOW ME EVER...
4,JUNGLE TEEMO LOCKED IN,https://www.youtube.com/watch?v=BYDrxt3vhdU,"[{'name': 'Pianta', 'url': 'https://www.youtub...",Gaming,"[League of Legends, Pianta, tenmo, tenmo playe...",2024-02-07T16:36:49.137Z,Tenmo to GM - in this series we are going for ...
...,...,...,...,...,...,...,...
24568,Jodi Talks How to Handle Sleeping With Someone...,https://www.youtube.com/watch?v=0shrXdj3OqQ,"[{'name': 'OTV Munchables! ', 'url': 'https://...",Entertainment,[],2023-06-04T16:40:55.615Z,Jodi's stream: https://www.twitch.tv/quarterja...
24569,Sakura's standing up for the group and opposin...,https://www.youtube.com/watch?v=o2gUBwIoh34,"[{'name': 'iie kiyeon', 'url': 'https://www.yo...",Gaming,[],2023-06-04T16:40:39.435Z,LE SSERAFIM will Comeback on October 17\n\n#le...
24570,aespa 에스파 'Spicy' Recording Behind The Scenes,https://www.youtube.com/watch?v=j8S8wmVKhr4,"[{'name': 'aespa', 'url': 'https://www.youtube...",Music,"[aespa, 에스파, 카리나, 윈터, 지젤, 닝닝, 마이월드, 마이, 수록곡, k...",2023-06-04T16:40:13.337Z,"aespa's 3rd Mini Album ""MY WORLD"" is out!\nLis..."
24571,The bagel effect,https://www.youtube.com/watch?v=D8KzunHi2EQ,"[{'name': 'GelNox', 'url': 'https://www.youtub...",Entertainment,"[the bagel effect, spider verse spot origin st...",2023-06-04T16:40:01.116Z,the bagel effect spider verse spot origin stor...


In [178]:
video_df = video_df.drop(labels = ['subtitles'], axis = 1)

In [179]:
video_df.insert(2, 'Youtuber Name', youtuber_name_col)

While we're at it, let's rename the columns to better fit a naming convention

In [180]:
video_df.columns = ['Video Title', 'Video URL', 'Youtuber Name', 'Category', 'Tags', 'Time', 'Description']

In [181]:
video_df

Unnamed: 0,Video Title,Video URL,Youtuber Name,Category,Tags,Time,Description
0,OS V0006 Jani EN 16比9 14s,https://www.youtube.com/watch?v=UmKMxivCDmA,March 7th,Gaming,[],2024-02-08T00:56:14.512Z,
1,what does bbb stand for?,https://www.youtube.com/watch?v=2RkiJqrGPrc,YustShortz,Gaming,[],2024-02-07T23:18:58.245Z,#shorts\nReuploaded Video from either TikTok o...
2,My INSANE KRABER Play in a Pro Apex Lobby 😱,https://www.youtube.com/watch?v=wSHCZ5sdaFs,iiTzTimmy,Gaming,"[apex legends, apex legends new season, apex l...",2024-02-07T20:00:42.079Z,"LIKE & SUBSCRIBE IF YOU ENJOYED, New videos da..."
3,she got bullied on CS:GO :(,https://www.youtube.com/watch?v=VuhVeU7fGNo,Fitz,Gaming,"[counter, strike, counter strike 2, funny, mom...",2024-02-07T16:47:24.189Z,it was kinda funny though...\n\nFOLLOW ME EVER...
4,JUNGLE TEEMO LOCKED IN,https://www.youtube.com/watch?v=BYDrxt3vhdU,Pianta,Gaming,"[League of Legends, Pianta, tenmo, tenmo playe...",2024-02-07T16:36:49.137Z,Tenmo to GM - in this series we are going for ...
...,...,...,...,...,...,...,...
24568,Jodi Talks How to Handle Sleeping With Someone...,https://www.youtube.com/watch?v=0shrXdj3OqQ,OTV Munchables!,Entertainment,[],2023-06-04T16:40:55.615Z,Jodi's stream: https://www.twitch.tv/quarterja...
24569,Sakura's standing up for the group and opposin...,https://www.youtube.com/watch?v=o2gUBwIoh34,iie kiyeon,Gaming,[],2023-06-04T16:40:39.435Z,LE SSERAFIM will Comeback on October 17\n\n#le...
24570,aespa 에스파 'Spicy' Recording Behind The Scenes,https://www.youtube.com/watch?v=j8S8wmVKhr4,aespa,Music,"[aespa, 에스파, 카리나, 윈터, 지젤, 닝닝, 마이월드, 마이, 수록곡, k...",2023-06-04T16:40:13.337Z,"aespa's 3rd Mini Album ""MY WORLD"" is out!\nLis..."
24571,The bagel effect,https://www.youtube.com/watch?v=D8KzunHi2EQ,GelNox,Entertainment,"[the bagel effect, spider verse spot origin st...",2023-06-04T16:40:01.116Z,the bagel effect spider verse spot origin stor...


# 4. Outputting into csv to Analyze in SQL

In [186]:
video_df.to_csv('watch-history.csv', index=False)

For Tableau Public, the csv file will not be read in properly since some columns have commas within them, that Tableau recognizes as creating a new column. So, we will be removing those columns from the csv instead.

In [187]:
import pandas as pd

video_df_tableau = pd.read_csv('watch-history.csv')
video_df_tableau = video_df_tableau.drop(labels=['Tags', 'Description'], axis = 1)
video_df_tableau.to_csv('watch-history-tableau.csv', index=True, index_label='id')