## Filtering the datasets for size reduction

For obvious size reasons, we need to filter out the datasets to a reasonable size before using them.

We know already that we won't need all the videos from all categories. Indeed, we only need the "scientific" and "artistic" videos. So we can rule out categories that do not fall in these domains. Example of categories we can exclude are:
- Gaming
- People & Blogs
- Sports
- Autos & Vehicles
- News & Politics
- Pets & Animals
- Nonprofits & Activism
- Not categorized

Which leaves us with these categories **(TO DISCUSS)**:
- Music
- Entertainment
- Howto & Style
- Education
- Film and Animation
- Science & Technology
- Comedy
- Travel & Events

In [1]:
import pandas as pd
import sys

In [2]:
df_comments1 = pd.read_feather('comments1.feather')
print(df_comments1.shape)

(103068702, 5)


In [6]:
df_comments1.drop(['level_0', 'index'], axis=1, inplace=True)

In [7]:
df_comments1.head()

Unnamed: 0,author,video_id,likes,replies
0,13,jRzgdv1Z4m4,0,0
1,16,Q_zxBVOF0YI,0,0
2,231158,Q_zxBVOF0YI,0,0
3,16,FjmbFh40xp8,0,0
4,125641,FjmbFh40xp8,0,0


### Filtering the video dataset: yt_metadata_en.jsonl.gz (~16 Go)
Steps:
1. Download the dataset
2. Load the dataset by chunks
3. Keep only videos falling in the said categories
4. Save the preprocessed dataset using `pickle`

In [2]:
categories = [
    'Music',
    'Entertainment',
    'Howto & Style',
    'Education',
    'Film and Animation', 
    'Science & Technology',
    'Comedy',
    'Travel & Events'
]

In [3]:
videos_metadata_filepath = 'data/yt_metadata_en.jsonl.gz'

max_chunk_count = 0     # number of chunk to load (0 to load all the chunks)
chunk_size = 2000000     # number of rows to load each chunk

year = '2016'

dfs = []
chunk_num = 1
for df_json in pd.read_json(videos_metadata_filepath, compression="infer", chunksize=chunk_size, lines=True):
    # remove title and description
    print('with title and description: %d' % sys.getsizeof(df_json))
    df_json.drop(['description', 'title'], axis=1 , inplace=True)
    print('without title and description: %d' % sys.getsizeof(df_json))

    # filter to keep only needed categories
    print('before category filtering: %d' % sys.getsizeof(df_json))
    df_json = df_json[df_json.categories.isin(categories)]
    print('after category filtering: %d' % sys.getsizeof(df_json))

    # filter to keep only needed categories
    print('before year filtering: %d' % sys.getsizeof(df_json))
    df_json = df_json[df_json.upload_date.str.startswith(year)]
    print('after year filtering: %d' % sys.getsizeof(df_json))
    


    dfs.append(df_json)
    print('Chunk n°%d completed. Loaded %d/%d videos after filtering.' % (chunk_num, len(df_json), chunk_size))

    chunk_num += 1
    if chunk_num == max_chunk_count + 1:
        break

video_metadata = pd.concat(dfs).reset_index()
filename = 'filtered/video_metadata.feather'
print('Saving %d rows (%d Bytes) to %s' % (len(video_metadata), sys.getsizeof(video_metadata), filename))
video_metadata.to_feather(filename)

with title and description: 4316331632
without title and description: 1367646599
before category filtering: 1367646599
after category filtering: 648824351
before year filtering: 648824351
after year filtering: 69876685
Chunk n°1 completed. Loaded 105026/2000000 videos after filtering.
with title and description: 4120973806
without title and description: 1356637204
before category filtering: 1356637204
after category filtering: 636251962
before year filtering: 636251962
after year filtering: 77071125
Chunk n°2 completed. Loaded 110607/2000000 videos after filtering.
with title and description: 4141894851
without title and description: 1357848927
before category filtering: 1357848927
after category filtering: 612400736
before year filtering: 612400736
after year filtering: 76508366
Chunk n°3 completed. Loaded 111517/2000000 videos after filtering.
with title and description: 4192431665
without title and description: 1332780830
before category filtering: 1332780830
after category filterin

In [None]:
sys.getsizeof(video_metadata)

In [None]:
len(video_metadata)

In [4]:
# get memory usage

# These are the usual ipython objects, including this one you are creating
ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']

# Get a sorted list of the objects and their sizes
memory_usage = sorted([(x, sys.getsizeof(globals().get(x))) for x in dir() if not x.startswith('_') and x not in sys.modules and x not in ipython_vars], key=lambda x: x[1], reverse=True)
# display the 3 most memory consuming variables
memory_usage[:3]

[('video_metadata', 2882460736), ('df_json', 29398746), ('dfs', 376)]

In [5]:
# clear memory for further analysis
%xdel video_metadata
%xdel df_json


### Filtering the comments dataset: youtube_comments.tsv.gz (~70 Go)
1. Download the dataset
2. Load the dataset by chunks
3. Only keep comments from videos present in the preprocessed dataset above
4. Save the preprocessed dataset using `pickle`