# Characterizing YouTube Popularity

Created by three former PayPal employees in 2005 and later acquired in 2006 by Google, YouTube has tranformed from a simple video hosting service into the world's largest entertainment platform. A community for creators everywhere, YouTube has proven able to support its creators financially as well through Google's targeted advertising program AdSense, and people are using that to quit their jobs and focus on their YouTube channels as a full-time career. In fact, in an age where anyone can pick up their phones and start a vlog, the question of what makes a YouTube channel successful is hotly debated. Today, we'll be looking at channels across several genres (lifestyle, food & travel, gaming, beauty & fashion) to get an idea of what being a successful YouTuber looks like.

#### Q: What does a successful YouTube channel look like?

To break this down even further, we're going to answer this question by carrying out the following steps:

1. [Dataset Assembly](#1.-Dataset-Assembly)
2. [Feature Engineering](#2.-Feature-Engineering)
3. [Model Training](#3.-Model-Training)
4. [Cluster Visualization](#4.-Cluster-Visualization)
5. [Future Directions](#5.-Future-Directions)

Without further ado, let's begin!

## 1. Dataset Assembly

To assemble our dataset, we're going to query the YouTube Data API's search function for our genres (lifestyle, food & travel, gaming, beauty & fashion). Before we do this, however, it would be helpful to define the 3 major types of channels we expect to find.

### Channel Types

#### Popular

A **popular** channel has popular videos (large view/video count), a large following (large subscriber count), and an active community (buzzing comment section) around it. Popular channels may not necessarily have the highest view count per video, but accumulate views through consistent content release and the community around them.

#### Growing

A **growing** channel is like a popular channel except it only started gaining traction recently so its community activity may not be as consistent. Many channels with growing view counts and subscriber counts exist on YouTube, but the key distinction for the growing channels we've defined is the consistency of its content release schedules since channels only become popular if they engage their audience consistently.

#### Small

A **small** channel is lesser known within the YouTube community. It generally has low view counts and subscriber counts relative to popular and growing channels.

### Sampling Bias

To understand the motivation behind asking our question, consider music as a genre on YouTube. Vevo is a music video hosting service that partners with huge records labels like the Warner Music Group, and as a part of their contracts, Vevo helps artists like Eminem and Rihanna manage their YouTube channels. As we've defined above, channels like EminemVEVO and RihannaVEVO would be considered **popular** channels, but their success is largely derived from the artists' success in the music industry and other platforms like Spotify. 

Because of this, we're not extremely interested in investigating what makes these YouTube channels successful because we suspect that a big part of their success is independent of the YouTube channels themselves. That's why we're not considering music as a genre when we're assembling this dataset and why we're not looking into other genres like late night talk shows because most of them present the same confounding variables that influence the success of their channels.

### YouTube Data API

For querying the YouTube database, we'll be using the YouTube Data API. Here is a link to its official documentation: https://developers.google.com/youtube/v3/.

More specifically, we're interested in using a method called **search: list**, which will allow us to query the database with parameters like keywords, location, etc. More information about this method can be found here: https://developers.google.com/youtube/v3/docs/search/list.

We start by instantiating an object that establishes an authenticated connection to the YouTube Data Api.

In [3]:
from httplib2 import Http
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
import pandas as pd
import dateutil.parser
import numpy as np
from scipy import stats
import sklearn
from sklearn.decomposition import PCA

In [2]:
API_SERVICE_NAME = "youtube"
API_VERSION = "v3"
API_KEY = "AIzaSyDI8cZyqHiXp1uh9zr5qPRKe4-bhhaPYUw" # use your Google Developers Console API key

def get_authenticated_service():
    return build(API_SERVICE_NAME, API_VERSION, http=Http(), developerKey=API_KEY)

client = get_authenticated_service()

Now, we can use this object to make requests to the YouTube Data API. Below, for example, we have a code snippet that gets 25 channels about ukuleles.

In [2]:
params = {
    "part": "snippet",
    "maxResults": "25",
    "q": "ukulele",
    "type": "channel"
}

response = client.search().list(**params).execute()
print(response['items'][0]['snippet']['title'])
print(response['items'][0]['id']['channelId'])

The Ukulele Teacher
UC1HlihY-iNtOemAlYQq3GXQ


We can extend the demo code above to query for up to 50 (limit) channels related to each of our topics and add them to a pandas dataframe where we store our dataset. To ensure we obtain as much data for preprocessing as possible, we're considering 3 orderings: relevance, video count, and view count. This will give us something closer to 700 channels rather than 300.

In [3]:
def create_dataset(topics, n=50):
    dataset = {
        'Channel ID': [],
        'Channel Name': [],
        'Description': [],
        'Created': [],
        'Genre': []
    }
    
    params = {
        "part": "snippet",
        "maxResults": n,
        "relevanceLanguage": "en",
        "type": "channel"
    }

    channels = set()
    orders = ['relevance', 'viewCount', 'videoCount']
    
    for topic in topics:
        params['q'] = topic
        for order in orders:
            params['order'] = order
            response = client.search().list(**params).execute()

            for channel in response['items']:
                channel_id = channel['id']['channelId']
                channel_name = channel['snippet']['title']
                channel_description = channel['snippet']['description']
                channel_created = channel['snippet']['publishedAt']
                channel_created = channel_created[:channel_created.find('T')]

                if channel_id not in channels:
                    dataset['Channel ID'].append(channel_id)
                    dataset['Channel Name'].append(channel_name)
                    dataset['Description'].append(channel_description)
                    dataset['Created'].append(channel_created)
                    dataset['Genre'].append(topic)
                    
                    channels.add(channel_id)

    return pd.DataFrame.from_dict(dataset)

topics = ['Vlog', 'Food', 'Gaming', 'Beauty', 'Fashion', 'Fitness']
init_dataset = create_dataset(topics)
init_dataset.tail()

Unnamed: 0,Channel ID,Channel Name,Created,Description,Genre
689,UCXFazK2sRYkuNgZ_Xs7BKUg,Batman Fitness,2016-11-06,Salut à tous je suis Batman fitness de la vérité.,Fitness
690,UCtE1l7hJ1helcsyxoquNDvQ,NBO FITNESS,2012-06-29,"YOUTUBE FITNESS PERSONALITY, TRAINER, FAMILY M...",Fitness
691,UCrF6sYzdIgo64yi5GlcSsDw,Pain & Gain Fitness,2011-03-28,Management: Christian Torres Video Edit: Denis...,Fitness
692,UCYi4JJcjDYtNGLX0N6e7g5A,Men's Health & Fitness Tips,2015-07-15,"Men Health & Fitness and Sexual Tips , This Ch...",Fitness
693,UCLDw7ummSJnILbMZn2Azf2g,ON THE RADAR,2011-12-04,ON THE RADAR IS YOUR ULTIMATE RESOURCE FOR Hea...,Fitness


#### Feature Extension

Awesome! We have now assembled a dataset around channels in the topics that we care about, but the dataset is lacking in features. Ideally, we would have additional information about the channel so we can characterize them into the 3 types we defined earlier.

Below, we will extend our dataset to include the following features about each channel:
- view count
- subscriber count
- video count

We'll take these simple channel statistics and engineering some features like Views/Subscriber, which will give us an idea of how much each subscriber to the channel is contributing to the views on that channel.

In [73]:
def get_channel_stats(channel, params):
    params['id'] = channel
    res = client.channels().list(**params).execute()
    return res['items'][0]['statistics']

def extend_features(dataset):
    params = {'part': 'statistics'}
    extended = {
        'Channel ID': [],
        'View Count': [],
        'Subscriber Count': [],
        'Video Count': [],
        'Views/Subscriber': [],
        'Views/Video': [],
        'Subscriber/Video': []
    }
        
    for channel_id in dataset['Channel ID']:
        channel_stats = get_channel_stats(channel_id, params)
                
        viewCount = int(channel_stats['viewCount'])
        subscriberCount = int(channel_stats['subscriberCount'])
        videoCount = int(channel_stats['videoCount'])
        
        if videoCount == 0 or viewCount == 0 or subscriberCount == 0:
            continue
        
        extended['Channel ID'].append(channel_id)
        extended['View Count'].append(viewCount)
        extended['Subscriber Count'].append(subscriberCount)
        extended['Video Count'].append(videoCount)
        extended['Views/Subscriber'].append(viewCount/subscriberCount)
        extended['Views/Video'].append(viewCount/videoCount)
        extended['Subscriber/Video'].append(subscriberCount/videoCount)
    
    extended_dataset = pd.DataFrame.from_dict(extended).set_index('Channel ID')
    return dataset.join(extended_dataset, on="Channel ID", how="inner")
    
channel_features_dataset = extend_features(init_dataset)
channel_features_dataset.tail()

#### Other Features

#### 1. Community Activity

Earlier, we identified a channel's community as an important metric in measuring its popularity, and a channel's comment section best reflects this. For this we will query each channels last 50 videos and average across the comments for these videos.

#### 2. Content Consistency

Additionally, our sampling targets consistent uploaders because we assume that the uploader is active so we will use these videos to look for how far apart the dates between uploads are on average.

#### 3. Growth Rate

To also characterize whether a channel is growing or declining, we will use the channel's percentage change in view counts across these videos by fitting a linear regression line onto these datapoints.

#### 4. Favorability

Lastly, each video has a statistics on the number of likes and dislikes, which is the only venue through which users can give direct binary feedback on the content uploaders post.

In [13]:
sec_to_day = lambda x: np.round(x/86400, 1)

def get_video_stats(video, params):
    try:
        params['id'] = video
        res = client.videos().list(**params).execute()
        stats = res['items'][0]['statistics']
        stats['publishedAt'] = res['items'][0]['snippet']['publishedAt']
        return (True, stats)
    except:
        return (False, None)
    
def get_channel_videos(channel, n=50):
    channel_params = {
        'part': 'contentDetails',
        'id' : channel
    }
    res = client.channels().list(**channel_params).execute()
    upload_id = res['items'][0]['contentDetails']['relatedPlaylists']['uploads']
    playlist_params = {
        'part': 'contentDetails',
        'playlistId' : upload_id,
        'maxResults' : n
    }
    uploads = client.playlistItems().list(**playlist_params).execute()
    return uploads['items']

def calc_channel_video_stats(dates, views, likes, dislikes, comments):
    video_stats = dict()
    
    views, likes, dislikes, comments = np.array(views), np.array(likes), np.array(dislikes), np.array(comments)
    dates = list(map(lambda x: dateutil.parser.parse(x), dates))
        
    # average views per upload
    video_stats['Views/Upload'] = np.mean(views)
    # average likes per upload
    video_stats['Likes/Upload'] = np.mean(likes)
    # average dislikes per upload
    video_stats['Dislikes/Upload'] = np.mean(dislikes)
    # average comments per upload
    video_stats['Comments/Upload'] = np.mean(comments)
    # like to views ratio
    video_stats['Likes/View'] = np.mean(likes/views)*100
    # dislikes to views ratio
    video_stats['Dislikes/View'] = np.mean(dislikes/views)*100
    # comments to views ratio
    video_stats['Comments/View'] = np.mean(comments/views)*100
    # upload frequency
    time_diff = (dates[-1] - dates[0]).total_seconds()
    video_stats['Days/Upload'] = sec_to_day(time_diff / len(dates))

    upload_days = list(map(lambda x: sec_to_day((x - dates[0]).total_seconds()), dates))
    m, b, r, p, err = stats.linregress(upload_days, views)
    # growth rate (views)
    video_stats['Growth Rate'] = (m / views[0]) * 100
    
    return video_stats
    
def get_channel_video_stats(channel):
    published_dates = []
    view_counts = []
    like_counts = []
    dislike_counts = []
    comment_counts = []
    
    video_params = { 'part': 'snippet,statistics' }
    
    channel_videos = get_channel_videos(channel)
    
    if len(channel_videos) == 0:
        return None
        
    for i in range(len(channel_videos)-1, -1, -1):
        video_info = channel_videos[i]
        video_id = video_info['contentDetails']['videoId']
        
        success, video_stats = get_video_stats(video_id, video_params)

        # publishedAt, viewCount, likeCount, dislikeCount, commentCount
        if not success:
            continue
            
        published_dates.append(video_stats['publishedAt'])
        
        view_counts.append(int(video_stats['viewCount']) if 'viewCount' in video_stats else 0)
        like_counts.append(int(video_stats['likeCount']) if 'likeCount' in video_stats else 0)
        dislike_counts.append(int(video_stats['dislikeCount']) if 'dislikeCount' in video_stats else 0)
        comment_counts.append(int(video_stats['commentCount']) if 'commentCount' in video_stats else 0)
                        
    return calc_channel_video_stats(published_dates, view_counts, like_counts, dislike_counts, comment_counts) 

In [72]:
# Channels with no videos returned are filtered out (ex: UCOpNcN46UbXVtpKMrmU4Abg)

def extend_video_features(data):
    extended_video = { 'Channel ID': [] }
        
    for channel_id in data['Channel ID']:            
        channel_video_stats = get_channel_video_stats(channel_id)
        
        if channel_video_stats == None:
            continue
                    
        extended_video['Channel ID'].append(channel_id)
        for video_stat in channel_video_stats:
            if video_stat not in extended_video:
                extended_video[video_stat] = []
            extended_video[video_stat].append(channel_video_stats[video_stat])
                                        
    extended_dataset = pd.DataFrame.from_dict(extended_video).set_index('Channel ID')
    
    return data.join(extended_dataset, on="Channel ID", how="inner")

dataset = extend_video_features(channel_features_dataset)
dataset.tail()

And that's it!

We now have 15 features on 700 channels on its most recent 50 videos that we'll store in a csv file "youtube-data.csv", and we're ready to extract the important features and train our model.

In [15]:
dataset.to_csv('youtube-data.csv')

## 2. Feature Engineering

Now that we have all our assembled dataset, we need to apply some normalization methods to ensure that our KMeans++ clustering can give us the most meaningful results possible. To do so, we will apply feature scaling to bring the variance of our clusters closer together and apply kernel PCA as well to increase the dimensionality of our data so PCA can identify non-linear principal components.

#### Feature Normalization
On of the most common ways to standardize data across a dimension is to reduce the dimension to zero mean and unit variance. Let's apply the formula below for each feature dimension and see how our data changes. 

<center>$X^{(m)} = \frac{X^{(m)} - \overline{X^{(m)}}}{\sigma}$</center>

In [21]:
orig_data = pd.read_csv('youtube-data.csv')

dataset = orig_data.drop(['Unnamed: 0'], axis=1)
dataset = dataset.replace([np.inf, -np.inf], np.nan)
dataset = dataset.dropna(axis=0, how="any")
dataset = dataset.reset_index(drop=True)

def standardize_features(X):
    return (X - np.mean(X)) / np.std(X)

def normalize_data(data):
    result = data.copy(deep=True)
    for column in data.columns:
        if data[column].dtype == object:
            continue
        result[column] = standardize_features(data[column])
    return result

norm_dataset = normalize_data(dataset)
norm_dataset.tail()

Unnamed: 0,Channel ID,Channel Name,Created,Description,Genre,Subscriber Count,Subscriber/Video,Video Count,View Count,Views/Subscriber,Views/Video,Comments/Upload,Comments/View,Days/Upload,Dislikes/Upload,Dislikes/View,Growth Rate,Likes/Upload,Likes/View,Views/Upload
604,UCjKr2Ro-5X0BM6UptvhWvEw,The Ultimate Fashion History,10/26/12,FASHION HISTORY LIKE YOU'VE NEVER LEARNED IT B...,Fashion,-0.431295,-0.341956,-0.448291,-0.311062,-0.629982,-0.311746,-0.272409,-0.563164,0.144017,-0.234135,-0.253131,-0.004679,-0.3267,0.398713,-0.236878
605,UCS-HyxHT3A_PLXC8zlX73Yw,Fashion Television,11/10/14,Fashion Television is considered the leading a...,Fashion,-0.435253,-0.363157,-0.14375,-0.306632,0.783178,-0.318565,-0.272409,-0.563164,-0.597007,-0.235434,-0.287314,0.001706,-0.331377,-0.922151,-0.239402
606,UCoc8tpGCY1wrp8pV7mI0scA,H&M,3/7/07,Welcome to H&M's official YouTube page. Explor...,Fashion,-0.29438,-0.270066,-0.286767,-0.043644,1.182617,0.005803,-0.272409,-0.563164,0.367722,-0.065151,-0.105284,0.475871,-0.289295,-1.015866,0.88016
607,UCkkVe_1wVVhT_w_BU6CKgXw,Fashion9tv,8/10/16,fashion9tv channel is reference channel for ge...,Fashion,-0.416403,-0.362802,1.041181,-0.286283,0.387297,-0.318919,-0.272409,-0.563164,-0.708859,-0.235288,0.016218,-0.010982,-0.331367,-1.258717,-0.239284
608,UCPlqUggohC1vldNi3biHkvA,Fitness Incentive,10/18/10,"Watch, learn and enjoy Fitness incentive instr...",Fitness,-0.439641,-0.364862,-0.204322,-0.3132,1.446982,-0.323333,-0.272409,-0.563164,-0.051725,-0.235467,-0.414546,0.002601,-0.331404,-1.356894,-0.239401


#### Kernel PCA

Next, we will apply a Gaussian RBF (Radial Basis Function) kernel to map our data into a higher dimensional space so we can identify a potentially non-linear lower dimensional subspace for our principal components. This is done by applying the function below to each pair of feature vectors.

<center>$K(x,y) = exp(-\frac{||x-y||^2}{2\sigma^2})$</center>

Additionally, we'll also apply a median distance trick and use that as the standard deviation across our samples.

In [22]:
from sklearn.decomposition import PCA

def calc_median_dist(X):
    dists = []
    n,m = X.shape
    for i in range(n):
        for j in range(i+1, n):
            dist = np.sqrt(np.sum((X[i,:]-X[j,:])**2))
            dists.append(dist)
    return np.median(dists)

def drop_non_numerical(data):
    drop_cols = []
    for column in data.columns:
        if data[column].dtype == object:
            drop_cols.append(column)
    return data.drop(drop_cols, axis=1)

def kernel_PCA(data, n=6):
    num_data = drop_non_numerical(data)
    num_data = num_data.as_matrix()
    
    median_dist = calc_median_dist(num_data)
    gamma = 1/(2*median_dist**2)
    rbf = np.exp(-gamma * (np.sum(num_data**2, axis=-1)[:,None] + np.sum(num_data**2, axis=-1)[None,:] - 2*np.dot(num_data, num_data.T)))
    pca = PCA(n_components=n)
    return pca.fit_transform(rbf)
    
kernel_data = kernel_PCA(norm_dataset)
print(kernel_data.shape)

(609, 6)


## 3. Model Training

Now that we have transformed our dataset into several potentially non-linear principal components, we will train a clustering model to see what major clusters these channels form.

We chose the KMeans++ model because we think that similar channels (popular, growing, declining) should share similar properties based on these identified metrics. We initially chose to train on $k=4$ clusters because we wanted an additional bucket for channels we cannot successfully label as one of our 3 types. 

In [23]:
# From dzq homework 5
import math

def distance_matrix(X):
    m,n = X.shape
    M = np.zeros((m,m))
    for i in range(m):
        for j in range(m):
            M[i,j] = np.sum((X[i]-X[j])**2)
    return M

class KMeans:
    def init_centers(self, X, k):
        centers = []
        m = X.shape[0]
        M = distance_matrix(X)
        pos = np.arange(len(X))
        for i in range(k):
            if i == 0:
                center = X[np.random.choice(pos)]
            else:
                # Calculate probabilities
                probs = []
                for x in range(m):
                    # Calculate each vector's probability
                    sub_probs = []
                    for y in range(m):
                        sub_prob = []
                        for center in centers:
                            dist_to_center = np.sum((X[y]-center)**2)
                            sub_prob.append(dist_to_center)
                        val = np.min(sub_prob)
                        sub_probs.append(val)
                    
                    num = sub_probs[x]
                    denom = np.sum(sub_probs)
                    
                    prob = num/denom
                    probs.append(prob)
                    
                center = X[np.random.choice(pos, p=probs)]
                
            centers.append(center)
            
        centers = np.array(centers)
        return centers
        
    def assign_clusters(self, X, centers):
        m,k = X.shape[0], len(centers) 
        clusters = np.zeros((m,k))
        for i in range(m):
            probs = []
            for j in range(k):
                dist = np.sum((X[i]-centers[j])**2)
                probs.append(dist)
            y = np.argmin(probs)
            clusters[i,y] = 1
        return clusters
    
    def compute_means(self, X, y):
        m, n, k = X.shape[0], X.shape[1], y.shape[1]
        centers = np.zeros((k,n))
        
        for j in range(k):
            cluster = np.zeros(n)
            num_cluster = 0
            for i in range(m):
                if y[i,j] == 1:
                    num_cluster += 1
                    cluster = cluster + X[i]
            center = cluster / num_cluster
            centers[j,:] = center
            
        return centers    
    
    def train(self, X, centers, niters=500):
        for i in range(niters):
            clusters = self.assign_clusters(X, centers)
            centers = self.compute_means(X, clusters)
                    
        return (clusters, centers)

After running KMeans++, we can see the size of each cluster.

In [24]:
k = 4
KM = KMeans()
mu = KM.init_centers(kernel_data, k)
(labels, centers) = KM.train(kernel_data, mu)

for i in range(k):
    print("Cluster %d: " % i, len(np.where(labels[:,i]==1)[0]))

Cluster 0:  353
Cluster 1:  68
Cluster 2:  89
Cluster 3:  99


Let's take a look at what channels are in our largest cluster, cluster 0.

In [45]:
dataset["Channel Name"][labels[:,0] == 1] # Cluster 0 (364 channels)

38                                   TheReportOfTheWeek
43                                          Family Fizz
51                                      julien solomita
54                                         Casey Holmes
56                                     eleventhgorgeous
58                                        VasseurBeauty
59                                            8-BitRyan
60                                    Strictly Dumpling
61                                      FastGoodCuisine
62                                           Mark Wiens
65                                         Neebs Gaming
66                                      RawBeautyKristi
67                                      HellthyJunkFood
68                                        Joey Graceffa
69                             Agnieszka Grzelak Beauty
74                                 The Life Of Us Vlogs
75                                     Christian Guzman
77                                          Bon 

### Tuning Hyperparameters

In this section, we will consider $k$ that will generate different cluster labels for each channel. Then, we will calculate an error using cross entropy on a curated test dataset and minimize this our objective funciton to obtain the optimal hyperparameter.

#### Test Data

We manually assembled this dataset by searching for popular YouTube channels online, and filtering for channels that are not already in our training set. 

In [25]:
test_channel_ids = {
    "Vlog": [
        "UC4-CH0epzZpD_ARhxCx6LaQ",
        "UC_gV70G_Y51LTa3qhu8KiEA",
        "UCcgVECVN4OKV6DH1jLkqmcA",
        "UCPOw2O3_uZ1doro9iR4x6vw",
        "UC0otZdGYsA9KqVKAcn2peQA"
    ], "Food": [
        "UCJQL1Fai-9GlVunsbP4x8Pg",
        "UCRIZtPl9nb9RiXc9btSTQNw",
        "UCNbngWUqL2eqRw12yAwcICg",
        "UC6S5a3MQtr_PSWZxysXkOCg",
        "UCffs63OaN2nh-6StR6hzfiQ"
    ], "Gaming": [
        "UCAW-NpUFkMyCNrvRSSGIvDQ",
        "UC1uvf8YdxSVzthF45kotmJQ",
        "UCbTVTephX30ZhQF5zwFppBg",
        "UCS5Oz6CHmeoF7vSad0qqXfw",
        "UCpGdL9Sn3Q5YWUH2DVUW1Ug"
    ], "Fashion": [
        "UC5zSySQab9SA6Wz569WDgqw",
        "UCgWfS_47YPVbKx5EK4FLm4A",
        "UCo5zIpjl2OQkYatd8R0bDaw",
        "UC-BaXc1TU9i0XSguq9mZwdg",
        "UC48DOiEvCDu3sThBijwkQ1A"
    ], "Fitness": [
        "UCXIJ2-RSIGn53HA-x9RDevA",
        "UCuA6Ht35K326kBTPTXaWj3g",
        "UCgBc9iNvvjWDInV6fBeTGXQ",
        "UCuY1W4AwhhgkB6rsJBtltUA",
        "UCMTXToEZ6VT5k9GOCFNYjWA"
    ]
}

def create_test_dataset(channel_ids):
    test_dataset = { "Channel ID": [] }
    for topic in channel_ids:
        for channel_id in test_channel_ids[topic]:
            test_dataset["Channel ID"].append(channel_id)
    
    test_data = pd.DataFrame.from_dict(test_dataset)
    init_test = extend_features(test_data)
    return extend_video_features(init_test)
            
test_data = create_test_dataset(test_channel_ids)
test_data.tail()

NameError: name 'extend_features' is not defined

Before we move on, let's save this into a spreadsheet for easy loading!

In [295]:
test_data.to_csv('youtube-test-data.csv')

And then we apply the same cleaning and feature space transformations onto our test data we can calculate distances to the training data cluster centers.

In [26]:
orig_test = pd.read_csv('youtube-test-data.csv')

test_data = orig_test.drop(['Unnamed: 0'], axis=1)
test_data = orig_test.replace([np.inf, -np.inf], np.nan)
test_data = test_data.dropna(axis=0, how="any")
test_data = test_data.reset_index(drop=True)

test_norm = normalize_data(test_data)
test_kernel = kernel_PCA(test_norm)
print(test_kernel.shape)

(24, 6)


#### Cross Entropy Loss

Since these channels are very similar and commonly recognized as popular channels on YouTube, we will asumme that they should be classified into similar clusters under our model. Thus, we can use the 

In [27]:
from collections import Counter

def cross_entropy(cluster_labels):
    freqs = Counter()
    for label in cluster_labels:
        freqs[label] += 1
        
    n = len(cluster_labels)
    entropy = 0
    for label in freqs:
        label_prob = freqs[label]/n
        entropy += label_prob*np.log(label_prob)
    return -entropy

Now we write a model testing function that returns the total entropy for a given model and another wrapper function that minimize cross entropy loss across models trained on different hyperparameter $k$.

In [28]:
def test_model(KM, centers, test_data):
    predicted_clusters = KM.assign_clusters(test_data, centers)    
    predicted_labels = list(map(lambda x: np.argmax(x), predicted_clusters))
    return cross_entropy(predicted_labels)

def optimize(train_data, test_data):
    min_error = None
    opt_k = None
    opt_model = None
    opt_centers = None
    opt_labels = None
    
    for k in range(3,8): # 3,4,5,6,7
        kmeans = KMeans()
        init_centers = kmeans.init_centers(train_data, k)
        labels, centers = kmeans.train(train_data, init_centers)
        
        error = test_model(kmeans, centers, test_data)
        
        print("k=%d  error=%0.4f" % (k,error))
        
        if min_error == None or error < min_error:
            min_error = error
            opt_k = k
            opt_model = kmeans
            opt_centers = centers
            opt_labels = labels
            
    return (opt_k, opt_model, opt_centers, opt_labels)
        
k, KM, opt_centers, opt_labels = optimize(kernel_data, test_kernel)

k=3  error=0.6897
k=4  error=0.7924
k=5  error=0.1732
k=6  error=0.6160
k=7  error=0.6520


Using this information about our optimal hyperparameter, we can train a new KMeans++ model with $k=5$ and use that to visualize the clusters.

## 4. Cluster Visualization

We use Plotly in our cluster visualization because it supports powerful interactive graphs, and we are visualizing the assigned labels on our training data in the first 3 principal components. The plot generated can be found below.

In [29]:
import plotly
import plotly.plotly as py
import plotly.graph_objs as go

plotly.tools.set_credentials_file(username="pdsfinal", api_key="uWfUSo3kfMnXM6gTxmUO")

def visualize_clusters_3D(data, labels, k):
    transformed = data[:, 0:3]
    
    plot_data = []
    for i in range(k):
        trace = go.Scatter3d(
            x=transformed[np.where(labels[:,i] == 1)[0]][:,0],
            y=transformed[np.where(labels[:,i] == 1)[0]][:,1],
            z=transformed[np.where(labels[:,i] == 1)[0]][:,2],
            mode="markers",
            marker=dict(
                size=12,
                line=dict(
                    color='rgba(217,217,217,0.14)',
                    width=0.5
                ),
                opacity=0.8
            )
        )
        plot_data.append(trace)

    layout = go.Layout(
        margin=dict(
            l=0,
            r=0,
            b=0,
            t=0
        )
    )
    return go.Figure(data=plot_data, layout=layout)
    
fig = visualize_clusters_3D(kernel_data, opt_labels, k)
py.iplot(fig, filename='cluster-3d-scatter')

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~pdsfinal/0 or inside your plot.ly account where it is named 'cluster-3d-scatter'


In addition, now that we have labeled data, we can visualize the clusters with respect to our original features. We will explore clusters with regards to some of those metrics below.

First, we'll get the dataframe indices for each cluster.

In [30]:
def get_cluster_idxs(labels, k):
    cluster_idxs = dict()
    for i in range(k):
        cluster_idxs[i] = np.where(labels[:,i] == 1)[0]
    return cluster_idxs

cluster_idxs = get_cluster_idxs(opt_labels, k)

Since PCA transforms our data by projecting each feature to the principal components, the $x,y,z$ axes in the plot above don't necessarily correspond to our original features in the way we would expect it. However, we can use the clusters that our KMeans++ model has identified to visualize features we identified earlier.

Here's an example of that with subscriber count.

In [32]:
def visualize_cluster_stacked_histogram(data, feature, clusters, labels, k):      
    plot_data = []
    for i in range(k):
        trace = go.Histogram(
            x = dataset[feature][clusters[i]],
            name = "Cluster %d" % i
        )
        plot_data.append(trace)
    
    layout = go.Layout(
        xaxis=dict(title=feature),
        yaxis=dict(title="Count"),
        barmode="stack")
    
    return go.Figure(data=plot_data, layout=layout)
    
fig = visualize_cluster_stacked_histogram(dataset, "Subscriber Count", cluster_idxs, opt_labels, k)
py.iplot(fig, filename='cluster-subscriber-count-stacked-histogram')

Another one with upload frequency.

In [33]:
fig = visualize_cluster_stacked_histogram(dataset, "Days/Upload", cluster_idxs, opt_labels, k)
py.iplot(fig, filename='cluster-days-upload-stacked-histogram')

There are certain insights we've gained from visualizing the data. In the upload frequency plot above, we can see that channels from cluster 1 and 4 tend to be really frequent uploaders (videos within a day).

### Cluster Types

**Popular**: Cluster 3 (red) and Cluster 0 (blue) contain our most popular channels. These channels have lots of subscribers and their channels have consistently high engagement. Examples include: CaseyNeistat, Gigi Gorgeous, RomanAtwoodVlogs, and Epic Meal Time.

**Growing**: Cluster 4 (purple) and Cluster 1 (orange) contain our channels with average popularity. These channels have fewer subsrcibers than the two above, but still have good audience engagement. Examples include: AlishaMarieVlogs, iJustine, Vlogs by DK4L, and Bon Appetit.

**Small Channels**: Cluster 2 (green) contains our smallest channels. These channels have low engagement from their audience. Examples include: LevelCapGaming, Beauty Tricks, and Jamie Oliver.

## 5. Future Directions

1. Analyzing genres separately
2. Webscraping for smaller channels
3. Create time series database
4. Exploring other unsupervised clustering algorithms
5. Training sentiment analysis model

In this project we did not consider the differences achieving popularity across different genres. In the future we look to analyze each genre separately to reduce the effects of any confounding variables across genres.

When we queried the YouTube Data API, it was inherently biased towards larger channels because they sorted in descending order on the metrics we specified in our parameters. Therefore, it might be helpful to webscrape for smaller channels for more variety in channels.

The Youtube Data API does not record historical data, but we could assemble our own, which could useful to characterize growth, for example, of a channel. This is most likely what SocialBlade, a YouTube analytics platform, utilizes.

Although KMeans++ is very easy to implement, the accuracy of its clusters relies on spherically distributed clusters, similar variance across clusters, etc. Thus, it could be beneficial to examine the performance of other clustering algorithms like GMM and hierarchical clustering to see what yields the lowest error values.

Lastly, we were initially interested in training a sentiment analysis model to analyze video captions. We ultimately decided against it because we found that captions across even one channel's videos were highly variable. Thus, we avoided it for the purposes of this project but that could be something else to look into when expanding on our feature set.

Word Count: 1954