## Network Analysis

In this notebook, we represent two specific networks to have a better understanding of the data. We use the [Gephi](https://gephi.org/) software to visualize the networks.

1. The first network is an undirected graph with nodes as videos ans sponsors, and edges appear between them if a video is sponsored by a sponsor.

2. The second network is an weighted undirected graph with nodes as sponsors, and edges appear between them if they sponsor videos together. The weights describe the number of videos they sponsor together, normalized.

In [1]:
import findspark
findspark.init()

import os
import glob
import pandas as pd
from itertools import combinations

from pyspark.sql.functions import udf, explode, collect_list, count
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, LongType, StringType, DateType, ArrayType, BooleanType, FloatType

from pyspark.sql import SparkSession
import pyspark as ps
config = ps.SparkConf()
config.set('spark.executor.heartbeatInterval', '3600s')
config.set('spark.network.timeout', '7200s')
config.set('spark.driver.memory', '16g')
sc = ps.SparkContext('local[*]', '', conf=config) # write 'local' for single-threaded execution and 'local[*]' for multi-threaded execution
spark = SparkSession(sc)

### Load the Data 

In [2]:
PATH_METADATAS_CLASSIFIED_DOMAINS_SRC = '../data/domains_classification.csv'
PATH_METADATAS_DOMAINS_SRC = '../data/generated/yt_metadata_en_domains.parquet'

In [3]:
schema_top_domains = ps.sql.types.StructType([
    StructField('domain', StringType(), True),
    StructField('count', IntegerType(), True),
    StructField('median_sponsor_score', FloatType(), True),
    StructField('is_sponsored', IntegerType(), True),
    StructField('domain_category', StringType(), True),
])

schema_domain_metadatas = StructType([
    StructField("categories",    StringType(),            True),
    StructField("channel_id",    StringType(),            True),
    StructField("dislike_count", DoubleType(),            True), # This field must be specified as a double as it is represented as a floating point number
    StructField("display_id",    StringType(),            True),
    StructField("duration",      IntegerType(),           True),
    StructField("like_count",    DoubleType(),            True), # This field must be specified as a double as it is represented as a floating point number
    StructField("tags",          StringType(),            True),
    StructField("title",         StringType(),            True),
    StructField("upload_date",   DateType(),              True),
    StructField("view_count",    DoubleType(),            True),  # This field must be specified as a double as it is represented as a floating point number
    StructField("domains",       ArrayType(StringType()), True), 
    StructField("domains_count", IntegerType(),           True),
    StructField("has_domains",   BooleanType(),           True),
])

classified_domains = spark.read.csv(PATH_METADATAS_CLASSIFIED_DOMAINS_SRC, header=True, schema=schema_top_domains)
domain_metadatas = spark.read.parquet(PATH_METADATAS_DOMAINS_SRC, schema=schema_domain_metadatas)

domain_metadatas = domain_metadatas \
    .withColumn("dislike_count", domain_metadatas.dislike_count.cast(IntegerType())) \
    .withColumn("like_count", domain_metadatas.like_count.cast(IntegerType())) \
    .withColumn("view_count", domain_metadatas.view_count.cast(LongType()))

classified_domains = classified_domains \
    .withColumn("median_sponsor_score", classified_domains.is_sponsored.cast(IntegerType())) \
    .withColumn("is_sponsored", classified_domains.is_sponsored.cast(BooleanType()))

### Generate the Video-Sponsor Network

In [4]:
# Retrieve the sponsored domains
sp_domains = classified_domains.select('domain').where(classified_domains.is_sponsored).collect()
sp_domains = [domain.domain for domain in sp_domains]

In [5]:
# Retrieve the sponsored domains with their category
sp_domains_cat = classified_domains.select('domain', 'domain_category').where(classified_domains.is_sponsored).collect()
sp_domains_cat = [(domain.domain, domain.domain_category) for domain in sp_domains_cat]
# Create a dictionary with the sponsored domains as keys and their category as values
sp_domains_cat = dict(sp_domains_cat)

In [6]:
domain_metadatas.show(10)

+----------------+--------------------+-------------+-----------+--------+----------+--------------------+--------------------+-----------+----------+--------------------+-------------+-----------+
|      categories|          channel_id|dislike_count| display_id|duration|like_count|                tags|               title|upload_date|view_count|             domains|domains_count|has_domains|
+----------------+--------------------+-------------+-----------+--------+----------+--------------------+--------------------+-----------+----------+--------------------+-------------+-----------+
|                |UCD8GawxPXpJnql46...|         null|Gt_r6SrOxv8|    6183|      null|                    |Los Angeles to Sa...| 2019-09-28|     13695|  [tipeeestream.com]|            1|       true|
|Autos & Vehicles|UC-9IhqrFTkc53Dx1...|            6|xNoMrPUlBzw|     230|       134|new cars,new york...|First Look at the...| 2019-04-22|      2194|[creativecommons....|            1|       true|
|Autos & V

In [6]:
def is_in_top_domains(domain):
    return domain in sp_domains

In [7]:
# Explode the domains and keep only the rows with a sponsored domain
is_in_top_domains_udf = udf(is_in_top_domains, BooleanType())
domain_metadatas_vs = domain_metadatas \
    .withColumn('domain', explode(domain_metadatas.domains).alias('domain'))
domain_metadatas_vs = domain_metadatas_vs.select('display_id', 'domain') \
    .withColumn('is_in_top_domains', is_in_top_domains_udf(domain_metadatas_vs.domain))
domain_metadatas_vs = domain_metadatas_vs \
    .where(domain_metadatas_vs.is_in_top_domains) \
    .drop('is_in_top_domains') \
    .distinct()

In [9]:
domain_metadatas_vs.show(10)

+-----------+---------------+
| display_id|         domain|
+-----------+---------------+
|CcJtdkaf0Hk|     newegg.com|
|Hgzonw__mUI|play.google.com|
|3ZMPz2K5uJ8|play.google.com|
|gjg66FPuXME|     zazzle.com|
|6bQ17PH-iWk|play.google.com|
|z9Mqj9EXaBg|   spreaker.com|
|ZQ7OiOISmMU|   spreaker.com|
|I7ODh021stA|     artlist.io|
|zSLPIRY9srU|   testbook.com|
|b7T7ej5G84w|play.google.com|
+-----------+---------------+
only showing top 10 rows



In [8]:
# Add the domain category
def get_domain_category(domain):
    if domain in sp_domains_cat:
        return sp_domains_cat[domain]
    return None 

get_domain_category_udf = udf(get_domain_category, StringType())
domain_metadatas_vs = domain_metadatas_vs \
    .withColumn('domain_category', get_domain_category_udf(domain_metadatas_vs.domain))

In [11]:
domain_metadatas_vs.show(10)

+-----------+---------------+---------------+
| display_id|         domain|domain_category|
+-----------+---------------+---------------+
|CcJtdkaf0Hk|     newegg.com|     Technology|
|Hgzonw__mUI|play.google.com|    Application|
|3ZMPz2K5uJ8|play.google.com|    Application|
|gjg66FPuXME|     zazzle.com|           Shop|
|6bQ17PH-iWk|play.google.com|    Application|
|z9Mqj9EXaBg|   spreaker.com|         Agency|
|ZQ7OiOISmMU|   spreaker.com|         Agency|
|I7ODh021stA|     artlist.io|          Music|
|zSLPIRY9srU|   testbook.com|      Education|
|b7T7ej5G84w|play.google.com|    Application|
+-----------+---------------+---------------+
only showing top 10 rows



This final dataframe describes the edges of the network, from the video described as its `display_id` to the sponsor described as its `domain`. The `domain_category` is the category of the sponsor. We only take a sample of the data to avoid memory issues, while still having a good representation of the data.

In [9]:
SAMPLE_RATIO = 0.01
SAMPLE_SEED = 0
domain_metadatas_vs = domain_metadatas_vs.sample(False, SAMPLE_RATIO, seed=SAMPLE_SEED)

In [10]:
domain_metadatas_vs.write.csv('../data/generated/yt_network.csv', mode='overwrite')

In [11]:
NETWORK_PATH = "../data/generated/yt_network.csv/"

all_files = glob.glob(os.path.join(NETWORK_PATH, "part-*.csv"))
df_from_each_file = [pd.read_csv(f, sep=',', header=None) for f in all_files]
df_merged = pd.concat(df_from_each_file, ignore_index=True)
df_merged.to_csv(NETWORK_PATH + "merged.csv")

### Generate the Sponsor-Sponsor Network

In [14]:
# Get all distinct pairs of domains from the same video
combinations_udf = udf(lambda x: list(combinations(list(set(x)), 2)), "array<struct<domain1:string,domain2:string>>")

In [15]:
# Explode the domains and keep only the rows with a sponsored domain
is_in_top_domains_udf = udf(is_in_top_domains, BooleanType())
domain_metadatas_ss = domain_metadatas \
    .withColumn('domain', explode(domain_metadatas.domains).alias('domain'))
domain_metadatas_ss = domain_metadatas_ss.select('display_id', 'domain') \
    .withColumn('is_in_top_domains', is_in_top_domains_udf(domain_metadatas_ss.domain))
domain_metadatas_ss = domain_metadatas_ss \
    .where(domain_metadatas_ss.is_in_top_domains) \
    .drop('is_in_top_domains') \
    .distinct()

In [16]:
# List them back together
domain_metadatas_ss = domain_metadatas_ss \
    .groupBy('display_id') \
    .agg(collect_list('domain').alias('domains'))

In [16]:
domain_metadatas_ss.select('display_id', 'domains').show(10, False)

+-----------+-----------------------------------------------------------------------+
|display_id |domains                                                                |
+-----------+-----------------------------------------------------------------------+
|---jqfcks4Y|[gamewisp.com]                                                         |
|---rKGl6b6k|[play.google.com]                                                      |
|--1-YLWkQgc|[e.lga.to]                                                             |
|--1udHoGWFY|[fr.shopping.rakuten.com, sigma-beauty.7eer.net, rstyle.me, ebates.com]|
|--2qGzZS0cc|[wattpad.com]                                                          |
|--322IagBXo|[epidemicsound.com)]                                                   |
|--3gtM7gnCQ|[fiverr.com]                                                           |
|--4TsCinz9Y|[teespring.com, streamlabs.com]                                        |
|--4qhrXSuTs|[sellfy.com]                             

In [17]:
domain_metadatas_ss = domain_metadatas_ss.withColumn('domains', explode(combinations_udf(domain_metadatas_ss.domains)).alias('domains'))

In [14]:
domain_metadatas_ss.select('display_id', 'domains').show(10, False)

+-----------+------------------------------------------------+
|display_id |domains                                         |
+-----------+------------------------------------------------+
|--1udHoGWFY|{fr.shopping.rakuten.com, rstyle.me}            |
|--1udHoGWFY|{fr.shopping.rakuten.com, sigma-beauty.7eer.net}|
|--1udHoGWFY|{fr.shopping.rakuten.com, ebates.com}           |
|--1udHoGWFY|{rstyle.me, sigma-beauty.7eer.net}              |
|--1udHoGWFY|{rstyle.me, ebates.com}                         |
|--1udHoGWFY|{sigma-beauty.7eer.net, ebates.com}             |
|--4TsCinz9Y|{streamlabs.com, teespring.com}                 |
|--AT5_SIBBg|{hautelook.com, sigmabeauty.com}                |
|--AeWbCVaNA|{etsy.com, ebates.com}                          |
|--E5fYurbTk|{teespring.com, play.google.com}                |
+-----------+------------------------------------------------+
only showing top 10 rows



In [18]:
# Get the 2 domains in separate columns
domain_metadatas_ss = domain_metadatas_ss \
    .withColumn('domain1', domain_metadatas_ss.domains.domain1) \
    .withColumn('domain2', domain_metadatas_ss.domains.domain2) \
    .drop('domains') \
    .select('domain1', 'domain2')

# Get weights
domain_metadatas_ss = domain_metadatas_ss \
    .groupBy('domain1', 'domain2') \
    .agg(count('domain1').alias('weight'))

# Normalize the weights
max_weight = domain_metadatas_ss.agg({'weight': 'max'}).collect()[0][0]
domain_metadatas_ss = domain_metadatas_ss \
    .withColumn('weight', domain_metadatas_ss.weight / max_weight)

In [19]:
domain_metadatas_ss.show(10)

+------------------+--------------------+--------------------+
|           domain1|             domain2|              weight|
+------------------+--------------------+--------------------+
|        airbnb.com|      shareasale.com|0.003622562610790...|
|     tubebuddy.com|     erincondren.com|0.001918094632062179|
|           seph.me|            ulta.com|0.004986136993773...|
|noscopeglasses.com|          cdkeys.com|0.002249897731921...|
|      coinbase.com|          medium.com|0.004049815917458298|
|         rstyle.me|fr.shopping.rakut...|0.013667560565428845|
|     teespring.com|         audible.com| 0.00974955683832553|
|        ebates.com|    m.freemyapps.com|3.363483478023726E-4|
|     tubebuddy.com|             bstk.me|0.002268078723694...|
|    rover.ebay.com|            bhpho.to|0.001290850415890...|
+------------------+--------------------+--------------------+
only showing top 10 rows



The final dataframe describes the edges of the network, from a sponsor `domain1` to another sponsor `domain2`. The `weight` is the number of videos they sponsor together, normalized.

In [20]:
domain_metadatas_ss.write.csv('../data/generated/yt_network_ss.csv', mode='overwrite')

In [21]:
NETWORK_PATH = "../data/generated/yt_network_ss.csv/"

all_files = glob.glob(os.path.join(NETWORK_PATH, "part-*.csv"))
df_from_each_file = [pd.read_csv(f, sep=',', header=None) for f in all_files]
df_merged = pd.concat(df_from_each_file, ignore_index=True)
df_merged.to_csv(NETWORK_PATH + "merged.csv")

## Visualize the Networks

The first network has nodes representing both the videos and the domains. Every video has an edge to every sponsored domain it contains in its description. Using Yifan Hu's algorithm to layout the graph, we get the following visualization:

<img title="network_01_visualization" alt="Network 01 visualization" width="800" src="../generated/network/vs/01.png">
<img title="network_01_legend" alt="Network 01 legend" width="100" src="../generated/network/vs/legend.png">

This network shows us that the **videos rarely have more than one sponsor** in their description. Indeed, the vast majority of the sponsor nodes are pushed outwards since they only have their own video community. Also note that we took a sample of $\frac{1}{100}\text{th}$ of the full dataset. Taking only a subset of the data might be discriminatory towards the videos with many sponsors, since they are rarely present in the full dataset.

We still have some interesting findings comming out of this graph. Indeed, it seems that `play.google.com` and `apps.apple.com` share a non-negligible number of videos. This is probably due to the fact that they are both app stores, and videos link them together since many applications are available for both Android and iOS.

Also, clusters in the middle tend to be related to the same domain. For example, `sephora.com`, `ipsy.com` and some other cosmetic websites are close to each other. This is probably due to the fact that some beauty-related videos are sponsored by many of these types of websites. Here is a close-up of the specific cluster:

<img title="network_01_visualization" alt="Network 01 visualization" width="500" src="../generated/network/vs/02.png">

We now take a look at a second network. This time, we only have nodes representing the domains, and edges exists between two domains if they sponsor videos together. Using Force Atlas to layout the graph, we get the following visualization:

<img title="network_02_visualization" alt="Network 02 visualization" width="800" src="../generated/network/ss/01.png">
<img title="network_02_legend" alt="Network 02 legend" width="100" src="../generated/network/ss/legend.png">

We now do not take into account videos that only have one sponsor in their description, since no edge would be created. The graph is now much more dense, and clusters are more visible. We can still see that **`play.google.com` and `apps.apple.com` are very related to each other**, since they are close to each other. Also, being in the center of the graph is a good indicator that **these domains sponsor many diverse videos**! Here is a close-up of the region of interest:

<img title="network_02_visualization" alt="Network 02 visualization" width="500" src="../generated/network/ss/02.png">

The agencies seem to appear everywhere in the graph and there are a good number of them in the center. This tells us that **agencies are omnipresent** in the data, sponsoring and **targetting a broad range of videos**.

Interestingly enough, **there seems to be two main clusters**: a big one in the top left, and a smaller one in the bottom right. The bigger cluster groups together many sponsors related to **video games, technology and applications**. The smaller cluster is more related to **beauty and fashion**. This gives us reasonable insights about the data: fashion-related and tech-related videos tend to form different communities on YouTube, which can directly be seen with the relation between sponsors! Here is a close-up of the specific clusters:

<img title="network_02_visualization" alt="Network 02 visualization" width="500" src="../generated/network/ss/04.png">
<img title="network_02_visualization" alt="Network 02 visualization" width="350" src="../generated/network/ss/03.png">

Finally, **shops** are noticeably placed in the outer part of the graph. This could tell us that they might tend to target a broader audience, and not only a specific community. Having low weights between many different nodes could lead to such a configuration. Such examples are `aliexpress.com` or `bangood.com`, which are both e-commerce websites. Both of them sell a wide range of products, from electronics to fashion.

Here is a graph showing the clustering coefficient distribution of the nodes:

<img title="network_02_cluster_coeff" alt="Network 02 clustering coefficient" width="600" src="../generated/network/ss/clustering-coefficient.png">

Here is a graph of the closeness centrality distribution of the nodes:

<img title="network_02_cluster_coeff" alt="Network 02 clustering coefficient" width="600" src="../generated/network/ss/closeness.png">

The mean clustering coefficient is $\approx 0.65$. This means that **communities of sponsors are quite common** when taking into account only the interraction between them. Indeed for example, the domain `sephora.com` is often found alongside `ipsy.com` but also `ulta.com`, which together form a community of cosmetic-related sponsors.

The closeness centrality distribution also shows us that **the graph is well connected**. Indeed, the mean closeness centrality is $\approx 0.6$. Besides some outliers, this metric is quite high, which means that sponsors are often found in communities.

Since many shops and agencies tend to target vast audiences, they also may be the nodes connecting some more specific communities. This could explain why we see that the closeness centrality of the agencies is quite high.