# Regression analysis

In this part, we perform a regression analysis on sponsored videos. As our current dataset only takes into account links in the description, our definition of a sponsored video may not reflect reality as good as we would like at the moment. Nevertheless, this analysis might give us better insights in the future.

In [1]:
import re
from pyspark.sql.functions import col, udf, explode
from pyspark.sql.types import FloatType

import numpy as np

from pyspark.sql import SparkSession
import pyspark as ps
import pyspark.sql.functions as F

import math
from statsmodels.stats import diagnostic
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

config = ps.SparkConf().setAll([
    ('spark.network.timeout', '3601s'),
    ('spark.executor.heartbeatInterval', '3600s'),
])
sc = ps.SparkContext('local', '', conf=config)
spark = SparkSession(sc)

22/11/18 12:43:56 WARN Utils: Your hostname, LAPTOP-8QFB5E0N resolves to a loopback address: 127.0.1.1; using 172.17.221.138 instead (on interface eth0)
22/11/18 12:43:56 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/18 12:43:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Data prep :
-----------------------------------------------------------------------------------------------------

In [2]:
PATH_METADATA_W_URLS = 'data/yt_metadata_en_urls.parquet'

In [3]:
metadatas_urls_df = spark.read.parquet(PATH_METADATA_W_URLS, sep=',')

                                                                                

In [4]:
metadatas_urls_df.head()

                                                                                

Row(categories='Howto & Style', channel_id='UCROB2-0bJEcwiP059oNil_Q', crawl_date=datetime.date(2019, 11, 17), dislike_count=None, display_id='iICAtB8ViFM', duration=1365, like_count=None, tags='makeup geek,kathleen lights,makeup,wannamakeup,sephora,by terry,ulta,korres,chanel,haul', title='Makeup Haul | Chanel, MAC, Makeup Geek, Sephora & more', upload_date=datetime.date(2016, 8, 22), view_count=2813, urls=['http://go.magik.ly/ml/1fqi/', 'http://go.magik.ly/ml/1fqk/', 'http://go.magik.ly/ml/1fqo/', 'http://go.magik.ly/ml/1fql/', 'http://go.magik.ly/ml/1fqm/', 'http://go.magik.ly/ml/1fqr/', 'http://go.magik.ly/ml/1fqp/', 'http://www.ebates.com/rf.do?referreri...', 'http://go.magik.ly/ml/1fqn/', 'http://go.magik.ly/ml/1fqj/', 'http://go.magik.ly/ml/1fqs/', 'http://go.magik.ly/ml/167n/', 'http://go.magik.ly/ml/1fqq/', 'https://www.octoly.com/creators?cref=hato9'], urls_count=14, has_urls='true')

In [4]:
metadatas_urls_df = metadatas_urls_df.fillna(0,subset='dislike_count') \
    .fillna(0,subset='like_count')

In [5]:
like_view_ratio_udf = udf(lambda like, view: like / view if view != 0 else 0, FloatType())
metadatas_urls_df = metadatas_urls_df.withColumn('like_per_view', like_view_ratio_udf(metadatas_urls_df.like_count, metadatas_urls_df.view_count))

In [29]:
metadatas_urls_df.filter(col("like_per_view")==0).count()

                                                                                

7094044

In [21]:
print((metadatas_urls_df.count(), len(metadatas_urls_df.columns))) # it represent less than 10 % of our dataset



(72924794, 15)



                                                                                

There are $7'094'044$ videos with a ratio of likes per view of 0, which is about $10%$ of our dataset. We can drop them from our analysis, since they might not be considered as sponsored videos.

In [7]:
metadatas_urls_df = metadatas_urls_df.where(metadatas_urls_df.like_per_view>0)

In [8]:
metadatas_urls_df = metadatas_urls_df.withColumn('dislike_per_view', df.dislike_count / df.view_count)
metadatas_urls_df = metadatas_urls_df.fillna(0,subset='dislike_per_view')

In [55]:
df.head()

Row(categories='Howto & Style', channel_id='UCROB2-0bJEcwiP059oNil_Q', crawl_date=datetime.date(2019, 11, 17), dislike_count=1, display_id='4e3A7ohrWZ0', duration=426, like_count=87, tags='urban decay,foundation,review,makeup,sephora,ulta,wannamakeup,demo,first impressions,all nighter,too faced,colourpop', title='Urban Decay All Nighter Foundation | Demo and Review', upload_date=datetime.date(2016, 8, 11), view_count=816, urls=['http://www.ebates.com/rf.do?referreri...', 'https://www.octoly.com/creators?cref=hato9'], urls_count=2, has_urls='true', like_per_view=0.10661764705882353, dislike_per_view=0.0012254901960784314)

# Model fitting
-----------------------------------------------------------------------------------------------------
Attempt at using the spark library of regression, not yet opreationnal

-------------------------------------------------------------------------------------------------------------------------------
Let's use a statsmodel approach for large data sets (https://www.statsmodels.org/stable/large_data.html)

In [None]:
# Run this cell to write the parquet file regression_urls with the columns you want to fit on :
df2 = df.select(col('urls_count'),col('like_per_view'))
df2.write.parquet('regression_urls.parquet')

In [9]:
import pyarrow.parquet as pq
import statsmodels.formula.api as smf

class DataSet(dict):
    def __init__(self, path):
        self.parquet = pq.ParquetDataset(path)

    def __getitem__(self, key):
        try:
            return self.parquet.read([key]).to_pandas()[key]
        except:
            raise KeyError

df_urls = DataSet('regression_urls.parquet')

In [10]:
mod = smf.ols('like_per_view ~ urls_count', data=df_urls)
np.random.seed(2)
res = mod.fit()
print(res.summary())

We can atleast see that we have a positive coefficient for urls_count which would seems coherent with the fact that popular videos (thus with more likes) tend to have a more worked-through description (more descriptive etc) which means also more links it them.