# CZ1016 Introduction to Data Science
## Group Project
### YouTube Statistics - What makes a popular YouTube video?


#### Members:
- Chua Zi Heng U1922370K
- Mun Kei Wuai U1921982B
- Tan Wen Xiu  U1921771H


In this project, we aim to find out which features are important in the classification of a popular YouTube video. We define popular as having a high (likes-dislikes)/views ratio. 

We will be using datasets from **United States of America** and **Great Britain**. The following project will be split into 2 parts - separate data cleaning and classifications will be carried out for each country. Thereafter, we will compare the results derived from both datasets and perform in-depth analysis.

We have came up with 8 features that we think might be useful in the classification.

Features (for each unique video): 
1. Number of views
2. Number of likes
3. Number of dislikes
4. Number of trending days
5. Number of comments
6. Average Sentiment Score
7. Number of positive comments 
8. Number of negative comments 

We have also built a univariate decision tree for each feature, a multivariate decsion tree and a random forest to find out which form of classification will give the highest classification accuracy.

In [8]:
# Basic Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
sb.set() # set the default Seaborn style for graphics
import re
import string
import nltk
from wordcloud import WordCloud
from nltk import pos_tag, word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import export_graphviz
import graphviz
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
import plotly.offline as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly import tools
# Activate inline plotting in notebook
py.init_notebook_mode(connected = False)

ImportError: cannot import name '_np_version_under1p18' from 'pandas.compat.numpy' (C:\Users\chuaz\Anaconda3\lib\site-packages\pandas\compat\numpy\__init__.py)

# 1. United States of America (US)
- Since the video statistics and comments statistics are in separate CSV files, we will import both files (USvideos & UScomments) for analysis.

In [None]:
#import dataset and remove bad lines
USvideos = pd.read_csv('USvideos.csv', error_bad_lines=False)
USvideos

In [None]:
#import dataset and remove bad lines
UScomments = pd.read_csv('UScomments.csv', error_bad_lines=False)
UScomments

# Data Cleaning

In [None]:
#del thumbnail column as it is redundant for classification
USvideos = USvideos.drop(columns='thumbnail_link')
USvideos.head()

In [None]:
#sort videos from most views to least views for visualisation 
USvideos.sort_values(by=['views'], inplace=True, ascending=False)
USvideos.head()

It can be seen from above that there are duplicates in the videos. For example, BTS (방탄소년단) 'DNA' Official MV appeared more than 1 time. Duplicates happen when the video is trending for more than 1 day. Thus, with this inference, we can use this data to derive the number of days a video has been trending for.

In [9]:
USvideos.describe()

NameError: name 'USvideos' is not defined

In [None]:
#add new DataFrame for number of duplicates to show the number of duplicates
duplicates = USvideos['video_id'].value_counts().rename_axis('video_id').reset_index(name='trending days')
duplicates

In [None]:
#remove duplicates
USvideos.drop_duplicates(subset ="video_id", keep = 'first', inplace = True)
USvideos.head()

In [None]:
#show that there are some videos with 0 views
print(len(USvideos[USvideos['views'] == 0]))

In [None]:
#show the percentage of videos with 0 views is very small so its is okay to remove
print((len(USvideos[USvideos['views']==0])) / len(USvideos*100), '%', sep='')

In [None]:
#remove rows with 0 views
indexNames = USvideos[USvideos['views'] == 0].index
USvideos.drop(indexNames , inplace=True)

In [None]:
#merging 2 dataframes together, adding number of trending days
USvideos = pd.merge(USvideos, duplicates, on = 'video_id')
USvideos.head()

In [None]:
# take a look at dataset again: 2634 videos change to 2632 videos after removal of 2 videos with 0 views
USvideos.describe()

In [None]:
# it can be seen that there are some null objects under comment_text
UScomments.info()

In [None]:
UScomments.describe()

In [None]:
# take a look at rows which comments are null
print(UScomments[UScomments['comment_text'].isnull()])

In [None]:
# number of rows with null comments
len(UScomments[UScomments['comment_text'].isnull()])

In [None]:
# get the row index which comment_text are null
indexNames = UScomments[UScomments['comment_text'].isnull()].index
indexNames

In [None]:
#drop the rows with null comments
UScomments.drop(indexNames, inplace=True)

In [None]:
# confirm that the 25 rows are dropped
UScomments.info()

In [None]:
# find the number of rows in USvideos whose video_id are NOT in UScomments
len(USvideos[~USvideos.video_id.isin(UScomments.video_id)])

In [None]:
# find the row index of rows in USvideos whose video_id are NOT in UScomments
rowIndex1 = USvideos[~USvideos.video_id.isin(UScomments.video_id)].index
print(rowIndex1)

In [None]:
# drop the rows in UScomments whose video_id are NOT in USvideos
USvideos.drop(rowIndex1, inplace=True)
USvideos

In [None]:
# take a look at dataset again after 97 videos were dropped
USvideos.describe()

In [None]:
#take a look at UScomments dataset
UScomments.describe()

In [None]:
# find the number of rows in UScomments whose video_id are NOT in USvideos
len(UScomments[~UScomments.video_id.isin(USvideos.video_id)])

In [None]:
# find the row index of rows in UScomments whose video_id are NOT in USvideos
rowIndex2 = UScomments[~UScomments.video_id.isin(USvideos.video_id)].index
print(rowIndex2)

In [None]:
# drop the rows in UScomments whose video_id are NOT in USvideos
UScomments.drop(rowIndex2, inplace=True)
UScomments

In [None]:
# take a look at dataset again after 888 videos were dropped
UScomments.describe()

# Spearman Rank Correlation
- uses ranking between likes, views and popularity, popularity2 
- choose a y-variable that is appropriate (ie. is normalised, which means a low spearman correlation coefficient)

In [None]:
#create column for ranking of views 
USvideos['views_rank'] = USvideos['views'].rank(ascending = False)
USvideos.head()

In [None]:
#create column for ranking of likes
USvideos['likes_rank'] = USvideos['views'].rank(ascending = False)
USvideos.head()

In [None]:
#set y as likes/views
#sort based on likes over views 
USvideos['popularity'] = USvideos.apply(lambda row: 1000*(row.likes/row.views), axis = 1)
USvideos.head()

In [None]:
#set y as (likes-dislikes)/views
#sort based on (likes-dislikes)/views
USvideos['popularity2'] = USvideos.apply(lambda row: 1000*(row.likes-row.dislikes)/row.views, axis = 1)

USvideos.head()

In [None]:
#create column for ranking of popularity 
USvideos['popularity_rank'] = USvideos['popularity'].rank(ascending = False)
USvideos.head()

In [10]:
#create column for ranking of popularity2
USvideos['popularity2_rank'] = USvideos['popularity2'].rank(ascending = False)
USvideos.head()

NameError: name 'USvideos' is not defined

In [None]:
from scipy.stats import spearmanr
coef, p = spearmanr(USvideos["likes_rank"], USvideos["views_rank"])
print('Spearmans correlation coefficient: %.3f' % coef)

This means that the ranking of videos with the most number of views would correspond with that of the most number of likes. The rankings of these 2 features are essentially equivalent. Hence, the spearman correlation coefficient of "views" or "likes", with other features would be the same.

In [None]:
from scipy.stats import spearmanr
coef, p = spearmanr(USvideos["popularity_rank"], USvideos["views_rank"])
print('Spearmans correlation coefficient: %.3f' % coef)

**popularity = likes/views**. This suggests a higher correlation between the ranks of popularity and views, rather than popularity2 and views as seen below, possibily because we did not take into account the number of dislikes of the video.

In [None]:
from scipy.stats import spearmanr
coef, p = spearmanr(USvideos["popularity2_rank"], USvideos["views_rank"])
print('Spearmans correlation coefficient: %.3f' % coef)

**popularity2 = (likes-dislikes)/views**. Lower correlation shows that popularity2 is not highly associated to the number of views (or likes) that the video receives. Hence, the response is considered to be normalised.

In [None]:
#sort videos from most views to least views
USvideos.sort_values(by=['views'], inplace=True, ascending=False)
USvideos.head()

The above is sorted by highest to lowest views. Looking at popularity2_rank, it is noted that the highest views does not mean highest popularity2_rank. Hence, we will make use of popularity2 as the response variable for this project. 

# Basic Data Visualisation

In [None]:
US_pop = go.Box(x = USvideos['popularity2'], showlegend = False, name = "popularity2")
py.iplot([US_pop])

In [None]:
# look at overall dataset
USvideos.describe()

In [None]:
# 1 point is very far away from the dataset, we will choose to separate them from the classification
anomaly1 = USvideos[USvideos['popularity2']==USvideos['popularity2'].min()]
USvideos = USvideos.loc[USvideos['popularity2']!=USvideos['popularity2'].min()]
USvideos.head()

In [None]:
# look at overall dataset
USvideos.describe()

In [None]:
#boxplot for popularity2 after removing 1 anomaly
US_pop = go.Box(x = USvideos['popularity2'], showlegend = False, name = "popularity2")
py.iplot([US_pop])

In [None]:
#categorize the popularity2 of the videos into 4 categories according to their quartiles 
USvideos['quantile_popularity'] = pd.cut(USvideos['popularity2'], bins = 4, labels = ['low', 'very low', 'very high','high'])
USvideos.head()

In [None]:
USvideos['quantile_popularity'].value_counts()

In [None]:
USvideos.describe()

In [None]:
USvideos['popularity2'].describe()

In [None]:
# Extract the Features from the Data
X = pd.DataFrame(USvideos[["views", "likes", "dislikes", "comment_total", "trending days"]]) 
# Plot the Raw Data on 2D grids
sb.pairplot(X)

In [None]:
# Draw the distribution of Response
f, axes = plt.subplots(1, 3, figsize=(24, 6))
sb.boxplot(USvideos['popularity2'], orient = "h", ax = axes[0], color = "g")
sb.distplot(USvideos['popularity2'], kde = False, ax = axes[1], color = "g")
sb.violinplot(USvideos['popularity2'], ax = axes[2], color = "g")

In [None]:
# interactive plot for popularity2
trace = go.Histogram(x = USvideos['popularity2'], histnorm = 'density')
layout = go.Layout(title = 'Popularity2 Distribution')
data = [trace]
fig = go.Figure(data = data, layout = layout)
py.iplot(fig)

In [None]:
# interactive plot for quantile_popularity
trace = go.Histogram(x = USvideos['quantile_popularity'], histnorm = 'density')
layout = go.Layout(title = 'Quantile Popularity Distribution')
data = [trace]
fig = go.Figure(data = data, layout = layout)
py.iplot(fig)

The plot above shows that there is an uneven distribution of videos in each of the categories, with 'very low' having the most number of videos and 'high' having the least. This might be a limitation as those in 'very low' can be better trained. We will analyse more of this in the Analysis part later on.

# Clustering for Further Data Visualisation
### Bi-variate clustering by Kmeans++ 
- views & popularity2 are used because views are seen as the most straightforward way of predicting popularity in general

In [None]:
# Import KMeans from sklearn.cluster
from sklearn.cluster import KMeans

# Extract the Features from the Data
X = pd.DataFrame(USvideos[["views", "popularity2"]])

# Set the Initialization to KMeans++
init_algo = 'k-means++'

# Vary the Number of Clusters
min_clust = 1
max_clust = 40

# Compute Within Cluster Sum of Squares
within_ss = []
for num_clust in range(min_clust, max_clust+1):
    kmeans = KMeans(n_clusters = num_clust,        # number of clusters
                    init = init_algo,              # initialization algorithm
                    n_init = 5)                    # number of initializations
    kmeans.fit(X)
    within_ss.append(kmeans.inertia_)

# Plot Within SS vs Number of Clusters
f, axes = plt.subplots(1, 1, figsize=(16,4))
plt.plot(range(min_clust, max_clust+1), within_ss)
plt.xlabel('Number of Clusters')
plt.ylabel('Within Cluster Sum of Squares')
plt.xticks(np.arange(min_clust, max_clust+1, 1.0))
plt.grid(which='major', axis='y')
plt.show()

From the elbow plot, it can be seen that the optimal number of clusters is 4.

In [None]:
# Set "optimal" Number of Clusters
num_clust = 4

# Set the Initialization to KMeans++
init_algo = 'k-means++'

# Create Clustering Model using KMeans
kmeans = KMeans(n_clusters = num_clust, init = init_algo, n_init = 20)                 

# Fit the Clustering Model on the Data
kmeans.fit(X)

# Print the Cluster Centers
print("Features", "\tviews", "\tpopularity2")
print()

for i, center in enumerate(kmeans.cluster_centers_):
    print("Cluster", i, end=":\t")
    for coord in center:
        print(round(coord, 2), end="\t")
    print()
print()

# Print the Within Cluster Sum of Squares
print("Within Cluster Sum of Squares :", kmeans.inertia_)
print()

# Predict the Cluster Labels
labels = kmeans.predict(X)

# Append Labels to the Data
X_labeled = X.copy()
X_labeled["Cluster"] = pd.Categorical(labels)

# Summary of the Cluster Labels
sb.countplot(X_labeled["Cluster"])

In [None]:
# Visualize the Clusters in the Data
f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "views", y = "popularity2", c = "Cluster", cmap = 'viridis', data = X_labeled)

In [None]:
# Boxplots for the Features against the Clusters
f, axes = plt.subplots(2, 1, figsize=(16,8))
sb.boxplot(x = 'views', y = 'Cluster', data = X_labeled, ax = axes[0])
sb.boxplot(x = 'popularity2', y = 'Cluster', data = X_labeled, ax = axes[1])

### Multi-variate Clustering by Kmeans++
- we decided that bi-variate clustering was not enough, thus we decided to use this method to get a better overview of the dataset
- categorical data (ie. Category ID) and non-numerical data (ie comment_text) are not used in clustering 

In [None]:
# Extract the Features from the Data
X = pd.DataFrame(USvideos[["likes", "views", "comment_total", "dislikes"]]) 


# Plot the Raw Data on 2D grids
sb.pairplot(X)

In [None]:
# Vary the Number of Clusters
min_clust = 1
max_clust = 40
init_algo = 'k-means++'

# Compute Within Cluster Sum of Squares
within_ss = []
for num_clust in range(min_clust, max_clust+1):
    kmeans = KMeans(n_clusters = num_clust, init = init_algo, n_init = 5)
    kmeans.fit(X)
    within_ss.append(kmeans.inertia_)

# Angle Plot : Within SS vs Number of Clusters
f, axes = plt.subplots(1, 1, figsize=(16,4))
plt.plot(range(min_clust, max_clust+1), within_ss)
plt.xlabel('Number of Clusters')
plt.ylabel('Within Cluster Sum of Squares')
plt.xticks(np.arange(min_clust, max_clust+1, 1.0))
plt.grid(which='major', axis='y')
plt.show()

In [None]:
# Import essential models from sklearn
from sklearn.cluster import KMeans

# Set "optimal" Clustering Parameters
num_clust = 4
init_algo = 'k-means++'

# Create Clustering Model using KMeans
kmeans = KMeans(n_clusters = num_clust,         
               init = init_algo,
               n_init = 20)                 

# Fit the Clustering Model on the Data
kmeans.fit(X)

In [None]:
# Print the Cluster Centers
print("Features", "\tlikes", "\tviews", "\tcomment_total", "\tdislikes",)
print()

for i, center in enumerate(kmeans.cluster_centers_):
    print("Cluster", i, end=":\t")
    for coord in center:
        print(round(coord, 2), end="\t")
    print()
print()

# Print the Within Cluster Sum of Squares
print("Within Cluster Sum of Squares :", kmeans.inertia_)
print()

# Predict the Cluster Labels
labels = kmeans.predict(X)

# Append Labels to the Data
X_labeled = X.copy()
X_labeled["Cluster"] = pd.Categorical(labels)

# Summary of the Cluster Labels
sb.countplot(X_labeled["Cluster"])

In [None]:
# Plot the Clusters on 2D grids
sb.pairplot(X_labeled, vars = X.columns.values, hue = "Cluster")

In [None]:
# Boxplots for all Features against the Clusters
f, axes = plt.subplots(4, 1, figsize=(16,24))
sb.boxplot(x = 'likes', y = 'Cluster', data = X_labeled, ax = axes[0])
sb.boxplot(x = 'views', y = 'Cluster', data = X_labeled, ax = axes[1])
sb.boxplot(x = 'comment_total', y = 'Cluster', data = X_labeled, ax = axes[2])
sb.boxplot(x = 'dislikes', y = 'Cluster', data = X_labeled, ax = axes[3])

In [None]:
# Average Behaviour of each Cluster
cluster_data = pd.DataFrame(X_labeled.groupby(by = "Cluster").mean())
cluster_data.plot.bar(figsize = (16,6))

Next, we will be moving on to the 8 features mentioned at the start.

# Feature 1: Number of Views

In [11]:
# Recall the Legendary-Total Dataset
QP = pd.DataFrame(USvideos['quantile_popularity'])   # Response
views = pd.DataFrame(USvideos['views'])       # Predictor

# Split the Legendary-Total Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(views, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict Legendary values corresponding to Total
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)

NameError: name 'pd' is not defined

# Feature 2: Number of Likes

In [None]:
# Recall the Legendary-Total Dataset
QP = pd.DataFrame(USvideos['quantile_popularity'])   # Response
likes = pd.DataFrame(USvideos['likes'])       # Predictor

# Split the Legendary-Total Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(likes, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict Legendary values corresponding to Total
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)

# Feature 3: Number of Dislikes

In [None]:
# Recall the Legendary-Total Dataset
QP = pd.DataFrame(USvideos['quantile_popularity'])   # Response
dislikes = pd.DataFrame(USvideos['dislikes'])       # Predictor

# Split the Legendary-Total Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(dislikes, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict Legendary values corresponding to Total
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)


# Feature 4: Number of Trending Days

In [None]:
# Recall the Legendary-Total Dataset
QP = pd.DataFrame(USvideos['quantile_popularity'])   # Response
TD = pd.DataFrame(USvideos['trending days'])       # Predictor

# Split the Legendary-Total Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(TD, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict Legendary values corresponding to Total
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)

# Feature 5: Total Number of Comments

In [None]:
# Recall the Legendary-Total Dataset
QP = pd.DataFrame(USvideos['quantile_popularity'])   # Response
CT = pd.DataFrame(USvideos['comment_total'])       # Predictor

# Split the Legendary-Total Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(CT, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict Legendary values corresponding to Total
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)


# Feature 6: Average Sentiment Score
- convert emojis (eg 😂) to text
- convert emoticons (eg :D) to text
- perform sentiment analysis to get the compound, positive, negative and neutral sentiment scores of each comment
- sum up the compound sentiment scores of all the comments of each video
- get the number of comments for each video
- find the average sentiment score of each video

In [None]:
import nltk
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [None]:
#Installing emot library
!pip install emot
#Importing libraries
import re
from emot.emo_unicode import UNICODE_EMO, EMOTICONS


In [None]:
# Function for converting emojis into word
def convert_emojis(text):
    for emot in UNICODE_EMO:
        text = text.replace(emot, " ".join(UNICODE_EMO[emot].replace(",","").replace("_"," ").replace(":","").split()))
    return text

# Example
text1 = "Hilarious 😂. The feeling of making a sale 😎, The feeling of actually fulfilling orders 😒"
convert_emojis(text1)

In [None]:
UScomments['processed text'] = UScomments['comment_text'].apply(lambda x: convert_emojis(x))
UScomments

In [None]:
# Function for converting emoticons into word
def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', " ".join(EMOTICONS[emot].replace(",","").replace("_"," ").split()), text)
    return text

# Example
text = "Hello :-) :-)"
convert_emoticons(text)

In [None]:
UScomments['processed text'] = UScomments['processed text'].apply(lambda x: convert_emoticons(x))

# look at row 74
UScomments.head(79)

In [None]:
#add column of sentiment scores for visualisation
UScomments['pos_score'] = UScomments['comment_text'].apply(lambda x:sia.polarity_scores(x)['pos'])
UScomments['neg_score'] = UScomments['comment_text'].apply(lambda x:sia.polarity_scores(x)['neg'])
UScomments['neu_score'] = UScomments['comment_text'].apply(lambda x:sia.polarity_scores(x)['neu'])

# we will use the compound score for further analysis. here the compound score is renamed as sentiment score
UScomments['sentiment_scores'] = UScomments['comment_text'].apply(lambda x:sia.polarity_scores(x)['compound'])
UScomments.head()

In [None]:
#create new dataframe for number of comments per video
number_comments = UScomments['video_id'].value_counts().rename_axis('video_id').reset_index(name='number_comments')
number_comments

In [None]:
#create new dataframe for total number of positive comments per video
total = UScomments.groupby(['video_id'], sort = False).sum()
total

In [None]:
#merge total and number_comments dataframes 
total = pd.merge(total, number_comments, on = 'video_id')
total.head()

In [None]:
#add column of average sentiment
total['average_sentiment'] = total['sentiment_scores'].div(total['number_comments'].values, axis=0)
total.head()

In [None]:
#add average sentiment column into USvideos dataframe
USvideos = pd.merge(USvideos, total, on = 'video_id')
USvideos.head(30)

In [None]:
# visualisation of the more commonly used words
all_words = ' '.join([text for text in UScomments['comment_text']])
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
QP = pd.DataFrame(USvideos['quantile_popularity'])   # Response
AS = pd.DataFrame(USvideos['average_sentiment'])       # Predictor

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(AS, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict response values corresponding to predictor
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)

# Feature 7: Percentage of Positive Comments over Total Number of Comments
###### for each comment: 
- if sentiment score > 0, it is a positive sentiment
- if sentiment score < 0, it is a negative sentiment
- if sentiment score > 0, it is a neutral sentiment

- then find the percentage of comments with positive sentiments


In [None]:
#Categorize positive, negative, neutral
UScomments['Sentiment'] = UScomments['sentiment_scores'].apply(lambda s : 'Positive' if s > 0 else ('Neutral' if s == 0 else 'Negative'))
UScomments.head(20)

In [None]:
#percentage of comments which are positive in all the videos
positive_percent = []
for i in range(0,UScomments.video_id.nunique()):
    a = UScomments[(UScomments.video_id == UScomments.video_id.unique()[i]) & (UScomments.Sentiment == 'Positive')].count()[0]
    b = UScomments[UScomments.video_id == UScomments.video_id.unique()[i]]['Sentiment'].value_counts().sum()
    Percentage = (a/b)*100
    positive_percent.append(round(Percentage,2))

positive_percent

In [None]:
#Creating dataframe for positive percentage
positive_percentage = pd.DataFrame(positive_percent,UScomments.video_id.unique()).reset_index()
positive_percentage.columns = ['video_id','Positive Percentage']
positive_percentage

In [None]:
#add positive percentage column into USvideos dataframe
USvideos = pd.merge(USvideos, positive_percentage, on = 'video_id')
USvideos.head()

In [None]:
all_words_posi = ' '.join([text for text in UScomments['comment_text'][UScomments.Sentiment == 'Positive']])

In [None]:
# visualisation of words in positive comments
wordcloud_posi = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words_posi)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud_posi, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
QP = pd.DataFrame(USvideos['quantile_popularity'])   # Response
PP = pd.DataFrame(USvideos['Positive Percentage'])       # Predictor

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(PP, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict response values corresponding to predictor
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)

# Feature 8: Percentage of Negative Comments over Total Number of Comments

In [None]:
#percentage of comments which are negative in all the videos
negative_percent = []
for i in range(0,UScomments.video_id.nunique()):
    a = UScomments[(UScomments.video_id == UScomments.video_id.unique()[i]) & (UScomments.Sentiment == 'Negative')].count()[0]
    b = UScomments[UScomments.video_id == UScomments.video_id.unique()[i]]['Sentiment'].value_counts().sum()
    Percentage = (a/b)*100
    negative_percent.append(round(Percentage,2))

negative_percent

In [None]:
#Creating dataframe for negative percentage
negative_percentage = pd.DataFrame(negative_percent,UScomments.video_id.unique()).reset_index()
negative_percentage.columns = ['video_id','Negative Percentage']
negative_percentage

In [None]:
#add negative percentage column into USvideos dataframe
USvideos = pd.merge(USvideos, negative_percentage, on = 'video_id')
USvideos.head()

In [None]:
all_words_nega = ' '.join([text for text in UScomments['comment_text'][UScomments.Sentiment == 'Negative']])

In [None]:
# visualisation of words in negative comments
wordcloud_nega = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words_nega)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud_nega, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
QP = pd.DataFrame(USvideos['quantile_popularity'])   # Response
NP = pd.DataFrame(USvideos['Negative Percentage'])       # Predictor

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(NP, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict response values corresponding to predictor
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)

# Multivariate Classification

In [None]:
# Extract Response and Predictors
y = pd.DataFrame(USvideos["quantile_popularity"])
X = pd.DataFrame(USvideos[["likes", "views", "comment_total", "trending days", "dislikes", "average_sentiment", "Positive Percentage", "Negative Percentage"]])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Check the sample sizes
print("Train Set :", y_train.shape, X_train.shape)
print("Test Set  :", y_test.shape, X_test.shape)

In [None]:
# Draw the distributions of all Predictors
f, axes = plt.subplots(8, 3, figsize=(18, 16))

count = 0
for var in X_train:
    sb.boxplot(X_train[var], orient = "h", ax = axes[count,0])
    sb.distplot(X_train[var], ax = axes[count,1])
    sb.violinplot(X_train[var], ax = axes[count,2])
    count += 1

In [None]:
# Relationship between Response and the Predictors
trainDF = pd.concat([y_train, X_train.reindex(index=y_train.index)], sort = False, axis = 1)

f, axes = plt.subplots(8, 1, figsize=(18, 24))

count = 0
for var in X_train:
    sb.boxplot(x = var, y = "quantile_popularity", data = trainDF, orient = "h", ax = axes[count])
    count += 1

In [None]:
# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)

# Random Forest

In [12]:
# Extract Response and Predictors
y = pd.DataFrame(USvideos["quantile_popularity"])
X = pd.DataFrame(USvideos[["likes", "views", "comment_total", "trending days", "dislikes", "average_sentiment", "Positive Percentage", "Negative Percentage"]])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Check the sample sizes
print("Train Set :", y_train.shape, X_train.shape)
print("Test Set  :", y_test.shape, X_test.shape)

NameError: name 'pd' is not defined

In [None]:
# Import RandomForestClassifier model from Scikit-Learn
from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest object
rforest = RandomForestClassifier(n_estimators = 100,  # n_estimators denote number of trees
                                 max_depth = 7)       # set the maximum depth of each tree

# Fit Random Forest on Train Data
rforest.fit(X_train, y_train.values.ravel())

In [None]:
# Import confusion_matrix from Scikit-Learn
from sklearn.metrics import confusion_matrix

# Predict 0/1 values corresponding to message
y_train_pred = rforest.predict(X_train)
y_test_pred = rforest.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", rforest.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", rforest.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# 2. Great Britain (GB)
- Since the video statistics and comments statistics are in separate CSV files, we will import both files (GBvideos & GBcomments) for analysis.

In [None]:
#import dataset and remove bad lines
GBvideos = pd.read_csv('GBvideos.csv', error_bad_lines=False)
GBvideos

In [None]:
#import dataset and remove bad lines
GBcomments = pd.read_csv('GBcomments.csv', error_bad_lines=False)
GBcomments

# Data Cleaning

In [None]:
#del thumbnail column as it is redundant for classification
GBvideos = GBvideos.drop(columns='thumbnail_link')
GBvideos.head()

In [None]:
#sort videos from most views to least views for visualisation 
GBvideos.sort_values(by=['views'], inplace=True, ascending=False)
GBvideos.head()

In [None]:
GBvideos.describe()

In [None]:
#add new DataFrame for number of duplicates to show the numberof duplicates
duplicates_GB = GBvideos['video_id'].value_counts().rename_axis('video_id').reset_index(name='trending days')
duplicates_GB

In [None]:
#remove duplicates
GBvideos.drop_duplicates(subset ="video_id", keep = 'first', inplace = True)
GBvideos

In [None]:
#show that there are some videos with 0 views
print(len(GBvideos[GBvideos['views'] == 0]))

In [None]:
#show the percentage of videos with 0 views is very small so its is okay to remove
print((len(GBvideos[GBvideos['views']==0])) / len(GBvideos*100), '%', sep='')

In [None]:
#remove rows with 0 views
indexNames = GBvideos[GBvideos['views'] == 0].index
GBvideos.drop(indexNames , inplace=True)

In [None]:
#merging 2 dataframes together, adding number of trending days
GBvideos = pd.merge(GBvideos, duplicates_GB, on = 'video_id')
GBvideos.head()

In [None]:
# take a look at dataset again: 1736 videos change to 1734 videos after removal of 2 videos with 0 views
GBvideos.describe()

In [None]:
# it can be seen that there are some null objects under comment_text
GBcomments.info()

In [None]:
GBcomments.describe()

In [None]:
# take a look at rows which comments are null
print(GBcomments[GBcomments['comment_text'].isnull()])

In [None]:
# number of rows with null comments
len(GBcomments[GBcomments['comment_text'].isnull()])

In [None]:
# get the row index which comment_text are null
indexNames_GB = GBcomments[GBcomments['comment_text'].isnull()].index
indexNames_GB

In [None]:
#drop the rows with null comments
GBcomments.drop(indexNames_GB, inplace=True)

In [None]:
# confirm that the 28 rows are dropped
GBcomments.info()

In [None]:
# find the number of rows in GBvideos whose video_id are NOT in GBcomments
len(GBvideos[~GBvideos.video_id.isin(GBcomments.video_id)])

In [None]:
# find the row index of rows in GBvideos whose video_id are NOT in GBcomments
rowIndex3 = GBvideos[~GBvideos.video_id.isin(GBcomments.video_id)].index
print(rowIndex3)

In [None]:
# drop the rows in GBcomments whose video_id are NOT in GBvideos
GBvideos.drop(rowIndex3, inplace=True)
GBvideos

In [None]:
# take a look at dataset again after 42 videos were dropped
GBvideos.describe()

In [None]:
#take a look at GBcomments dataset
GBcomments.describe()

In [None]:
# find the number of rows in GBcomments whose video_id are NOT in GBvideos
len(GBcomments[~GBcomments.video_id.isin(GBvideos.video_id)])

In [None]:
#set y as likes/views
#sort based on likes over views 
GBvideos['popularity'] = GBvideos.apply(lambda row: 1000*(row.likes/row.views), axis = 1)
GBvideos.head()

In [None]:
#set y as (likes-dislikes)/views
#sort based on (likes-dislikes)/views
GBvideos['popularity2'] = GBvideos.apply(lambda row: 1000*(row.likes-row.dislikes)/row.views, axis = 1)

GBvideos.head()

# Spearman's Correlation Coefficient

In [None]:
#create column for ranking of views 
GBvideos['views_rank'] = GBvideos['views'].rank(ascending = False)
GBvideos.head()

In [None]:
#create column for ranking of likes 
GBvideos['likes_rank'] = GBvideos['likes'].rank(ascending = False)
GBvideos.head()

In [None]:
#create column for ranking of popularity
GBvideos['popularity_rank'] = GBvideos['popularity'].rank(ascending = False)
GBvideos.head()

In [None]:
#create column for ranking of popularity2
GBvideos['popoularity2_rank'] = GBvideos['popularity2'].rank(ascending = False)
GBvideos.head()

In [13]:
#Spearmans correlation coefficient for views and likes
from scipy.stats import spearmanr
coef, p = spearmanr(GBvideos["likes_rank"], GBvideos["views_rank"])
print('Spearmans correlation coefficient: %.3f' % coef)

NameError: name 'GBvideos' is not defined

Since the Spearman's correlation coefficient is not 1.000, which is different from the USvideos, this means that the videos with the highest views may not necessarily be the videos with the highest likes. Hence, we will have to compare the different features separately. 

In [None]:
#Spearmans correlation coefficient for views and popularity
from scipy.stats import spearmanr
coef, p = spearmanr(GBvideos["popularity_rank"], GBvideos["views_rank"])
print('Spearmans correlation coefficient: %.3f' % coef)

In [None]:
#Spearmans correlation coefficient for views and popularity2
from scipy.stats import spearmanr
coef, p = spearmanr(GBvideos["popularity2_rank"], GBvideos["views_rank"])
print('Spearmans correlation coefficient: %.3f' % coef)

In [None]:
#Spearmans correlation coefficient for popularity and likes
from scipy.stats import spearmanr
coef, p = spearmanr(GBvideos["likes_rank"], GBvideos["popularity_rank"])
print('Spearmans correlation coefficient: %.3f' % coef)

In [None]:
#Spearmans correlation coefficient for popularity2 and likes
from scipy.stats import spearmanr
coef, p = spearmanr(GBvideos["likes_rank"], GBvideos["popularity2_rank"])
print('Spearmans correlation coefficient: %.3f' % coef)

Since the correlation coefficient for the ranking of likes is higher than that of views, it can be said that the popularity of the videos is more reliant on the number of likes rather than views. However, a similarity is that the correlation coefficient for popularity2 is lower for both likes and views, suggesting that the number is normalized. Hence, we will be using popularity2 as the response variable. 

# Basic Data Visualisation

We will be using the same response variable (ie. popularity2 = (likes-dislikes)/views) as derived earlier.

In [None]:
GB_pop = go.Box(x = GBvideos['popularity2'], showlegend = False, name = "popularity2")
py.iplot([GB_pop])

In [None]:
# look at overall dataset
GBvideos.describe()

In [None]:
# 2 points are very far away from the dataset, we will choose to separate them from the classification
anomaly2 = GBvideos[GBvideos['popularity2']==GBvideos['popularity2'].min()]
anomaly3 = GBvideos[GBvideos['popularity2']==GBvideos['popularity2'].max()]
GBvideos = GBvideos.loc[GBvideos['popularity2']!=GBvideos['popularity2'].min()]
GBvideos = GBvideos.loc[GBvideos['popularity2']!=GBvideos['popularity2'].max()]
GBvideos.head()

In [None]:
#boxplot for popularity2 after removing 2 anomaly
GB_pop = go.Box(x = GBvideos['popularity2'], showlegend = False, name = "popularity2")
py.iplot([GB_pop])

#should we remove the maximum one also?

In [None]:
#categorize the popularity2 of the videos into 4 categories according to their quartiles 
GBvideos['quantile_popularity'] = pd.cut(GBvideos['popularity2'], bins = 4, labels = ['low', 'very low', 'very high','high'])
GBvideos.head()

In [None]:
GBvideos['quantile_popularity'].value_counts()

In [None]:
GBvideos.describe()

In [None]:
GBvideos['popularity2'].describe()

In [None]:
# Extract the Features from the Data
X = pd.DataFrame(GBvideos[["views", "likes", "dislikes", "comment_total", "trending days"]]) 
# Plot the Raw Data on 2D grids
sb.pairplot(X)

In [None]:
# Draw the distribution of Response
f, axes = plt.subplots(1, 3, figsize=(24, 6))
sb.boxplot(GBvideos['popularity2'], orient = "h", ax = axes[0], color = "r")
sb.distplot(GBvideos['popularity2'], kde = False, ax = axes[1], color = "r")
sb.violinplot(GBvideos['popularity2'], ax = axes[2], color = "r")

In [None]:
# interactive plot for popularity2
trace = go.Histogram(x = GBvideos['popularity2'], histnorm = 'density')
layout = go.Layout(title = 'Popularity2 Distribution')
data = [trace]
fig = go.Figure(data = data, layout = layout)
py.iplot(fig)

In [None]:
# interactive plot for quantile_popularity
trace = go.Histogram(x = GBvideos['quantile_popularity'], histnorm = 'density')
layout = go.Layout(title = 'Quantile Popularity Distribution')
data = [trace]
fig = go.Figure(data = data, layout = layout)
py.iplot(fig)

# Clustering for Further Data Visualisation
### Bi-variate clustering by Kmeans++ 
- views & popularity2 are used because views are seen as the most straightforward way of predicting popularity in general

In [None]:
# Import KMeans from sklearn.cluster
from sklearn.cluster import KMeans

# Extract the Features from the Data
X = pd.DataFrame(GBvideos[["views", "popularity2"]])

# Set the Initialization to KMeans++
init_algo = 'k-means++'

# Vary the Number of Clusters
min_clust = 1
max_clust = 40

# Compute Within Cluster Sum of Squares
within_ss = []
for num_clust in range(min_clust, max_clust+1):
    kmeans = KMeans(n_clusters = num_clust,        # number of clusters
                    init = init_algo,              # initialization algorithm
                    n_init = 5)                    # number of initializations
    kmeans.fit(X)
    within_ss.append(kmeans.inertia_)

# Plot Within SS vs Number of Clusters
f, axes = plt.subplots(1, 1, figsize=(16,4))
plt.plot(range(min_clust, max_clust+1), within_ss)
plt.xlabel('Number of Clusters')
plt.ylabel('Within Cluster Sum of Squares')
plt.xticks(np.arange(min_clust, max_clust+1, 1.0))
plt.grid(which='major', axis='y')
plt.show()

In [None]:
# Set "optimal" Number of Clusters
num_clust = 4

# Set the Initialization to KMeans++
init_algo = 'k-means++'

# Create Clustering Model using KMeans
kmeans = KMeans(n_clusters = num_clust, init = init_algo, n_init = 20)                 

# Fit the Clustering Model on the Data
kmeans.fit(X)

# Print the Cluster Centers
print("Features", "\tviews", "\tpopularity2")
print()

for i, center in enumerate(kmeans.cluster_centers_):
    print("Cluster", i, end=":\t")
    for coord in center:
        print(round(coord, 2), end="\t")
    print()
print()

# Print the Within Cluster Sum of Squares
print("Within Cluster Sum of Squares :", kmeans.inertia_)
print()

# Predict the Cluster Labels
labels = kmeans.predict(X)

# Append Labels to the Data
X_labeled = X.copy()
X_labeled["Cluster"] = pd.Categorical(labels)

# Summary of the Cluster Labels
sb.countplot(X_labeled["Cluster"])

In [None]:
# Visualize the Clusters in the Data
f, axes = plt.subplots(1, 1, figsize=(16,8))
plt.scatter(x = "views", y = "popularity2", c = "Cluster", cmap = 'viridis', data = X_labeled)

In [None]:
# Boxplots for the Features against the Clusters
f, axes = plt.subplots(2, 1, figsize=(16,8))
sb.boxplot(x = 'views', y = 'Cluster', data = X_labeled, ax = axes[0])
sb.boxplot(x = 'popularity2', y = 'Cluster', data = X_labeled, ax = axes[1])

### Multi-variate Clustering by Kmeans++
- we decided that bi-variate clustering was not enough, thus we decided to use this method to get a better overview of the dataset
- categorical data (ie. Category ID) and non-numerical data (ie comment_text) are not used in clustering 

In [None]:
# Extract the Features from the Data
X = pd.DataFrame(GBvideos[["likes", "views", "comment_total", "dislikes"]]) 


# Plot the Raw Data on 2D grids
sb.pairplot(X)

In [None]:
# Vary the Number of Clusters
min_clust = 1
max_clust = 40
init_algo = 'k-means++'

# Compute Within Cluster Sum of Squares
within_ss = []
for num_clust in range(min_clust, max_clust+1):
    kmeans = KMeans(n_clusters = num_clust, init = init_algo, n_init = 5)
    kmeans.fit(X)
    within_ss.append(kmeans.inertia_)

# Angle Plot : Within SS vs Number of Clusters
f, axes = plt.subplots(1, 1, figsize=(16,4))
plt.plot(range(min_clust, max_clust+1), within_ss)
plt.xlabel('Number of Clusters')
plt.ylabel('Within Cluster Sum of Squares')
plt.xticks(np.arange(min_clust, max_clust+1, 1.0))
plt.grid(which='major', axis='y')
plt.show()

In [None]:
# Import essential models from sklearn
from sklearn.cluster import KMeans

# Set "optimal" Clustering Parameters
num_clust = 4
init_algo = 'k-means++'

# Create Clustering Model using KMeans
kmeans = KMeans(n_clusters = num_clust,         
               init = init_algo,
               n_init = 20)                 

# Fit the Clustering Model on the Data
kmeans.fit(X)

In [None]:
# Print the Cluster Centers
print("Features", "\tlikes", "\tviews", "\tcomment_total", "\tdislikes",)
print()

for i, center in enumerate(kmeans.cluster_centers_):
    print("Cluster", i, end=":\t")
    for coord in center:
        print(round(coord, 2), end="\t")
    print()
print()

# Print the Within Cluster Sum of Squares
print("Within Cluster Sum of Squares :", kmeans.inertia_)
print()

# Predict the Cluster Labels
labels = kmeans.predict(X)

# Append Labels to the Data
X_labeled = X.copy()
X_labeled["Cluster"] = pd.Categorical(labels)

# Summary of the Cluster Labels
sb.countplot(X_labeled["Cluster"])

In [None]:
# Plot the Clusters on 2D grids
sb.pairplot(X_labeled, vars = X.columns.values, hue = "Cluster")

In [None]:
# Boxplots for all Features against the Clusters
f, axes = plt.subplots(4, 1, figsize=(16,24))
sb.boxplot(x = 'likes', y = 'Cluster', data = X_labeled, ax = axes[0])
sb.boxplot(x = 'views', y = 'Cluster', data = X_labeled, ax = axes[1])
sb.boxplot(x = 'comment_total', y = 'Cluster', data = X_labeled, ax = axes[2])
sb.boxplot(x = 'dislikes', y = 'Cluster', data = X_labeled, ax = axes[3])

In [None]:
# Average Behaviour of each Cluster
cluster_data = pd.DataFrame(X_labeled.groupby(by = "Cluster").mean())
cluster_data.plot.bar(figsize = (16,6))

# Feature 1: Number of Likes

In [None]:
# Recall the Legendary-Total Dataset
QP = pd.DataFrame(GBvideos['quantile_popularity'])   # Response
likes = pd.DataFrame(GBvideos['likes'])       # Predictor

# Split the Legendary-Total Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(likes, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict Legendary values corresponding to Total
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)

# Feature 3: Number of Dislikes

In [None]:
# Recall the Legendary-Total Dataset
QP = pd.DataFrame(GBvideos['quantile_popularity'])   # Response
dislikes = pd.DataFrame(GBvideos['dislikes'])       # Predictor

# Split the Legendary-Total Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(dislikes, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict Legendary values corresponding to Total
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)


# Feature 4: Number of Trending Days

In [None]:
# Recall the Legendary-Total Dataset
QP = pd.DataFrame(GBvideos['quantile_popularity'])   # Response
TD = pd.DataFrame(GBvideos['trending days'])       # Predictor

# Split the Legendary-Total Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(TD, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict Legendary values corresponding to Total
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)

# Feature 5: Total Number of Comments

In [None]:
# Recall the Legendary-Total Dataset
QP = pd.DataFrame(GBvideos['quantile_popularity'])   # Response
CT = pd.DataFrame(GBvideos['comment_total'])       # Predictor

# Split the Legendary-Total Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(CT, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict Legendary values corresponding to Total
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)


# Feature 6: Average Sentiment Score

In [None]:
import nltk
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [None]:
#Installing emot library
!pip install emot
#Importing libraries
import re
from emot.emo_unicode import UNICODE_EMO, EMOTICONS


In [None]:
# Function for converting emojis into word
def convert_emojis(text):
    for emot in UNICODE_EMO:
        text = text.replace(emot, " ".join(UNICODE_EMO[emot].replace(",","").replace("_"," ").replace(":","").split()))
    return text

In [14]:
GBcomments['processed text'] = GBcomments['comment_text'].apply(lambda x: convert_emojis(x))
GBcomments

NameError: name 'GBcomments' is not defined

In [None]:
# Function for converting emoticons into word
def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', " ".join(EMOTICONS[emot].replace(",","").replace("_"," ").split()), text)
    return text

In [None]:
GBcomments['processed text'] = GBcomments['processed text'].apply(lambda x: convert_emoticons(x))

In [None]:
GBcomments

In [None]:
#add column of sentiment scores for visualisation
GBcomments['pos_score'] = GBcomments['comment_text'].apply(lambda x:sia.polarity_scores(x)['pos'])
GBcomments['neg_score'] = GBcomments['comment_text'].apply(lambda x:sia.polarity_scores(x)['neg'])
GBcomments['neu_score'] = GBcomments['comment_text'].apply(lambda x:sia.polarity_scores(x)['neu'])

# we will use the compound score for further analysis. here the compound score is renamed as sentiment score
GBcomments['sentiment_scores'] = GBcomments['comment_text'].apply(lambda x:sia.polarity_scores(x)['compound'])
GBcomments.head()

In [None]:
#create new dataframe for number of comments per video
number_comments = GBcomments['video_id'].value_counts().rename_axis('video_id').reset_index(name='number_comments')
number_comments

In [None]:
#create new dataframe for total number of positive comments per video
total_GB = GBcomments.groupby(['video_id'], sort = False).sum()
total_GB

In [None]:
#merge total and number_comments dataframes 
total_GB = pd.merge(total_GB, number_comments, on = 'video_id')
total_GB = total_GB.drop(columns='likes')
total_GB = total_GB.drop(columns='replies')
total_GB.head()

In [None]:
#add column of average sentiment
total_GB['average_sentiment'] = total_GB['sentiment_scores'].div(total_GB['number_comments'].values, axis=0)
total_GB.head()

In [None]:
#add average sentiment column into GBvideos dataframe
GBvideos = pd.merge(GBvideos, total_GB, on = 'video_id')
GBvideos.head(30)

In [None]:
all_words = ' '.join([text for text in GBcomments['comment_text']])
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
QP = pd.DataFrame(GBvideos['quantile_popularity'])   # Response
AS = pd.DataFrame(GBvideos['average_sentiment'])       # Predictor

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(AS, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict response values corresponding to predictor
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)

# Feature 7: Percentage of Positive Comments over Total Number of Comments

In [None]:
#Categorize positive, negative, neutral
GBcomments['Sentiment'] = GBcomments['sentiment_scores'].apply(lambda s : 'Positive' if s > 0 else ('Neutral' if s == 0 else 'Negative'))
GBcomments.head(20)

In [None]:
#percentage of comments which are positive in all the videos
positive_percent = []
for i in range(0,GBcomments.video_id.nunique()):
    a = GBcomments[(GBcomments.video_id == GBcomments.video_id.unique()[i]) & (GBcomments.Sentiment == 'Positive')].count()[0]
    b = GBcomments[GBcomments.video_id == GBcomments.video_id.unique()[i]]['Sentiment'].value_counts().sum()
    Percentage = (a/b)*100
    positive_percent.append(round(Percentage,2))

positive_percent

In [None]:
#Creating dataframe for positive percentage
positive_percentage = pd.DataFrame(positive_percent,GBcomments.video_id.unique()).reset_index()
positive_percentage.columns = ['video_id','Positive Percentage']
positive_percentage

In [None]:
#add positive percentage column into USvideos dataframe
GBvideos = pd.merge(GBvideos, positive_percentage, on = 'video_id')
GBvideos.head()

In [None]:
all_words_posi = ' '.join([text for text in GBcomments['comment_text'][GBcomments.Sentiment == 'Positive']])

In [None]:
# visulise words used in positive comments
wordcloud_posi = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words_posi)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud_posi, interpolation="bilinear")
plt.axis('off')
plt.show()

In [15]:
QP = pd.DataFrame(GBvideos['quantile_popularity'])   # Response
PP = pd.DataFrame(GBvideos['Positive Percentage'])       # Predictor

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(PP, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict response values corresponding to predictor
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)

NameError: name 'pd' is not defined

# Feature 8: Percentage of Negative Comments over Total Number of Comments

In [None]:
#percentage of comments which are negative in all the videos
negative_percent = []
for i in range(0,GBcomments.video_id.nunique()):
    a = GBcomments[(GBcomments.video_id == GBcomments.video_id.unique()[i]) & (GBcomments.Sentiment == 'Negative')].count()[0]
    b = GBcomments[GBcomments.video_id == GBcomments.video_id.unique()[i]]['Sentiment'].value_counts().sum()
    Percentage = (a/b)*100
    negative_percent.append(round(Percentage,2))

negative_percent

In [None]:
#Creating dataframe for negative percentage
negative_percentage = pd.DataFrame(negative_percent,GBcomments.video_id.unique()).reset_index()
negative_percentage.columns = ['video_id','Negative Percentage']
negative_percentage

In [None]:
#add negative percentage column into USvideos dataframe
GBvideos = pd.merge(GBvideos, negative_percentage, on = 'video_id')
GBvideos.head()

In [None]:
all_words_nega = ' '.join([text for text in GBcomments['comment_text'][GBcomments.Sentiment == 'Negative']])

In [None]:
# visualise words used in negative comments
wordcloud_nega = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words_nega)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud_nega, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
QP = pd.DataFrame(GBvideos['quantile_popularity'])   # Response
NP = pd.DataFrame(GBvideos['Negative Percentage'])       # Predictor

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(NP, QP, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 7)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict response values corresponding to predictor
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)

# Multivariate Classification

In [None]:
# Extract Response and Predictors
y = pd.DataFrame(GBvideos["quantile_popularity"])
X = pd.DataFrame(GBvideos[["likes", "views", "comment_total", "dislikes", "average_sentiment", "trending days", "Positive Percentage", "Negative Percentage"]])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Check the sample sizes
print("Train Set :", y_train.shape, X_train.shape)
print("Test Set  :", y_test.shape, X_test.shape)

In [None]:
# Draw the distributions of all Predictors
f, axes = plt.subplots(8, 3, figsize=(18, 16))

count = 0
for var in X_train:
    sb.boxplot(X_train[var], orient = "h", ax = axes[count,0])
    sb.distplot(X_train[var], ax = axes[count,1])
    sb.violinplot(X_train[var], ax = axes[count,2])
    count += 1

In [None]:
# Relationship between Response and the Predictors
trainDF = pd.concat([y_train, X_train.reindex(index=y_train.index)], sort = False, axis = 1)

f, axes = plt.subplots(8, 1, figsize=(18, 24))

count = 0
for var in X_train:
    sb.boxplot(x = var, y = "quantile_popularity", data = trainDF, orient = "h", ax = axes[count])
    count += 1

In [None]:
# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 10)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
treedot = export_graphviz(dectree,                                      # the model
                          feature_names = X_train.columns,              # the features 
                          out_file = None,                              # output file
                          filled = True,                                # node colors
                          rounded = True,                               # make pretty
                          special_characters = True)                    # postscript

graphviz.Source(treedot)

# Random Forest

In [None]:
# Extract Response and Predictors
y = pd.DataFrame(GBvideos["quantile_popularity"])
X = pd.DataFrame(GBvideos[["likes", "views", "comment_total", "dislikes", "average_sentiment","trending days", "Positive Percentage", "Negative Percentage"]])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Check the sample sizes
print("Train Set :", y_train.shape, X_train.shape)
print("Test Set  :", y_test.shape, X_test.shape)

In [None]:
# Import RandomForestClassifier model from Scikit-Learn
from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest object
rforest = RandomForestClassifier(n_estimators = 100,  # n_estimators denote number of trees
                                 max_depth = 10)  # set the maximum depth of each tree

# Fit Random Forest on Train Data
rforest.fit(X_train, y_train.values.ravel())

In [None]:
# Import confusion_matrix from Scikit-Learn
from sklearn.metrics import confusion_matrix

# Predict 0/1 values corresponding to message
y_train_pred = rforest.predict(X_train)
y_test_pred = rforest.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", rforest.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", rforest.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Analysis

### 1. Comparing US and GB Datasets

In [None]:
print("Number of unique USvideos:", USvideos['video_id'].count())
print("Number of unique GBvideos:", GBvideos['video_id'].count())

In [None]:
x0 = USvideos['popularity2']
x1 = GBvideos['popularity2']

fig = go.Figure()
# Use x instead of y argument for horizontal plot
fig.add_trace(go.Box(x=x0, name = "USvideos"))
fig.add_trace(go.Box(x=x1, name = "GBvideos"))

fig.show()

In [None]:
x0 = USvideos['popularity2']
x1 = GBvideos['popularity2']

fig = go.Figure()
fig.add_trace(go.Histogram(x=x0 , name = "USvideos"))
fig.add_trace(go.Histogram(x=x1, name = "GBvideos"))
# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

### 2. Choice of Response

#### 2a) Definition

We defined popular as **(likes-dislikes)/views** (termed as **popularity2** in our dataset).

This was subjective as we based it on our instincts as to what might be the best response to use for this dataset. However, we felt that this was the most appropriate response after carrying out Spearman Correlation Ranking as we have normalised it.

#### 2b) Spearman’s rank correlation vs Pearson’s rank correlation? 
The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data.
Since we are trying to find the best response variable for our project, we want to use a response variable that is based on the relationships of the variables rather than the raw data since the raw data between the most views and next most views could fluctuate a lot.

### 3. Duplicates

By sorting the dataset according to views (highest at the top), we found out that there are duplicted videos. Since the dataset consists of videos that have been trending for 30 days, duplicated videos happen because some videos have been trending for more than 1 day. Hence, we took into account of this and removed all the duplicated videos, leaving only the latest one. 

### 4. Categorical Data for Response

We have also recognised that single, multivariate decision trees require the response (ie popularity2) to be discrete/categorised. However, popularity2 is a continuous variable.

Hence, we brought in **quantile_popularity** which splits popularity2 into 4 catogories by their quartile. Each quartile has the same range, but might not have the same number of videos. The categories are: "very low", "low", "high", "very high".

However, one limitation of quantile_popularity is that there is an **uneven distribution** of videos in the 4 categories for quantile_popularity. As can be seen below, the category "high", has only 16 videos. 

This meant that the trained data for this category might not be very reliable and accurate. Results might be more biased towards videos with "very low" quantile_popularity as there is a larger dataset for it to be trained and tested on.

In [None]:
# interactive plot for quantile_popularity
trace = go.Histogram(x = USvideos['quantile_popularity'], histnorm = 'density')
layout = go.Layout(title = 'Quantile Popularity Distribution')
data = [trace]
fig = go.Figure(data = data, layout = layout)
py.iplot(fig)

### 5. Anomaly
#### 5a) Separate Analysis for anomaly1 in USvideos

In [None]:
anomaly1

From the value of popularity2 and the difference between likes and dislikes, we can infer that this video is not well received. Let's take a look at the comments below.

In [None]:
#Analysis on Anomaly1

anomaly1_comments = UScomments.loc[UScomments['video_id']== anomaly1.iloc[0]['video_id']]
anomaly1_comments

In [None]:
# Function to convert emoji into words
def convert_emojis(text):
    for emot in UNICODE_EMO:
        text = text.replace(emot, " ".join(UNICODE_EMO[emot].replace(",","").replace("_"," ").replace(":","").split()))
    return text

In [None]:
anomaly1_comments['processed text'] = anomaly1_comments['comment_text'].apply(lambda x: convert_emojis(x))
anomaly1_comments.head(30)

In [None]:
# Function for converting emoticons into word
def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', " ".join(EMOTICONS[emot].replace(",","").replace("_"," ").split()), text)
    return text
anomaly1_comments['processed text'] = anomaly1_comments['processed text'].apply(lambda x: convert_emoticons(x))

In [None]:
#add column of sentiment scores for visualisation
anomaly1_comments['pos_score'] = anomaly1_comments['comment_text'].apply(lambda x:sia.polarity_scores(x)['pos'])
anomaly1_comments['neg_score'] = anomaly1_comments['comment_text'].apply(lambda x:sia.polarity_scores(x)['neg'])
anomaly1_comments['neu_score'] = anomaly1_comments['comment_text'].apply(lambda x:sia.polarity_scores(x)['neu'])

# we will use the compound score for further analysis. here the compound score is renamed as sentiment score
anomaly1_comments['sentiment_scores'] = anomaly1_comments['comment_text'].apply(lambda x:sia.polarity_scores(x)['compound'])
anomaly1_comments.head()

In [None]:
#create new dataframe for number of comments per video
number_comments1 = anomaly1_comments['video_id'].value_counts().rename_axis('video_id').reset_index(name='number_comments1')
number_comments1

In [None]:
#create new dataframe for total number of positive comments per video
anomaly1_total = anomaly1_comments.groupby(['video_id'], sort = False).sum()
anomaly1_total

In [None]:
#merge anomaly1_total and number_comments dataframes 
anomaly1_total = pd.merge(anomaly1_total, number_comments1, on = 'video_id')
anomaly1_total.head()

In [None]:
#add column of average sentiment
anomaly1_total['average_sentiment'] = anomaly1_total['sentiment_scores'].div(anomaly1_total['number_comments1'].values, axis=0)
anomaly1_total.head()

In [None]:
# words used in this anomaly1 video
all_words = ' '.join([text for text in anomaly1_comments['comment_text']])
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

After analysing the statistics of this video in detail, it can be seen that this video is about the politics in US. It can be a prime example of a video that is not well received. However, we chose to exclude this video in our classification as we considered this video as an anomaly. The popularity2 value is pretty extreme and might skew our data. That being said, if this project was about finding out which features were important in classifying videos that are not well received, this video might be important in the classification.

#### 5b) Univariate Anomaly Detection

We decided to ultimately remove 1 anomaly via box-plot visualisation as explained above. However, observation might not be the best justification for anomalies. Thus, to be more sure that the point we removed is indeed an anomaly, one improvement that could have been done was to use univarite anomaly detection method shown below.

In [None]:
plt.scatter(range(USvideos.shape[0]), np.sort(USvideos['popularity2'].values))
plt.xlabel('index')
plt.ylabel('popularity2')
plt.title("Popularity2 Distribution")
sb.despine()

In [None]:
sb.distplot(USvideos['popularity2'])
plt.title("Distribution of Popularity2")
sb.despine()

In [None]:
#Using IsolationForest
from sklearn.ensemble import IsolationForest
isolation_forest = IsolationForest(n_estimators=100)
isolation_forest.fit(USvideos['popularity2'].values.reshape(-1, 1))
xx = np.linspace(USvideos['popularity2'].min(), USvideos['popularity2'].max(), len(USvideos)).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)
plt.figure(figsize=(10,4))
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score), 
                 where=outlier==-1, color='r', 
                 alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('popularity2')
plt.show();

#### 5c) Multivariate Anomaly Detection

Another method is to set up a Multi-Variate Anomaly Detection problem on the USvideos Dataset.   
Features : **views, likes, dislikes, comment_total, sentiment_scores, trending days, positive percentage, negative percentage**  

In [None]:
USvideos.head()

In [None]:
# Extract the Features from the Data
X = pd.DataFrame(USvideos[["views", "likes", "dislikes", "comment_total", "trending days", "average_sentiment", "Positive Percentage", "Negative Percentage"]]) 

# Plot the Raw Data on 2D grids
sb.pairplot(X)

In [None]:
# Import LocalOutlierFactor from sklearn.neighbors
from sklearn.neighbors import LocalOutlierFactor

# Set the Parameters for Neighborhood
num_neighbors = 20      # Number of Neighbors
cont_fraction = 0.05    # Fraction of Anomalies

# Create Anomaly Detection Model using LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors = num_neighbors, contamination = cont_fraction)

# Fit the Model on the Data and Predict Anomalies
lof.fit(X)

In [None]:
# Predict the Anomalies
labels = lof.fit_predict(X)

# Append Labels to the Data
X_labeled = X.copy()
X_labeled["Anomaly"] = pd.Categorical(labels)

# Summary of the Anomaly Labels
sb.countplot(X_labeled["Anomaly"])

In [None]:
# Visualize the Anomalies in the Data
sb.pairplot(X_labeled, vars = X.columns.values, hue = "Anomaly")

The same could be applied to GBvideos.

### 6. Number of Views

It might seem that high views would mean high popularity. However, as can be seen, the univariate classification accuracy for views is only around 63%. This shows that views alone is not a good indicator of popularity. A few other factors come together to determine popularity.

In [None]:
x0 = USvideos['views']
x1 = GBvideos['views']

fig = go.Figure()
fig.add_trace(go.Histogram(x=x0 , name = "USvideos"))
fig.add_trace(go.Histogram(x=x1, name = "GBvideos"))
# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

The classification accuracy for both USvideos and GBvideos are similar, at 0.64. This shows that views as a features performs equally in both datasets.

However, 'views' as a feature alone is not strong enough as the classification accuracy of 0.64 is pretty low. This reinforces high views does not mean high popularity.

### 7. Number of Likes and Dislikes

In [None]:
x0 = USvideos['likes']
x1 = GBvideos['likes']

fig = go.Figure()
fig.add_trace(go.Histogram(x=x0 , name = "USvideos"))
fig.add_trace(go.Histogram(x=x1, name = "GBvideos"))
# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

Taking note that there are fewer GBvideos and that the number of likes GBvideos get are moderately higher, this might play a part in the importance of the feature 'likes'. As derived above, the classification accuracy for the feature 'likes' for USvideos is around 0.752 and that of GBvideos is 0.676. 

This might be because the number of likes is moderately high for a fewer number of videos and hence 'likes' as a feature is not as significant for GBvideos. Hence, the lower accuracy.

In [None]:
x0 = USvideos['dislikes']
x1 = GBvideos['dislikes']

fig = go.Figure()
fig.add_trace(go.Histogram(x=x0 , name = "USvideos"))
fig.add_trace(go.Histogram(x=x1, name = "GBvideos"))
# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

However, 'dislikes' as a feature, on the other hand, is pretty similar across both datasets. Classification accuracy for USvideos is 0.646 and that if GBvideos is 0.630.

### 8. Trending Days

In [None]:
x0 = USvideos['trending days']
x1 = GBvideos['trending days']

fig = go.Figure()
fig.add_trace(go.Histogram(x=x0 , name = "USvideos"))
fig.add_trace(go.Histogram(x=x1, name = "GBvideos"))
# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

There are some videos that were trending for a few days. To improve on the classification, we could have performed a more indepth analysis of why these videos were trending for so long.

That being said, 'trending days' as a feature is not a strong feature as well. Classification accuracy is at 0.552 for USvideos and 0.619 for GBvideos. This might be because the youtube trending algorithm is different and based on different aspects of the video, hence the number of trending days is reliant on the popularity of the video, and not the other way round. 

### 9. Number of Comments

In [None]:
x0 = USvideos['comment_total']
x1 = GBvideos['comment_total']

fig = go.Figure()
fig.add_trace(go.Histogram(x=x0 , name = "USvideos"))
fig.add_trace(go.Histogram(x=x1, name = "GBvideos"))
# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

Classification accuracy for USvideos is 0.684 and that for GBvideos is 0.635. Many factors might contribute to the difference in accuracies. One might be the higher number of data in USvideos and this allows the dataset to be better trained.

### 10. Average Sentiment Score
##### 10a) Translation
TextBlob analyses other language comments but it is a paid service. It only analyses up to 500 words per day. Hence, we could not implement text translation into our project. However, if possible, the codes below can be added in for translation, so that our dataset can be more robust in terms of the sentimental analysis feature.

In [None]:
#take a look at how textblob translate languages

from textblob import TextBlob

blob = TextBlob("Ä°lk baÅŸta gÃ¶sterilen yerde tÃ¼rkÃ§e olarak bÃ¼yÃ¼k karakterlerle kongre merkezi yazÄ±lmÄ±ÅŸ olmasÄ± da ilginÃ§.")

In [None]:
blob.detect_language()

In [None]:
blob.translate(to= 'en')

In [None]:
#make a function that can translate non-english comments to english

def translate_comment(text):
    blob = TextBlob(text)
    if blob.detect_language() == 'en':
        return text
    else: 
        return blob.translate(to= 'en')

In [None]:
#try out the function

translate_comment("Ä°lk baÅŸta gÃ¶sterilen yerde tÃ¼rkÃ§e olarak bÃ¼yÃ¼k karakterlerle kongre merkezi yazÄ±lmÄ±ÅŸ olmasÄ± da ilginÃ§.")

In [None]:
# #apply to the whole dataset

# UScomments[‘processed text’] = US[‘processed text’].apply(lambda x: translate_comment(x))

However, as mentioned earlier, this code cannot be run on our dataset.

In [None]:
# check if sentiment analyser works on foreign languages
analyser = SentimentIntensityAnalyzer()
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{:-<40} {}".format(sentence, str(score)))
print(sentiment_analyzer_scores('Ä°lk baÅŸta gÃ¶sterilen yerde tÃ¼rkÃ§e olarak bÃ¼yÃ¼k karakterlerle kongre merkezi yazÄ±lmÄ±ÅŸ olmasÄ± da ilginÃ§.'))

In [None]:
# check with the english translation
print(sentiment_analyzer_scores("It is also interesting that the congress center was written with great characters in the place shown in the first place."))

It is shown that setimental analysis do not work with foreign languages and hence those comments with foreign languages will have a sentiment score of 0. Thus, we have included 2 features, namely Percentage of Positive Comments and Percentage of Negative Comments to help us make our model more robust.

##### 10b) Spelling Check

We attempted to load spell check using text blob. However, dataset is too huge and alot of time will be taken. Hence, we could not carry out spelling check on our dataset. 

In [None]:
# spelling check by first removing letters that are repeated >2 times consecutively
# english words have a max of 2 letters that are repeeated one after another

def reduce_lengthening(text):
    pattern = re.compile(r"(.)\1{2,}")
    return pattern.sub(r"\1\1", text)

#example
sentence = "she is amazzzinggg"
updated = reduce_lengthening(sentence)
print(updated)

In [None]:
UScomments['processed text'] = UScomments['processed text'].apply(lambda x: reduce_lengthening(str(x)))
UScomments

In [None]:
# use TextBlob to do spelling correction
from textblob import TextBlob

#example
print(str(TextBlob(updated).correct()))

In [None]:
#carry out spelling correction for first 5 rows
#can be seen that it is not very accurate
#slangs are not corrected correctly
UScomments['comment_text'][:5].apply(lambda x: str(TextBlob(x).correct()))

In [None]:
# #do for whole dataset

# UScomments['spell checked text'] = UScomments['comment_text'].apply(lambda x: str(TextBlob(x).correct()))
# UScomments

However, that being said, after conducting spelling check for the first 5 rows for exploration, it can be seen that spelling check by TextBlob is not very accurate. Hence, spell check might not be of high significance to the classification.

#### 10c) Reason for not cleaning text data

We handpicked Vader Sentiment because it is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. We realised that Vader is a relatively sensitive tool that takes into account several characteristics of a text, therefore we decided not to proceed on with the text data cleaning.

##### i) Uppercase letters
Using upper case letters to emphasize a sentiment-relevant word in the presence of other non-capitalized words, increases the magnitude of the sentiment intensity

In [None]:
# #Changing the text to lower case
# UScomments['comment_text'] = UScomments['comment_text'].apply(lambda x:x.lower())

##### ii) Conjunctions
Use of conjunctions like “but” signals a shift in sentiment polarity, with the sentiment of the text following the conjunction being dominant. “The food here is great, but the service is horrible” has mixed sentiment, with the latter half dictating the overall rating. (Decided not to lemmatize or remove stopwords)

In [None]:
# #Lemmatization
# from nltk.stem import WordNetLemmatizer
# from nltk.corpus import stopwords

# wnl = WordNetLemmatizer()

# tokenized_UScomments.apply(lambda x: [wnl.lemmatize(i) for i in x if i not in set(stopwords.words('english'))]) 
# tokenized_UScomments.head()


##### iii) Punctuation, Special Characters, Numbers

We found out that these symbols are usually part of emoticons that is an essential feature of determining the sentiment of the comment. Emoticons are used extensively in social media texts (in this case 'youtube comments')


In [None]:
# #Removing Punctuations, Numbers and Special Characters

# UScomments['comment_text'] = UScomments['comment_text'].str.replace("[^a-zA-Z#]", " ")

Other challenges faced conducting sentiment analysis
- Context and polarity
- Irony and sarcasm
- Comparison
- Defining what is neutral


#### 10d) Use of Tags

We have found out that tags do not contribute much to the classification, as they have a very low correlation, as seen from the very low R-Squared value. Hence we are excluding tags.

In [None]:
#Statistical imports
import statsmodels.api as sm
from sklearn.preprocessing import MultiLabelBinarizer
from pprint import pprint
#Returns a sorted histogram dataframe (with top_n rows) for a given list.
def form_hist(given_list,top_n):

    item_set = set(given_list)
    items = []
    counts = []
    for nm in item_set:
        items.append(nm)
        counts.append(given_list.count(nm))
    return pd.DataFrame({'count':counts,'items':items}).sort_values(by='count',ascending=False).head(top_n)

In [None]:
def top20_UStags(videos, num, title):

    all_tags = videos['tags'].map(lambda k: k.lower().split('|')).values
    all_tags = [item for sublist in all_tags for item in sublist]

    counts = form_hist(all_tags,num)
    counts.columns = ['count','tags']
    plt.figure()
    sb.barplot(x = counts['tags'], y = counts['count'])
    plt.xticks(rotation=90)
    plt.ylabel('count')
    plt.title(title)

top20_UStags(USvideos,20,'Top 20 US Tags')

In [None]:
def tags_as_feature(videos, k):
    #Determine the top k tags
    videos = videos.copy()
    all_tags = videos['tags'].map(lambda k: k.lower().split('|'))
    all_tags = [item for sublist in all_tags for item in sublist]
    counts = form_hist(all_tags,k)
    top_tags = counts['items'].values[:k]

    def filter_f(x):
        x = x.lower().split('|')
        return [e for e in x if e in top_tags]

    #Reduce tags to only the most frequent ones
    videos['tags'] = videos['tags'].map(filter_f)

    #Convert our data into the design matrix
    mlb = MultiLabelBinarizer()
    design = mlb.fit_transform(videos['tags'])
    design = sm.add_constant(design)


    #Fit linear regression
    ols = sm.OLS(videos['views'].values, design)
    fitting = ols.fit()
    labels = ['intercept'] + list(mlb.classes_)
    return fitting.summary(), labels

top_n_tags = 20

USsummary, uslabels = tags_as_feature(USvideos, top_n_tags)
print("US VIDEOS")
pprint(USsummary)

#### 10e) Vader (Rule-Based) vs other Machine Learning-Automatic Approach like Naive Bayes, SVM etc
Advantage of Vader: 
- Doesn’t require any training data but is constructed from a generalizable, valence-based, human-curated gold standard sentiment lexicon
- works exceedingly well on social media type text, yet readily generalizes to multiple domains
- fast enough to be used online with streaming data

Disadvantage of ML:
- depend on the training set to represent as many features as possible (which often, they do not – especially in the case of the short, sparse text of social media)


### 11. Positive Percentage and Negative Perentage

In [None]:
x0 = USvideos['Positive Percentage']
x1 = GBvideos['Positive Percentage']

fig = go.Figure()
fig.add_trace(go.Histogram(x=x0 , name = "USvideos"))
fig.add_trace(go.Histogram(x=x1, name = "GBvideos"))
# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

Classification accuracy is around 0.64 for USvideos and 0.72 for GBvideos. It is also noted that the classification accuracy for percentage of positive comments vary a lot between US and GB (65% vs 72%), but similar for percentage of negative comments (both 70%). This might be because words of negative sentiment might be more unique hence they might perform better with sentiment analysis.


In [None]:
x0 = USvideos['Negative Percentage']
x1 = GBvideos['Negative Percentage']

fig = go.Figure()
fig.add_trace(go.Histogram(x=x0 , name = "USvideos"))
fig.add_trace(go.Histogram(x=x1, name = "GBvideos"))
# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

Classification accuracy is similar at around 0.70.

### 12. Category ID
#### 12a) Visualisation

In [None]:
# show the categories of videos that are present in dataset
USvideos.category_id.value_counts()

In [None]:
len(USvideos.category_id.unique())

In [None]:
USvideos.category_id.value_counts().plot(kind='pie', autopct='%1.0f%%')

In [None]:
#import the old category id list
print("Category ID List used for this dataset:")
OCI = pd.read_csv('old_category_id.csv')
OCI

From the above category list, it can be seen that Category ID 24 which corresponds to **Entertainment** has the highest count. This might seem expected as people usually watch entertainment videos on YouTube, hence contributing to the high value count. 

However, on further inspection, it is also good to take note that Entertainment is actually a general category for all the other sub-categories like comedy, shows, music. Hence, video creators might have just chosen Entertainment as it is a broader cateogory. 

Thus, deriving Entertainment as the category of highest count might not be very useful in classification. It might be better if we are getting the genres (eg horror, comedy, sports) instead.

In the recent years, YouTube has also developed a new Cateogry ID List, hence this classification for Category ID might not be applicable for current videos. The new list can be seen below.

In [None]:
#import the updated category id list
print("Updated Category ID List used currently by YouTube (which this dataset does not use):")
UCI = pd.read_csv('updated_category_id.csv')
UCI

We attempted to use **one hot encoding** for the classification of Category IDs to ensure that the data is normalised.

#### 12b) One Hot Encoding

In [None]:
# importing one hot encoder 
from sklearn.preprocessing import OneHotEncoder
# creating one hot encoder object 
onehotencoder = OneHotEncoder()
#reshape the 1-D country array to 2-D as fit_transform expects 2-D and finally fit the object 
X = onehotencoder.fit_transform(USvideos.category_id.values.reshape(-1,1)).toarray()
#To add this back into the original dataframe 
dfOneHot = pd.DataFrame(X, columns = ["category_id_"+str(int(i)) for i in range(0,16)]) 
USvideos = pd.concat([USvideos, dfOneHot], axis=1)
#printing to verify 
USvideos.head()

However, we realised that although one hot encoding is a decent tool to normalise categorised variables, categorised variables itself is not a good feature of Decision Tree classifications. By one-hot encoding a categorical variable, we are inducing sparsity into the dataset and that is undesirable.

During the splitting algorithm, the answer will be a yes or no (ie binary). Thus, the tree will only grow in one direction. One hot encoding categorical variables with high cardinality can cause inefficiency in tree-based classifications. 

Continuous variables will be given more importance than the categorical variables by the algorithm. Categorical variables will obscure the order of feature importance resulting in poorer performance.

### 13. Choice of Model

When deciding on the most suitable ML Model for our project, we considered Logistic Regression as well as Support Vector Machine (SVM). However, both are intended for binary (two-class) classification problems, which does not fit our 4 quantile model.

# Conclusion

Overall, it is to be acknowledged that the dataset itself has **limitations**. Some of the features are somehow correlated with each other. Higher trending days count might be due to having high number of views and likes etc. Hence, we attempted to normalise the dataset by choosing the most appropriate response indicator (ie (likes-dislikes)/views)

Other improvements that could have been done for this project could have been to include the analysis of likes and replies for each comment. If this project could be brought further, it could help in the recommendation of YouTube videos to users in different countries.

Ranking of Features for **USvideos**:
    1. Number of Likes
    2. Percentage of Negative Comments over Total Number of Comments
    3. Average Sentiment
    4. Total Number of Comments
    5. Percentage of Positive Comments over Total Number of Comments
    6. Number of Dislikes
    7. Number of Views
    8. Trending Days
    

Ranking of Features for **GBvideos**:
    1. Percentage of Positive Comments over Total Number of Comments
    2. Average Sentiment
    3. Percentage of Negative Comments over Total Number of Comments
    4. Number of Likes
    5. Number of views
    6. Total Number of Comments
    7. Number of Dislikes
    8. Trending Days
    

The ranking above shows that the features might be of different importance in different countries. This might be due to the cultural factors. For example, the way people type the comments - the tone, emojis and slangs might play a part in sentimental analysis.

If possible, a different recommendation system can be used for US and GB as the types of videos that Americans prefer watching might be different from those that British prefer to watch, or the range of videos or genres that the people watch are different. 

Multi-variate Decision Tree is better than the single decision tree as it combines all the features, hence all features can be used at different levels of the tree to give a high classification accuracy.

Random forest is able to generalize much better to the testing data than the single decision tree or the multi-variate decision tree. The random forest has lower variance while maintaining the low bias of a decision tree. This is because the random forest is essentially a collection of decision trees.

A decision tree is built on an entire dataset, using all the features as mentioned earlier whereas a random forest randomly selects specific features to build multiple decision trees from and then averages the results.

Thus, we can conclude that the random forest will give the highest classification accuracy and that the 8 features listed above will serve to provide better predications in classifications. From the rankings above, it can be seen that average sentiment is a pretty good feature in classsifcation.

# Datasets
- https://www.kaggle.com/datasnaek/youtube#GBcomments.csv
- https://www.kaggle.com/datasnaek/youtube-new#GBvideos.csv 

# References

- https://techpostplus.com/2019/04/26/youtube-video-categories-list-faqs-and-solutions/
- https://gist.github.com/dgp/1b24bf2961521bd75d6c
- https://www.kaggle.com/minc33/k-means-clustering-vs-logistic-regression
- https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a
- https://towardsdatascience.com/k-means-clustering-8e1e64c1561c
- https://towardsdatascience.com/unsupervised-learning-clustering-algorithms-5b290967f746
- https://data-flair.training/blogs/k-means-clustering-tutorial/
- https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/
- https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/
- https://medium.com/towards-artificial-intelligence/emoticon-and-emoji-in-text-mining-7392c49f596a
- http://datameetsmedia.com/staging/3908/vader-sentiment-analysis-explained/
- https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/
- https://textblob.readthedocs.io/en/dev/quickstart.html#translation-and-language-detection
- https://stackabuse.com/python-for-nlp-introduction-to-the-textblob-library/
