# ADA final exam (winter semester 2019/2020)

A friend of yours wants to start a YouTube channel and ideally earn some money via ads. However, there are so many channels and videos out there that your friend has no idea where to even start. Fortunately, they know that you have taken ADA and think you might help them out by analyzing the videos that are currently on YouTube.

The data you are provided with is a subset of YouTube videos, with videos from some of the giant channels in two categories: "Gaming" and "How-to & Style", which are the categories your friend is choosing between. The dataset contains a lot of videos, with data on those videos including their titles, their total number of views in 2019, their tags and descriptions, etc. The data is, in gzip-compressed format, contained in the `data/` folder, as the file `youtube.csv.gz`.

The three tasks A, B and C are **independent** of each other, and you can solve any combination of them. The exam is designed for more than 3 hours, so don't worry if you don't manage to solve everything; you can still score a 6.

You need to run the following two cells to read and prepare the dataset.

In [None]:
pd.DataFrame.sort_values?

In [None]:
import pandas as pd
import numpy as np
import networkx as nx

In [None]:
youtube = pd.read_csv('data/youtube.csv.gz', compression='gzip')
youtube.upload_date = pd.to_datetime(youtube.upload_date)

In [None]:
youtube.head(5)

In [None]:
youtube.info()

## Dataset description

Each row of the dataset corresponds to one video that was uploaded to YouTube. There are 11 columns:
'channel', 'upload_date', 'title', 'categories', 'tags', 'duration',
       'view_count', 'average_rating', 'height', 'width', 'channel_cat'.
- `channel`: The channel (account) on which the video was uploaded.
- `upload_date`: The date on which the video was uploaded (Pandas Timestamp object).
- `title`: The title of the video.
- `tags`: A list of words that describe the video.
- `duration`: The duration of the video in seconds.
- `view_count`: The number of times the video was watched.
- `average_rating`: The average score with which the viewers rated the video (1-5).
- `height`: The height of the video in pixels.
- `width`: The width of the video in pixels.
- `channel_cat`: The category of the channel on which this video was uploaded. This dataset only contains videos from channels from the 'Gaming' and the 'Howto & Style' category.

# Task A: Welcome to the exam!

All of Task A refers to the videos that were published between and including 2010 and 2018.

## A1: A growing platform?

You would first like to know whether YouTube in general is the right platform to invest time into.

1. Using the appropriate plot type, plot the number of videos published per year between and including 2010 and 2018.

**Comment** Select only the videos between 2010/2018 (included)

In [None]:
range_10_18 = range(2010,2019)
youtube['year'] = youtube.upload_date.apply(lambda x : x.year)
youtube_10_18 = youtube[youtube.year.isin(range_10_18)]
youtube_10_18.head(3)

In [None]:
youtube_10_18.groupby(by='year').agg("count").reset_index().plot.bar(x='year', y='channel', rot=0)

2. Now for each year, plot the number of channels that have been created between the beginning of 2010 and the end of that year. A channel is considered to be created at the time at which they upload their first video.

In [None]:
df = youtube[~youtube.channel.isin(youtube.channel[youtube.upload_date < np.datetime64('2010-01-01')].unique())]
#youtube_cat = new_channels[(new_channels.channel_cat.isin(['Gaming', 'Howto & Style']))&(new_channels.year.isin(list(range_10_18)))][['channel','year']]
#youtube_cat.sort_values(by ='year').drop_duplicates(subset='channel', keep='first')

In [None]:
df_cleaned = df.loc[(df.channel_cat.apply(lambda x: x in ['Gaming', 'Howto & Style']))&(df.year <= 2018) & (df.year >= 2010), ['year', 'channel']]\
                                    .sort_values('year')\
                                    .drop_duplicates(subset='channel', keep='first')\
                                    .groupby('year').count() + \
                                    pd.DataFrame(data={'channel': [0]*(2018-2010+1),'year':list(range(2010,2018+1))}).set_index('year')\
                                    .fillna(0).cumsum()

df_cleaned.plot(kind='bar')
plt.xlabel('year')
plt.ylabel('channels')

3. Normalize the number of videos published each year by the number of channels that have been created between the beginning of 2010 and the end of that year, and plot these quantities. Do seperate plots for gaming channels, how-to channels, and both together. Can you conclude from the plot that both gaming and how-to channels have been becoming less and less active recently? Why, or why not?

1. compute mean number of number of channel between 2010 and year-x

In [None]:
df = youtube_10_18.sort_values(by='upload_date', ascending=True)\
                .drop_duplicates(subset='channel', keep='first')\
                .groupby(by='year').agg(count_=pd.NamedAgg(column="channel", aggfunc="count")).reset_index()

In [None]:
mean = [df.count_.values[:i].mean() for i in range(1,10)]
mean[0] = 0
std = [df.count_.values[:i].std() for i in range(1,10)]
std[0] = 1

In [None]:
df['norm'] = (df.count_-mean)/std

In [None]:
df.norm

## A2: The one thing we all love: cash money

Your friend is really keen on making money from their YouTube channel through ads and wants you to help them choose the most profitable channel category (Gaming or Howto & Style). The ad profit is directly proportional to the number of views of a video.

1. Since your friend wants to keep producing videos for several years to come, it might also be worth looking at the growth of the two categories.
  1. Compute the total number of views in each category per year for the years 2010-2018.
  2. Divide the yearly view count by the number of channels that posted a video in each category in each year. Plot these normalized counts.




2. Your friend's channel will be brand new, so you decide to look more closely at newer channels. For this question and all the following questions in A2, only consider channels that uploaded their first video in  2016 or later. Compute the total number of views in each category and divide it by the number of channels in that category.


3. The number of views might be very unevenly over the different channels, and channels might upload different numbers of videos.
  1. Compute the mean number of views per video for each channel.
  2. Compute the mean of these means for each of the two categories. Print these values.
  3. Using bootstrapping, compute 95% confidence intervals for these two means. From this analysis, can you draw a recommendation for one of the two categories? Why, or why not?

# Task B: View forecasting (Machine Learning)

Your friend wants to figure out how they can optimize their videos for getting the maximum number of views (without using shocking thumbnails and clickbait titles). In this task, you will build a machine learning (ML) model for predicting the success of a video.

In [408]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LogisticRegression

## B1: Get those shovels out again

In [483]:
youtube = pd.read_csv('data/youtube.csv.gz', compression='gzip')
youtube.upload_date = pd.to_datetime(youtube.upload_date)

In [484]:
youtube

Unnamed: 0,channel,upload_date,title,tags,duration,view_count,average_rating,height,width,channel_cat
0,PewDiePie,2013-03-04,A NEW ADVENTURE! - Kingdom Hearts (1) w/ Pewds,"['lets', 'play', 'horror', 'game', 'walkthroug...",1126.0,2541550.0,4.886102,720.0,1280.0,Gaming
1,PewDiePie,2013-03-04,SAVING PRIVATE PEWDS - Conker's Bad Fur Day (15),"['lets', 'play', 'horror', 'game', 'walkthroug...",903.0,1727646.0,4.951531,720.0,1280.0,Gaming
2,PewDiePie,2013-03-04,THE WORST SCARE! - Amnesia: Rain (4),"['lets', 'play', 'horror', 'game', 'walkthroug...",806.0,1402747.0,4.962706,720.0,1280.0,Gaming
3,PewDiePie,2013-03-03,Nova / Sp00n / Cry / Pewds - Worms Revolution ...,"['lets', 'play', 'horror', 'game', 'walkthroug...",909.0,4348296.0,4.937665,720.0,1280.0,Gaming
4,PewDiePie,2013-03-03,SEXIEST HORROR EVER - Amnesia: Rain (3),"['lets', 'play', 'horror', 'game', 'walkthroug...",834.0,1410659.0,4.957545,720.0,1280.0,Gaming
...,...,...,...,...,...,...,...,...,...,...
139502,cutepolish,2010-02-23,Easy Bride Wedding Nails,"['easy', 'makeup', 'beauty', 'fashion']",201.0,284147.0,4.608439,480.0,640.0,Howto & Style
139503,cutepolish,2010-02-22,Purple Flower Nails,"['easy', 'makeup', 'beauty', 'fashion', 'tutor...",180.0,136278.0,4.638451,480.0,640.0,Howto & Style
139504,cutepolish,2010-02-21,Domo Kun Nails,"['easy', 'makeup', 'beauty', 'fashion']",277.0,228384.0,4.836411,480.0,640.0,Howto & Style
139505,cutepolish,2010-02-20,Easy Plaid Nails,"['easy', 'makeup', 'beauty', 'fashion']",174.0,247053.0,4.855700,480.0,640.0,Howto & Style


1. For the prediction model, use all rows of the dataset, but keep only the following columns: `view_count, channel, upload_date, duration, average_rating, height, width`.

In [410]:
mask_columns = ['view_count', 'channel', 'upload_date', 'duration', 'average_rating', 'height', 'width']
youtube = youtube[mask_columns]
youtube.head(5)

Unnamed: 0,view_count,channel,upload_date,duration,average_rating,height,width
0,2541550.0,PewDiePie,2013-03-04,1126.0,4.886102,720.0,1280.0
1,1727646.0,PewDiePie,2013-03-04,903.0,4.951531,720.0,1280.0
2,1402747.0,PewDiePie,2013-03-04,806.0,4.962706,720.0,1280.0
3,4348296.0,PewDiePie,2013-03-03,909.0,4.937665,720.0,1280.0
4,1410659.0,PewDiePie,2013-03-03,834.0,4.957545,720.0,1280.0


2. Extract the upload year and upload month from the `upload_date` column into the two columns `upload_year` and `upload_month`, and remove `upload_date`.

In [411]:
mask_columns_2 = ['view_count', 'channel', 'upload_year','upload_month', 'duration', 'average_rating', 'height', 'width']
youtube['upload_year'] = youtube.upload_date.apply(lambda x : x.year)
youtube['upload_month'] = youtube.upload_date.apply(lambda x : x.month)
youtube = youtube[mask_columns_2]
youtube.head(5)

Unnamed: 0,view_count,channel,upload_year,upload_month,duration,average_rating,height,width
0,2541550.0,PewDiePie,2013,3,1126.0,4.886102,720.0,1280.0
1,1727646.0,PewDiePie,2013,3,903.0,4.951531,720.0,1280.0
2,1402747.0,PewDiePie,2013,3,806.0,4.962706,720.0,1280.0
3,4348296.0,PewDiePie,2013,3,909.0,4.937665,720.0,1280.0
4,1410659.0,PewDiePie,2013,3,834.0,4.957545,720.0,1280.0


3. The entry in the channel column for a video indicates on which channel the video was uploaded. Encode this column via one-hot encoding.

In [412]:
#How many new columns ?
len(youtube.channel.unique())

195

In [413]:
# Creating instance of one-hot-encoder
enc = OneHotEncoder(handle_unknown='ignore')
# Passing bridge-types-cat column (label encoded values of bridge_types)
enc_df = pd.DataFrame(enc.fit_transform(youtube[['channel']]).toarray())
# Merge with main df bridge_df on key values
youtube_ = youtube.join(enc_df)
youtube_ = youtube_.drop(labels='channel', axis=1)
youtube_.head(5)

Unnamed: 0,view_count,upload_year,upload_month,duration,average_rating,height,width,0,1,2,...,185,186,187,188,189,190,191,192,193,194
0,2541550.0,2013,3,1126.0,4.886102,720.0,1280.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1727646.0,2013,3,903.0,4.951531,720.0,1280.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1402747.0,2013,3,806.0,4.962706,720.0,1280.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4348296.0,2013,3,909.0,4.937665,720.0,1280.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1410659.0,2013,3,834.0,4.957545,720.0,1280.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


4. Split the data into a train (70%) and a test set (30%) with the appropriate function from sklearn, using 42 as the random seed.

In [466]:
train, test = train_test_split(youtube_, test_size=0.30, random_state=42)

## B2: Who is the most viewed of them all?

1. Train a ridge regression model (i.e., an L2-regularized linear regression model) on the train set that predicts the view count from the other features. Find and use the optimal regularization parameter $\alpha$ from the set {0.001, 0.01, 0.1} via 3-fold cross validation.

In [467]:
X_train = train.drop(labels='view_count', axis=1)
y_train = train.view_count
ridge = Ridge()
parameters = {'alpha':(0.001, 0.01, 0.1)}
clf = GridSearchCV(ridge, parameters)
clf.fit(X_train, y_train)

GridSearchCV(estimator=Ridge(), param_grid={'alpha': (0.001, 0.01, 0.1)})

In [468]:
clf.cv_results_['mean_score_time']

array([0.01563001, 0.01292539, 0.01266613])

In [469]:
clf.cv_results_['std_score_time']

array([0.00505938, 0.00026913, 0.0002349 ])

2. Report the mean absolute error that the model makes on the test set.

In [470]:
X_test = test.drop(labels='view_count', axis=1)
y_test = test.view_count

In [471]:
mean_absolute_error(y_test, clf.predict(X_test))

1444649.5039951713

## B3: Checking our ambitions

To improve performance, you want to make the task of the ML model easier and turn it into a classification task. Now it only has to predict whether a video has a high view count (defined as being larger than the median of the view counts in the training set) or a low view count (defined as being smaller or equal to the median of the view counts in the training set).

1. Train a logistic regression model for this classification task. Find and use the optimal regularization parameter C (as defined in scikit-learn's documentation) from the set {1, 10, 100} via 3-fold cross validation. Use the random seed 42. _Hint_: If you get a warning about the training algorithm failing to converge, increase the maximum number of training iterations.

In [472]:
threshold = y_train.median()
y_train_binary = (y_train > threshold).astype(int)
y_test_binary = (y_test > threshold).astype(int)

In [473]:
logit = LogisticRegression(random_state=42, max_iter=1000)
parameters = {'C':(1,10,100)}
clf = GridSearchCV(logit, parameters)
clf.fit(X_train, y_train_binary)

GridSearchCV(estimator=LogisticRegression(max_iter=1000, random_state=42),
             param_grid={'C': (1, 10, 100)})

In [474]:
clf.cv_results_['mean_score_time']

array([0.03399649, 0.03553305, 0.035533  ])

2. Compute the accuracy of the logistic regression model on the test set.

In [475]:
clf.score(X_test, y_test_binary)

0.7480228418512412

## B4: ...something's not right.

You are satisfied with the model performance. In fact, you are a bit surprised at how good the model is given the relatively little amount of information about the videos. So you take a closer look at the features and realize that the (one-hot-encoded) channel feature does not make sense for the application that your friend has in mind.

1. Why does the channel feature not make sense?

**Comment** 

Because the channel is too specific and it is more an name than a feature. 

2. Train another logistic regression model with all the features from B3 except the one-hot-encoded channel. Use again 42 as the seed for the train test split and perform the same hyperparameter optimization as in B3. How does the model performance change?

In [476]:
X_train = X_train[['upload_year','upload_month','duration', 'average_rating','height','width']]
X_test = X_test[['upload_year','upload_month','duration', 'average_rating','height','width']]

(41853, 6)

In [478]:
logit = LogisticRegression(random_state=42, max_iter=1000)
parameters = {'C':(1,10,100)}
clf = GridSearchCV(logit, parameters)
clf.fit(X_train, y_train_binary)

GridSearchCV(estimator=LogisticRegression(max_iter=1000, random_state=42),
             param_grid={'C': (1, 10, 100)})

In [479]:
clf.cv_results_['mean_score_time']

array([0.002772  , 0.00269728, 0.00259695])

In [480]:
clf.score(X_test, y_test_binary)

0.6048311948964232

**Comment**

The performance is lower for this set up. Also we can notice that it is not more C=10 but C=1 for the parameter.

## B5: "We kinda forgot about categories."

On second thought, there is actually one feature that you may use about the channel. Namely, the channel category. The reason this one makes sense might also help you answer B4.1.

1. Train and evaluate another logistic regression model (in the same way as in B4 regarding train/test split and hyperparameter) that additionally includes the one-hot-encoded channel category.

In [494]:
youtube_ml = youtube_ml[['view_count', 'duration', 'average_rating', 'height', 'width', 'u_year', 'u_month']]
youtube_ml_gaming = youtube_ml[youtube['channel_cat'] == 'Gaming']
youtube_ml_howto = youtube_ml[youtube['channel_cat'] == 'Howto & Style']

In [495]:
train_gaming, test_gaming = train_test_split(youtube_ml_gaming, test_size=0.3, random_state=1)
X_train_gaming = train_gaming.drop(columns=['view_count'])
y_train_gaming = train_gaming['view_count']
X_test_gaming = test_gaming.drop(columns=['view_count'])
y_test_gaming = test_gaming['view_count']

train_howto, test_howto = train_test_split(youtube_ml_howto, test_size=0.3, random_state=1)
X_train_howto = train_howto.drop(columns=['view_count'])
y_train_howto = train_howto['view_count']
X_test_howto = test_howto.drop(columns=['view_count'])
y_test_howto = test_howto['view_count']

In [496]:
y_train_binary_gaming = (y_train_gaming > y_train_gaming.median()).astype(int)
y_test_binary_gaming = (y_test_gaming > y_train_gaming.median()).astype(int)
y_train_binary_howto = (y_train_howto > y_train_howto.median()).astype(int)
y_test_binary_howto = (y_test_howto > y_train_howto.median()).astype(int)

In [498]:
clf.fit(X_train_gaming, y_train_binary_gaming)
clf.score(X_test_gaming, y_test_binary_gaming)

0.6309280826029535

In [500]:
clf.fit(X_train_howto, y_train_binary_howto)
clf.score(X_test_howto, y_test_binary_howto)

0.6413494604529824

2. The dynamics of the two categories might differ a lot, and the two communities might value different properties of a video differently. For instance, for one community, a long duration might be more important, for the other one, a large picture width. Thus, having only a single weight for, e.g., the duration of a video, might not give the best results. Is there something smarter that you can do than simply including the category as a single one-hot-encoded feature to improve the classification performance? Implement your idea and compare the accuracy on the test set with that of the first model (from task B5.1).

# Task C: A map of the channels (Graphs)

Your friend wants to map out the channels and represent their similarities. For this purpose, we have created two undirected and unweighted graphs for you, where in each graph, each channel has a node and similar channels have edges connecting them. In one graph, the similarity between two channels is based on how similar their video descriptions are, while in the other, the similarity is based on how similar their video tags are. We will call the former $G_{text}$ and the latter $G_{tags}$. You will be analyzing the two graphs loaded by running the cell below.

In [None]:
from networkx import from_numpy_array
import json
g_text_adj = np.loadtxt(open('data/g_text_adj.csv', 'r'), delimiter=',', skiprows=0)
g_tags_adj = np.loadtxt(open('data/g_tags_adj.csv', 'r'), delimiter=',', skiprows=0)
channel_to_index = json.load(open('data/channel_indices.json', 'r'))
g_text = from_numpy_array(g_text_adj)
g_tags = from_numpy_array(g_tags_adj)

## C1: Does YouTube have a content diversity problem?

1. For each graph, calculate its diameter (i.e., the largest shortest-path length, where the maximization is done over all node pairs). What difference do you see? _Hint_: Don't worry if you get an error, just read the error message carefully.

In [None]:
d_text = nx.diameter(g_text, e=None, usebounds=False)

In [None]:
#d_tags = nx.diameter(g_tags, e=None, usebounds=False)

**Comment** For the G_text, the diameter is equal to 2, for the G_tags there is no diameters since the graph is not connected and would have an infinite diameter.

2. What does the diameter of $G_{text}$ say about the diversity of the channels’ contents? How about the diameter of $G_{tags}$?

**Comment $G_{text}$** 

Diameter = largest shortest-path length

It seems that for $G_{text}$, all the nodes are connected almost 

In [None]:
print('In G_text, there is %d edges and %d nodes.'% (len(g_text.edges()), len(g_text.nodes())))

When a graph is fully connected it as n(n-1)/2 edges where n is the number of edges. In our case, for 195 it would be 18'915 which is almost the number that we have.

**Comment $G_{tags}$**

When we tried to find the diameter of the graph, we raised an error. It seems that the graph is not connected so for some nodes, there is no paths between them. 

3. Based on what you have calculated, which one has greater diversity: descriptions used by channels, or tags used by channels? Justify your answer.

**Comment** 

$G_{text}$ has low diversity since the diameter is small, while $G_{tags}$ doesn't tell us much because its diameter is undefined.

4. Imagine that you want to **compare** content diversity between two sets of channels (i.e., you want to see which set of channels has more diverse content), and you have calculated a tag-based graph for each set. Do you think the diameter is a good measure for doing the comparison? Justify your answer.

**Comment**

No, because as we saw, the diameter can end up being undefined for both, which implies that both are diverse but does not provide much of a comparison.

5. Back to our own two graphs. Based on $G_{text}$, for each category of channels, which channel is the one most representative of the contents of all channels in that category? In other words, for each category, if you needed to provide a summary of all channels in the category via one channel, which channel would you choose? Show us (us being the exam designers and your friend) the descriptions of this channel’s two most-viewed videos. What metric did you use for this purpose? Explain your choice.

In [None]:
centrality_node = nx.degree_centrality(g_text)
centrality_node = {k: v for k, v in sorted(centrality_node.items(), key=lambda item: -item[1])}

In [None]:
centrality_node[1]

In [None]:
channel_to_index['Desi Perkins']

In [None]:
youtube[youtube.channel == 'Desi Perkins']['channel_cat'].values[0]

In [None]:
youtube[youtube.channel == 'Desi Perkins']['view_count'].sum(), youtube[youtube.channel == 'Desi Perkins']['view_count'].mean()

**Comment**

Since the degree centrality for a node v is the fraction of nodes it is connected to, the one with the biggest score in his category will be the one the most representative.

The node with the highest centrality score is the node 1.

This node correspond to the channel Desi Perkins and it represents the category How to & Style. 

## C2: Going back to categories again

1. We want to use the two graphs to cluster channels from the same category together, and we want to compare their effectiveness at doing so. Use Kernighan-Lin bisection in the networkx package (`networkx.algorithms.community.kernighan_lin_bisection`) to divide each graph into two communities. Use 42 as the random seed. For each graph, show how many members of each category fall into each of the two communities.

In [None]:
communities = nx.algorithms.community.kernighan_lin_bisection(g_text,seed=42)
print('In G_text, there is %d nodes in the category "How to & Style" ' % (len(communities[0])))
print('In G_text, there is %d nodes in the category "Gaming" ' % (len(communities[1])))

In [None]:
communities_ = nx.algorithms.community.kernighan_lin_bisection(g_tags,seed=42)
print('In G_text, there is %d nodes in the category "How to & Style" ' % (len(communities_[0])))
print('In G_text, there is %d nodes in the category "Gaming" ' % (len(communities_[1])))

2. If one of these graphs were ideal for this clustering task, what would the resulting communities look like? If it were the absolute worst possible graph for the task, what would the resulting communities look like?

- Ideal Graph: would have two distincts communities with maybe some outliers.
- Worst Graph: would have a lot of communities with only a few number of edges in it.

3. Calculate the probability $P(community|category)$ for each community and category within each graph. Design a metric, using the four $P(community|category)$ values in a graph, whose value would be 1 for the ideal graph and 0 for the worst graph. Calculate this metric for both graphs and compare the two. What do the results say about how representative tags and descriptions are regarding the channel categories? Are tags better suited, or descriptions?

In [None]:
hs_community_nodes = list(communities[0])
real_hs_community = youtube[youtube.channel_cat == "Howto & Style"]
real_hs_community['index']= real_hs_community.channel.apply(lambda x : channel_to_index[x])

gaming_community_nodes = list(communities[1])
real_gaming_community = youtube[youtube.channel_cat == "Gaming"]
real_gaming_community['index']= real_gaming_community.channel.apply(lambda x : channel_to_index[x])

p_hs = real_hs_community.shape[0]/youtube.shape[0] 
p_gaming = real_gaming_community.shape[0]/youtube.shape[0]

p_c0 = len(hs_community_nodes)/(len(hs_community_nodes)+len(gaming_community_nodes))
p_c1 = len(gaming_community_nodes)/(len(hs_community_nodes)+len(gaming_community_nodes))

In [None]:
p_hs_c0 = len(set(hs_community_nodes).intersection(set(real_hs_community['index'].values)))/len(hs_community_nodes)
p_gaming_c0 = len(set(hs_community_nodes).intersection(set(real_gaming_community['index'].values)))/len(hs_community_nodes)

p_hs_c1 = len(set(gaming_community_nodes).intersection(set(real_hs_community['index'].values)))/len(gaming_community_nodes)
p_gaming_c1 = len(set(gaming_community_nodes).intersection(set(real_gaming_community['index'].values)))/len(gaming_community_nodes)

In [None]:
p_c0_hs = (p_hs_c0*p_c0)/p_hs
p_c0_hs

In [None]:
p_c1_hs = (p_hs_c1*p_c1)/p_hs
p_c1_hs

In [None]:
p_c0_gaming = (p_gaming_c0*p_c0)/p_gaming
p_c0_gaming

In [None]:
p_c1_gaming = (p_gaming_c1*p_c1)/p_gaming
p_c1_gaming

4. The Kernighan-Lin bisection you used above performs a min-edge cut: It attempts to partition the nodes of the graph into two sets of almost-equal size by deleting as few edges as possible. It starts off by creating a random partition of the nodes of the graph into two sets A and B that are almost equal in size, and then iteratively and in a greedy fashion moves nodes between A and B to reduce the number of edges between A and B. Show at least one toy example of a graph where the initialization could also be the final result. (Hint: Think of how, as we explained, the bisection algorithm relies on a minimum edge cut with a random initialization; under what circumstances could the original A and B be the best partition given that graph?)