## Top TfIdf words for channels

Methodology: similar to https://pudding.cool/2017/09/hip-hop-words/

Merge all videos for each channel for every year and see what makes that channel distinctive and if it changes over time.

Method:
1. Import cleaned captions
2. Group them by channel and year

In [1]:
import pandas as pd
import datetime as dt
from spacy.lang.en import English
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import networkx as nx

In [2]:


captions = '/home/dim/Downloads/captions_right.csv'
videos = '/home/dim/Documents/projecten/extremisme/youtube/data/temp/bubble/right/videos_right.csv'

columns = ['video_id', 'text']

df1 = pd.read_csv(captions, names=columns, low_memory=False)
df2 = pd.read_csv(videos, low_memory=False)
df = pd.merge(df1, df2, on='video_id', how='left')

del df1, df2

In [3]:

df['video_published'] = pd.to_datetime(df['video_published'])
df['year'] = df['video_published'].dt.year

df = df.groupby(['video_channel_title', 'year'])['text'].apply(lambda x: x.sum())

In [9]:
df = df.reset_index()

In [None]:
###Optional: Lemmatize

tokenizer = English().Defaults.create_tokenizer()

df.text = df.text.apply(lambda x: ' '.join([tok.lemma_ for tok in tokenizer(x)]))

## Tfidf values

### Parameter choices
Followed the pudding hiphop blog. Terms have to appear in at least one in 50 channels (lower than with the pudding, who use one in 10, because we have a very diverse and large set of channels with topics probably changing a lot over time). Used sublinear term frequency (not 10, but 1 + log(9)), because otherwise stop words appear.

In [None]:
vec = TfidfVectorizer(min_df=.02,sublinear_tf = True)
res = vec.fit_transform(merged_df.text)
vocab = {value:key for key,value in vec.vocabulary_.items()}

In [None]:
results = []
for index in merged_df.index:
    top10words = [vocab[j] for i,j in sorted(zip(res[index].data,res[index].indices),reverse=True)[:10]]
    if len(top10words) < 10:
        continue
    meta = {'year':merged_df.year[index],'channel':merged_df.channel[index],'channel_id':merged_df.channel_id[index]}
    words = ({'word{no}'.format(no=i+1):top10words[i] for i in range(10)})
    results.append({k: v for d in [meta, words] for k, v in d.items()})
top10words_df = pd.DataFrame(results)
top10words_df = top10words_df[['year','channel','channel_id']+['word'+str(no) for no in range(1,11)]]
top10words_df.to_csv('C:/hackathon/top10tfidf_per_channel.csv',index=False)

## Tfidf top 100 words (for similarity)

Parameter choices same as above, but with json output to preserve list structure

In [None]:
results = []
for index in merged_df.index:
    top100words = [vocab[j] for i,j in sorted(zip(res[index].data,res[index].indices),reverse=True)[:100]]
    if len(top100words) < 100:
        continue
    results.append({'year':merged_df.year[index],
            'channel':merged_df.channel[index],
            'channel_id':merged_df.channel_id[index], 
            'words':top100words})
top100words_df = pd.DataFrame(results)
top100words_df = top100words_df[['year','channel','channel_id','words']]
top100words_df.to_json('C:/hackathon/top100tfidf.json')

## 'Overlap' matrix tfidf

In [None]:

channel_id = {i:{'year':top100words_df.year[i],
                 'channel':top100words_df.channel[i],
                 'channel_id':top100words_df.channel_id[i]} for i in top100words_df.index}
top100words_df.words = top100words_df.words.apply(set)
distance_matrix = np.ones((len(channel_id),len(channel_id)))

for i in range(len(channel_id)):
    for j in range(len(channel_id)):
        if i == j:
            continue
        elif i > j:
            distance_matrix[i,j] = distance_matrix[j,i]
        else:
            distance_matrix[i,j] = len(top100words_df.words[i] & top100words_df.words[j])/100

distance_matrix[distance_matrix < .05] = 0

In [None]:

G = nx.from_numpy_matrix(distance_matrix)

for i in range(len(channel_id)):
    G.node[i].update(channel_id[i])
#nx.write_gexf(G,'C:/hackathon/tfidf_graph.gexf')

In [None]:
nx.write_gexf(G,'C:/hackathon/tfidf_graph.gexf')

In [None]:
merged_df.to_csv('C:/hackathon/merged_right.csv',index = False)