In [1]:
import json
import numpy as np
import pandas as pd
import time

# Very simple neural network

## Target:

For each existing line in the training dataset `df_audience`, we compute a score that will be the target of our neural network:

    0.55*progress + 0.15*comment + 0.2*share + 0.1*like
    
(coefficients should be optimized/changed to account for Keakr' strategy)


## First step: dumb neural network

A dense neural network that will take as inputs two vectors representing the user and the video. It will try to predict the score computed.
The output is a single neuron, that computes this score between 0 and 1.

### User embedding

A very naive embedding that won't work well, with those features (all normalized):

- createdAt: time difference between createdAt and time of viewing
- lastConnection: time difference between lastConnection and time of viewing
- friendCount
- listenedGenres: number of listened genres
- likeCount
- likeGivenCount
- mutualFollowCount
- sessionCount
- viewCount
- shareCount
- isBeatmaker
- isSinger

The same should be done with videos (cf. Leonardo)

## Second step: finding a better representation of users

As the above representation doesn't capture any real specificity of the user but rather some stats, it would be better to find a latent representation of the user.

Using the social graph, we can use graph neural networks to find that userembedding. Using PyTorch geometric https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8 , a graph can be the input of a neural network. The graph that would be used 

First, the graph already has to be featurized (with the features we've already defined for example). Then we feed into into a neural network that will end up with a vector for each node. After that, we extract the representation of the node that corresponds to our user and feed that into a dense neural network with the target already defined before.

### What kind of neural network ?

For the GNN, we have some possibilities I have already used:
- a very simple yet efficient approach would be the use of Graph Convolutational Network ( https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GCNConv ): works quite well with "fixed" graphs. Indeed, the graph we are going to have will be pretty similar to the one used in training.
- the GraphConv can also work pretty well for this kind of graph: https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GraphConv
- A good one that could make it easier to discover "new" videos that pretty close nodes don't listen to: https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.GATConv (with this "attention" mechanism that allows some far away node to be taken into account even if it's not too close)

This page seems to sum up the GNN architectures pretty well: https://theaisummer.com/gnn-architectures/



In [2]:
def from_jsonl_to_df(path):
    t_start = time.time()
    with open(path, 'r') as json_file:
        json_list = list(json_file)
    json_list = [json.loads(json_str) for json_str in json_list]
    df = pd.DataFrame.from_dict(json_list, orient='columns')
    
    print("DataFrame of {}".format(len(df))+" rows loaded in {:.2f} sec".format(time.time()-t_start))
    return df

def from_csv_to_df(path):
    t_start=time.time()
    if "followers" in path:
        df=pd.read_csv(path, names=["follower", "person followed", "timestamp"], header=None)
    else:
        df=pd.read_csv(path)
    print("DataFrame of {}".format(len(df))+" rows loaded in {:.2f} sec".format(time.time()-t_start))
    return df

In [3]:
df_keaks = from_jsonl_to_df("data/full/keaks.jsonl")

DataFrame of 274472 rows loaded in 3.69 sec


In [4]:
df_beats =from_jsonl_to_df("data/full/beats.jsonl")

DataFrame of 59761 rows loaded in 1.38 sec


In [5]:
df_users =from_jsonl_to_df("data/full/users.jsonl")

DataFrame of 1162572 rows loaded in 31.78 sec


In [6]:
df_followers = from_csv_to_df("data/full/followers.csv")

DataFrame of 10506442 rows loaded in 11.55 sec


In [7]:
df_audiences = from_csv_to_df("data/full/audiences.csv")

DataFrame of 28952275 rows loaded in 32.43 sec


In [8]:
df_keaks.head(2)

Unnamed: 0,keakId,createdAt,likeCount,commentCount,viewCount,averageViewProgress,duration,hashtags,contentType,hasSmallThumbnail,lien
0,17301813623064450175,2019-10-10T21:20:59.2970322Z,4,2,42,7.8,107.0,"[91, 1, Rap, Pen, freestyle2019, Trap2K19]",freestyle,True,https://www.keakr.com/fr/keak/mon-mec-rap-1
1,6202649352,2018-04-02T15:06:34.9851924Z,2,2,53,0.0,68.0,[],freestyle,True,https://www.keakr.com/fr/keak/petit-salaire


In [9]:
df_audiences.head(2)

Unnamed: 0,userId,contentId,timestamp,progress,liked,commented,shared
0,users/6512051967,keaks/17301813623657913783,2020-01-01T00:00:00.4081666,0,True,True,True
1,users/17301813623701852659,keaks/17301813623464753700,2020-01-01T00:00:00.8245932,0,False,False,False


In [10]:
df_followers.head(2)

Unnamed: 0,follower,person followed,timestamp
0,users/17301813624860662494,users/17301813625195354624,2020-06-13T16:53:19.7077441Z
1,users/11928492392,users/12680019689,2019-04-11T22:23:51.9010202Z


In [11]:
df_beats.head(2)

Unnamed: 0,beatId,genres,moods,nbKeaks,nbLikes,beatmakerId,duration,bpm,createdAt,updatedAt,link,licenceType
0,17301813628927249101,"[{'id': '9920897543', 'name': 'Trap'}]","[{'id': '17301813622132424287', 'name': 'Dark'...",7,4,17301813625134492069,121.0,102.0,2021-09-20T18:16:47.1420645Z,2021-09-20T18:17:00.2684235Z,https://keakr.com/fr/beat/turquoiz,[free]
1,17301813627622625982,"[{'id': '9920897543', 'name': 'Trap'}]",[],1,5,17301813627622569830,163.0,125.0,2021-03-31T15:20:10.726762Z,2021-03-31T15:20:26.5760601Z,https://keakr.com/fr/beat/moula-i,[free]


In [20]:
df_users.head(10)

Unnamed: 0,userId,createdAt,lastConnection,usedGenres,listenedGenres,battleCreatedCount,battleLostCount,battleRespondedCount,battleWonCount,friendCount,...,mutualFollowCount,overallBeatUsage,PlaylistCount,prizeMoneyParticipationCount,prizeMoneyWinner,sessionCount,shareCount,viewCount,isBeatmaker,isSinger
0,12354401148,2019-03-02T10:23:33.0903001Z,2019-03-02T10:23:44.3283191Z,[],[],0,0,0,0,0,...,0.0,0.0,,0,False,2.0,,0,False,False
1,12354411487,2019-03-02T10:26:23.9016307Z,2020-09-24T06:16:46.8631984Z,[],[],0,0,0,0,0,...,,0.0,,0,False,8.0,,0,False,False
2,12354483185,2019-03-02T10:28:49.5056384Z,2019-03-04T17:39:14.6525393Z,[],[],0,0,0,0,0,...,,0.0,,0,False,3.0,,0,False,False
3,12354483249,2019-03-02T10:28:51.4548726Z,2019-03-03T12:52:52.7709452Z,[],[],0,0,0,0,0,...,0.0,0.0,,0,False,4.0,,0,False,False
4,12354656710,2019-03-02T10:35:40.0733875Z,2020-05-05T09:33:54.9408989Z,[],[],0,0,0,0,0,...,0.0,0.0,,0,False,20.0,,0,False,False
5,12354719690,2019-03-02T10:38:49.0256284Z,2019-03-15T17:02:08.5038717Z,[],[],0,0,0,0,0,...,0.0,0.0,,0,False,3.0,,0,False,False
6,12354776855,2019-03-02T10:41:49.8917913Z,2019-03-02T10:41:49.8917913Z,[],[],0,0,0,0,0,...,0.0,0.0,,0,False,7.0,,0,False,False
7,12354798692,2019-03-02T10:44:13.0838926Z,2019-03-02T10:44:33.1834778Z,[],[],0,0,0,0,0,...,,0.0,,0,False,2.0,,0,False,False
8,12354818744,2019-03-02T10:44:46.6556384Z,2019-03-02T10:45:07.4571885Z,[],[],0,0,0,0,0,...,,0.0,,0,False,2.0,,0,False,False
9,12354842348,2019-03-02T10:47:49.2815579Z,2019-05-20T13:32:52.3574973Z,[],[],0,0,0,0,0,...,,0.0,,0,False,16.0,,0,False,False


In [13]:
list(df_users.columns)

['userId',
 'createdAt',
 'lastConnection',
 'usedGenres',
 'listenedGenres',
 'battleCreatedCount',
 'battleLostCount',
 'battleRespondedCount',
 'battleWonCount',
 'friendCount',
 'keakCount',
 'keakrCoinGiven',
 'keakCoinReceived',
 'likeCount',
 'likeGivenCount',
 'mutualFollowCount',
 'overallBeatUsage',
 'PlaylistCount',
 'prizeMoneyParticipationCount',
 'prizeMoneyWinner',
 'sessionCount',
 'shareCount',
 'viewCount',
 'isBeatmaker',
 'isSinger']

In [18]:
df_keaks.loc[:, "contentType"].unique()

array(['freestyle', 'live', 'audioFreestyle', 'importedVideo', None,
       'dance'], dtype=object)