**This notebook is the executable version of lab note 3.
It answers the following questions:**

Finally, we answer 4 items:

    1. Are successful creators more connected to high-outdegree users than do unsuccessful creators?
    2. Are mavens more connected to successful creators than to unsuccessful creators?
    3. Do successful creators send more non-follow actions towards mavens than to zombies?
    4. Do successful creators send more non-follow actions towards mavens than to stars?

    

In [1]:
#Run parameters
#used to control every run. Can be user to perfom sensitivity checks
path_dir = r"/Users/../Volumes/Raw/"

low_success = 0.5 #below the median: unsuccessful
high_success = 0.9 #top 10% creators with more followers are deemed successful

low_user_outdegree = 0.25 
high_user_outdegree = 0.75
low_user_activity = 0.25 
high_user_activity = 0.75 

activity_filter = 0
days_delta = 7

In [2]:
import sys  
import pickle
sys.path.insert(0, '/Users/caiorego/Desktop/BDS/RA/Seeding-Bandits/')
import numpy as np
import src.utils
from collections import Counter
from src.utils import import_dta, import_tracks_dta,\
gen_active_relations, get_fan_interactions_per_week, calculate_avg_monthly_valence,\
gen_active_relations_prob, get_fan_interactions_per_week_prob, stripplot_prob,\
reaction_probability, follower_list, filter_quantile, sample_creators_music,\
gen_outbound_creators
import numpy as np
import datetime
import pandas as pd
from tqdm import tqdm
import dask.dataframe as dd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy
import os
from statsmodels.stats.proportion import proportions_ztest

In [3]:
def process_date(date):
    '''convert date format like '2013-w09' to '2013-03-04', i.e. the first day of that week'''
    year = date[0:4]
    week = date[6:]
    day = "1"
    date = "{}-{}-1".format(year, week)
    dt = datetime.datetime.strptime(date, "%Y-%W-%w")
    return dt

# Data Imports

We start by importing the raw data.  `follows_sent`, `comments_sent`, `shares_sent`, `likes_sent` and `messages_sent` contains data pn the promotional activities that the 35k users tracked in the dataset directed to other users. It includes the `user_id`, the `fan_id` and the `date_sent` which identifies the date when the prom. activity was sent. `users_info_1st` shows the type of user (creator or non-creator, which is identified by a blank) and the date the user entered the platform, for every user that sent or received prom. activities from any of the 35k users tracked in this dataset, while `users_info` contains the same information, but pertaining to the 35k users themselves.

`follows_received` contains information on the follows received by the 35k users and will be used to generate the successful/unsuccessful groups of content creators.

In [4]:
#affiliations :follows
#favoritings :likes

#used in filtering:
path_dir = r"/Users/../Volumes/Raw/"
tracks = import_tracks_dta(path_dir, "12sample_tracks.dta");

#these are the actions sent to 
follows_sent = import_dta(path_dir, "12sample_affiliations_sent.dta");
comments_sent = import_dta(path_dir, "12sample_comments_made.dta");
shares_sent = import_dta(path_dir, "12sample_reposts_made.dta");
likes_sent = import_dta(path_dir, "12sample_favoritings_made.dta");
messages_sent = import_dta(path_dir, "12sample_messages_sent.dta");

#Used to track information on the 1st degree connections
user_info_1st = import_dta(path_dir, "12sample_1st_deg_user_infos.dta");
user_info_1st.columns = ['user_id', 'type', 'entered_platform'];
user_info = import_dta(path_dir, "12sample_user_infos.dta");

#Used to compute creator's success measure
follows_received = import_dta(path_dir, "12sample_affiliations_received.dta");

%%%%%%%%%% 12sample_tracks.dta %%%%%%%%%%
(56262, 7)
%%%%%%%%%% 12sample_affiliations_sent.dta %%%%%%%%%%
(800913, 3)
%%%%%%%%%% 12sample_comments_made.dta %%%%%%%%%%
(29258, 4)
%%%%%%%%%% 12sample_reposts_made.dta %%%%%%%%%%
(179329, 4)
%%%%%%%%%% 12sample_favoritings_made.dta %%%%%%%%%%
(527701, 4)
%%%%%%%%%% 12sample_messages_sent.dta %%%%%%%%%%
(11091, 3)
%%%%%%%%%% 12sample_1st_deg_user_infos.dta %%%%%%%%%%
(670746, 3)
%%%%%%%%%% 12sample_user_infos.dta %%%%%%%%%%
(35000, 3)
%%%%%%%%%% 12sample_affiliations_received.dta %%%%%%%%%%
(432503, 3)


Indegree and outdegree information.

The function below import the outdegree dataset. Because the raw version of those dataset are too large to be processed in memory, we preprocessed them in a separate script.

In [5]:
# Aggregates preprocessed outdegree of 1st degree users
def import_outdegree(path='/Users/caiorego/Desktop/BDS/RA/Seeding-Bandits/'):
    d = {}
    for i in range(6):
       d[str(i)] = pd.read_pickle(os.path.join(path,'{}.pkl'.format(i))) 
       d[str(i)]['created_at'] =  pd.to_datetime(d[str(i)]['created_at'])
       d[str(i)]['created_at'] = pd.to_datetime(d[str(i)]['created_at']).dt.floor('d')
       d[str(i)] = d[str(i)].groupby(['sender_id', 'created_at'], as_index = False).size() 
    
    data_outdegree = pd.concat([d['0'], d['1'], d['2'], d['3'], d['4'], d['5']])
    #data_outdegree.set_index('created_at', inplace = True)
    return data_outdegree

In [6]:
data_outdegree = import_outdegree()

In [7]:
#data_outdegree = data_outdegree.groupby(['sender_id','created_at'], as_index = False).size()

# Preprocessing

## Creator ids, successful and unsucessful creators

Next, we define three lists of ids: one with the ids from the content creators, according to the `users_info` table, one with the ids of successful creators and the last one with the ids of the unsuccessful ones.

Let's start with a list of the id of creators. We also create a dataset with containing information on creators only.

In [8]:
mask = (tracks.track_available == 1) & (tracks.public == 't')
creator_ids = tracks[mask].user_id.unique()

creators = tracks[(tracks.track_available == 1) & (tracks.public == 't')]

#mask = user_info.type == 'creator'
#creator_ids = user_info[mask].user_id.unique()

#creators = user_info[user_info.type == 'creator']

## Putting together a dataset with the promotional activities made by content creators.

The function `gen_actions_sent_df` creates a dataframe with all the promotional activities that content creators sent to users.

In [9]:
def gen_actions_sent_df(follows_sent, shares_sent, likes_sent, comments_sent, messages_sent, creator_ids = creator_ids):
    '''
    Creates dataframe containing the actions that content creators send to users.
        Attributes:
                    follows_sent:  dataframe with the follows sent by the 35k users.
                    shares_sent:   dataframe with the shares sent by the 35k users.
                    likes_sent:    dataframe with the likes sent by the 35k users.
                    comments_sent: dataframe with the comments sent by the 35k users.
                    messages_sent: dataframe with the messages sent by the 35k users.
                    creator_ids:   list with content creator ids. If not none, is used to
                                   filter out activities from non creators.
    '''
    
    follows_sent['outbound_activity'] = 'follow'
    follows_sent.columns = ['user_id', 'fan_id', 'date_sent', 'outbound_activity']

    if 'song_id' in shares_sent.columns:
        shares_sent.drop(columns=["song_id"])
    shares_sent = shares_sent[['reposter_id', "owner_id", 'created_at']]
    shares_sent['outbound_activity'] = 'share'
    shares_sent.columns = ['user_id', 'fan_id', 'date_sent', 'outbound_activity']

    if 'track_id' in likes_sent.columns:
        likes_sent.drop(columns=["track_id"], inplace=True)
    likes_sent['outbound_activity'] = 'like'
    likes_sent.columns = ['user_id', 'fan_id', 'date_sent', 'outbound_activity']

    if 'track_id' in comments_sent.columns:
        comments_sent.drop(columns=["track_id"], inplace=True)
    comments_sent['outbound_activity'] = 'comment'
    comments_sent.columns = ['user_id', 'fan_id', 'date_sent', 'outbound_activity']

    messages_sent["outbound_activity"] = 'message'
    messages_sent.columns = ['user_id', 'fan_id', 'date_sent', 'outbound_activity']
    df = pd.concat([follows_sent, shares_sent, likes_sent, comments_sent, messages_sent])


    if type(creator_ids) == numpy.ndarray:
        df = df[df['user_id'].isin(creator_ids)]
        
    df['week_yr'] = df.date_sent.dt.strftime('%Y-w%U')
    df = df.loc[df['user_id'] != df['fan_id'],:]

    return df

In [10]:
actions_sent = gen_actions_sent_df(follows_sent, shares_sent, likes_sent, comments_sent,
                                     messages_sent, creator_ids = None)
actions_sent = actions_sent.loc[actions_sent.user_id.isin(creators.user_id.unique())]

In [11]:
active_users_ids = actions_sent.groupby('user_id', as_index = False).size()
mask = active_users_ids['size']>= activity_filter
active_users_ids = active_users_ids[mask].user_id.unique()

In [12]:
def successful_creators_followers(follows_received, base_date = datetime.datetime(2016, 5, 30, 0, 0), perc1 = None, perc2 = None, subset_creators = None):
    '''Classifies content creators in successful or unsuccessfull
        Arguments:
                    follows_received: dataframe containing the follows received by content creators
                    base date:        date, in datetime.datetime(YYYY, M, DD, H, M) format, in which the number 
                                      of followers per creator is calculated.
                    perc1:            the threshold used to classify unsuccessful content creators. Creator having 
                                      total followers below the number dictated by this threshold, at the base date,
                                      are classified as unsuccessful 
                    perc2:            the threshold used to classify successful content creators. Creator having 
                                      total followers above the number dictated by this threshold, at the base date,
                                      are classified as successful
                    subset_creators:  a pd.DataFrame containing the creators. If is it available, it will be used to 
                                      filter out non creators and to make sure creators with 0 followers are part of
                                      the resulting dataset.
        
    '''
    print(base_date)

    if 'inbound_activity' not in follows_received.columns:
        follows_received.columns = ['fan_id', 'user_id', 'date_sent']

    mask = (follows_received['date_sent'] < base_date)

    df = follows_received[mask].groupby('user_id', as_index=False).agg({'fan_id': pd.Series.nunique})
    df.columns = ['user_id', 'followers']

    
    if type(subset_creators) == pd.DataFrame:
        print('subsetting...')
        df.set_index('user_id', inplace = True)
        df = df.reindex(subset_creators.user_id.unique())
        df.fillna(0, inplace = True)
        df.reset_index(inplace = True)
        df.columns = ['user_id', 'followers']
        
    mask = df.user_id.isin(active_users_ids)
    df = df[mask]

    low = np.quantile(df.followers, perc1)
    high = np.quantile(df.followers, perc2)

    print("High influencer boundary: {}".format(high))
    print("Low influencer boundary: {}".format(low))

    mask = (df["followers"] <= low) | (df["followers"] >= high)
    
    unsuccessful_creator_ids = df.loc[df["followers"] <= low].user_id.unique()
    successful_creator_ids = df.loc[df["followers"] >= high].user_id.unique()

    return unsuccessful_creator_ids, successful_creator_ids


In [13]:
unsuccessful_ids, successful_ids = successful_creators_followers(follows_received, 
                                                        perc1 = low_success, perc2 = high_success, subset_creators = creators)


2016-05-30 00:00:00
subsetting...
High influencer boundary: 81.0
Low influencer boundary: 13.0


In [14]:
print(len(unsuccessful_ids))
print(len(successful_ids))

1959
389


In [15]:
creators.user_id.nunique()

4604

## Filter only actions that were sent to non-fans

We merge the `actions_sent` dataset with a table containing the date each fan started following the creator.

In [16]:
follows_received.columns = ['fan_id', 'user_id', 'date_sent']
followers = follows_received[["fan_id", "user_id", "date_sent"]]
followers.columns = ["fan_id", "user_id", "follower_since"]

actions_sent = actions_sent.merge(followers, right_on = ['user_id', 'fan_id'],
                                      left_on = ['user_id', 'fan_id'], how = 'left')

Since we are interested in acquisition campaings, we need to produce a dataset that exclude actions targetting fans.
We do that using filters based on the date of the action and the date that the user became a fan of the content creator. The resulting dataframe is named `actions_sent_non_fans`.

We then filter only actions that happened before the user follows the content creator.

In [17]:
mask = (actions_sent.date_sent < actions_sent.follower_since) | (actions_sent.follower_since.isnull())
actions_sent_non_fans =  actions_sent[mask]
actions_sent_non_fans['week_yr_date'] = actions_sent_non_fans.week_yr.apply(lambda x: process_date(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  actions_sent_non_fans['week_yr_date'] = actions_sent_non_fans.week_yr.apply(lambda x: process_date(x))


## Non-follow Actions level

The activity level is defined as the number of actions performed by users. It is important to notice that we only observe actions targeting the 35k users that joined in march 2012. We consider this measure a proxy for the real activity level.

Let's begin by creating a dataset with all action received by those 35k users.

In [19]:
comments_received_35k = import_dta(path_dir, "12sample_comments_received.dta");
shares_received_35k = import_dta(path_dir, "12sample_reposts_received.dta");
likes_received_35k = import_dta(path_dir, "12sample_favoritings_received.dta");
messages_received_35k = import_dta(path_dir, "12sample_messages_received.dta");

if 'song_id' in shares_received_35k:
        shares_received_35k.drop(columns=["song_id"])
shares_received_35k = shares_received_35k[['reposter_id', "owner_id", 'created_at']]
shares_received_35k['outbound_activity'] = 'share'
shares_received_35k.columns = ['fan_id', 'user_id', 'date_sent', 'outbound_activity']

if 'track_id' in likes_received_35k:
        likes_received_35k = likes_received_35k.drop(columns=["track_id"])
likes_received_35k['outbound_activity'] = 'like'
likes_received_35k.columns = ['fan_id', 'user_id', 'date_sent', 'outbound_activity']

if 'track_id' in comments_received_35k:
        comments_received_35k = comments_received_35k.drop(columns=["track_id"])
comments_received_35k['outbound_activity'] = 'comment'
comments_received_35k.columns = ['fan_id', 'user_id', 'date_sent', 'outbound_activity']

messages_received_35k["outbound_activity"] = 'message'
messages_received_35k.columns = ['user_id', 'fan_id', 'date_sent', 'outbound_activity']

user_activity_data_35k = pd.concat([shares_received_35k, likes_received_35k, comments_received_35k, messages_received_35k])

%%%%%%%%%% 12sample_comments_received.dta %%%%%%%%%%
(21386, 4)
%%%%%%%%%% 12sample_reposts_received.dta %%%%%%%%%%
(83013, 4)
%%%%%%%%%% 12sample_favoritings_received.dta %%%%%%%%%%
(286903, 4)
%%%%%%%%%% 12sample_messages_received.dta %%%%%%%%%%
(17364, 3)


In [20]:
path_dir_2 = r'/Users/../Volumes/Alter_outbound_activities/'

comments_received_c = import_dta(path_dir_2, "12sample_1st_deg_comments_made.dta");
shares_received_c = import_dta(path_dir_2, "12sample_1st_deg_reposts_made.dta");
likes_received_c = import_dta(path_dir_2, "12sample_1st_deg_favoritings_made.dta");
messages_received_c = import_dta(path_dir_2, "12sample_1st_deg_messages_sent.dta");

%%%%%%%%%% 12sample_1st_deg_comments_made.dta %%%%%%%%%%
(21463011, 4)
%%%%%%%%%% 12sample_1st_deg_reposts_made.dta %%%%%%%%%%
(18953640, 4)
%%%%%%%%%% 12sample_1st_deg_favoritings_made.dta %%%%%%%%%%
(86793370, 4)
%%%%%%%%%% 12sample_1st_deg_messages_sent.dta %%%%%%%%%%
(16824074, 3)


In [21]:
if 'song_id' in shares_received_c:
        shares_received_c.drop(columns=["song_id"])
shares_received_c = shares_received_c[['reposter_id', "owner_id", 'created_at']]
shares_received_c['inbound_activity'] = 'share'
shares_received_c.columns = ['fan_id', 'user_id', 'date_sent', 'inbound_activity']

if 'track_id' in likes_received_c:
        likes_received_c = likes_received_c.drop(columns=["track_id"])
likes_received_c['inbound_activity'] = 'like'
likes_received_c.columns = ['fan_id', 'user_id', 'date_sent', 'inbound_activity']

if 'track_id' in comments_received_c:
        comments_received_c = comments_received_c.drop(columns=["track_id"])
comments_received_c['inbound_activity'] = 'comment'
comments_received_c.columns_c = ['fan_id', 'user_id', 'date_sent', 'inbound_activity']

messages_received_c["outbound_activity"] = 'message'
messages_received_c.columns = ['user_id', 'fan_id', 'date_sent', 'inbound_activity']

  comments_received_c.columns_c = ['fan_id', 'user_id', 'date_sent', 'inbound_activity']


In [22]:
user_activity_data_c = pd.concat([shares_received_c, likes_received_c, comments_received_c, messages_received_c])

Once more we create an object containing the unique ids of users in the resulting dataset. This will be used in a flow-chart, as explained.

# Priori response probabilities

In [24]:
table_data = pd.read_csv('user_types_ids.csv')

In [25]:
hermit_ids = table_data.loc[table_data.Type =='Hermit'].user_id.unique()
w_a_ids = table_data.loc[table_data.Type =='w_a'].user_id.unique()
f_a_ids = table_data.loc[table_data.Type =='f_a'].user_id.unique()
observer_ids = table_data.loc[table_data.Type =='Observer'].user_id.unique()

In [26]:
actions_sent_non_fans['user_type'] = actions_sent_non_fans.fan_id.apply(lambda x: 'f_a' if x in f_a_ids else 
                          ('Hermit' if x in hermit_ids else
                          ('Observer' if x in observer_ids else
                          ('w_a' if x in w_a_ids else 'other'))))

## classify content creators
actions_sent_non_fans['creator_type'] = actions_sent_non_fans.user_id.apply(
                               lambda x: 'successful' if x in successful_ids else 
                               ('unsuccessful' if x in unsuccessful_ids else 'other'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  actions_sent_non_fans['user_type'] = actions_sent_non_fans.fan_id.apply(lambda x: 'f_a' if x in f_a_ids else
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  actions_sent_non_fans['creator_type'] = actions_sent_non_fans.user_id.apply(


In [27]:
actions_sent['user_type'] = actions_sent.fan_id.apply(lambda x: 'f_a' if x in f_a_ids else 
                          ('Hermit' if x in hermit_ids else
                          ('Observer' if x in observer_ids else
                          ('w_a' if x in w_a_ids else 'other'))))

## classify content creators
actions_sent['creator_type'] = actions_sent.user_id.apply(
                               lambda x: 'successful' if x in successful_ids else 
                               ('unsuccessful' if x in unsuccessful_ids else 'other'))

In [28]:
import datetime
# Target Creation
delta = datetime.timedelta(days = days_delta)
mask = (actions_sent_non_fans['follower_since'] <= (actions_sent_non_fans['date_sent'] + delta))

response_df = actions_sent_non_fans.copy()
response_df.loc[mask, 'reward'] = 1
mask = response_df['reward'].isnull()
response_df.loc[mask, 'reward'] = 0

In [29]:
attempts = response_df.groupby(['creator_type', 'user_type'], as_index = False).size()
rewards = response_df.groupby(['creator_type', 'user_type'], as_index = False).agg({'reward' : 'sum'})

priori_prob_df = rewards.merge(attempts)
priori_prob_df['P_resp_prob'] = priori_prob_df['reward']/priori_prob_df['size']
priori_prob_df 

Unnamed: 0,creator_type,user_type,reward,size,P_resp_prob
0,other,Hermit,160.0,14025,0.011408
1,other,Observer,12.0,303,0.039604
2,other,f_a,19.0,14778,0.001286
3,other,other,3224.0,95966,0.033595
4,other,w_a,1003.0,23615,0.042473
5,successful,Hermit,84.0,7258,0.011573
6,successful,Observer,60.0,691,0.086831
7,successful,f_a,45.0,13164,0.003418
8,successful,other,6237.0,104129,0.059897
9,successful,w_a,6418.0,51416,0.124825


# Priori repost probabilities

In [32]:
user_activity_data_35k['day_yr_date'] = user_activity_data_35k.date_sent.dt.normalize()

reposts_df = user_activity_data_35k.loc[user_activity_data_35k.outbound_activity == 'share']

In [34]:
mask = (actions_sent.date_sent >= actions_sent.follower_since)
actions_sent_to_fans = actions_sent[mask]
actions_sent_to_fans['week_yr_date'] = actions_sent_to_fans.week_yr.apply(lambda x: process_date(x))

repost_prob_df = actions_sent_to_fans # filter for follower only
repost_prob_df.sort_values(['user_id', 'fan_id'], inplace = True)
repost_prob_df['reward_repost'] = np.nan
for user_id in tqdm(repost_prob_df.user_id.unique()):
    for fan_id in repost_prob_df.loc[repost_prob_df.user_id == user_id].fan_id.unique():
        repost_prob_df.loc[(repost_prob_df.user_id == user_id)&
                              (repost_prob_df.fan_id == fan_id),'reward_repost'] =\
        repost_prob_df.loc[(repost_prob_df.user_id == user_id)&
                              (repost_prob_df.fan_id == fan_id)].date_sent.apply(
        lambda x : 1 if 
    (reposts_df.loc[
    (reposts_df.user_id == user_id)&
    (reposts_df.fan_id == fan_id)&
    (reposts_df.day_yr_date > x)&   
    (reposts_df.day_yr_date <= x + datetime.timedelta(days=days_delta))]).shape[0]>0 else 0)
                                          
repost_prob_df.sort_values(by='date_sent', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  actions_sent_to_fans['week_yr_date'] = actions_sent_to_fans.week_yr.apply(lambda x: process_date(x))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  repost_prob_df.sort_values(['user_id', 'fan_id'], inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  repost_prob_df['reward_repost'] = np.nan
100%|█████████████████████████████████████████████████

In [35]:
repost_prob_df['user_type'] = repost_prob_df.fan_id.apply(lambda x: 'f_a' if x in f_a_ids else 
                          ('Hermit' if x in hermit_ids else
                          ('Observer' if x in observer_ids else
                          ('w_a' if x in w_a_ids else 'other'))))

## classify content creators
repost_prob_df['creator_type'] = repost_prob_df.user_id.apply(
                               lambda x: 'successful' if x in successful_ids else 
                               ('unsuccessful' if x in unsuccessful_ids else 'other'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  repost_prob_df['user_type'] = repost_prob_df.fan_id.apply(lambda x: 'f_a' if x in f_a_ids else
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  repost_prob_df['creator_type'] = repost_prob_df.user_id.apply(


In [38]:
attempts = repost_prob_df.groupby(['creator_type', 'user_type'], as_index = False).size()
rewards = repost_prob_df.groupby(['creator_type', 'user_type'], as_index = False).agg({'reward_repost' : 'sum'})

priori_rep_df = rewards.merge(attempts)
priori_rep_df['P_rep_prob'] = priori_rep_df['reward_repost']/priori_rep_df['size']
priori_rep_df 

Unnamed: 0,creator_type,user_type,reward_repost,size,P_rep_prob
0,other,Hermit,1.0,624,0.001603
1,other,Observer,0.0,173,0.0
2,other,f_a,0.0,73,0.0
3,other,other,77.0,12259,0.006281
4,other,w_a,17.0,2728,0.006232
5,successful,Hermit,0.0,335,0.0
6,successful,Observer,0.0,252,0.0
7,successful,f_a,13.0,194,0.06701
8,successful,other,100.0,13856,0.007217
9,successful,w_a,372.0,11533,0.032255


Use the following cell if you need the number of reposts given an action

In [40]:
# repost_avg_df = actions_sent.copy()
# repost_avg_df.sort_values(['user_id', 'fan_id'], inplace = True)
# repost_avg_df['reward_repost'] = np.nan
# for user_id in tqdm(repost_avg_df.user_id.unique()):
#     for fan_id in repost_avg_df.loc[repost_avg_df.user_id == user_id].fan_id.unique():
#         repost_avg_df.loc[(repost_avg_df.user_id == user_id)&
#                               (repost_avg_df.fan_id == fan_id),'reward_repost'] =\
#         repost_avg_df.loc[(repost_avg_df.user_id == user_id)&
#                               (repost_avg_df.fan_id == fan_id)].date_sent.apply(
#         lambda x : 
#     (reposts_df.loc[
#     (reposts_df.user_id == user_id)&
#     (reposts_df.fan_id == fan_id)&
#     (reposts_df.day_yr_date > x)&   
#     (reposts_df.day_yr_date <= x + datetime.timedelta(days=days_delta))]).shape[0])
                                          
# repost_avg_df.sort_values(by='date_sent', inplace = True)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3857/3857 [11:14<00:00,  5.72it/s]


# Expected (Indirect) Returns Associated with a Song-Repost

In [41]:
import pickle as pkl
share_follower_lists = pd.read_pickle('/Users/caiorego/Desktop/BDS/RA/Seeding-Bandits/Data/alter_follower_lists.pkl')
action_follower_lists = pd.read_pickle('/Users/caiorego/Desktop/BDS/RA/Seeding-Bandits/Data/alter_follower_lists_2.pkl')

In [42]:
# follower_lists = share_follower_lists
# expec_followers = user_activity_data_35k.loc[user_activity_data_35k.outbound_activity == 'share']
# expec_followers = expec_followers.loc[expec_followers.user_id.isin(creators.user_id)]

# expec_followers = expec_followers.loc[expec_followers.fan_id.isin(follower_lists.index.unique())]
# expec_followers['expec_followers'] = np.nan
# expec_followers.sort_values(['user_id', 'fan_id'], inplace = True)

# follows_received.date_sent = follows_received.date_sent.dt.normalize()

# for user_id in tqdm(expec_followers.user_id.unique()):
#     follows_received_j = follows_received.loc[follows_received.user_id == user_id]
#     for fan_id in expec_followers.loc[expec_followers.user_id == user_id].fan_id.unique():
#         fan_follows = follower_lists.loc[follower_lists.index == fan_id].values[0]
#         expec_followers.loc[(expec_followers.user_id == user_id)&
#                                   (expec_followers.fan_id == fan_id),'expec_followers'] =\
#         expec_followers.loc[(expec_followers.user_id == user_id)&
#                                   (expec_followers.fan_id == fan_id)].date_sent.apply(
#             lambda x : 
#             np.sum(np.in1d(
#                            follows_received_j.loc[(x <= follows_received_j.date_sent) & 
#                                                   (follows_received_j.date_sent<= x + 
#                                                    datetime.timedelta(days=days_delta))
#                                                  ].values,
#                 fan_follows)
#         ))                                        

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1250/1250 [01:26<00:00, 14.53it/s]


In [43]:
# expec_followers['user_type'] = expec_followers.fan_id.apply(lambda x: 'f_a' if x in f_a_ids else 
#                           ('Hermit' if x in hermit_ids else
#                           ('Observer' if x in observer_ids else
#                           ('w_a' if x in w_a_ids else 'other'))))

# ## classify content creators

# expec_followers['creator_type'] = expec_followers.user_id.apply(
#                                             lambda x: 'successful' if x in successful_ids else 
#                                             ('unsuccessful' if x in unsuccessful_ids else 'other'))
# attempts = expec_followers.groupby(['creator_type', 'user_type'], as_index = False).size()
# desc = expec_followers.groupby(['creator_type', 'user_type'], as_index = False).agg(expected_followers=('expec_followers', 'mean'),
#                                std_expected_followers=('expec_followers', 'std'),
#                                max_followers=('expec_followers', 'max'))

# expect_return = attempts.merge(desc)

# All activities

In [46]:
# #now using all activities
# user_activity_data_35k['day_yr_date'] = user_activity_data_35k.date_sent.dt.normalize()

# activities_df = user_activity_data_35k.loc[user_activity_data_35k.outbound_activity.isin(['share', 'comment', 'like'])]

# act_prob_df = actions_sent.copy()
# act_prob_df.sort_values(['user_id', 'fan_id'], inplace = True)
# act_prob_df['reward_repost'] = np.nan
# for user_id in tqdm(act_prob_df.user_id.unique()):
#     for fan_id in act_prob_df.loc[act_prob_df.user_id == user_id].fan_id.unique():
#         act_prob_df.loc[(act_prob_df.user_id == user_id)&
#                               (act_prob_df.fan_id == fan_id),'reward_repost'] =\
#         act_prob_df.loc[(act_prob_df.user_id == user_id)&
#                               (act_prob_df.fan_id == fan_id)].date_sent.apply(
#         lambda x : 1 if 
#     (activities_df.loc[
#     (activities_df.user_id == user_id)&
#     (activities_df.fan_id == fan_id)&
#     (activities_df.day_yr_date > x)&   
#     (activities_df.day_yr_date <= x + datetime.timedelta(days=days_delta))]).shape[0]>0 else 0)
                                          
# act_prob_df.sort_values(by='date_sent', inplace = True)

# attempts = act_prob_df.groupby(['creator_type', 'user_type'], as_index = False).size()
# rewards = act_prob_df.groupby(['creator_type', 'user_type'], as_index = False).agg({'reward_repost' : 'sum'})
 
# priori_act_df = rewards.merge(attempts)
# priori_act_df['P_rep_prob'] = priori_act_df['reward_repost']/priori_act_df['size']
# priori_act_df 

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3857/3857 [19:06<00:00,  3.37it/s]


Unnamed: 0,creator_type,user_type,reward_repost,size,P_rep_prob
0,other,Hermit,6.0,14649,0.00041
1,other,Observer,0.0,476,0.0
2,other,f_a,8.0,14851,0.000539
3,other,other,781.0,108225,0.007216
4,other,w_a,181.0,26343,0.006871
5,successful,Hermit,3.0,7593,0.000395
6,successful,Observer,0.0,943,0.0
7,successful,f_a,71.0,13358,0.005315
8,successful,other,822.0,117985,0.006967
9,successful,w_a,1897.0,62949,0.030136


In [47]:
# #Expected number of followers given an activity

# expec_followers2 = user_activity_data_35k.loc[user_activity_data_35k.outbound_activity.isin(['share', 'comment', 'like'])]
# expec_followers2 = expec_followers2.loc[expec_followers2.user_id.isin(creators.user_id)]
# follower_list = action_follower_lists

# expec_followers2 = expec_followers2.loc[expec_followers2.fan_id.isin(follower_lists.index.unique())]
# expec_followers2['expec_followers'] = np.nan
# expec_followers2.sort_values(['user_id', 'fan_id'], inplace = True)

# for user_id in tqdm(expec_followers2.user_id.unique()):
#     follows_received_j = follows_received.loc[follows_received.user_id == user_id]
#     for fan_id in expec_followers2.loc[expec_followers2.user_id == user_id].fan_id.unique():
#         fan_follows = follower_lists.loc[follower_lists.index == fan_id].values[0]
#         expec_followers2.loc[(expec_followers2.user_id == user_id)&
#                                   (expec_followers2.fan_id == fan_id),'expec_followers'] =\
#         expec_followers2.loc[(expec_followers2.user_id == user_id)&
#                                   (expec_followers2.fan_id == fan_id)].date_sent.apply(
#             lambda x : 
#             np.sum(np.in1d(
#                            follows_received_j.loc[(x <= follows_received_j.date_sent) & 
#                                                   (follows_received_j.date_sent<= x + 
#                                                    datetime.timedelta(days=days_delta))
#                                                  ].values,
#                 fan_follows)
#         ))                                        

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1322/1322 [06:41<00:00,  3.29it/s]


In [48]:
# expec_followers2['user_type'] = expec_followers2.fan_id.apply(lambda x: 'f_a' if x in f_a_ids else 
#                           ('Hermit' if x in hermit_ids else
#                           ('Observer' if x in observer_ids else
#                           ('w_a' if x in w_a_ids else 'other'))))

# ## classify content creators

# expec_followers2['creator_type'] = expec_followers2.user_id.apply(
#                                             lambda x: 'successful' if x in successful_ids else 
#                                             ('unsuccessful' if x in unsuccessful_ids else 'other'))
# attempts = expec_followers2.groupby(['creator_type', 'user_type'], as_index = False).size()
# desc = expec_followers2.groupby(['creator_type', 'user_type'], as_index = False).agg(expected_followers=('expec_followers', 'mean'),
#                                std_expected_followers=('expec_followers', 'std'),
#                                max_followers=('expec_followers', 'max'))

# expect_return2 = attempts.merge(desc)

In [49]:
# expect_return2

Unnamed: 0,creator_type,user_type,size,expected_followers,std_expected_followers,max_followers
0,other,Hermit,14,0.857143,1.460092,4.0
1,other,f_a,297,0.329966,0.95093,7.0
2,other,other,5248,0.84737,1.888988,31.0
3,other,w_a,1221,0.774775,1.939888,20.0
4,successful,Hermit,33,0.060606,0.348155,2.0
5,successful,Observer,5,0.0,0.0,0.0
6,successful,f_a,1779,3.659921,18.56323,211.0
7,successful,other,18583,1.835172,9.976682,275.0
8,successful,w_a,25411,19.033175,42.507668,527.0
9,unsuccessful,Hermit,10,0.0,0.0,0.0


# Analysis

In [50]:
# returns_table = priori_prob_df[['creator_type', 'user_type', 'P_resp_prob']].\
# merge(priori_rep_df[['creator_type', 'user_type', 'P_rep_prob']]).\
# merge(expect_return[['creator_type', 'user_type', 'expected_followers']])

In [51]:
# returns_table['Expected_direct_return'] = returns_table['P_resp_prob']
# returns_table['Expected_indirect_return'] = returns_table['P_rep_prob']*returns_table['expected_followers']
# returns_table['Total_expected_return'] = returns_table['Expected_direct_return'] + returns_table['Expected_indirect_return']

In [52]:
# returns_table[['creator_type', 'user_type', 'P_resp_prob', 'P_rep_prob',
#        'expected_followers',
#        'Expected_direct_return', 'Expected_indirect_return',
#        'Total_expected_return']].loc[(returns_table.creator_type == 'successful')] #& (returns_table.user_type.isin(['Hermit', 'f_a']))]

Unnamed: 0,creator_type,user_type,P_resp_prob,P_rep_prob,expected_followers,Expected_direct_return,Expected_indirect_return,Total_expected_return
4,successful,Hermit,0.011573,0.0,0.086957,0.011573,0.0,0.011573
5,successful,Observer,0.086831,0.0,0.0,0.086831,0.0,0.086831
6,successful,f_a,0.003418,0.06701,4.779641,0.003418,0.320285,0.323704
7,successful,other,0.059897,0.007217,1.570914,0.059897,0.011337,0.071234
8,successful,w_a,0.124825,0.032255,11.357512,0.124825,0.36634,0.491165


In [53]:
# #avg outbound activities from successful creators to the same user
# mask = actions_sent.creator_type == 'successful'
# size = actions_sent.loc[mask].groupby(['user_id', 'fan_id','user_type'], as_index = False).size()

In [54]:
table_data.Type.value_counts()

other       469324
w_a          94314
Hermit       90383
Observer      8829
f_a           7896
Name: Type, dtype: int64

In [55]:
table_data.shape

(670746, 7)

In [57]:
#Response (Follow-Back) Probabilities
df = priori_prob_df

pivoted_df = round(df.pivot(index='creator_type', columns='user_type', values=['P_resp_prob']),3)

# Display the pivoted DataFrame
pivoted_df

Unnamed: 0_level_0,P_resp_prob,P_resp_prob,P_resp_prob,P_resp_prob,P_resp_prob
user_type,Hermit,Observer,f_a,other,w_a
creator_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
other,0.011,0.04,0.001,0.034,0.042
successful,0.012,0.087,0.003,0.06,0.125
unsuccessful,0.014,0.033,0.002,0.018,0.023


In [58]:
#Response (Follow-Back) Probabilities
df = priori_prob_df

pivoted_df = round(df.pivot(index='creator_type', columns='user_type', values=['size']),3)

# Display the pivoted DataFrame
pivoted_df

Unnamed: 0_level_0,size,size,size,size,size
user_type,Hermit,Observer,f_a,other,w_a
creator_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
other,14025,303,14778,95966,23615
successful,7258,691,13164,104129,51416
unsuccessful,5259,61,5400,40169,6009


In [59]:
#Response (Follow-Back) Probabilities
df = priori_rep_df

pivoted_df = round(df.pivot(index='creator_type', columns='user_type', values=['P_rep_prob'])*100,2)

# Display the pivoted DataFrame
pivoted_df

Unnamed: 0_level_0,P_rep_prob,P_rep_prob,P_rep_prob,P_rep_prob,P_rep_prob
user_type,Hermit,Observer,f_a,other,w_a
creator_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
other,0.16,0.0,0.0,0.63,0.62
successful,0.0,0.0,6.7,0.72,3.23
unsuccessful,0.0,0.0,0.0,0.57,0.54
