**This notebook is the executable version of lab note 3.
It answers the following questions:**

Finally, we answer 4 items:

    1. Are successful creators more connected to high-outdegree users than do unsuccessful creators?
    2. Are mavens more connected to successful creators thanto unsuccessful creators?
    3. Do successful creators send more non-follow actions towards mavens than to zombies?
    4. Do successful creators send more non-follow actions towards mavens than to stars?

    

In [1]:
#Run parameters
#used to control every run. Can be user to perfom sensitivity checks
path_dir = r"/Users/../Volumes/Raw/"

low_success = 0.5 #below the median: unsuccessful
high_success = 0.9 #top 10% creators with more followers are deemed successful

low_user_outdegree = 0.1 
high_user_outdegree = 0.9 

low_user_activity = 0.1 
high_user_activity = 0.9 

In [2]:
import sys  
import pickle
sys.path.insert(0, '/Users/caiorego/Desktop/BDS/RA/Seeding-Bandits/')
import numpy as np
import src.utils
from collections import Counter
from src.utils import import_dta, import_tracks_dta, successful_creators_followers,\
gen_active_relations, get_fan_interactions_per_week, calculate_avg_monthly_valence,\
gen_active_relations_prob, get_fan_interactions_per_week_prob, stripplot_prob,\
reaction_probability, follower_list, filter_quantile, sample_creators_music,\
gen_outbound_creators
import numpy as np
import datetime
import pandas as pd
from tqdm import tqdm
import dask.dataframe as dd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy
import os
from statsmodels.stats.proportion import proportions_ztest

In [3]:
def process_date(date):
    '''convert date format like '2013-w09' to '2013-03-04', i.e. the first day of that week'''
    year = date[0:4]
    week = date[6:]
    day = "1"
    date = "{}-{}-1".format(year, week)
    dt = datetime.datetime.strptime(date, "%Y-%W-%w")
    return dt

# Data Imports

We start by importing the raw data.  `follows_sent`, `comments_sent`, `shares_sent`, `likes_sent` and `messages_sent` contains data pn the promotional activities that the 35k users tracked in the dataset directed to other users. It includes the `user_id`, the `fan_id` and the `date_sent` which identifies the date when the prom. activity was sent. `users_info_1st` shows the type of user (creator or non-creator, which is identified by a blank) and the date the user entered the platform, for every user that sent or received prom. activities from any of the 35k users tracked in this dataset, while `users_info` contains the same information, but pertaining to the 35k users themselves.

`follows_received` contains information on the follows received by the 35k users and will be used to generate the successful/unsuccessful groups of content creators.

In [4]:
#affiliations :follows
#favoritings :likes

#used in filtering:
path_dir = r"/Users/../Volumes/Raw/"
tracks = import_tracks_dta(path_dir, "12sample_tracks.dta");

#these are the actions sent to 
follows_sent = import_dta(path_dir, "12sample_affiliations_sent.dta");
comments_sent = import_dta(path_dir, "12sample_comments_made.dta");
shares_sent = import_dta(path_dir, "12sample_reposts_made.dta");
likes_sent = import_dta(path_dir, "12sample_favoritings_made.dta");
messages_sent = import_dta(path_dir, "12sample_messages_sent.dta");

#Used to track information on the 1st degree connections
user_info_1st = import_dta(path_dir, "12sample_1st_deg_user_infos.dta");
user_info_1st.columns = ['user_id', 'type', 'entered_platform'];
user_info = import_dta(path_dir, "12sample_user_infos.dta");

#Used to compute creator's success measure
follows_received = import_dta(path_dir, "12sample_affiliations_received.dta");

%%%%%%%%%% 12sample_tracks.dta %%%%%%%%%%
(56262, 7)
%%%%%%%%%% 12sample_affiliations_sent.dta %%%%%%%%%%
(800913, 3)
%%%%%%%%%% 12sample_comments_made.dta %%%%%%%%%%
(29258, 4)
%%%%%%%%%% 12sample_reposts_made.dta %%%%%%%%%%
(179329, 4)
%%%%%%%%%% 12sample_favoritings_made.dta %%%%%%%%%%
(527701, 4)
%%%%%%%%%% 12sample_messages_sent.dta %%%%%%%%%%
(11091, 3)
%%%%%%%%%% 12sample_1st_deg_user_infos.dta %%%%%%%%%%
(670746, 3)
%%%%%%%%%% 12sample_user_infos.dta %%%%%%%%%%
(35000, 3)
%%%%%%%%%% 12sample_affiliations_received.dta %%%%%%%%%%
(432503, 3)


Indegree and outdegree information.

The function below import the outdegree dataset. Because the raw version of those dataset are too large to be processed in memory, we preprocessed them in a separate script.

In [6]:
# Aggregates preprocessed outdegree of 1st degree users
def import_outdegree(path='/Users/caiorego/Desktop/BDS/RA/Seeding-Bandits/'):
    d = {}
    for i in range(6):
       d[str(i)] = pd.read_pickle(os.path.join(path,'{}.pkl'.format(i))) 
       d[str(i)]['created_at'] =  pd.to_datetime(d[str(i)]['created_at'])
       d[str(i)]['created_at'] = pd.to_datetime(d[str(i)]['created_at']).dt.floor('d')
       d[str(i)] = d[str(i)].groupby(['sender_id', 'created_at'], as_index = False).size() 
    
    data_outdegree = pd.concat([d['0'], d['1'], d['2'], d['3'], d['4'], d['5']])
    #data_outdegree.set_index('created_at', inplace = True)
    return data_outdegree

In [7]:
data_outdegree = import_outdegree()

In [8]:
#data_outdegree = data_outdegree.groupby(['sender_id','created_at'], as_index = False).size()

# Preprocessing

## Creator ids, successful and unsucessful creators

Next, we define three lists of ids: one with the ids from the content creators, according to the `users_info` table, one with the ids of successful creators and the last one with the ids of the unsuccessful ones.

Let's start with a list of the id of creators. We also create a dataset with containing information on creators only.

In [9]:
mask = (tracks.track_available == 1) & (tracks.public == 't')
creator_ids = tracks[mask].user_id.unique()

creators = tracks[(tracks.track_available == 1) & (tracks.public == 't')]

In [10]:
def successful_creators_followers(follows_received, base_date = datetime.datetime(2016, 5, 30, 0, 0), perc1 = None, perc2 = None, subset_creators = None):
    '''Classifies content creators in successful or unsuccessfull
        Arguments:
                    follows_received: dataframe containing the follows received by content creators
                    base date:        date, in datetime.datetime(YYYY, M, DD, H, M) format, in which the number 
                                      of followers per creator is calculated.
                    perc1:            the threshold used to classify unsuccessful content creators. Creator having 
                                      total followers below the number dictated by this threshold, at the base date,
                                      are classified as unsuccessful 
                    perc2:            the threshold used to classify successful content creators. Creator having 
                                      total followers above the number dictated by this threshold, at the base date,
                                      are classified as successful
                    subset_creators:  a pd.DataFrame containing the creators. If is it available, it will be used to 
                                      filter out non creators and to make sure creators with 0 followers are part of
                                      the resulting dataset.
        
    '''
    print(base_date)

    if 'inbound_activity' not in follows_received.columns:
        follows_received.columns = ['fan_id', 'user_id', 'date_sent']

    mask = (follows_received['date_sent'] < base_date)

    df = follows_received[mask].groupby('user_id', as_index=False).agg({'fan_id': pd.Series.nunique})
    df.columns = ['user_id', 'followers']

    
    if type(subset_creators) == pd.DataFrame:
        df.reindex(subset_creators.user_id.unique())
        df.fillna(0, inplace = True)

    low = np.quantile(df.followers, perc1)
    high = np.quantile(df.followers, perc2)

    print("High influencer boundary: {}".format(high))
    print("Low influencer boundary: {}".format(low))

    mask = (df["followers"] <= low) | (df["followers"] >= high)
    
    unsuccessful_creator_ids = df.loc[df["followers"] <= low].user_id.unique()
    successful_creator_ids = df.loc[df["followers"] >= high].user_id.unique()

    return unsuccessful_creator_ids, successful_creator_ids

In [85]:
len(unsuccessful_ids)

10857

In [11]:
unsuccessful_ids, successful_ids = successful_creators_followers(follows_received, 
                                                        perc1 = low_success, perc2 = high_success, subset_creators = creators)


2016-05-30 00:00:00
High influencer boundary: 40.0
Low influencer boundary: 8.0


## Putting together a dataset with the promotional activities made by content creators.

The function `gen_actions_sent_df` creates a dataframe with all the promotional activities that content creators sent to users.

In [13]:
def gen_actions_sent_df(follows_sent, shares_sent, likes_sent, comments_sent, messages_sent, creator_ids = creator_ids):
    '''
    Creates dataframe containing the actions that content creators send to users.
        Attributes:
                    follows_sent:  dataframe with the follows sent by the 35k users.
                    shares_sent:   dataframe with the shares sent by the 35k users.
                    likes_sent:    dataframe with the likes sent by the 35k users.
                    comments_sent: dataframe with the comments sent by the 35k users.
                    messages_sent: dataframe with the messages sent by the 35k users.
                    creator_ids:   list with content creator ids. If not none, is used to
                                   filter out activities from non creators.
    '''
    
    follows_sent['outbound_activity'] = 'follow'
    follows_sent.columns = ['user_id', 'fan_id', 'date_sent', 'outbound_activity']

    if 'song_id' in shares_sent.columns:
        shares_sent.drop(columns=["song_id"])
    shares_sent = shares_sent[['reposter_id', "owner_id", 'created_at']]
    shares_sent['outbound_activity'] = 'share'
    shares_sent.columns = ['user_id', 'fan_id', 'date_sent', 'outbound_activity']

    if 'track_id' in likes_sent.columns:
        likes_sent.drop(columns=["track_id"], inplace=True)
    likes_sent['outbound_activity'] = 'like'
    likes_sent.columns = ['user_id', 'fan_id', 'date_sent', 'outbound_activity']

    if 'track_id' in comments_sent.columns:
        comments_sent.drop(columns=["track_id"], inplace=True)
    comments_sent['outbound_activity'] = 'comment'
    comments_sent.columns = ['user_id', 'fan_id', 'date_sent', 'outbound_activity']

    messages_sent["outbound_activity"] = 'message'
    messages_sent.columns = ['user_id', 'fan_id', 'date_sent', 'outbound_activity']
    df = pd.concat([follows_sent, shares_sent, likes_sent, comments_sent, messages_sent])


    if type(creator_ids) == numpy.ndarray:
        df = df[df['user_id'].isin(creator_ids)]
        
    df['week_yr'] = df.date_sent.dt.strftime('%Y-w%U')
    df = df.loc[df['user_id'] != df['fan_id'],:]

    return df

In [14]:
actions_sent = gen_actions_sent_df(follows_sent, shares_sent, likes_sent, comments_sent,
                                     messages_sent, creator_ids = None)

## Filter only actions that were sent to non-fans

We merge the `actions_sent` dataset with a table containing the date each fan started following the creator.

In [15]:
follows_received.columns = ['fan_id', 'user_id', 'date_sent']
followers = follows_received[["fan_id", "user_id", "date_sent"]]
followers.columns = ["fan_id", "user_id", "follower_since"]

actions_sent = actions_sent.merge(followers, right_on = ['user_id', 'fan_id'],
                                      left_on = ['user_id', 'fan_id'], how = 'left')

Since we are interested in acquisition campaings, we need to produce a dataset that exclude actions targetting fans.
We do that using filters based on the date of the action and the date that the user became a fan of the content creator. The resulting dataframe is named `actions_sent_non_fans`.

We then filter only actions that happened before the user follows the content creator.

In [16]:
mask = (actions_sent.date_sent < actions_sent.follower_since) | (actions_sent.follower_since.isnull())
actions_sent_non_fans =  actions_sent[mask]
actions_sent_non_fans['week_yr_date'] = actions_sent_non_fans.week_yr.apply(lambda x: process_date(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  actions_sent_non_fans['week_yr_date'] = actions_sent_non_fans.week_yr.apply(lambda x: process_date(x))


## Outdegree level

Originally, we only have outdegree information on users that follow at least one user. The function below inputs an outdegree of 0 to users that are following anyone.

In [17]:
#CHANGE DOCUMENTATION
def outdegree_level(data, date, user_info = user_info_1st):
    
    '''
    This function returns the membership table at date equals `date`. Every user that interacted with the 35k tracked 
    users and entered the platform before `date` is present in the table, even if it has indegree 0.
    arguments:
              data:           the indegree dataset.
              user_info:      the dataset containing all the users that interacted with the 35k users tracked.
    '''
    
    data = data[data.created_at.dt.floor('d')<=date]
    data = data.groupby('sender_id').agg({'size':'sum'}).compute()
    
    #merge with user info to obtain users that are not followed by anyone at the current date
    data = user_info_1st.merge(data, left_on = 'user_id', right_on = 'sender_id', how= 'outer')
    data.loc[data['size'].isnull(), 'size'] = 0
    data = data[['user_id', 'size', 'entered_platform']].set_index('user_id')
    
    #filter out users that didnt exist in the current date
    mask = data['entered_platform'].dt.floor('d') <= date
    data = data.loc[mask]
    
    mask = (data['size']>0)
    data.loc[~mask, 'size'] = 0 
    
    return data

In [18]:
#dask is a python api with objects optimized for big data (user directed acyclic graphs). 
dask_outdegree = dd.from_pandas(data_outdegree, npartitions = 3)

In [19]:
last_day =  max(actions_sent.date_sent.dt.floor('d').unique())
user_outdegree = outdegree_level(dask_outdegree, last_day, user_info = user_info_1st)

Now we classify the creator in successful and unsuccessful, according to the threshold defined in the beggining of this notebook.

In [20]:
def user_outdegree_level(oudegree_data, perc1 = None, perc2 = None):
    '''Classifies content creators in successful or unsuccessfull
        Arguments:
                    oudegree_data:    dataframe containing the fans followers at 
                    perc1:            the threshold used to classify unsuccessful content creators. Creator having 
                                      total followers below the number dictated by this threshold, at the base date,
                                      are classified as unsuccessful 
                    perc2:            the threshold used to classify successful content creators. Creator having 
                                      total followers above the number dictated by this threshold, at the base date,
                                      are classified as successful
    '''

    df = oudegree_data.reset_index().iloc[:,:2]
    df.columns = ['user_id', 'followers']

    low = np.quantile(df.followers, perc1)
    high = np.quantile(df.followers, perc2)

    print("High outdegree boundary: {}".format(high))
    print("Low outdegree boundary: {}".format(low))
    
    low_outdegree_ids = df.loc[df["followers"] <= low].user_id.unique()
    high_outdegree_ids = df.loc[df["followers"] >= high].user_id.unique()
    
    
    df['outdegree_level'] = df.user_id.apply(
        lambda x: 'high' if x in high_outdegree_ids else ('low' if x in low_outdegree_ids else None))

    
    return df, low_outdegree_ids, high_outdegree_ids

In [21]:
user_outdegree_df, low_outdegree_ids, high_outdegree_ids = user_outdegree_level(user_outdegree,
                perc1 = low_user_outdegree, perc2 = high_user_outdegree)

High outdegree boundary: 772.0
Low outdegree boundary: 1.0


In the cell below, we create a list with unique ids from users that appear in the oudegree level table. This will later be uses to construct a flow-chart indicating how we lose data based on filters and operations.

In [22]:
receiver_ids = user_outdegree.index.unique()

## Non-follow Actions level

The activity level is defined as the number of actions performed by users. It is important to notice that we only observe actions targeting the 35k users that joined in march 2012. We consider this measure a proxy for the real activity level.

Let's begin by creating a dataset with all action received by those 35k users.

In [23]:
comments_received = import_dta(path_dir, "12sample_comments_received.dta");
shares_received = import_dta(path_dir, "12sample_reposts_received.dta");
likes_received = import_dta(path_dir, "12sample_favoritings_received.dta");
messages_received = import_dta(path_dir, "12sample_messages_received.dta");

%%%%%%%%%% 12sample_comments_received.dta %%%%%%%%%%
(21386, 4)
%%%%%%%%%% 12sample_reposts_received.dta %%%%%%%%%%
(83013, 4)
%%%%%%%%%% 12sample_favoritings_received.dta %%%%%%%%%%
(286903, 4)
%%%%%%%%%% 12sample_messages_received.dta %%%%%%%%%%
(17364, 3)


In [24]:
if 'song_id' in shares_received:
        shares_received.drop(columns=["song_id"])
shares_received = shares_received[['reposter_id', "owner_id", 'created_at']]
shares_received['inbound_activity'] = 'share'
shares_received.columns = ['fan_id', 'user_id', 'date_sent', 'inbound_activity']

if 'track_id' in likes_received:
        likes_received = likes_received.drop(columns=["track_id"])
likes_received['inbound_activity'] = 'like'
likes_received.columns = ['fan_id', 'user_id', 'date_sent', 'inbound_activity']

if 'track_id' in comments_received:
        comments_received = comments_received.drop(columns=["track_id"])
comments_received['inbound_activity'] = 'comment'
comments_received.columns = ['fan_id', 'user_id', 'date_sent', 'inbound_activity']

messages_received["outbound_activity"] = 'message'
messages_received.columns = ['user_id', 'fan_id', 'date_sent', 'inbound_activity']

In [25]:
user_activity_data = pd.concat([shares_received, likes_received, comments_received, messages_received])

In [26]:
def user_activity_level(user_activity_data, receiver_ids, perc1 = None, perc2 = None):
    '''Classifies users based on the amount of non-follow activities
        Arguments:
                    user_activity_data:   dataframe containing the user activities 
                    base date:             date, in datetime.datetime(YYYY, M, DD, H, M) format, in which the number 
                                           of followers per creator is calculated.
                    perc1:                 the threshold used to classify unsuccessful content creators. Creator having 
                                           total followers below the number dictated by this threshold, at the base date,
                                           are classified as unsuccessful 
                    perc2:                 the threshold used to classify successful content creators. Creator having 
                                           total followers above the number dictated by this threshold, at the base date,
                                           are classified as successful
    '''

    df = user_activity_data.groupby('fan_id', as_index = True).size()
    df = df.reindex(receiver_ids)
    print(df.shape)
    df = df.reset_index()
    df.columns = ['user_id', 'activities_performed']
    
    df.loc[df.activities_performed.isna(), 'activities_performed'] = 0
    #classification should happen after this

    low = np.quantile(df.activities_performed, perc1)
    high = np.quantile(df.activities_performed, perc2)

    print("High activity boundary: {}".format(high))
    print("Low activity boundary: {}".format(low))
    
    low_activity_ids = df.loc[df["activities_performed"] <= low].user_id.unique()
    high_activity_ids = df.loc[df["activities_performed"] > high].user_id.unique()
    
    df['activity_level'] = df.user_id.apply(
    lambda x: 'high' if x in high_activity_ids else ('low' if x in low_activity_ids else None))

    return df, low_activity_ids, high_activity_ids

In [27]:
activity_level, low_activity_ids, high_activity_ids = user_activity_level(user_activity_data, receiver_ids, 
                perc1 = low_user_activity, perc2 = high_user_activity)

(670746,)
High activity boundary: 0.0
Low activity boundary: 0.0


Once more we create an object containing the unique ids of users in the resulting dataset. This will be used in a flow-chart, as explained.

In [28]:
activity_ids = np.append(low_activity_ids, high_activity_ids)

In [29]:
len(set(activity_ids).intersection(set(receiver_ids)))
#only 35493 from 240292 that follow at least one of the 35k, performed at least one non-follow action

670746

# Analysis

Now we merge the datasets with the outdegree, and non-follow activities information.

In [30]:
table_data = user_outdegree_df.merge(activity_level, left_on = 'user_id', right_on = 'user_id', how = 'inner')

In [31]:
table_data

Unnamed: 0,user_id,followers,outdegree_level,activities_performed,activity_level
0,706,549.0,,0.0,low
1,845,43.0,,0.0,low
2,1255,429.0,,0.0,low
3,1506,455.0,,0.0,low
4,2667,408.0,,0.0,low
...,...,...,...,...,...
670741,160366917,62.0,,0.0,low
670742,160556475,50.0,,0.0,low
670743,160592014,50.0,,0.0,low
670744,160615335,50.0,,0.0,low


In [32]:
table_data.groupby(['outdegree_level', 'activity_level']).size()

outdegree_level  activity_level
high             high               5232
                 low               61877
low              high               1304
                 low               70202
dtype: int64

And, finally, we create the 4 user groups that we are interested in: *Maven*, *Zombie*, *Stars* and *Stalker*. Everyone else is classified as *Other*.

In [33]:
table_data['Type'] = table_data.apply(lambda x: 
    'Maven' if (x.outdegree_level == 'low') & (x.activity_level == 'high')
     else ('Zombie' if (x.outdegree_level == 'low') & (x.activity_level == 'low')
     else ('Stalker' if (x.outdegree_level == 'high') & (x.activity_level == 'low')
     else ('Star' if (x.outdegree_level == 'high') & (x.activity_level == 'high')
     else 'other'))), axis=1)

Let's keep track of each user type's ids, as it gives us a compact way of identifying each user.

In [34]:
zombie_ids = table_data.loc[table_data.Type =='Zombie'].user_id.unique()
star_ids = table_data.loc[table_data.Type =='Star'].user_id.unique()
maven_ids = table_data.loc[table_data.Type =='Maven'].user_id.unique()
stalker_ids = table_data.loc[table_data.Type =='Stalker'].user_id.unique()

actions_sent_non_fans['user_type'] = actions_sent_non_fans.fan_id.apply(lambda x: 'Maven' if x in maven_ids else 
                          ('Zombie' if x in zombie_ids else
                          ('Stalker' if x in stalker_ids else
                          ('Star' if x in star_ids else 'other'))))

## classify content creators
actions_sent_non_fans['creator_type'] = actions_sent_non_fans.user_id.apply(
                                            lambda x: 'successful' if x in successful_ids else 
                                            ('unsuccessful' if x in unsuccessful_ids else 'other'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  actions_sent_non_fans['user_type'] = actions_sent_non_fans.fan_id.apply(lambda x: 'Maven' if x in maven_ids else
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  actions_sent_non_fans['creator_type'] = actions_sent_non_fans.user_id.apply(


In [35]:
table_data.Type.value_counts()

other      532131
Zombie      70202
Stalker     61877
Star         5232
Maven        1304
Name: Type, dtype: int64

## Table 2

In [36]:
#incoming follows
follows_sent.merge(table_data, left_on = 'fan_id', right_on = 'user_id', how = 'inner')\
.groupby('Type').size()

Type
Maven         670
Stalker     58158
Star         4701
Zombie     192605
other      531330
dtype: int64

In [37]:
#outgoing follows
follows_received.merge(table_data, left_on = 'fan_id', right_on = 'user_id', how = 'inner')\
.groupby('Type').size()

Type
Maven         858
Stalker     59523
Star         5653
Zombie       1890
other      351453
dtype: int64

In [38]:
#incoming nonfollow actions
mask = actions_sent.outbound_activity != 'follow'
incoming_non_follow = actions_sent[mask]
incoming_non_follow.merge(table_data, left_on = 'fan_id', right_on = 'user_id', how = 'inner')\
.groupby('Type').size()

Type
Maven         228
Stalker     34738
Star         6562
Zombie      97201
other      426118
dtype: int64

In [39]:
#outgoing nonfollow actions
user_activity_data.merge(table_data, left_on = 'fan_id', right_on = 'user_id', how = 'inner')\
.groupby('Type').size()

Type
Maven     3512
Star     27806
other    83094
dtype: int64

# Testing

Finally, we answer 4 items:

    1. Are successful creators more connected to high-outdegree users than do unsuccessful creators?
    2. Are mavens more connected to successful creators thanto unsuccessful creators?
    3. Do successful creators send more non-follow actions towards mavens than to zombies?
    4. Do successful creators send more non-follow actions towards mavens than to stars?

## Desc. Statistics

Let's start with 3, since if provide the figures we are going to use to answer to other items:

### 3. Descriptive statistics of the four groups.

In [40]:
table_data.groupby('Type', as_index = False).activities_performed.describe().unstack(1)

Unnamed: 0,0,1,2,3,4
Type,Maven,Stalker,Star,Zombie,other
count,1304.0,61877.0,5232.0,70202.0,532131.0
mean,2.693252,0.0,5.314602,0.0,0.156153
std,4.248181,0.0,57.741198,0.0,1.778666
min,1.0,0.0,1.0,0.0,0.0
25%,1.0,0.0,1.0,0.0,0.0
50%,1.0,0.0,2.0,0.0,0.0
75%,3.0,0.0,3.0,0.0,0.0
max,45.0,0.0,4025.0,0.0,867.0


### 3.1. Are successful creators more connected to high-outdegree users than do unsuccessful creators?


We want to measure if successful creators are skilled in identifying *maven* and connecting to *high-outdegree and mavens*. Unfortunatelly,  we cannot use a probability measure to do the comparison, since it is sharply affected by group size variations. One measure that sidestep this issue is the odds ratio, a statistic that is better suited to compare groups of different sizes because it takes the number of events and non events in each group into account.

To calculate the odds ratio of for a given user type, say *hardcore*, we need the probability that a *hardcore* user is targeted by a successful creator. That is obtained by the following ratio: `p_maven_users` = `total unique users of type maven target by successful creators`/`total unique users of type maven`. 

We then use it to calculate: `odds_maven_user` = `p_maven_users`/(1-`p_maven_users`)

Finally, let's say we want to compare `maven` and `zombie` users. We would then use the meetric `odds_ratio` = `odds_maven_user`/`odds_zombie_user`

We will also report the Confidence Intervals (95%) for the Odds Ratio. The function below automatizes that calculation.

In [41]:
import numpy as np
import scipy.stats as stats

def calculate_odds_ratio_ci(exposed_cases, exposed_controls, nonexposed_cases, nonexposed_controls):
    # Calculate the odds ratio (OR)
    odds_ratio = (exposed_cases / exposed_controls) / (nonexposed_cases / nonexposed_controls)

    # Calculate the 95% confidence interval (CI) for the OR
    log_odds_ratio = np.log(odds_ratio)
    std_error = np.sqrt(1/exposed_cases + 1/exposed_controls + 1/nonexposed_cases + 1/nonexposed_controls)
    z_score = stats.norm.ppf(0.975)
    lower_ci = np.exp(log_odds_ratio - z_score*std_error)
    upper_ci = np.exp(log_odds_ratio + z_score*std_error)

    return round(odds_ratio, 2), (round(lower_ci, 2), round(upper_ci, 2))

In [42]:
from scipy.stats import chi2_contingency

In [43]:
table_data.columns = ['fan_id', 'followers', 'outdegree_level', 'activities_performed', 'activity_level', 'Type']
table = follows_received.merge(table_data, left_on = 'fan_id', right_on = 'fan_id', how = 'inner') #suffixes

table['creator_type'] = table.user_id.apply(
                                            lambda x: 'successful' if x in successful_ids else 
                                            ('unsuccessful' if x in unsuccessful_ids else 'other'))

In [44]:
table.groupby(['creator_type','outdegree_level']).size()

creator_type  outdegree_level
other         high               14117
              low                  776
successful    high               46441
              low                 1677
unsuccessful  high                4618
              low                  295
dtype: int64

In [45]:
high_outdegree_ids = table[table.outdegree_level == 'high'].fan_id.unique()

In [46]:
g1 = 32557
g2 = 806
group_ids1= high_outdegree_ids
group_ids2= high_outdegree_ids

tab = np.array([[g1, g2], [len(group_ids1)-g1, len(group_ids2)-g2]])

# calculate the odds ratio by taking the ratio of the odds of the event occurring in the two groups
odds_ratio = (g1/(len(group_ids1)-g1)/(g2/(len(group_ids2)-g2)))
print("Odds Ratio:", round(odds_ratio,4))

# perform a chi-square test to determine whether the observed odds ratio is statistically significant
chi2, p_value, _, _ = chi2_contingency(tab)
print("Chi-Square Statistic:", round(chi2,4))
print("P-Value:", round(p_value,4))

Odds Ratio: 113.958
Chi-Square Statistic: 45346.7404
P-Value: 0.0


In [47]:
calculate_odds_ratio_ci(g1, g2, len(group_ids1)-g1, len(group_ids2)-g2)

(113.96, (106.04, 122.46))

### 3.2. Are mavens more connected to successful creators than to unsuccessful creators?


In [48]:
#follows received data + maven classification + successful/unsuccessful

In [49]:
table.groupby(['creator_type','Type']).size()

creator_type  Type   
other         Maven         182
              Stalker     13422
              Star          695
              Zombie        594
              other      140386
successful    Maven         623
              Stalker     41661
              Star         4780
              Zombie       1054
              other      179558
unsuccessful  Maven          53
              Stalker      4440
              Star          178
              Zombie        242
              other       31509
dtype: int64

In [50]:
len(group_ids2)-g2

49185

In [51]:
g1 = 516
g2 = 62
group_ids1 = maven_ids
group_ids2 = maven_ids

tab = np.array([[g1, g2], [len(group_ids1)-g1, len(group_ids2)-g2]])

# calculate the odds ratio by taking the ratio of the odds of the event occurring in the two groups
odds_ratio = (g1/(len(group_ids1)-g1)/(g2/(len(group_ids2)-g2)))
print("Odds Ratio:", round(odds_ratio,4))

# perform a chi-square test to determine whether the observed odds ratio is statistically significant
chi2, p_value, _, _ = chi2_contingency(tab)
print("Chi-Square Statistic:", round(chi2,4))
print("P-Value:", round(p_value,4))

Odds Ratio: 13.1176
Chi-Square Statistic: 456.121
P-Value: 0.0


In [52]:
calculate_odds_ratio_ci(g1, g2, len(group_ids1)-g1, len(group_ids2)-g2)

(13.12, (9.93, 17.32))

###     3. Do successful creators send more non-follow actions towards mavens than to zombies?

In [53]:
#actions sentdata + mavens and zombie classification + successful/unsuccessful

In [54]:
distribution_target_user_type = actions_sent_non_fans.groupby(['creator_type', 'user_type']).fan_id.nunique()
dist_target_user = distribution_target_user_type.to_frame().reset_index()
dist_target_user.columns = ['creator_type', 'user_type', 'non_follow_actions']

In [55]:
dist_target_user

Unnamed: 0,creator_type,user_type,non_follow_actions
0,other,Maven,79
1,other,Stalker,6029
2,other,Star,447
3,other,Zombie,40344
4,other,other,153104
5,successful,Maven,385
6,successful,Stalker,24658
7,successful,Star,2150
8,successful,Zombie,22165
9,successful,other,161633


In [83]:
g1 = 340
g2 = 7570
group_ids1 = maven_ids
group_ids2 = zombie_ids

table = np.array([[g1, g2], [len(group_ids1)-g1, len(group_ids2)-g2]])

# calculate the odds ratio by taking the ratio of the odds of the event occurring in the two groups
odds_ratio = (g1/(len(group_ids1)-g1)/(g2/(len(group_ids2)-g2)))
print("Odds Ratio:", round(odds_ratio,4))

# perform a chi-square test to determine whether the observed odds ratio is statistically significant
chi2, p_value, _, _ = chi2_contingency(table)
print("Chi-Square Statistic:", round(chi2,4))
print("P-Value:", round(p_value,4))

Odds Ratio: 2.9181
Chi-Square Statistic: 302.6791
P-Value: 0.0


In [68]:
calculate_odds_ratio_ci(g1, g2, len(group_ids1)-g1, len(group_ids2)-g2)

(0.91, (0.81, 1.02))

###     4. Do successful creators send more non-follow actions towards mavens than to stars?

In [81]:
g1 = 340
g2 = 1820
group_ids1 = maven_ids
group_ids2 = star_ids

table = np.array([[g1, g2], [len(group_ids1)-g1, len(group_ids2)-g2]])

# calculate the odds ratio by taking the ratio of the odds of the event occurring in the two groups
odds_ratio = (g1/(len(group_ids1)-g1)/(g2/(len(group_ids2)-g2)))
print("Odds Ratio:", round(odds_ratio,4))

# perform a chi-square test to determine whether the observed odds ratio is statistically significant
chi2, p_value, _, _ = chi2_contingency(table)
print("Chi-Square Statistic:", round(chi2,4))
print("P-Value:", round(p_value,4))

Odds Ratio: 0.6005
Chi-Square Statistic: 58.3551
P-Value: 0.0


In [82]:
calculate_odds_ratio_ci(g1, g2, len(group_ids1)-g1, len(group_ids2)-g2)

(0.6, (0.53, 0.68))

## 5 Are creators better off (more likely to succeed) if they reach out/target mavens than stars?

To answer question 5, we will fit a logistic regression. The goal is to predict if a creator becames succesfull based on the number of non-follow activities and follows sent, both in aggregate and per user type.

Let's create the dataset to be used in the model.

In [61]:
#Create target:
#Per creator:
#1 non-follow activities sent
#2 follow activities sent
#3 non-follow activities sent per user type
#4 follow activities sent per user type

In [62]:
model_df = pd.DataFrame(creator_ids, columns = ['creator_id'])
model_df['successful'] = model_df.creator_id.apply(lambda x: 1 if x in creator_ids else 0)
mask = actions_sent.outbound_activity != 'follow'
model_df['total_non_follow_activities'] = model_df.creator_id.apply(lambda x: actions_sent[mask].\
                                          loc[actions_sent.user_id == x].shape[0])

In [63]:
creator_ids

array([37887492, 37847842, 37831822, ..., 38012037, 38362049, 38319852],
      dtype=int32)