<h1> Introductory Python Notebook to Semantic Shift Analysis Guide </h1>

The purpose of this guide is to understand how parts of the Semantic Shift Api works

<h3> Guidelines of this notebook </h3>

<p><b> Module functions </b></p>
The goal of this notebook is to introduce a user to the module functions, hence descriptions surround how they work and how they should be used.

<p><b> Notebook functions </b></p>
These functions are helper functions that are implemented <i>specficially</i> for this notebook to illustrate certain examples in a clean manner. The user should not necessarily pay attention to them, and should only study them in detail if one is confused about the functionality that is being demonstrated or described.

In [2]:
import pandas as pd
import numpy as np
import gzip
import json
import os, glob, sys
import multiprocessing as mp

<h3> Module Imports </h3>
We append module_path to sys.path to be able to import any inhouse module.<br>

In [6]:
module_path = '/home/ndg/users/abhand1/subreddit_data/module/'
if module_path not in sys.path:
    sys.path.append(module_path)

Load relevant modules for Slang Analysis. 

<ul>
    <li>Slang Analyser - Computes Affinity terms for subreddits</li>
    <li>Subreddinfo - Contains helper functions for such as pickle_dump, and pickle_load</li>
    <li>Config - Loads path information. Path information is prefixed in config. </li>
    <li>Original Subreddit Metrics - The engine that computes </li>
</ul>

In [137]:
import slang_analyser
from subreddinfo import pickle_dump, pickle_load
import config

import original_subreddit_metrics

<h3> Loading Subreddits </h3> <br>
Below, we calculate and load the subreddits in a list for the interval of 2014 November to 2015 June.<br>
The function get_subnames loads and parses subreddits from a file that contains subreddits line by line. 

In [15]:
def get_subnames(fname):
    """Gets Subnames from a local file."""
    subnames = []
    with open(fname) as f:
        subnames = f.read().split('\n')[:-1]
    return subnames

all_subs_path = 'all_subreddits_above_10000_comments.txt'
all_subs_full_path = os.path.join(module_path, all_subs_path)
subnames = get_subnames(all_subs_full_path)

In this introductory notebook guide, we will be using the first 10 subreddits in the subname category to demonstrate the features and capacity of the modules that we have. These subreddits are:

<p>['motogp',
 'italy',
 'Shave_Bazaar',
 'poker',
 'ableton',
 'batman',
 'havoc_bot',
 'ftlgame',
 'lootcrate',
 'lacrosse']</p>

In [140]:
demo_subs = subnames[:10]

<h3> Calculating Affinity Values </h3> <br>
The first set of features displayed is the way to extract affinity terms for a subreddit. A pre-requisite for this is to compute word-freq dictionaries for each subreddit that maps word to its frequency {word -> freq}. This step is done, and these dictionaries are store in config.SUBDIR_ANALYSIS_LOAD_PATH/subname/{subname + "\_filtered_lemma.pkl"} 



The affinity terms are computed by passing the list of subnames to slang_analyser.run_slang_analyser which internally calls a function called affinity_analysis. <br>
affinity_analysis computes affinity terms for each subreddit using the affinity analysis formula. <br>

The affinity value for each term will be dependent on the number of subreddits and type of subreddits that are passed to the function, because the values are dependent on the pool of data available.

In [143]:
demo_affinity_terms = slang_analyser.run_slang_analyser(demo_subs)

In [199]:
def extract_some_affinity_terms(list_affinity_terms, subs, sub_index, head=10, desc=True, show_aff_score=True):
    """A print function that displays a head number of affinity terms for a subreddit.
    
    Args:
        list_affinity_terms (list): a list of affinity terms (dictionary). Maps aff -> aff score.
        
    Returns:
        smth
    
    """
    print("Printing {} affinity terms of {}".format(head, subs[sub_index]))
    target_affinity_dic = list_affinity_terms[sub_index]
    aff_head = sorted(target_affinity_dic, 
                      key=lambda x: target_affinity_dic[x], 
                      reverse=desc)[:head]
    
    if show_aff_score:
        print(*list((aff, target_affinity_dic[aff]) for aff in aff_head), sep='\n')


In [198]:
extract_some_affinity_terms(demo_affinity_terms, demo_subs, 0)

Printing 10 affinity terms of motogp
('lap', 0.9991460290350128)
('marc', 0.9989561586638831)
('marquez', 0.9987266079763581)
('motogp', 0.9983514955863898)
('rid', 0.9983221476510067)
('honda', 0.9980065686539601)
('pedrosa', 0.997907949790795)
('dovi', 0.9977426636568849)
('cal', 0.9975728155339806)
('ducati', 0.997542539388946)


<h3> Graphing Imports </h3>
These imports are useful for the purposes of graphing

In [800]:
# Standard plotly imports
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
# Using plotly + cufflinks in offline mode
import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=True)

"""
For linear regrssion
"""
from sklearn import linear_model

<h2> Helper Functions for Numerical Analysis and Plotting </h2> 

In [793]:
home_dir = '/home/ndg/users/abhand1/'
all_subs_df = pickle_load(home_dir + 'all_subreddits_df.pkl')
all_subs_df_30 = all_subs_df[all_subs_df['aff_wc_in_models'] > 30]

For the purposes of accuracy, while we have datasets of about 2500+ subreddits, we  only accept subreddits where 
of the 100 affinity terms, over 30 were present in the word embeddings of all time intervals (many happen to be not present in all 4). 30 is the threshold, and we end up with about 1400+ subreddits. 

<b>Note:</b> Now we store in the folder <i>slang_semantic_shift_df</i> dataframess which have n number of affinity terms present in all four word embeddings. all_subs_df does not reflect this data, whereas it contains the results of the top 100 absolute affinity terms (regardless of if they are present in all 4 embeddings).

Also: <br>
aff_wc_in_models -> the number of words present in all four word embeddings (out of the top n affinity terms) <br>
wc -> word count

In [796]:
# Data that is important
top_100_aff_df = pd.read_csv(home_dir + 'slang_semantic_shift_df/top_100_aff.csv', index_col=0)
top_50_aff_df = pd.read_csv(home_dir + 'slang_semantic_shift_df/top_50_aff.csv', index_col=0)
bot_50_aff_df = pd.read_csv(home_dir + 'slang_semantic_shift_df/bottom_50_aff.csv', index_col=0)

<h3> Explanation of columns </h3>
Only explaining important columns. <br>
Ignore these columns: mean,	std, slang_to_user_wc, user_percent, average_affinity_value <br>
half_load_average: Average affinity values of 

In [797]:
top_50_aff_df.head()

Unnamed: 0,mean,std,slang_to_user_wc,user_percent,average_affinity_value,forward_shift,backward_shift,net_forward_shift,net_backward_shift,aff_wc_in_models,half_load_affinity_average,loyalty,dedication,no._of_users,no._of_comments
motogp,0.288,1.020322,169,0.13079,0.106694,0.152,0.206,0.008,0.046,50,0.702799,0.076748,0.173479,2202,24799
italy,0.096,0.585478,86,0.019293,0.997124,0.514,0.544,-0.002,0.032,50,0.999672,0.227693,0.275121,4976,183884
Shave_Bazaar,0.34,1.276088,200,0.241993,0.050463,0.234,0.13,-0.098,-0.006,50,0.19697,0.05694,0.218505,1405,14866
poker,2.041,6.611302,462,0.238101,0.129063,0.402,0.404,-0.018,0.02,50,0.986001,0.135908,0.217452,8572,138814
ableton,0.416,1.65074,199,0.153392,0.034866,0.19,0.198,0.002,0.006,50,0.121612,0.044617,0.09292,2712,11703


The following set of function help with finding a moving average (for smoothing purposes).

More information: 
http://bestmaths.net/online/index.php/year-levels/year-12/year-12-topic-list/smoothing-techniques/

In [370]:
def get_smoothened_intervaled_frames(temp, sorting_index, n=10, asc=False):
    """
    Purpose: 
    The following function receives a dataframe, which it treats it as temporary, and then sorts the dataframe
    based on the column which is indexed. For instance, if the sorting index is 'loyalty', then it will
    sort the rows in order of loyalty. The function, then smoothens the dataset by binning it in boxes
    of size n (default is 10). In other words, n consecutive rows are averaged and smoothened into a new row,
    and this collection of smoothened rows is returned as a new dataframe.
    """
    temp = temp.sort_values(by=sorting_index)
    i = 0
    df_val = 0
    new_temp = pd.DataFrame(columns=temp.columns)
    temp_summer = 0
    for index, row in temp.iterrows():
        if i % n == 0 and i != 0:
            temp_summer = temp_summer/n
            new_temp.loc[df_val] = temp_summer
            df_val += 1
            temp_summer = 0
            i = 0
        if type(temp_summer) == int and temp_summer == 0:
            temp_summer = row
        else:
            temp_summer += row
        i += 1
    temp_summer = temp_summer/i
    new_temp.loc[df_val] = temp_summer
    return new_temp.iloc[:-1]
    

The following set of helper functions help with plotting the dataframe data.
It produces an interactive plotly graph

In [792]:
def data_to_plotly(x):
    k = []
    
    for i in range(0, len(x)):
        k.append(x[i][0])
        
    return k


def create_iplot_trends(df, x_index, y_indexes, markers={'reg': 'lines', 'index': 'markers'}, title='', secondary_y=''):
    temp_df = df.copy()
    df_markers = {}
    keys = []
    for y_index in y_indexes:
        regr = linear_model.LinearRegression()
        x_val = np.reshape(list(df[x_index]), (-1, 1))
        y_val = np.reshape(list(df[y_index]), (-1, 1))
        df_markers[y_index] = markers['index']
        keys.append(y_index)
        regr.fit(x_val, y_val)
        
        y_reg = regr.predict(x_val)
        y_reg_index = y_index + '_reg'
        temp_df[y_reg_index] = data_to_plotly(y_reg)
        df_markers[y_reg_index] = markers['reg']
        keys.append(y_reg_index)
    temp_df_sorted = temp_df.set_index(x_index).sort_index()
    temp_df_sorted.iplot(keys=keys, mode=df_markers, xTitle=x_index, title=title, secondary_y=secondary_y)
        

<h2> Numerical Analysis</h2>

In the following examples of numerical analysis, we run analysis on smoothened datasets. In other_words, a group of n consecutive points are averaged (default n = 10), and made into one point for all points. At times all points instead of moving (or smoothing) averages are used. This is to illustrate if the trend exists on a non-smoothened surface. 

For instance, Loyalty and Semantic values do not have a clear relationship when running the analysis on loyalty vs. semantic shift, Hence the loss of information with smoothing  may lead one  to believe that there may be a trend.

<b>Sorting Reason:</b> For each primary variable whose relationship is being compared against other features, we create a smoothened dataframe from the complete dataframe SORTED on the primary variable. This allows us to to see if linear patterns are present.

<b>ALSO NOTE </b>: "Semantic Shift" and "Net Semantic Shift" are <b>not</b> the same.
Net semantic shift defines the amount of net shift words have had

<h3> Analysis of relationship of Loyalty with Existing Features </h3>
Explore the relationship between loyalty and semantic shift, net semantic shift, no._of_comments and no._of_users

In [479]:
"""
This dataframe basically means that we are smoothening it by binning n points to 1, 
since the dataframe contains 1400+/2500+ points.

default n = 10
"""

loyalty_smoothened_df = get_smoothened_intervaled_frames(ss, 'loyalty')

loyalty_smoothened_df_10 = get_smoothened_intervaled_frames(ss, 'loyalty', n=200)

In [714]:
# Plotting all 474 points. 
create_iplot_trends(df=filter_stuff[filter_stuff['net_backward_shift']  >  0.1], 
                       x_index='half_load_affinity_average', 
                       y_indexes=['no._of_comments', 'net_backward_shift', 'loyalty'], 
                       title='Loyalty Vs. Net Semantic Shift', secondary_y='net_backward_shift')

In [763]:
new_df.columns

Index(['mean', 'std', 'slang_to_user_wc', 'user_percent',
       'average_affinity_value', 'forward_shift', 'backward_shift',
       'net_forward_shift', 'net_backward_shift', 'aff_wc_in_models',
       'half_load_affinity_average', 'loyalty', 'dedication', 'no._of_users',
       'no._of_comments', 'net_forward_shift_neutral',
       'net_backward_shift_neutral', 'backward_shift_diff'],
      dtype='object')

In [775]:
os.listdir('/home/ndg/users/carmst16')

PermissionError: [Errno 13] Permission denied: '/home/ndg/users/carmst16'

In [777]:
ss[ss['backward_shift'] == 0]

Unnamed: 0,mean,std,slang_to_user_wc,user_percent,average_affinity_value,forward_shift,backward_shift,net_forward_shift,net_backward_shift,aff_wc_in_models,half_load_affinity_average,loyalty,dedication,no._of_users,no._of_comments
announcements,1.495,4.895506,365,0.091465,0.040756,0.0,0.0,0.0,0.0,0,0.0,6.1e-05,0.036035,16345,46339
DWMA,0.018,0.132951,21,0.333333,0.049139,0.0,0.0,0.0,0.0,0,0.0,0.333333,0.666667,54,23049
empirepowers,0.046,0.240591,50,0.511111,0.094016,0.0,0.0,0.0,0.0,0,0.0,0.044444,0.711111,90,10815


In [772]:
os.listdir('slang_semantic_shift_df/')

['top_100_aff.csv', 'bottom_50_aff.csv', 'top_50_aff.csv']

In [804]:
# Plotting all 474 points. 
create_iplot_trends(df=ss, 
                       x_index='net_backward_shift', 
                       y_indexes=['half_load_affinity_average'], 
                       title='Loyalty Vs. Net Semantic Shift')

In [767]:
# Plotting all 474 points. 
create_iplot_trends(df=new_df, 
                       x_index='loyalty', 
                       y_indexes=['backward_shift_diff'], 
                       title='Loyalty Vs. Net Semantic Shift')

In [790]:
# Plotting all 474 points. 
create_iplot_trends(df=new_df, 
                       x_index='loyalty', 
                       y_indexes=['net_backward_shift', 'net_backward_shift_neutral'], 
                       title='Loyalty Vs. Net Semantic Shift')

In [716]:
# Plotting all 474 points. 
create_iplot_trends(df=ss2, 
                       x_index='loyalty', 
                       y_indexes=['net_backward_shift', 'net_forward_shift'], 
                       title='Loyalty Vs. Net Semantic Shift')

In [573]:
filter_stuff[filter_stuff['net_backward_shift'] > 0.2]

Unnamed: 0,mean,std,slang_to_user_wc,user_percent,average_affinity_value,forward_shift,backward_shift,net_forward_shift,net_backward_shift,aff_wc_in_models,half_load_affinity_average,loyalty,dedication,no._of_users,no._of_comments
DFO,0.767,2.509325,280,0.143741,0.114808,0.129,0.546,-0.013,0.43,100,0.081979,0.259183,0.276424,5336,108351
fantasybaseball,1.018,3.036392,328,0.134443,0.093264,0.141,0.444,-0.018,0.321,100,0.225849,0.216984,0.314448,7572,164300
MaddenMobileForums,0.672,2.135045,295,0.243478,0.0642,0.161,0.392,-0.009,0.24,100,0.07437,0.288043,0.356159,2760,91356
amiibo,7.114,17.329368,693,0.218288,0.034992,0.403,0.589,-0.078,0.264,100,0.207293,0.270758,0.356459,32590,1222527
CitiesSkylines,3.22,10.445363,465,0.098024,0.04038,0.157,0.5,-0.018,0.361,100,0.048154,0.040488,0.141283,32849,238672
ContestOfChampions,0.425,1.945347,205,0.22703,0.036912,0.213,0.444,-0.033,0.264,100,0.039451,0.245192,0.260684,1872,33121
Nationals,0.398,1.384773,221,0.178796,0.108194,0.119,0.321,-0.006,0.208,100,0.246216,0.124888,0.2646,2226,97248
bloodborne,4.765,12.53187,590,0.157427,0.08697,0.146,0.576,-0.026,0.456,100,0.014268,0.11603,0.243987,30268,445601
Muse,0.852,2.86986,317,0.181663,0.067352,0.203,0.457,0.014,0.24,100,0.229544,0.169936,0.24371,4690,66067
StarWarsLeaks,0.54,1.99359,224,0.214626,0.07488,0.198,0.396,-0.023,0.221,100,0.171248,0.089825,0.200318,2516,27002


In [717]:
# Plotting all 474 points. 
create_iplot_trends(df=filter_stuff, 
                       x_index='loyalty', 
                       y_indexes=['net_backward_shift', 'net_forward_shift'], 
                       title='Loyalty Vs. Net Semantic Shift')

In [None]:
#ratio of loyalty vs subreddit size -> 

In [718]:
# Plotting all 474 points. 
create_iplot_trends(df=loyalty_smoothened_df_10, 
                       x_index='loyalty', 
                       y_indexes=['net_backward_shift', 'net_forward_shift'], 
                       title='Loyalty Vs. Net Semantic Shift')

In [719]:
# Plotting all 474 points. 
create_iplot_trends(df=loyalty_smoothened_df_10, 
                       x_index='loyalty', 
                       y_indexes=['net_backward_shift', 'net_forward_shift'], 
                       title='Loyalty Vs. Net Semantic Shift')

In [721]:
# CLEARLY No Distinguishable relationship!! Plotting 47 points.
# Plotting all 474 points. 
create_iplot_trends(df=ss, 
                       x_index='loyalty', 
                       y_indexes=['backward_shift', 'forward_shift'], 
                       title='Loyalty Vs. Semantic Shift')

In [743]:
jj = filter_stuff.copy()
jj['smth'] = filter_stuff['backward_shift'] - filter_stuff['forward_shift']

In [744]:
lk = get_smoothened_intervaled_frames(jj, 'smth')

In [748]:
len(jj)

1744

In [749]:
len(jj[jj['smth'] > 0])

1094

In [751]:
# CLEARLY No Distinguishable relationship!!
create_iplot_trends(df=jj, 
                       x_index='loyalty', 
                       y_indexes=['smth'], 
                       title='Loyalty Vs. Semantic Shift')

In [722]:
# CLEARLY No Distinguishable relationship!!
create_iplot_trends(df=loyalty_smoothened_df_10, 
                       x_index='loyalty', 
                       y_indexes=['backward_shift', 'forward_shift'], 
                       title='Loyalty Vs. Semantic Shift')

In [469]:

all_subs_df[['no._of_users', 'no._of_comments', 'loyalty']].set_index('loyalty').sort_index().iplot(subplots=True, subplot_titles=True, mode='markers', xTitle='loyalty')



<h3> Analysis of relationship of No. Of Comments with Existing Features </h3>
Explore the relationship between no. of comments and semantic shift, net semantic shift, and loyalty

In [486]:
ss[ss['no._of_comments'] > 20000]

1794

In [487]:
create_iplot_trends(df=ss[ss['no._of_comments'] > 20000], 
                       x_index='no._of_comments', 
                       y_indexes=['backward_shift', 'forward_shift'], 
                       title='No. of Comments Vs. Semantic Value')


In [482]:
"""
This dataframe basically means that we are smoothening it by binning n points to 1, 
since the dataframe contains 1400+/2500+ points.

default n = 10
"""

comments_smoothened_df = get_smoothened_intervaled_frames(ss, 'no._of_comments')

comments_smoothened_df_10 = get_smoothened_intervaled_frames(ss, 'no._of_comments', n=200)

In [492]:
create_iplot_trends(df=ss[ss['no._of_comments'] > 20000], 
                       x_index='no._of_comments', 
                       y_indexes=['net_backward_shift'], 
                       title='No. of Comments Vs. Semantic Value')


In [757]:
create_iplot_trends(df=comments_smoothened_df_10, 
                       x_index='no._of_comments', 
                       y_indexes=['net_backward_shift', 'net_forward_shift'], 
                       title='No. of Comments Vs. Semantic Value')


In [756]:
create_iplot_trends(df=ss, 
                       x_index='no._of_comments', 
                       y_indexes=['net_backward_shift', 'net_forward_shift'], 
                       title='No. of Comments Vs. Semantic Value')


In [607]:
bytea = ss2['forward_shift'] != 0 
byteb = ss2['no._of_comments'] > 20000
bytec = ss2['aff_wc_in_models'] == 50
filter_stuff2 = ss2[bytea & byteb & bytec]

In [525]:
bytea = ss['forward_shift'] != 0 
byteb = ss['no._of_comments'] > 20000
bytec = ss['aff_wc_in_models'] == 100
filter_stuff = ss[bytea & byteb & bytec]

In [None]:
ss[ss['no._of_comments'] > 20000]

In [521]:
filter_stuff.sort_values(by="half_load_affinity_average")['loyalty']

gif                     0.002571
answers                 0.001825
rant                    0.008303
promos                  0.004662
self                    0.005579
Damnthatsinteresting    0.001043
TrueAskReddit           0.002017
makemychoice            0.014096
AMA                     0.014140
geek                    0.000482
NoStupidQuestions       0.009902
instant_regret          0.000608
offmychest              0.019908
wallpapers              0.005051
firstworldproblems      0.001526
inthenews               0.003026
AskTrollX               0.024873
Foodforthought          0.003272
Frozenfriends           0.182065
interestingasfuck       0.002827
me_irl                  0.009705
explainlikeimfive       0.006096
entertainment           0.002716
blog                    0.000403
TrueReddit              0.007566
CasualConversation      0.064340
firstimpression         0.018227
fffffffuuuuuuuuuuuu     0.012427
Showerthoughts          0.007832
Advice                  0.023918
          

In [531]:
smth = get_smoothened_intervaled_frames(ss, 'loyalty')

In [566]:
jj = ss.rolling(window=50)

In [564]:
jj.mean().iloc[4:]

Unnamed: 0,mean,std,slang_to_user_wc,user_percent,average_affinity_value,forward_shift,backward_shift,net_forward_shift,net_backward_shift,aff_wc_in_models,half_load_affinity_average,loyalty,dedication,no._of_users,no._of_comments
ableton,,,,,,,,,,,,,,,
batman,,,,,,,,,,,,,,,
havoc_bot,,,,,,,,,,,,,,,
ftlgame,,,,,,,,,,,,,,,
lootcrate,,,,,,,,,,,,,,,
lacrosse,0.8601,3.086869,241.9,0.175130,0.210417,0.273200,0.286500,-2.950000e-02,0.042800,100.0,0.413147,0.080692,0.144552,4722.6,62029.4
aliens,0.8498,3.049549,239.5,0.171251,0.203377,0.269200,0.283100,-2.930000e-02,0.043200,100.0,0.377505,0.076946,0.139686,4703.5,60918.0
ruby,0.8677,3.119585,244.6,0.178650,0.113547,0.237500,0.250600,-2.920000e-02,0.042300,100.0,0.300959,0.060859,0.121807,4500.7,44006.2
Everton,0.8594,3.092224,239.2,0.172460,0.115758,0.237600,0.257300,-2.410000e-02,0.043800,100.0,0.323319,0.070512,0.126726,4502.9,45045.8
StarWars,1.4978,4.769341,256.9,0.165205,0.109999,0.243000,0.265900,-2.180000e-02,0.044700,100.0,0.275827,0.058947,0.116181,8734.8,65818.2


In [571]:
create_iplot_trends(df=ss, 
                       x_index='half_load_affinity_average', 
                       y_indexes=['no._of_users'], 
                       title='No. of Comments Vs. Semantic Value')


In [649]:
slang_analyser = reload(slang_analyser)

In [650]:
original_subreddit_metrics = reload(original_subreddit_metrics)

In [645]:
os.listdir(os.path.join(config.SUBDIR_ANALYSIS_LOAD_PATH, 'nba'))

['nba_filtered_lemma.pkl',
 'nba_usernames.pkl',
 'nba_neutral_1000.pkl',
 'nba_u2w.pkl',
 'nba-2015-05-06.model',
 'nba_lemma_dic.pkl',
 'nba-2015-03-04.model',
 'nba_w2u.pkl',
 'nba-2015-01-02.model',
 'nba-2014-11-12.model',
 'nba_affinity_terms.pkl',
 'nba_aff_1000.pkl']

In [781]:
mm = sorted(stuff, key=lambda x: stuff[x], reverse=False)

In [785]:
mm = [m for m in mm if stuff[m] != 0]

In [787]:
mm[-100:]

['sheed',
 'delly',
 'iguodala',
 'peja',
 'yao',
 'boozer',
 'speights',
 'ginobli',
 'garnett',
 'coty',
 'grantland',
 'westbrook',
 'bibby',
 'kareem',
 'dwade',
 'cavs',
 'wilt',
 'rasheed',
 'kerr',
 'dray',
 'lebrons',
 'bynum',
 'bogut',
 'sacre',
 'blatt',
 'flagrant',
 'klay',
 'mutombo',
 'linsanity',
 'speight',
 'windhorst',
 'roty',
 'kyrie',
 'kobe',
 'ezeli',
 'ewing',
 'vogel',
 'telfair',
 'dunker',
 'halfcourt',
 'sabonis',
 'hakeem',
 'lebron',
 'freethrow',
 'divac',
 'mjax',
 'shaq',
 'mosgov',
 'kwame',
 'dellavedova',
 'rodman',
 'iverson',
 'hardens',
 'woj',
 'scottie',
 'dpoy',
 'olajuwon',
 'ecf',
 'battier',
 'varejao',
 'igoudala',
 'artest',
 'lbj',
 'ppg',
 'stepback',
 'beno',
 'horry',
 'lac',
 'delonte',
 'hubie',
 'shaqtin',
 'thabeet',
 'mcgrady',
 'gsw',
 'fta',
 'pippen',
 'lal',
 'bbiq',
 'nop',
 'fga',
 'wcf',
 'wnba',
 'fiba',
 'jvg',
 'drtg',
 'aau',
 'zbo',
 'fts',
 'mle',
 'joerger',
 'ortg',
 'fmvp',
 'probasketballtalk',
 'mip',
 'moy',
 '

In [786]:
mm[:100]

['map',
 'automoderator',
 'moderator',
 'cat',
 'compose',
 'bot',
 'recommend',
 'government',
 'mail',
 'download',
 'store',
 'email',
 'steam',
 'message',
 'service',
 'album',
 'driver',
 'automatically',
 'artist',
 'technology',
 'advice',
 'card',
 'comic',
 'enemy',
 'bottle',
 'her',
 'character',
 'author',
 'helpful',
 'online',
 'computer',
 'app',
 'photo',
 'shop',
 'car',
 'industry',
 'mission',
 'episode',
 'band',
 'damage',
 'meat',
 'update',
 'policy',
 'image',
 'filter',
 'craft',
 'gender',
 'location',
 'truck',
 'input',
 'search',
 'program',
 'society',
 'she',
 'station',
 'weapon',
 'information',
 'remove',
 'chat',
 'subject',
 'partner',
 'access',
 'method',
 'wheel',
 'copy',
 'function',
 'sexual',
 'user',
 'water',
 'student',
 'private',
 'request',
 'mine',
 'boot',
 'subreddit',
 'sale',
 'food',
 'community',
 'safety',
 'religion',
 'porn',
 'audio',
 'company',
 'info',
 'plant',
 'cell',
 'political',
 'female',
 'button',
 'iron',
 'sexy

In [651]:
%%time 
start = time()
ss3 = original_subreddit_metrics.generate_subreddit_metrics_df(subnames, n=1000, j=50, load_style="half_load")
end = time()


Mean of empty slice.


invalid value encountered in double_scalars



CPU times: user 47.3 s, sys: 24.6 s, total: 1min 11s
Wall time: 7min 11s


In [457]:
ss2

Unnamed: 0,mean,std,slang_to_user_wc,user_percent,average_affinity_value,forward_shift,backward_shift,net_forward_shift,net_backward_shift,aff_wc_in_models,half_load_affinity_average,loyalty,dedication,no._of_users,no._of_comments
motogp,0.288,1.020322,169,0.130790,0.106694,0.170000,0.226000,-5.551115e-19,5.600000e-02,100,0.381149,0.076748,0.173479,2202,24799
italy,0.096,0.585478,86,0.019293,0.997124,0.474000,0.508000,-4.000000e-03,3.800000e-02,100,0.999536,0.227693,0.275121,4976,183884
Shave_Bazaar,0.340,1.276088,200,0.241993,0.050463,0.225000,0.131000,-9.000000e-02,-4.000000e-03,100,0.104096,0.056940,0.218505,1405,14866
poker,2.041,6.611302,462,0.238101,0.129063,0.330000,0.366000,1.000000e-03,3.500000e-02,100,0.917092,0.135908,0.217452,8572,138814
ableton,0.416,1.650740,199,0.153392,0.034866,0.194000,0.217000,-3.000000e-03,2.600000e-02,100,0.068420,0.044617,0.092920,2712,11703
batman,3.498,10.232106,550,0.196429,0.028485,0.255000,0.300000,-5.000000e-03,5.000000e-02,100,0.164084,0.019149,0.095968,17808,90970
havoc_bot,0.036,0.500704,28,0.135338,0.612850,0.286000,0.376000,-7.600000e-02,1.660000e-01,100,0.996219,0.011278,0.022556,266,93350
ftlgame,0.610,2.256081,246,0.157257,0.067638,0.336000,0.318000,-1.600000e-02,-2.000000e-03,100,0.315914,0.036092,0.140242,3879,33955
lootcrate,0.620,1.737700,260,0.207636,0.028049,0.263000,0.207000,-8.100000e-02,2.500000e-02,100,0.031688,0.062960,0.078701,2986,13414
lacrosse,0.656,4.998166,219,0.271074,0.048936,0.199000,0.216000,-2.100000e-02,3.800000e-02,100,0.153271,0.135537,0.130579,2420,14539


In [458]:
ss[ss['aff_wc_in_models'] != 100]

Unnamed: 0,mean,std,slang_to_user_wc,user_percent,average_affinity_value,forward_shift,backward_shift,net_forward_shift,net_backward_shift,aff_wc_in_models,half_load_affinity_average,loyalty,dedication,no._of_users,no._of_comments
yorku,0.126000,1.141106,70,0.105439,0.052525,0.114894,0.156383,5.319149e-03,0.036170,94,0.045888,0.108787,0.196653,1195,14631
fantasyhockey,0.554000,1.872721,242,0.172103,0.105057,0.384211,0.110526,-2.776316e-01,0.003947,76,0.122320,0.193538,0.286114,3219,61799
SiliconValleyHBO,0.599000,2.062086,246,0.083788,0.067316,0.380000,0.170000,-2.800000e-01,0.070000,10,0.024497,0.005735,0.067002,7149,26374
fresh_funny,0.782609,2.261009,44,0.240000,0.005900,0.761905,0.666667,-1.190476e-01,0.023810,21,0.025148,0.004444,0.017778,450,11619
Dirtybomb,0.389000,1.294480,196,0.116363,0.081090,0.281818,0.136364,-1.636364e-01,0.018182,11,0.015170,0.046964,0.167514,3343,27173
Fallout4,0.442000,1.438275,215,0.164129,0.013672,0.310448,0.117910,-1.940299e-01,0.001493,67,0.009740,0.007055,0.067954,2693,10144
announcements,1.495000,4.895506,365,0.091465,0.040756,0.000000,0.000000,0.000000e+00,0.000000,0,0.000000,0.000061,0.036035,16345,46339
mistyfront,0.192000,1.592211,81,0.235583,0.154625,0.639130,0.467391,-1.782609e-01,0.006522,46,0.029860,0.003681,0.023313,815,34040
Astros,0.363000,1.217058,195,0.225046,0.073278,0.162687,0.161194,-5.970149e-02,0.058209,67,0.138402,0.098574,0.265964,1613,49493
bourbon,4.435345,15.946980,153,0.219309,0.022348,0.373684,0.305263,-1.157895e-01,0.047368,19,0.028038,0.091645,0.168372,4692,62768
