<h1> Introductory Python Notebook to Semantic Shift Analysis Guide </h1>

The purpose of this guide is to understand how parts of the Semantic Shift Api works

<h3> Guidelines of this notebook </h3>

<p><b> Module functions </b></p>
The goal of this notebook is to introduce a user to the module functions, hence descriptions surround how they work and how they should be used.

<p><b> Notebook functions </b></p>
These functions are helper functions that are implemented <i>specficially</i> for this notebook to illustrate certain examples in a clean manner. The user should not necessarily pay attention to them, and should only study them in detail if one is confused about the functionality that is being demonstrated or described.

In [2]:
import pandas as pd
import numpy as np
import gzip
import json
import os, glob, sys
import multiprocessing as mp

<h3> Module Imports </h3>
We append module_path to sys.path to be able to import any inhouse module.<br>

In [6]:
module_path = '/home/ndg/users/abhand1/subreddit_data/module/'
if module_path not in sys.path:
    sys.path.append(module_path)

Load relevant modules for Slang Analysis. 

<ul>
    <li>Slang Analyser - Computes Affinity terms for subreddits</li>
    <li>Subreddinfo - Contains helper functions for such as pickle_dump, and pickle_load</li>
    <li>Config - Loads path information. Path information is prefixed in config. </li>
    <li>Original Subreddit Metrics - The engine that computes </li>
</ul>

In [137]:
import slang_analyser
from subreddinfo import pickle_dump, pickle_load
import config

import original_subreddit_metrics

<h3> Loading Subreddits </h3> <br>
Below, we calculate and load the subreddits in a list for the interval of 2014 November to 2015 June.<br>
The function get_subnames loads and parses subreddits from a file that contains subreddits line by line. 

In [15]:
def get_subnames(fname):
    """Gets Subnames from a local file."""
    subnames = []
    with open(fname) as f:
        subnames = f.read().split('\n')[:-1]
    return subnames

all_subs_path = 'all_subreddits_above_10000_comments.txt'
all_subs_full_path = os.path.join(module_path, all_subs_path)
subnames = get_subnames(all_subs_full_path)

In this introductory notebook guide, we will be using the first 10 subreddits in the subname category to demonstrate the features and capacity of the modules that we have. These subreddits are:

<p>['motogp',
 'italy',
 'Shave_Bazaar',
 'poker',
 'ableton',
 'batman',
 'havoc_bot',
 'ftlgame',
 'lootcrate',
 'lacrosse']</p>

In [140]:
demo_subs = subnames[:10]

<h3> Calculating Affinity Values </h3> <br>
The first set of features displayed is the way to extract affinity terms for a subreddit. A pre-requisite for this is to compute word-freq dictionaries for each subreddit that maps word to its frequency {word -> freq}. This step is done, and these dictionaries are store in config.SUBDIR_ANALYSIS_LOAD_PATH/subname/{subname + "\_filtered_lemma.pkl"} 



The affinity terms are computed by passing the list of subnames to slang_analyser.run_slang_analyser which internally calls a function called affinity_analysis. <br>
affinity_analysis computes affinity terms for each subreddit using the affinity analysis formula. <br>

The affinity value for each term will be dependent on the number of subreddits and type of subreddits that are passed to the function, because the values are dependent on the pool of data available.

In [143]:
demo_affinity_terms = slang_analyser.run_slang_analyser(demo_subs)

In [199]:
def extract_some_affinity_terms(list_affinity_terms, subs, sub_index, head=10, desc=True, show_aff_score=True):
    """A print function that displays a head number of affinity terms for a subreddit.
    
    Args:
        list_affinity_terms (list): a list of affinity terms (dictionary). Maps aff -> aff score.
        
    Returns:
        smth
    
    """
    print("Printing {} affinity terms of {}".format(head, subs[sub_index]))
    target_affinity_dic = list_affinity_terms[sub_index]
    aff_head = sorted(target_affinity_dic, 
                      key=lambda x: target_affinity_dic[x], 
                      reverse=desc)[:head]
    
    if show_aff_score:
        print(*list((aff, target_affinity_dic[aff]) for aff in aff_head), sep='\n')


In [198]:
extract_some_affinity_terms(demo_affinity_terms, demo_subs, 0)

Printing 10 affinity terms of motogp
('lap', 0.9991460290350128)
('marc', 0.9989561586638831)
('marquez', 0.9987266079763581)
('motogp', 0.9983514955863898)
('rid', 0.9983221476510067)
('honda', 0.9980065686539601)
('pedrosa', 0.997907949790795)
('dovi', 0.9977426636568849)
('cal', 0.9975728155339806)
('ducati', 0.997542539388946)


<h3> Graphing Imports </h3>
These imports are useful for the purposes of graphing

In [1527]:
# Standard plotly imports
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
# Using plotly + cufflinks in offline mode
import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=True)

"""
For linear regrssion
"""
from sklearn import linear_model

<h2> Helper Functions for Numerical Analysis and Plotting </h2> 

In [793]:
home_dir = '/home/ndg/users/abhand1/'
all_subs_df = pickle_load(home_dir + 'all_subreddits_df.pkl')
all_subs_df_30 = all_subs_df[all_subs_df['aff_wc_in_models'] > 30]

For the purposes of accuracy, while we have datasets of about 2500+ subreddits, we  only accept subreddits where 
of the 100 affinity terms, over 30 were present in the word embeddings of all time intervals (many happen to be not present in all 4). 30 is the threshold, and we end up with about 1400+ subreddits. 

<b>Note:</b> Now we store in the folder <i>slang_semantic_shift_df</i> dataframess which have n number of affinity terms present in all four word embeddings. all_subs_df does not reflect this data, whereas it contains the results of the top 100 absolute affinity terms (regardless of if they are present in all 4 embeddings).

Also: <br>
aff_wc_in_models -> the number of words present in all four word embeddings (out of the top n affinity terms) <br>
wc -> word count

In [796]:
# Data that is important
top_100_aff_df = pd.read_csv(home_dir + 'slang_semantic_shift_df/top_100_aff.csv', index_col=0)
top_50_aff_df = pd.read_csv(home_dir + 'slang_semantic_shift_df/top_50_aff.csv', index_col=0)
bot_50_aff_df = pd.read_csv(home_dir + 'slang_semantic_shift_df/bottom_50_aff.csv', index_col=0)

In [1525]:
user_rate_df = pd.read_csv(home_dir + 'slang_semantic_shift_df/top_50_aff_user_rate.csv', index_col=0)

<h3> Explanation of columns </h3>
Only explaining important columns. <br>
Ignore these columns: mean,	std, slang_to_user_wc, user_percent, average_affinity_value <br>
half_load_average: Average affinity values of 

In [797]:
top_50_aff_df.head()

Unnamed: 0,mean,std,slang_to_user_wc,user_percent,average_affinity_value,forward_shift,backward_shift,net_forward_shift,net_backward_shift,aff_wc_in_models,half_load_affinity_average,loyalty,dedication,no._of_users,no._of_comments
motogp,0.288,1.020322,169,0.13079,0.106694,0.152,0.206,0.008,0.046,50,0.702799,0.076748,0.173479,2202,24799
italy,0.096,0.585478,86,0.019293,0.997124,0.514,0.544,-0.002,0.032,50,0.999672,0.227693,0.275121,4976,183884
Shave_Bazaar,0.34,1.276088,200,0.241993,0.050463,0.234,0.13,-0.098,-0.006,50,0.19697,0.05694,0.218505,1405,14866
poker,2.041,6.611302,462,0.238101,0.129063,0.402,0.404,-0.018,0.02,50,0.986001,0.135908,0.217452,8572,138814
ableton,0.416,1.65074,199,0.153392,0.034866,0.19,0.198,0.002,0.006,50,0.121612,0.044617,0.09292,2712,11703


<h3> Filter Datasets </h3>
The above dataframes can all be manipulated to extract even more rich datasets. Below is a demonstration to get more comment heavy subreddits (as they are likely to contribute to richer conversations), as well as remove values of semantic shift where subs had no forward shift/backward shift due to configuration issues. <br>

We use <i>top_50_aff_df </i> for filtering example

In [808]:
bytea = top_50_aff_df['forward_shift'] != 0 
byteb = top_50_aff_df['no._of_comments'] > 20000
bytec = top_50_aff_df['aff_wc_in_models'] == 50
filtered_top_50_aff_df = top_50_aff_df[bytea & byteb & bytec]

Also for the filtered subreddits we add a new column, which contains the difference in net semantic shift between high affinity terms and neutral terms. 

In [814]:
filtered_top_50_aff_df[['net_backward_diff', 'net_forward_diff']] = filtered_top_50_aff_df[['net_backward_shift', 'net_forward_shift']] - bot_50_aff_df.loc[filtered_top_50_aff_df.index][['net_backward_shift', 'net_forward_shift']]


<h3> Helper functions </h3>

The following function helps with sorting the dataframe by a sorting index (one of the columns in the dataframe) which is then used to calculate the moving average (for smoothing purposes).

More information: 
http://bestmaths.net/online/index.php/year-levels/year-12/year-12-topic-list/smoothing-techniques/

In [1015]:
def calculate_pearson_by_interval(df, sorting_index, y_index, n=10):
    df = df.sort_values(by=sorting_index)
    pearson_vals = []
    index_vals = []
    for i in range(0, df.shape[0], n):
        lower_lim = i
        upper_lim = n*(i+1)
        temp_df = df.iloc[lower_lim:upper_lim]
        pearson_vals.append(pearsonr(temp_df[sorting_index], temp_df[y_index])[0])
        index_vals.append(np.mean(temp_df[sorting_index]))
    return pd.DataFrame(data=[index_vals, pearson_vals], index=[sorting_index, 'pearson_r']).T

In [370]:
def get_smoothened_intervaled_frames(temp, sorting_index, n=10, pearson=True, asc=False):
    """
    Purpose: 
    The following function receives a dataframe, which it treats it as temporary, and then sorts the dataframe
    based on the column which is indexed. For instance, if the sorting index is 'loyalty', then it will
    sort the rows in order of loyalty. The function, then smoothens the dataset by binning it in boxes
    of size n (default is 10). In other words, n consecutive rows are averaged and smoothened into a new row,
    and this collection of smoothened rows is returned as a new dataframe.
    """
    temp = temp.sort_values(by=sorting_index)
    i = 0
    df_val = 0
    new_temp = pd.DataFrame(columns=temp.columns)
    temp_summer = 0
    for index, row in temp.iterrows():
        if i % n == 0 and i != 0:
            temp_summer = temp_summer/n
            new_temp.loc[df_val] = temp_summer
            df_val += 1
            temp_summer = 0
            i = 0
        if type(temp_summer) == int and temp_summer == 0:
            temp_summer = row
        else:
            temp_summer += row
        i += 1
    temp_summer = temp_summer/i
    new_temp.loc[df_val] = temp_summer
    return new_temp.iloc[:-1]
    

The following set of helper functions help with plotting the dataframe data.
It produces an interactive plotly graph

In [933]:
def data_to_plotly(x):
    k = []
    
    for i in range(0, len(x)):
        k.append(x[i][0])
        
    return k


def create_iplot_trends(df, x_index, y_indexes, 
                        markers={'reg': 'lines', 'index': 'markers'}, 
                        title='', 
                        secondary_y='', 
                        size_of_markers=3):
    temp_df = df.copy()
    df_markers = {}
    keys = []
    for y_index in y_indexes:
        regr = linear_model.LinearRegression()
        x_val = np.reshape(list(df[x_index]), (-1, 1))
        y_val = np.reshape(list(df[y_index]), (-1, 1))
        df_markers[y_index] = markers['index']
        keys.append(y_index)
        regr.fit(x_val, y_val)
        
        y_reg = regr.predict(x_val)
        y_reg_index = y_index + '_reg'
        temp_df[y_reg_index] = data_to_plotly(y_reg)
        df_markers[y_reg_index] = markers['reg']
        keys.append(y_reg_index)
    temp_df_sorted = temp_df.set_index(x_index).sort_index()
    temp_df_sorted.iplot(keys=keys, 
                         mode=df_markers, 
                         xTitle=x_index, 
                         title=title, 
                         secondary_y=secondary_y, 
                         size=size_of_markers)
        

<h2> Numerical Analysis</h2>

In the following examples of numerical analysis, we run analysis on smoothened datasets. In other_words, a group of n consecutive points are averaged (default n = 10), and made into one point for all points. At times all points instead of moving (or smoothing) averages are used. This is to illustrate if the trend exists on a non-smoothened surface. 

For instance, Loyalty and Semantic values do not have a clear relationship when running the analysis on loyalty vs. semantic shift, Hence the loss of information with smoothing  may lead one  to believe that there may be a trend.

<b>Sorting Reason:</b> For each primary variable whose relationship is being compared against other features, we create a smoothened dataframe from the complete dataframe SORTED on the primary variable. This allows us to to see if linear patterns are present.

<b>ALSO NOTE </b>: "Semantic Shift" and "Net Semantic Shift" are <b>not</b> the same.
Net semantic shift defines the amount of net shift words have had

<h3> Analysis of relationship of Loyalty with Existing Features </h3>
Explore the relationship between loyalty and semantic shift, net semantic shift, no._of_comments and no._of_users

In [806]:
"""
This dataframe basically means that we are smoothening it by binning n points to 1, 
since the dataframe contains 1400+/2500+ points.

default n = 10
"""

loyalty_smoothened_df = get_smoothened_intervaled_frames(top_50_aff_df, 'loyalty')

loyalty_smoothened_df_10 = get_smoothened_intervaled_frames(top_50_aff_df, 'loyalty', n=200)

In [1358]:
create_iplot_trends(df=top_50_aff_df, 
                       x_index='loyalty', 
                       y_indexes=['net_backward_shift', 'net_forward_shift'], 
                       title='Loyalty Vs. Net Semantic Shift')

In [1357]:
create_iplot_trends(df=loyalty_smoothened_df, 
                       x_index='loyalty', 
                       y_indexes=['net_backward_shift', 'net_forward_shift'], 
                       title='Loyalty Vs. Net Semantic Shift')

In [1529]:
# Plotting all 474 points. 
create_iplot_trends(df=filtered_top_50_aff_df, 
                       x_index='loyalty', 
                       y_indexes=['net_backward_diff'], 
                       title='Loyalty Vs. Net Backward Diff')

In [1533]:
create_iplot_trends(df=top_50_aff_df, 
                       x_index='loyalty', 
                       y_indexes=['half_load_affinity_average'], 
                       title='Loyalty Vs. Average Affinity Value of High Affinity Terms')

In [1535]:
copy_df = top_50_aff_df.copy()
copy_df['log_loyalty'] = np.log(copy_df['loyalty'])
create_iplot_trends(df=copy_df, 
                       x_index='log_loyalty', 
                       y_indexes=['half_load_affinity_average'], 
                       title='Log Loyalty Vs. Average Affinity Value of High Affinity Terms')

In [1538]:
# Plotting all 474 points. 
# smoothened_filtered_top_50_aff_df = get_smoothened_intervaled_frames(filtered_top_50_aff_df, 'loyalty')
create_iplot_trends(df=smoothened_filtered_top_50_aff_df, 
                       x_index='loyalty', 
                       y_indexes=['half_load_affinity_average'], 
                       title='Loyalty Vs. Net Semantic Shift')

In [1536]:
create_iplot_trends(df=filtered_top_50_aff_df, 
                       x_index='net_backward_shift', 
                       y_indexes=['net_backward_shift_neutral'], 
                       title='High Affinity Semantic Shift Vs. Neutral Semantic Shift')

In [1351]:

top_50_aff_df[['no._of_users', 'no._of_comments', 'loyalty']].set_index('loyalty').sort_index().iplot(subplots=True, subplot_titles=True, mode='markers', xTitle='loyalty', size=3)



<h3> Analysis of relationship of No. Of Comments with Existing Features </h3>
Explore the relationship between no. of comments and semantic shift, net semantic shift, and loyalty

In [482]:
"""
This dataframe basically means that we are smoothening it by binning n points to 1, 
since the dataframe contains 1400+/2500+ points.

default n = 10
"""

comments_smoothened_df = get_smoothened_intervaled_frames(top_50_aff_df, 'no._of_comments')

comments_smoothened_df_10 = get_smoothened_intervaled_frames(top_50_aff_df, 'no._of_comments', n=200)

In [1350]:
create_iplot_trends(df=top_50_aff_df, 
                       x_index='no._of_comments', 
                       y_indexes=['net_backward_shift', 'net_forward_shift'], 
                       title='No. of Comments Vs. Net Semantic Shift')


In [1349]:
create_iplot_trends(df=comments_smoothened_df, 
                       x_index='no._of_comments', 
                       y_indexes=['net_backward_shift', 'net_forward_shift'], 
                       title='No. of Comments Vs. Net Semantic Shift')


In [1539]:
create_iplot_trends(df=top_50_aff_df, 
                       x_index='no._of_comments', 
                       y_indexes=['backward_shift', 'forward_shift'], 
                       title='No. of Comments Vs. Semantic Value')

In [1346]:
create_iplot_trends(df=comments_smoothened_df, 
                       x_index='no._of_comments', 
                       y_indexes=['backward_shift', 'forward_shift'], 
                       title='No. of Comments Vs. Semantic Value')

In [1540]:
create_iplot_trends(df=top_50_aff_df, 
                       x_index='no._of_comments', 
                       y_indexes=['half_load_affinity_average'], 
                       title='No. of Comments Vs. Average Affinity Value of High Affinity Terms')

<h2> Statistical Analysis of Variables </h2>

Conducting Pearson Coefficient and linear regression r-squared analysis of variables with respect to <br>
<ul> 
    <li> Half Load Affinity Average
    </li> 
    <li> Net Semantic Shift 
    </li>
    <li> Net Semantic Diff
    </li>
    <li> Something
    </li>
</ul>

For simplicity, we are using the filtered_top_50_aff_df because this dataframe contains the column net_backward_diff.

In [1343]:
from scipy.stats import pearsonr

In [1544]:
def calculate_regression_coefficients(df, x_index, y_index, log=False, decimal_places=3):
    if len(x_index) == 1:
        X = df[x_index].values.reshape(-1, 1)
    else:
        X = df[x_index].values
    
    if log:
        X = np.log(X)
    y = df[y_index].values.reshape(-1, 1)
    lm = linear_model.LinearRegression()
    model = lm.fit(X,y)

    r_squared = lm.score(X,y)
    if len(x_index) > 1:
        return {"r_squared": round(r_squared, decimal_places)}
    pearson_r, p_value = pearsonr(X, y)
    result_values = {
            "pearson_r": round(pearson_r[0], decimal_places),
            "p_value": round(p_value[0], decimal_places),
            "r_squared": round(r_squared, decimal_places)
        }
    return result_values

In [1561]:
def conduct_statistical_calculations(df, x_indexes, y_indexes, log_val=0):
    if log_val == 0:
        log_val = [False]
    else:
        log_val = [True, False]
    for log in log_val:
        print("\nLog Value is: ", log)
        for y_index in y_indexes:
            print(y_index)
            for x_index in x_indexes:
                if type(x_index) == list:
                    print('\t', x_index, calculate_regression_coefficients(df, x_index, y_index, log))
                else:
                    print('\t', x_index, calculate_regression_coefficients(df, [x_index], y_index, log))

There is a difference between half_load_affinity_average and average_affinity_values in the dataframes columns.
Half_load_affinity_average is the average affinity score of the 100 terms that are present in top 100 affinity terms that are present in all the word embeddings. <br>
Whereas ignore average_affinity_values for now <br>

The log value true, and false indicates whether a logarithm function is applied on the x values. Appyling the logarithm function helps spread the clustered datapoints spread towards the lower tail of x values and give a stronger linear relationship. This is demonstrated by the improvement of pearson coefficient between loyalty and other Y variables when log is applied in comparison to when it isn't applied. 

In [1555]:
y_indexes = ['half_load_affinity_average', 'net_backward_shift', 'net_backward_shift_neutral', 'backward_shift', 'net_backward_diff']
x_indexes = ['loyalty', 'dedication', 'no._of_comments', 'no._of_users']
log_val = 1
conduct_statistical_calculations(filtered_top_50_aff_df, x_indexes, y_indexes, log_val)


Log Value is:  True
half_load_affinity_average
	 loyalty {'pearson_r': 0.56, 'r_squared': 0.314, 'p_value': 0.0}
	 dedication {'pearson_r': 0.454, 'r_squared': 0.206, 'p_value': 0.0}
	 no._of_comments {'pearson_r': 0.197, 'r_squared': 0.039, 'p_value': 0.0}
	 no._of_users {'pearson_r': -0.137, 'r_squared': 0.019, 'p_value': 0.0}
net_backward_shift
	 loyalty {'pearson_r': 0.233, 'r_squared': 0.054, 'p_value': 0.0}
	 dedication {'pearson_r': 0.2, 'r_squared': 0.04, 'p_value': 0.0}
	 no._of_comments {'pearson_r': 0.15, 'r_squared': 0.023, 'p_value': 0.0}
	 no._of_users {'pearson_r': -0.015, 'r_squared': 0.0, 'p_value': 0.528}
net_backward_shift_neutral
	 loyalty {'pearson_r': 0.125, 'r_squared': 0.016, 'p_value': 0.0}
	 dedication {'pearson_r': 0.144, 'r_squared': 0.021, 'p_value': 0.0}
	 no._of_comments {'pearson_r': 0.165, 'r_squared': 0.027, 'p_value': 0.0}
	 no._of_users {'pearson_r': 0.042, 'r_squared': 0.002, 'p_value': 0.074}
backward_shift
	 loyalty {'pearson_r': 0.22, 'r_squared

The pearson coefficient between all X variables and Y variables are relatively low, however one should note that <b>loyalty</b> has a far stronger coefficient relationship than the other three variables. <br>
For instance loylaty has a 0.21 correlation to net_backward_shift whereas number of comments and users mark less than 1. Furthermore, loyalty has a tremondously strong relationship to 

Also all these data values present to us that there <b> is </b> a statistical difference between loyalty and dedication when it comes to influencing the evolution of language

Number of comments has a strong relationship to semantic shift (as seen in backward shift), with a pearson correlation of 0.5+

In [1564]:
y_indexes = ['half_load_affinity_average', 'net_backward_shift', 'backward_shift']
x_indexes = ['loyalty', 'dedication', 'no._of_comments', 'no._of_users']
log_val = 1

conduct_statistical_calculations(top_50_aff_df, x_indexes, y_indexes, log_val)



Log Value is:  True
half_load_affinity_average
	 loyalty {'pearson_r': 0.518, 'r_squared': 0.269, 'p_value': 0.0}
	 dedication {'pearson_r': 0.444, 'r_squared': 0.197, 'p_value': 0.0}
	 no._of_comments {'pearson_r': 0.326, 'r_squared': 0.107, 'p_value': 0.0}
	 no._of_users {'pearson_r': 0.018, 'r_squared': 0.0, 'p_value': 0.346}
net_backward_shift
	 loyalty {'pearson_r': 0.177, 'r_squared': 0.031, 'p_value': 0.0}
	 dedication {'pearson_r': 0.167, 'r_squared': 0.028, 'p_value': 0.0}
	 no._of_comments {'pearson_r': 0.181, 'r_squared': 0.033, 'p_value': 0.0}
	 no._of_users {'pearson_r': 0.042, 'r_squared': 0.002, 'p_value': 0.031}
backward_shift
	 loyalty {'pearson_r': 0.165, 'r_squared': 0.027, 'p_value': 0.0}
	 dedication {'pearson_r': 0.065, 'r_squared': 0.004, 'p_value': 0.001}
	 no._of_comments {'pearson_r': 0.523, 'r_squared': 0.274, 'p_value': 0.0}
	 no._of_users {'pearson_r': 0.425, 'r_squared': 0.18, 'p_value': 0.0}

Log Value is:  False
half_load_affinity_average
	 loyalty {'pe

In [1557]:
y_indexes = ['backward_shift', 'net_backward_diff']
x_indexes = ['net_backward_shift', 'net_backward_shift_neutral']
log_val = 0

conduct_statistical_calculations(filtered_top_50_aff_df, x_indexes, y_indexes, log_val)


Log Value is:  False
backward_shift
	 net_backward_shift {'pearson_r': 0.461, 'r_squared': 0.213, 'p_value': 0.0}
	 net_backward_shift_neutral {'pearson_r': 0.308, 'r_squared': 0.095, 'p_value': 0.0}
net_backward_diff
	 net_backward_shift {'pearson_r': 0.885, 'r_squared': 0.783, 'p_value': 0.0}
	 net_backward_shift_neutral {'pearson_r': 0.211, 'r_squared': 0.044, 'p_value': 0.0}


<h3> Statistical Analysis of Language Barrier to Entry with Subreddits </h3>

User Rate is a metric that measures barriers to entry to a subreddit. This is measured by summing the total rate of new user involvement in a subreddit, for all the intervals. For instance

One big proponent for a subreddit to have slow growth is its inability for other users to relate to, or the inability to be coherent with the underlying culture. Hence we measure the barriers to entry as over each intervals, the number of •new• users that visit a subreddit that visit the 

In [1563]:
y_indexes = ["user_rate"]
x_indexes = ["loyalty", "dedication", "half_load_affinity_average", "no._of_comments", "no._of_users"]
log_val = 0

conduct_statistical_calculations(user_rate_df, x_indexes, y_indexes, log_val)


Log Value is:  False
user_rate
	 loyalty {'pearson_r': -0.583, 'r_squared': 0.34, 'p_value': 0.0}
	 dedication {'pearson_r': -0.707, 'r_squared': 0.5, 'p_value': 0.0}
	 half_load_affinity_average {'pearson_r': -0.443, 'r_squared': 0.196, 'p_value': 0.0}
	 no._of_comments {'pearson_r': -0.161, 'r_squared': 0.026, 'p_value': 0.0}
	 no._of_users {'pearson_r': 0.056, 'r_squared': 0.003, 'p_value': 0.004}


Hence, now that we observe loyalty, dedication and half_load_affinity_average have relatively high combinations, we comhine pairs of these variables and calculate <b>multivarate</b> regression.

What we see is that when loyalty and dedication are paired variables, there is VERY little increase in productive power, however when combining half_load_affinity_average with dedication or loyalty, there is an increase of about 0.03 in r_squared value. 

In [1566]:
y_indexes = ["user_rate"]
x_indexes = [["loyalty", "dedication"], 
             ["dedication", "half_load_affinity_average"], 
             ["half_load_affinity_average", "loyalty"]
            ]
log_val = 0

conduct_statistical_calculations(user_rate_df, x_indexes, y_indexes, log_val)


Log Value is:  False
user_rate
	 ['loyalty', 'dedication'] {'r_squared': 0.503}
	 ['dedication', 'half_load_affinity_average'] {'r_squared': 0.539}
	 ['half_load_affinity_average', 'loyalty'] {'r_squared': 0.377}


In [1550]:
create_iplot_trends(user_rate_df, 'loyalty', ['user_rate'])

In [1551]:
create_iplot_trends(user_rate_df, 'dedication', ['user_rate'])

In [1552]:
create_iplot_trends(user_rate_df, 'half_load_affinity_average', ['user_rate'])

The correlation between net_backward_shift and predicting the difference between high affinity terms semantic shift and low affinity terms semantic shift is strikingly high.

In [1341]:
from scipy.stats import ttest_ind

t_test_aff_terms = ttest_ind(top_50_aff_df['net_backward_shift'], top_100_aff_df[ 'net_backward_shift'])
t_test_aff_vs_neutral = ttest_ind(top_50_aff_df['net_backward_shift'], bot_50_aff_df[ 'net_backward_shift'])
print("Printing the P-values of the t-tests: ")
print(t_test_aff_terms[1], t_test_aff_vs_neutral[1])

Printing the P-values of the t-tests: 
0.34767647822593806 3.6857045218584887e-85


In [651]:
%%time 
start = time()
ss3 = original_subreddit_metrics.generate_subreddit_metrics_df(subnames, n=1000, j=50, load_style="half_load")
end = time()


Mean of empty slice.


invalid value encountered in double_scalars



CPU times: user 47.3 s, sys: 24.6 s, total: 1min 11s
Wall time: 7min 11s
