# Finding sentiment scores for every company

In this notebook, I will find 10 sentiment scores for each company, one for each of the 5 Glassdoor categories (Culture & Value, Work/Life Balance, Senior Management, Comp & Benefits, Career Opportunities) and one for each of PRO/CON. For each company, these scores will measure how its employees feel about it on average. As this notebook is a bit long (and important), I'll start by outlining its contents.

## Outline


<b> Transforming and training models on hand labeled reviews </b>

I will start by working with my collection of 4,000 hand labeled PRO and CON sentences. I will train two Word2Vec models on these sentences, one for PRO and one for CON, and then do some feature engineering, adding binary variables. I will then train 10 Logistic Regression models, one for each category-PRO/CON combination. Then for each PRO sentence, I will apply the 5 PRO models to the sentence, predicting if the sentence belongs to each category. I will also do the same for the CON sentences, respectively.

<b> Transforming and predicting labels of all reviews </b>

Next, I'll work with the collection of Glassdoor reviews. After some filtering, I'll end up analyzing a collection of ~1.2 million Glassdoor reviews and ~5 million sentences. I'll apply the previously trained Word2Vec models to each of the Glassdoor sentences and do the same feature engineering. I'll then apply the already trained Logistic Regression models to all of the sentences.

<b> Find companies' average sentiments </b>

I'll conclude by performing VADER sentiment analysis on each sentence, and then find the average sentiment scores grouped by company.

As a last note, I think precision matters more than recall, i.e. having a lot of false negatives is worse than having a lot of false positives. I want each of the reviews in each category to represent how that company behaves in that category. If I miss a few reviews for a category, that's better than having flawed reviews being counted. Still, I desire high precision and recall if possible.

In [1]:
import pandas as pd
pd.set_option('display.max_columns',200)
import numpy as np
from numpy.testing import assert_array_equal

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
#set default seaborn plotting style
sns.set_style('white')

import time

import gensim
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import sent_tokenize
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

## Transforming and training models on hand labeled reviews

### Import hand labeled sentences

I will import my hand labeled collection of 1993 sentences from PROs reviews and 2000 sentences from CONs reviews.

In [2]:
pros_sentences = pd.read_csv('pros_sentences_labeled.csv', index_col='Unnamed: 0')

cons_sentences = pd.read_csv('cons_sentences_labeled.csv', index_col='Unnamed: 0')

#5 categories for PROs and CONs to consider
pros_categories = ['Culture & Values', 'Work/Life Balance', 'Senior Management',
       'Comp & Benefits', 'Career Opportunities']

cons_categories = ['Culture & Values', 'Work/Life Balance', 'Senior Management',
       'Comp & Benefits', 'Career Opportunities']


### Word tokenize sentences

In [3]:
tokenizer = RegexpTokenizer(r'\w+')

stop_words = set(stopwords.words('english'))

In [7]:
def tokenize_a_sentence_worklife(sentence, tokenizer_input):
    '''
    Remove stopwords, make lowercase and tokenize. 
    Also replace variants of "Work/Life" with "worklife".
    
    Args:
        sentence: sentence to edit
        tokenizer_input: word tokenizer to apply to sentence
        
    Returns:
        list of word tokens. For example:
        
            "The Work/Life is incredible." --> ["worklife", "is", "incredible"]      
    '''
    
    sentence = sentence.lower()
    
    #replace variants of worklife with worklife
    for variant in ['work-life', 'work/life', 'work/life.', 'work life']:
        sentence = sentence.replace(variant, 'worklife')
        
    sentence = ' '.join([word for word in sentence.split() if word not in stop_words])
    
    return tokenizer_input.tokenize(sentence)

In [9]:
#tokenize labeled PROs and CONs sentences
pros_sentences.loc[:,'tokens'] = pros_sentences.loc[:,'PROs_sentence']\
    .apply(lambda sentence: tokenize_a_sentence_worklife(sentence, tokenizer))
cons_sentences.loc[:,'tokens'] = cons_sentences.loc[:,'CONs_sentence']\
    .apply(lambda sentence: tokenize_a_sentence_worklife(sentence, tokenizer))

### Transform sentences with Word2Vec

In [11]:
def w2v_sentences(sentences_series, dim):
    '''
    Word2Vec vectorize series of lists of word tokens.
    
    Args:
        sentences_series: series of lists of tokens
        dim: dimension for Word2Vec transformed words
    
    Returns:
        DataFrame: each row is Word2Vec transformed sentence
    '''
    
    #train Word2Vec model on lists of tokens
    model = gensim.models.Word2Vec(list(sentences_series), size=dim)
    w2v = dict(zip(model.wv.index2word, model.wv.vectors))
    
    #organize transformed sentences in numpy array
    sentences_vectors_array = np.zeros((sentences_series.shape[0], dim))

    for idx in range(sentences_series.shape[0]):
        sentences_vectors_array[idx] = \
            np.array([np.mean([model[w] for w in sentences_series[idx] if w in model] or [np.zeros(dim)], axis=0)])
            
    return pd.DataFrame(sentences_vectors_array)

In [12]:
#transform labeled PROs and CONs sentences
pros_sentences_vectors_df = w2v_sentences(pros_sentences.loc[:,'tokens'], 100)
cons_sentences_vectors_df = w2v_sentences(cons_sentences.loc[:,'tokens'], 100)



In [13]:
#few first few rows of transformed vectors
pros_sentences_vectors_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.028717,0.018607,0.042368,0.029196,0.004087,0.042658,0.033049,-0.024671,0.012006,0.013547,...,0.017426,0.012471,0.018717,-0.029935,-0.079078,-0.011828,-0.077091,-0.056645,-0.025024,0.027811
1,0.028335,0.01788,0.039535,0.029969,0.005222,0.045316,0.032858,-0.02212,0.013805,0.009161,...,0.017021,0.011434,0.020428,-0.029728,-0.078625,-0.015569,-0.077349,-0.05897,-0.023497,0.027329
2,0.0287,0.015215,0.039859,0.026309,0.00137,0.039939,0.027561,-0.021461,0.011045,0.005851,...,0.013301,0.01473,0.016651,-0.02751,-0.074676,-0.018004,-0.074316,-0.055614,-0.022512,0.024521
3,0.016954,0.006026,0.021865,0.01571,0.001347,0.020781,0.018039,-0.00937,0.001633,0.006057,...,0.009844,0.003911,0.008421,-0.012273,-0.040665,-0.007833,-0.036294,-0.029661,-0.007776,0.012293
4,0.024748,0.015159,0.036285,0.029135,0.003534,0.036435,0.024946,-0.023188,0.009417,0.008782,...,0.016462,0.011729,0.017616,-0.022991,-0.068454,-0.011889,-0.067225,-0.050462,-0.017924,0.024987


### Feature engineering

I will add extra binary variables on top of my Word2Vec features. These binary variables if certain words (commonly co-occurring with certain categories) are present in each sentence. For example, one binary variable might ask if "pay" or any of its variants are present in the sentence.

I originally hoped to use a Porter Stemmer to handle the variants of different words, but it was too slow to work on my dataset.

In [19]:
#store common words with variants as dictionary

#key: common word
#values: set of variants
common_words_dict = {}

with open('common_words.txt','r') as fout:
    for common_words in fout:
        common_words_list = common_words.strip().split('/')
        common_words_dict[common_words_list[0]] = set(common_words_list)

In [20]:
#check if each sentence contains each of the common words
for common in common_words_dict:
    pros_sentences_vectors_df.loc[:,common] = \
        pros_sentences.loc[:,'tokens'].apply(lambda tokens: 1 
                                             if bool(set(tokens).intersection(common_words_dict[common])) 
                                             else 0) 
    cons_sentences_vectors_df.loc[:,common] = \
        cons_sentences.loc[:,'tokens'].apply(lambda tokens: 1 
                                             if bool(set(tokens).intersection(common_words_dict[common]))
                                             else 0) 
    

In [157]:
#DataFrame containing Word2Vec features then binary variables
pros_sentences_vectors_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,nice,environment,people,respect,supportive,culture,value,diverse,non-profit,energetic,loyal,large,waste,consumer,expectation,competition,fear,vacation,sick,balance,hours,load,flexible,location,travel,breaks,schedule,college,stress,overtime,evening,maternity,hard,goal,manager,upper,supervisor,lead,ceo,issue,organization,incompetent,communicate,pay,benefits,compensation,bonus,discount,package,401k,medical,free,facilities,retirement,competitive,stock,commission,advance,mobility,experience,salaries,wage,paycheck,perk,pto,low,raise,condition,rate,opportunity,promote,outsource,nepotism,less,talent,career,training,professional,development,potential,intern,skill,grow,resume,entry,transfer,network,connect,lack,move,lay,replace
0,0.028717,0.018607,0.042368,0.029196,0.004087,0.042658,0.033049,-0.024671,0.012006,0.013547,0.072913,-0.063807,0.039665,0.091402,0.022818,0.027448,0.008403,-0.00406,0.037416,-0.038077,0.041624,-0.028316,-0.021265,0.038053,0.035647,0.042492,0.127383,-0.041069,-0.001068,0.027081,0.002828,0.007056,-0.074864,0.058468,-0.045809,0.020147,-0.016577,0.019034,0.038301,-0.029702,0.020276,0.029405,-0.021028,-0.010516,0.004749,-0.020345,0.026555,-0.031175,-0.030817,-0.026716,0.023185,0.03664,0.038079,-0.042803,-0.014746,0.015267,0.011067,-0.027206,0.019749,0.040141,0.026534,-0.004126,0.046764,0.045557,0.040063,0.036996,-0.000259,0.048781,0.018582,0.04561,0.046938,0.023005,-0.006763,0.00597,-0.049633,0.047543,0.015793,-0.041671,-0.050694,0.075461,-0.012935,0.039903,0.032736,-0.028358,-0.065732,-0.072535,-0.025208,0.059317,0.035036,-0.04912,0.017426,0.012471,0.018717,-0.029935,-0.079078,-0.011828,-0.077091,-0.056645,-0.025024,0.027811,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0.028335,0.01788,0.039535,0.029969,0.005222,0.045316,0.032858,-0.02212,0.013805,0.009161,0.070963,-0.06463,0.039942,0.090488,0.024804,0.026959,0.010269,-0.001861,0.038128,-0.035108,0.041954,-0.029353,-0.023968,0.040909,0.034029,0.042276,0.129878,-0.040325,0.001975,0.025173,0.004855,0.007882,-0.07612,0.057087,-0.045958,0.017575,-0.017178,0.013714,0.03974,-0.029066,0.022165,0.030521,-0.024599,-0.010596,0.004802,-0.019069,0.025735,-0.02923,-0.031903,-0.030275,0.021598,0.034302,0.036423,-0.04078,-0.016031,0.01738,0.010462,-0.027143,0.017036,0.037587,0.025418,-0.002916,0.046743,0.045283,0.039431,0.036283,0.002424,0.046983,0.014852,0.04431,0.050914,0.023112,-0.004795,0.001741,-0.049687,0.048474,0.014419,-0.043254,-0.050149,0.074736,-0.012452,0.039784,0.033233,-0.02943,-0.068892,-0.073431,-0.020871,0.060252,0.033497,-0.047351,0.017021,0.011434,0.020428,-0.029728,-0.078625,-0.015569,-0.077349,-0.05897,-0.023497,0.027329,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0.0287,0.015215,0.039859,0.026309,0.00137,0.039939,0.027561,-0.021461,0.011045,0.005851,0.069384,-0.059044,0.039678,0.08724,0.020474,0.02374,0.012681,-0.008434,0.034966,-0.038007,0.043472,-0.021618,-0.020986,0.037517,0.03052,0.04474,0.116759,-0.038289,-0.005609,0.022835,-3e-05,0.009447,-0.071845,0.048912,-0.043033,0.015212,-0.020794,0.015586,0.039882,-0.0238,0.01792,0.025334,-0.017881,-0.007178,0.000703,-0.023421,0.025046,-0.023391,-0.031483,-0.029446,0.019224,0.03331,0.029601,-0.038058,-0.015826,0.016886,0.010153,-0.023521,0.020811,0.031253,0.022894,-0.006458,0.043191,0.043751,0.036104,0.033925,0.000986,0.04655,0.015903,0.042235,0.047757,0.01979,-0.005161,0.001456,-0.045978,0.042398,0.014814,-0.039177,-0.045607,0.073608,-0.014691,0.032049,0.030255,-0.025443,-0.060622,-0.067435,-0.023467,0.058085,0.027908,-0.047496,0.013301,0.01473,0.016651,-0.02751,-0.074676,-0.018004,-0.074316,-0.055614,-0.022512,0.024521,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0.016954,0.006026,0.021865,0.01571,0.001347,0.020781,0.018039,-0.00937,0.001633,0.006057,0.035409,-0.029367,0.018824,0.03745,0.01492,0.011587,0.001555,-0.006172,0.0141,-0.016756,0.015482,-0.012966,-0.013162,0.016427,0.01476,0.023342,0.060445,-0.020232,0.004041,0.014958,-0.000164,0.004863,-0.03737,0.022818,-0.019286,0.005228,-0.009311,0.007732,0.021579,-0.015392,0.009398,0.013238,-0.011145,-0.005105,-0.000168,-0.012872,0.013279,-0.015754,-0.015796,-0.012478,0.013863,0.015401,0.017772,-0.020888,-0.005722,0.008791,0.009091,-0.016417,0.008993,0.014017,0.013652,-0.003022,0.025065,0.023248,0.018947,0.016261,0.000776,0.021314,0.009399,0.023046,0.025151,0.010214,-0.001453,0.000365,-0.021134,0.022999,0.007881,-0.018056,-0.02117,0.035159,-0.008689,0.017353,0.017648,-0.010039,-0.027538,-0.033321,-0.010145,0.028266,0.017991,-0.024119,0.009844,0.003911,0.008421,-0.012273,-0.040665,-0.007833,-0.036294,-0.029661,-0.007776,0.012293,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0.024748,0.015159,0.036285,0.029135,0.003534,0.036435,0.024946,-0.023188,0.009417,0.008782,0.060429,-0.056196,0.03293,0.076855,0.019188,0.023356,0.00627,-0.001993,0.034387,-0.031912,0.036459,-0.021899,-0.016811,0.033423,0.026205,0.039099,0.109787,-0.029746,-0.00028,0.020474,0.003363,0.005032,-0.065364,0.047949,-0.040684,0.013714,-0.016564,0.013792,0.032043,-0.025439,0.018015,0.023436,-0.018329,-0.009262,0.00489,-0.018455,0.02101,-0.027384,-0.023024,-0.024082,0.017292,0.031369,0.028912,-0.035157,-0.013815,0.009731,0.00647,-0.025618,0.014249,0.031164,0.021004,-0.002158,0.042599,0.037042,0.032358,0.031303,-3e-06,0.044516,0.013126,0.038909,0.043617,0.021848,-0.003564,0.003113,-0.038511,0.041935,0.011694,-0.036698,-0.041817,0.064226,-0.009532,0.033145,0.028253,-0.024996,-0.055601,-0.060457,-0.02172,0.052203,0.028655,-0.040371,0.016462,0.011729,0.017616,-0.022991,-0.068454,-0.011889,-0.067225,-0.050462,-0.017924,0.024987,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Logistic Regression on hand labeled PROs sentences

For each category, I standard scale and then train a Logistic Regression model on all of the hand labeled PROs reviews. I use Grid Search cross-validation to find the best values for `C` for my Logistic Regression models.

In [101]:
lrp = Pipeline([('ss', StandardScaler()),
               ('lr', LogisticRegression(random_state=23))])

c_vals = np.logspace(-3,3,13) #10^-3 ~3*10^-3 10^-2 ... 10^3

In [102]:
#keys: category
#values: trained Logistic Regression pipeline with optimized C value
models_pros = {}


for cat in pros_categories:
    models_pros[cat] = Pipeline([('ss', StandardScaler()),
               ('lr', LogisticRegression(random_state=23))])

for cat in pros_categories:
     
    start_time = time.time()

    print('-'*50)
    print(cat)
    
    #train/test split keeping 25% for testing
    x_train, x_test, y_train, y_test = \
    train_test_split(pros_sentences_vectors_df,
                     pros_sentences.loc[:,cat],
                     test_size=0.25,
                     stratify=pros_sentences.loc[:,cat],
                     random_state=23)
    
    #Grid Search on C value with highest accuracy
    params = dict(lr__C=c_vals, lr__penalty=['l2'])

    gse = GridSearchCV(estimator=lrp, param_grid=params, cv=skf)

    gse.fit(x_train, y_train)


    print('Best C param = {}'.format(gse.best_estimator_.get_params()['lr__C']))
    print('Best penalty = {}'.format(gse.best_estimator_.get_params()['lr__penalty']))
    print('Best CV score = {}'.format(gse.best_score_))

    #train Logistic Regression with optimal C value
    models_pros[cat].set_params(lr__C = gse.best_estimator_.get_params()['lr__C'], 
                   lr__penalty = gse.best_estimator_.get_params()['lr__penalty'])

    models_pros[cat].fit(x_train, y_train)
    y_pred = models_pros[cat].predict(x_test)
    
    print(classification_report(y_test, y_pred))
    print(accuracy_score(y_test, y_pred))

    print('Took {} seconds.'.format(time.time()-start_time))
    
    start_time = time.time()

--------------------------------------------------
Culture & Values
Best C param = 0.0031622776601683794
Best penalty = l2
Best CV score = 0.7831325301204819
             precision    recall  f1-score   support

          0       0.80      0.89      0.84       317
          1       0.76      0.62      0.68       182

avg / total       0.79      0.79      0.78       499

0.7895791583166333
Took 13.041311979293823 seconds.
--------------------------------------------------
Work/Life Balance
Best C param = 0.001
Best penalty = l2
Best CV score = 0.9457831325301205
             precision    recall  f1-score   support

          0       0.96      0.98      0.97       434
          1       0.88      0.75      0.81        65

avg / total       0.95      0.95      0.95       499

0.9539078156312625
Took 15.361080884933472 seconds.
--------------------------------------------------
Senior Management
Best C param = 0.03162277660168379
Best penalty = l2
Best CV score = 0.964524765729585
         

### Logistic Regression on hand labeled CONs sentences

I now do the same for the CONs sentences, training a Logistic Regression model with standard scaling preprocessing for each category.

In [103]:
#keys: category
#values: trained Logistic Regression pipeline with optimized C value
models_cons = {}

for cat in pros_categories:
    models_cons[cat] = Pipeline([('ss', StandardScaler()),
               ('lr', LogisticRegression(random_state=23))])

for cat in cons_categories:
     
    start_time = time.time()

    print('-'*50)
    print(cat)
    
    #train/test split keeping 25% for testing
    x_train, x_test, y_train, y_test = \
    train_test_split(cons_sentences_vectors_df,
                     cons_sentences.loc[:,cat],
                     test_size=0.25,
                     stratify=cons_sentences.loc[:,cat],
                     random_state=23)
    
    #Grid Search on C value with highest accuracy
    params = dict(lr__C=c_vals, lr__penalty=['l2'])

    gse = GridSearchCV(estimator=lrp, param_grid=params, cv=skf)

    gse.fit(x_train, y_train)


    print('Best C param = {}'.format(gse.best_estimator_.get_params()['lr__C']))
    print('Best penalty = {}'.format(gse.best_estimator_.get_params()['lr__penalty']))

    print('Best CV score = {}'.format(gse.best_score_))

    
    #train Logistic Regression with optimal C value
    models_cons[cat].set_params(lr__C = gse.best_estimator_.get_params()['lr__C'], 
                   lr__penalty = gse.best_estimator_.get_params()['lr__penalty'])

    models_cons[cat].fit(x_train, y_train)
    y_pred = models_cons[cat].predict(x_test)
    
    print(classification_report(y_test, y_pred))

    print('Accuracy: {}'.format(accuracy_score(y_test,y_pred)))
    print('Took {} seconds.'.format(time.time()-start_time))
    
    start_time = time.time()

--------------------------------------------------
Culture & Values
Best C param = 0.03162277660168379
Best penalty = l2
Best CV score = 0.7026666666666667
             precision    recall  f1-score   support

          0       0.74      0.86      0.79       328
          1       0.61      0.42      0.50       172

avg / total       0.69      0.71      0.69       500

Accuracy: 0.708
Took 12.454102039337158 seconds.
--------------------------------------------------
Work/Life Balance
Best C param = 0.0031622776601683794
Best penalty = l2
Best CV score = 0.9233333333333333
             precision    recall  f1-score   support

          0       0.94      0.97      0.96       441
          1       0.72      0.58      0.64        59

avg / total       0.92      0.92      0.92       500

Accuracy: 0.924
Took 9.872137069702148 seconds.
--------------------------------------------------
Senior Management
Best C param = 0.03162277660168379
Best penalty = l2
Best CV score = 0.9393333333333334
 

## Transforming and predicting labels of all reviews

### Importing and cleaning data

We begin by importing the Glassdoor reviews dataset, the jobs ratings dataset, and the companies dataset. We will need to do some cleaning of the data. In particular, we will only consider companies that have been reviewed at least 10 times.

In [27]:
#import and clean Glassdoor reviews dataset


start_time = time.time()

reviews = pd.read_csv('glassdoor_reviews_2.csv')

original_reviews = reviews.copy()

#each review's "Author Title" should be of format "Employee Status - Job Title"
# for example, "Current Employee - Senior Engineer"

#determine how many parts each review's "Author Title" has (should be 2)
reviews.loc[:,'title_length'] = reviews.loc[:,'Author Title'].apply(lambda x: len(x.split(' - ')))

#only consider reviews of proper format "Employee Status - Author Title"
reviews = reviews[reviews['title_length'] == 2]
#could be omitting some job titles with 'dash' in name,
#but decreasing number of reviews from 2631927 to 2615691 (<1% change, so don't care)

#'Author Title' of all reviews now 2
reviews = reviews.drop('title_length', axis=1)

#break up "Author Title" into two columns: "Employee Status" and "Job Title"
reviews.loc[:,'Employee Status'] = reviews.loc[:,'Author Title'].apply(lambda x: x.split(' - ')[0])
reviews.loc[:,'Job Title'] = reviews.loc[:,'Author Title'].apply(lambda x: x.split(' - ')[1])

#remove 10 reviews have incorrect "Employee Status" 
#("Employee Status" not like "Current Employee", "Former Intern", etc.)
reviews = reviews[reviews['Employee Status'] != 'module.emp-review.current-'] #remove 4 reviews
reviews = reviews[reviews['Employee Status'] != 'module.emp-review.former-'] #remove 6 reviews

#add extra columns that states if employee is current or former employee
reviews.loc[:,'current_or_former'] = reviews.loc[:,'Employee Status'].apply(lambda x: x.split(' ')[0])

print('Took ' + str(time.time()-start_time) + ' seconds.')

  interactivity=interactivity, compiler=compiler, result=result)


Took 163.01474404335022 seconds.


In [43]:
start_time = time.time()

cleaned_reviews = reviews.copy()

reviews = cleaned_reviews.copy()

print('Took ' + str(time.time() - start_time) + ' seconds.')

Took 10.997055292129517 seconds.


In [30]:
#ratings of jobs based on required skills

all_jobs_ratings = pd.read_csv('all_jobs_ratings_transpose.csv')

In [31]:
all_jobs = all_jobs_ratings.loc[:,'Job Title']

#only consider reviews for jobs that have been reviewed at least 3x at some company
reviews = reviews[reviews.loc[:,'Job Title'].isin(all_jobs)]

In [41]:
print('Number of reviews before choosing only certain jobs: {}'.format(original_reviews.shape))
print('After cleaning: {}'.format(cleaned_reviews.shape))
print('After choosing only certain jobs: {}'.format(reviews.shape))


Before choosing only certain jobs: (2631927, 35)
After cleaning: (2615681, 38)
After choosing only certain jobs: (1220699, 38)


In [38]:
companies = pd.read_csv('reviewed_companies.csv', index_col='Unnamed: 0')

#only consider companies with at least this many reviews
min_reviews = 10

#company ID's of companies with at least 10 reviews
company_ids_at_least_min_reviews = companies[companies['count'] >= min_reviews].loc[:,'Company Id']

#only consider reviews of companies with at least 10 reviews
reviews = reviews[reviews['Company Id'].isin(company_ids_at_least_min_reviews)]

In [40]:
print('With company minimum of {} reviews: {}'.format(min_reviews, 
                                                      reviews.shape[0]))
#lost only about 4000 reviews

With company minimum of 10 reviews: 1220699


In [42]:
print('Original number of companies: {}'.format(original_reviews.loc[:,'Company Id'].nunique()))

print('Number after requiring at least 10 reviews: {}'.format(reviews.loc[:,'Company Id'].nunique()))

Original number of companies: 5833
Number after requiring at least 10 reviews: 4059


## Splitting up by PROs and CONs into sentences

As for the hand labeled dataset, we will break up each PRO and CON review into sentences. This will result in very long DataFrames.

In [46]:
#PROs and CONs of review (still in free-form text format)

reviews_pros_sentences = reviews.loc[:,['PROs','Company Id']]
reviews_cons_sentences = reviews.loc[:,['CONs','Company Id']]

In [47]:
def sent_tokenize_replace_period(a_string):
    '''
    Makes sure there is a space after each periods, and sentence tokenizes.
    
    Args:
        a_string: sentence
        
    Returns:
        List of cleaned sentences. For example,
        
        'I like pay.Also manager is good.' --> ['I like pay.', 'Also manager is good.']
    '''
    a_string = a_string.replace('.', '. ')
    a_string = a_string.replace('+', '. ')
    
    return sent_tokenize(a_string)

In [48]:
#Make new column and split up PROs into separate sentences
start_time = time.time()

reviews_pros_sentences.loc[:,'PROs sentence'] = \
    reviews_pros_sentences.loc[:,'PROs'].apply(lambda x: sent_tokenize_replace_period(x))

print('Took ' + str(time.time() - start_time) + ' seconds.')

Took 128.7796607017517 seconds.


In [51]:
#Make new column and split up CONs into separate sentences
start_time = time.time()

reviews_cons_sentences.loc[:,'CONs sentence'] =  reviews_cons_sentences.loc[:,'CONs'].apply(lambda x: sent_tokenize_replace_period(str(x)))

print('Took ' + str(time.time() - start_time) + ' seconds.')

Took 166.60184502601624 seconds.


In [53]:
#make sure 'Company Id's are type <int>
reviews_pros_sentences.loc[:,'Company Id'] = 
    reviews_pros_sentences.loc[:,'Company Id'].apply(lambda x:int(x))

In [55]:
#make new column of pairs (company ID, PRO sentence about company ID)
start_time = time.time()

reviews_pros_sentences.loc[:,'combined'] = 
    reviews_pros_sentences.apply(lambda row: [(row['Company Id'], sent) 
                                              for sent in row['PROs sentence']], axis=1)

print('Took ' + str(time.time() - start_time) + ' seconds.')

Took 59.264694929122925 seconds.


In [57]:
#list of all pairs (company ID, sentence about company ID)
company_ids_pros_sentences_flattened = 
    [comp_id_sent 
     for pair in reviews_pros_sentences.loc[:,'combined']
     for comp_id_sent in pair]


In [58]:
#make into 2490815 x 2  DataFrame of company ID and sentence about company ID
#for example, one row might look like:
# index    Company Id    PROs sentence
#  0         4           Steady 40 hours a week.


comp_id_pro_sentence_df = 
    pd.DataFrame.from_dict({'Company Id':[pair[0] 
                                          for pair in company_ids_pros_sentences_flattened], 
                            'PROs sentence':[pair[1] 
                                             for pair in company_ids_pros_sentences_flattened]})

In [61]:
start_time = time.time()

#word tokenize each sentence of each review
comp_id_pro_sentence_df.loc[:,'tokens'] = 
    comp_id_pro_sentence_df.loc[:,'PROs sentence'].\
        apply(lambda sentence: tokenize_a_sentence_worklife(sentence, tokenizer))

print('Took ' + str(time.time() - start_time) + ' seconds.')

Took 41.012876987457275 seconds.


Now let's do the same for CONs reviews.

In [63]:
start_time = time.time()

all_cons_sentences_list = 
    [sent for review in list(reviews_cons_sentences.loc[:,'CONs sentence']) for sent in review]

#make sure 'Company Id's are type <int>
reviews_cons_sentences.loc[:,'Company Id'] = 
    reviews_cons_sentences.loc[:,'Company Id'].apply(lambda x:int(x))


reviews_cons_sentences.loc[:,'combined'] = 
    reviews_cons_sentences.apply(lambda row: [(row['Company Id'], sent)
                                              for sent in row['CONs sentence']], axis=1)

#list of all pairs (company ID, CON sentence about company)
company_ids_cons_sentences_flattened = [comp_id_sent 
                                        for pair in reviews_cons_sentences.loc[:,'combined'] 
                                        for comp_id_sent in pair]

comp_id_con_sentence_df = 
    pd.DataFrame.from_dict({'Company Id':[pair[0] 
                                          for pair in company_ids_cons_sentences_flattened], 
                            'CONs sentence':[pair[1]   
                                             for pair in company_ids_cons_sentences_flattened]})

#word tokenize each sentence of each review
comp_id_con_sentence_df.loc[:,'tokens'] = 
    comp_id_con_sentence_df.loc[:,'CONs sentence'].\
    apply(lambda sentence: tokenize_a_sentence_worklife(sentence, tokenizer))


print('Took ' + str(time.time() - start_time) + ' seconds.')

Took 205.30730891227722 seconds.


### Transform sentences using previously trained Word2Vec models

In [73]:
def w2v_convert_sentences(original_series, to_convert_series, dim):
    '''
    Train a Word2Vec model on an original series of lists of tokens.
    Then use this model to vectorize a series of lists of tokens.

    Args:
        original_series: Series of lists of tokens to train on
        to_convert_series: Series of lists of tokens to then transform
        dim: dimension of Word2Vec model
    
    Returns:
    DataFrame: each row is vectorization of tokens of sentence
    '''
    
    #train Word2Vec model on orginal_series
    model = gensim.models.Word2Vec(list(original_series), size=dim)

    #Numpy array stores Word2Vec transformed series
    to_convert_vectors_array = np.zeros((to_convert_series.shape[0], dim))

    for idx in range(to_convert_series.shape[0]):
        to_convert_vectors_array[idx] = \
            np.array([np.mean([model[w] 
                               for w in to_convert_series[idx] 
                               if w in model] or [np.zeros(dim)], axis=0)])
            
    return pd.DataFrame(to_convert_vectors_array)

In [75]:
dim = 100 #dimension of word2Vec

In [76]:
#transform PROs sentences using hand labeled reviews-trained sentences

start_time = time.time()

reviews_pros_w2v = w2v_convert_sentences(pros_sentences.loc[:,'tokens'], 
                     comp_id_pro_sentence_df.loc[:,'tokens'],
                     dim=dim)

print('Took ' + str(time.time() - start_time) + ' seconds.')



Took 371.3734540939331 seconds.


In [77]:
#transform CONs sentences using hand labeled-trained sentences

start_time = time.time()

reviews_cons_w2v = w2v_convert_sentences(cons_sentences.loc[:,'tokens'], 
                     comp_id_con_sentence_df.loc[:,'tokens'],
                     dim=dim)

print('Took ' + str(time.time() - start_time) + ' seconds.')



Took 506.96224212646484 seconds.


In [79]:
print(reviews_pros_w2v.shape)
reviews_pros_w2v.head()

(2490815, 100)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,0.021073,0.015421,0.034878,0.02523,0.000503,0.033706,0.027564,-0.021196,0.008943,0.004582,0.064365,-0.049386,0.035511,0.074967,0.018548,0.018629,0.007315,-0.006095,0.034711,-0.035538,0.038316,-0.024083,-0.017932,0.034244,0.026115,0.03813,0.112936,-0.037382,8.9e-05,0.024936,0.000177,0.009346,-0.065599,0.04786,-0.035476,0.014463,-0.017142,0.010796,0.037608,-0.021935,0.018509,0.023596,-0.016953,-0.008246,0.000534,-0.01901,0.02183,-0.024113,-0.028104,-0.023688,0.020823,0.032518,0.030071,-0.03612,-0.01246,0.015397,0.00663,-0.022416,0.01155,0.036023,0.021052,-0.002979,0.039616,0.040573,0.032435,0.031113,-0.001467,0.042052,0.016206,0.035128,0.042158,0.016506,-0.004142,0.005444,-0.042419,0.039087,0.01075,-0.035864,-0.042669,0.064458,-0.011078,0.036195,0.029376,-0.022815,-0.056329,-0.059094,-0.018204,0.052708,0.026169,-0.038972,0.014054,0.012258,0.01962,-0.024914,-0.06677,-0.013657,-0.069117,-0.051804,-0.016766,0.02732
1,0.029788,0.025641,0.045741,0.029401,0.003033,0.054544,0.041118,-0.030071,0.016403,0.015008,0.084803,-0.075816,0.044683,0.106761,0.026246,0.028741,0.007452,-0.002802,0.041618,-0.039921,0.05035,-0.032263,-0.0255,0.047848,0.039556,0.047991,0.149339,-0.040978,-0.001822,0.034312,0.001876,0.011029,-0.091462,0.066622,-0.05293,0.017736,-0.024503,0.015219,0.04215,-0.028454,0.027815,0.028082,-0.029046,-0.014715,0.002232,-0.020672,0.028433,-0.036173,-0.0328,-0.02993,0.022307,0.045179,0.039254,-0.046527,-0.01835,0.01272,0.00877,-0.037371,0.024375,0.047582,0.028246,-0.002726,0.057367,0.045701,0.042199,0.041063,-0.001126,0.054985,0.022323,0.048682,0.056929,0.028535,-0.009399,0.000659,-0.059708,0.058511,0.019837,-0.050135,-0.057097,0.090647,-0.009512,0.04008,0.032916,-0.035386,-0.077826,-0.078651,-0.024523,0.067138,0.041372,-0.059167,0.014983,0.01545,0.019708,-0.03573,-0.093911,-0.020424,-0.08755,-0.070687,-0.026153,0.032906
2,0.018757,0.011648,0.027166,0.022895,0.000604,0.029699,0.023677,-0.014887,0.006413,0.007304,0.051785,-0.04481,0.026833,0.064396,0.017788,0.017767,0.006075,-0.001159,0.026479,-0.024913,0.027882,-0.021912,-0.017381,0.028166,0.023989,0.031767,0.090735,-0.027445,0.000229,0.019913,0.001419,0.004285,-0.053252,0.039196,-0.031003,0.01367,-0.014454,0.011099,0.027592,-0.020324,0.015048,0.017202,-0.017224,-0.010376,0.000109,-0.013471,0.0192,-0.021854,-0.024168,-0.019274,0.017809,0.024575,0.022609,-0.029755,-0.009814,0.012638,0.00653,-0.021462,0.013533,0.025643,0.020481,-0.001519,0.031801,0.028206,0.028812,0.023645,-0.001706,0.035103,0.010381,0.03237,0.03615,0.018776,-0.003068,0.001148,-0.034957,0.033148,0.012083,-0.029117,-0.037315,0.052787,-0.006647,0.026437,0.023441,-0.019134,-0.047604,-0.047212,-0.017952,0.041779,0.023289,-0.036248,0.012357,0.010446,0.012681,-0.019009,-0.055709,-0.008813,-0.052282,-0.041993,-0.015034,0.019928
3,0.015842,0.008818,0.023828,0.015648,-0.000454,0.017769,0.016922,-0.015988,0.003413,0.004033,0.038046,-0.034047,0.021969,0.049689,0.013364,0.010805,0.005434,-0.00059,0.017299,-0.017083,0.025337,-0.014447,-0.01151,0.018975,0.013568,0.0225,0.064662,-0.020148,-0.001712,0.012554,0.001538,0.004872,-0.036688,0.028479,-0.023916,0.009471,-0.011624,0.011774,0.02258,-0.017628,0.010367,0.014993,-0.01137,-0.002231,0.003618,-0.007072,0.009651,-0.012524,-0.015734,-0.01672,0.014109,0.020306,0.018037,-0.018633,-0.008056,0.009061,0.005844,-0.013576,0.01228,0.021196,0.012153,-0.00111,0.025754,0.022078,0.023348,0.019121,-0.001271,0.022667,0.009501,0.025676,0.026464,0.011755,-0.004804,0.004066,-0.025424,0.025554,0.008332,-0.021688,-0.026876,0.03802,-0.004387,0.02169,0.014407,-0.012085,-0.034715,-0.036194,-0.014289,0.031214,0.018053,-0.026729,0.009027,0.004829,0.009737,-0.016241,-0.041766,-0.008908,-0.03934,-0.027273,-0.010472,0.012611
4,0.022013,0.016211,0.030834,0.02323,0.000859,0.031623,0.024077,-0.014584,0.009601,0.007884,0.052794,-0.048794,0.033036,0.069429,0.018872,0.02239,0.010735,-0.002555,0.029382,-0.02695,0.032179,-0.021561,-0.015137,0.027287,0.027739,0.035005,0.099766,-0.034175,0.000243,0.019136,0.00039,0.004374,-0.06001,0.046878,-0.0341,0.013236,-0.013887,0.008844,0.027069,-0.022356,0.01617,0.024713,-0.015369,-0.008075,-0.000481,-0.013191,0.018716,-0.02607,-0.021073,-0.023616,0.015152,0.028513,0.027791,-0.029767,-0.011763,0.012417,0.008355,-0.020934,0.010961,0.031537,0.016907,-0.004899,0.034821,0.033242,0.029163,0.025687,0.001769,0.036589,0.012638,0.036061,0.038691,0.019728,-0.004919,0.00024,-0.037963,0.036397,0.008791,-0.032172,-0.035829,0.057408,-0.009615,0.028291,0.02541,-0.024042,-0.052538,-0.05141,-0.014895,0.043691,0.022639,-0.03634,0.0122,0.008444,0.015709,-0.023833,-0.059758,-0.0112,-0.060405,-0.043724,-0.018833,0.020915


In [80]:
print(reviews_cons_w2v.shape)
reviews_cons_w2v.head()

(3349713, 100)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,0.002795,-0.009266,0.001714,0.008198,0.010403,0.022358,0.002729,-0.023121,-0.003336,-0.009279,0.015054,-0.005551,0.011097,0.015526,0.002889,0.017839,-0.007214,-0.00583,0.010326,0.003575,0.010772,-0.011119,-0.00714,-0.003418,-0.004485,0.004444,0.039374,-0.011647,-0.011223,0.005787,0.001907,-0.008061,-0.007708,0.005959,-0.00209,0.009226,-0.014231,0.013425,0.01425,-0.01107,-0.004438,0.005467,-0.001209,0.002518,0.014984,-0.00056,0.007955,-0.009062,0.003594,-0.00895,0.013221,0.001724,0.009254,-0.009443,-0.010591,0.020579,0.010173,-0.003276,0.003143,0.020369,-0.006967,0.00392,0.011808,0.017144,-0.000234,0.019031,0.006772,0.011634,-0.000505,0.02027,0.004493,-0.001425,0.009285,-0.0086,-0.009939,-0.005251,0.012948,-0.003095,-0.006189,0.017592,-0.003313,0.013176,0.005436,-0.006969,-0.019107,-0.019807,-0.013777,0.03272,-0.005396,0.00017,0.016154,-0.007153,0.001333,-0.014156,-0.024159,-0.002575,-0.024156,-0.021444,-0.010763,0.005763
1,0.003016,-0.005019,-0.00145,0.005102,0.005247,0.017406,0.003293,-0.01306,-0.002312,-0.006276,0.009763,-0.003934,0.007297,0.0134,0.004924,0.012349,-0.002236,-0.002685,0.00405,0.004261,0.007832,-0.009664,-0.002707,-0.002381,-0.003324,0.002574,0.024611,-0.008271,-0.003606,0.005537,0.002895,-0.004512,-0.006196,0.005626,-0.004157,0.008535,-0.009348,0.009091,0.00882,-0.004072,7.2e-05,0.004775,0.000163,0.002366,0.00799,0.000792,0.005337,-0.005806,0.003523,-0.007723,0.009027,0.003407,0.007831,-0.007438,-0.004343,0.013408,0.004519,-0.003911,-0.001269,0.018127,-0.006791,0.00047,0.009138,0.012413,0.00263,0.015021,0.004302,0.005071,-0.002415,0.013814,0.003883,-0.000744,0.004547,-0.007615,-0.009779,-0.002613,0.004926,-0.002545,-0.004185,0.013014,-0.007266,0.010546,0.002719,-0.004889,-0.009432,-0.013602,-0.008833,0.020537,-0.004852,-0.001054,0.01079,-0.007204,0.002155,-0.009235,-0.017176,-0.004394,-0.015778,-0.016808,-0.00856,0.004839
2,-0.001351,-0.005689,0.000407,0.005885,0.004849,0.014367,0.00154,-0.015549,-0.00191,-0.004574,0.011094,-0.001984,0.009075,0.011009,0.001613,0.012324,-0.002097,-0.005525,0.006346,0.003755,0.01014,-0.007636,-0.008144,0.001833,-0.002947,0.005374,0.025404,-0.005225,-0.002142,0.002787,-0.002006,-0.004393,-0.003539,0.000592,-0.000301,0.009007,-0.010356,0.004465,0.009824,-0.004362,-0.003253,0.000726,0.002982,0.001192,0.008281,0.000276,0.003779,-0.00372,0.003178,-0.008674,0.009458,-0.00402,0.006266,-0.004956,-0.004559,0.012239,0.006015,-0.000522,1.2e-05,0.011348,-0.003251,0.002311,0.009699,0.007543,0.000702,0.012042,0.003653,0.008445,-0.001393,0.010415,0.003158,-0.003046,0.003478,-0.005591,-0.005638,-0.005758,0.005707,-3.6e-05,-0.005386,0.012857,-0.001107,0.005108,0.00073,-0.003479,-0.01066,-0.012694,-0.011585,0.020653,-0.006037,-0.002218,0.010691,-0.005362,0.002558,-0.011996,-0.011932,-0.001912,-0.010523,-0.014966,-0.008932,0.008451
3,0.004972,-0.010495,0.001732,0.007733,0.008092,0.027976,0.00524,-0.025121,-0.003775,-0.007835,0.020082,-0.00295,0.012459,0.020307,0.001926,0.021812,-0.005168,-0.00859,0.007587,0.007878,0.012812,-0.016185,-0.007023,-0.004586,-0.004506,0.004116,0.042258,-0.008662,-0.011597,0.009621,0.002362,-0.006925,-0.006607,0.004076,-0.001638,0.013009,-0.014086,0.015399,0.015902,-0.008163,-0.009907,0.004091,0.003886,0.002823,0.013852,0.002441,0.007388,-0.010766,0.005017,-0.015515,0.014012,0.001514,0.007915,-0.00712,-0.005257,0.022684,0.01029,-0.00426,0.004545,0.023138,-0.007926,0.0041,0.010433,0.021362,-0.000309,0.019416,0.006761,0.014521,-0.00045,0.01966,0.008144,-0.001106,0.012186,-0.013879,-0.017125,-0.006446,0.018083,-0.001742,-0.010479,0.01939,-0.006532,0.016741,0.009613,-0.006629,-0.020122,-0.024278,-0.012508,0.03607,-0.001309,-0.001357,0.015036,-0.005377,0.001222,-0.017501,-0.026647,-0.007224,-0.021281,-0.023249,-0.016846,0.010493
4,0.002847,0.000701,-0.001646,0.005628,0.001497,0.001682,0.005067,-0.004481,0.002172,0.002934,0.005222,0.003686,0.004694,0.00485,0.002154,0.000298,0.003504,-0.001238,0.002061,0.005097,-0.002044,-0.005353,-0.004694,-0.001124,0.00341,0.002433,0.004595,0.00128,0.001655,-0.001844,0.000343,-0.00436,-0.002322,-0.001103,5.1e-05,0.006471,0.000503,0.002779,0.006621,0.003207,-0.004944,0.004762,0.002008,-0.002775,0.000673,-0.003589,0.004806,-0.0029,0.005774,-0.006556,0.006091,-0.002329,0.00591,-0.002274,-0.002658,0.005471,0.006442,0.001084,-0.003278,0.004416,-0.000154,0.001028,0.006006,0.00197,-0.002013,0.002967,-0.001663,0.003782,0.003677,-0.00103,0.003281,0.003503,0.000127,-0.00257,-0.00554,-0.003211,0.005442,-0.002892,-0.002409,0.001548,0.001416,0.003568,0.005021,-0.000521,-0.002633,-0.005539,-0.001992,0.00852,-0.004533,0.002075,0.006065,0.002758,-0.004593,-0.000904,-0.001871,-0.004514,-0.005072,-0.002895,0.000783,0.000427


## Feature engineering

We will add the same features for the reviews as we did for the hand labeled reviews data. This will take almost 4 hours to run.

In [83]:
start_time = time.time()

#add binary variables to all reviews
for common in common_words_dict:
    reviews_pros_w2v.loc[:,common] = 
        comp_id_pro_sentence_df.loc[:,'tokens'].\
            apply(lambda tokens:
                   1 if bool(set(tokens).intersection(common_words_dict[common]))
                   else 0) 

    
    reviews_cons_w2v.loc[:,common] = 
        comp_id_con_sentence_df.loc[:,'tokens'].\
            apply(lambda tokens:
                  1 if bool(set(tokens).intersection(common_words_dict[common]))
                  else 0) 

    
print('Took ' + str(time.time()-start_time) + ' seconds.') #almost 4 hours to run!

Took 13858.070756912231 seconds.


In [85]:
reviews_pros_w2v.to_csv('pros_w2v_Sept26.csv') #5.5 GB
reviews_cons_w2v.to_csv('cons_w2v_Sept26.csv') #7.2 GB

In [86]:
reviews_pros_w2v.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,nice,environment,people,respect,supportive,culture,value,diverse,non-profit,energetic,loyal,large,waste,consumer,expectation,competition,fear,vacation,sick,balance,hours,load,flexible,location,travel,breaks,schedule,college,stress,overtime,evening,maternity,hard,goal,manager,upper,supervisor,lead,ceo,issue,organization,incompetent,communicate,pay,benefits,compensation,bonus,discount,package,401k,medical,free,facilities,retirement,competitive,stock,commission,advance,mobility,experience,salaries,wage,paycheck,perk,pto,low,raise,condition,rate,opportunity,promote,outsource,nepotism,less,talent,career,training,professional,development,potential,intern,skill,grow,resume,entry,transfer,network,connect,lack,move,lay,replace
0,0.021073,0.015421,0.034878,0.02523,0.000503,0.033706,0.027564,-0.021196,0.008943,0.004582,0.064365,-0.049386,0.035511,0.074967,0.018548,0.018629,0.007315,-0.006095,0.034711,-0.035538,0.038316,-0.024083,-0.017932,0.034244,0.026115,0.03813,0.112936,-0.037382,8.9e-05,0.024936,0.000177,0.009346,-0.065599,0.04786,-0.035476,0.014463,-0.017142,0.010796,0.037608,-0.021935,0.018509,0.023596,-0.016953,-0.008246,0.000534,-0.01901,0.02183,-0.024113,-0.028104,-0.023688,0.020823,0.032518,0.030071,-0.03612,-0.01246,0.015397,0.00663,-0.022416,0.01155,0.036023,0.021052,-0.002979,0.039616,0.040573,0.032435,0.031113,-0.001467,0.042052,0.016206,0.035128,0.042158,0.016506,-0.004142,0.005444,-0.042419,0.039087,0.01075,-0.035864,-0.042669,0.064458,-0.011078,0.036195,0.029376,-0.022815,-0.056329,-0.059094,-0.018204,0.052708,0.026169,-0.038972,0.014054,0.012258,0.01962,-0.024914,-0.06677,-0.013657,-0.069117,-0.051804,-0.016766,0.02732,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0.029788,0.025641,0.045741,0.029401,0.003033,0.054544,0.041118,-0.030071,0.016403,0.015008,0.084803,-0.075816,0.044683,0.106761,0.026246,0.028741,0.007452,-0.002802,0.041618,-0.039921,0.05035,-0.032263,-0.0255,0.047848,0.039556,0.047991,0.149339,-0.040978,-0.001822,0.034312,0.001876,0.011029,-0.091462,0.066622,-0.05293,0.017736,-0.024503,0.015219,0.04215,-0.028454,0.027815,0.028082,-0.029046,-0.014715,0.002232,-0.020672,0.028433,-0.036173,-0.0328,-0.02993,0.022307,0.045179,0.039254,-0.046527,-0.01835,0.01272,0.00877,-0.037371,0.024375,0.047582,0.028246,-0.002726,0.057367,0.045701,0.042199,0.041063,-0.001126,0.054985,0.022323,0.048682,0.056929,0.028535,-0.009399,0.000659,-0.059708,0.058511,0.019837,-0.050135,-0.057097,0.090647,-0.009512,0.04008,0.032916,-0.035386,-0.077826,-0.078651,-0.024523,0.067138,0.041372,-0.059167,0.014983,0.01545,0.019708,-0.03573,-0.093911,-0.020424,-0.08755,-0.070687,-0.026153,0.032906,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0.018757,0.011648,0.027166,0.022895,0.000604,0.029699,0.023677,-0.014887,0.006413,0.007304,0.051785,-0.04481,0.026833,0.064396,0.017788,0.017767,0.006075,-0.001159,0.026479,-0.024913,0.027882,-0.021912,-0.017381,0.028166,0.023989,0.031767,0.090735,-0.027445,0.000229,0.019913,0.001419,0.004285,-0.053252,0.039196,-0.031003,0.01367,-0.014454,0.011099,0.027592,-0.020324,0.015048,0.017202,-0.017224,-0.010376,0.000109,-0.013471,0.0192,-0.021854,-0.024168,-0.019274,0.017809,0.024575,0.022609,-0.029755,-0.009814,0.012638,0.00653,-0.021462,0.013533,0.025643,0.020481,-0.001519,0.031801,0.028206,0.028812,0.023645,-0.001706,0.035103,0.010381,0.03237,0.03615,0.018776,-0.003068,0.001148,-0.034957,0.033148,0.012083,-0.029117,-0.037315,0.052787,-0.006647,0.026437,0.023441,-0.019134,-0.047604,-0.047212,-0.017952,0.041779,0.023289,-0.036248,0.012357,0.010446,0.012681,-0.019009,-0.055709,-0.008813,-0.052282,-0.041993,-0.015034,0.019928,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0.015842,0.008818,0.023828,0.015648,-0.000454,0.017769,0.016922,-0.015988,0.003413,0.004033,0.038046,-0.034047,0.021969,0.049689,0.013364,0.010805,0.005434,-0.00059,0.017299,-0.017083,0.025337,-0.014447,-0.01151,0.018975,0.013568,0.0225,0.064662,-0.020148,-0.001712,0.012554,0.001538,0.004872,-0.036688,0.028479,-0.023916,0.009471,-0.011624,0.011774,0.02258,-0.017628,0.010367,0.014993,-0.01137,-0.002231,0.003618,-0.007072,0.009651,-0.012524,-0.015734,-0.01672,0.014109,0.020306,0.018037,-0.018633,-0.008056,0.009061,0.005844,-0.013576,0.01228,0.021196,0.012153,-0.00111,0.025754,0.022078,0.023348,0.019121,-0.001271,0.022667,0.009501,0.025676,0.026464,0.011755,-0.004804,0.004066,-0.025424,0.025554,0.008332,-0.021688,-0.026876,0.03802,-0.004387,0.02169,0.014407,-0.012085,-0.034715,-0.036194,-0.014289,0.031214,0.018053,-0.026729,0.009027,0.004829,0.009737,-0.016241,-0.041766,-0.008908,-0.03934,-0.027273,-0.010472,0.012611,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0.022013,0.016211,0.030834,0.02323,0.000859,0.031623,0.024077,-0.014584,0.009601,0.007884,0.052794,-0.048794,0.033036,0.069429,0.018872,0.02239,0.010735,-0.002555,0.029382,-0.02695,0.032179,-0.021561,-0.015137,0.027287,0.027739,0.035005,0.099766,-0.034175,0.000243,0.019136,0.00039,0.004374,-0.06001,0.046878,-0.0341,0.013236,-0.013887,0.008844,0.027069,-0.022356,0.01617,0.024713,-0.015369,-0.008075,-0.000481,-0.013191,0.018716,-0.02607,-0.021073,-0.023616,0.015152,0.028513,0.027791,-0.029767,-0.011763,0.012417,0.008355,-0.020934,0.010961,0.031537,0.016907,-0.004899,0.034821,0.033242,0.029163,0.025687,0.001769,0.036589,0.012638,0.036061,0.038691,0.019728,-0.004919,0.00024,-0.037963,0.036397,0.008791,-0.032172,-0.035829,0.057408,-0.009615,0.028291,0.02541,-0.024042,-0.052538,-0.05141,-0.014895,0.043691,0.022639,-0.03634,0.0122,0.008444,0.015709,-0.023833,-0.059758,-0.0112,-0.060405,-0.043724,-0.018833,0.020915,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Predicting categories of review sentences

We will now predict the labels of over 5 million sentences using our previously trained Logistic Regression models.

In [104]:
start_time = time.time()

for cat in pros_categories: #PROs and CONs have same categories
    company_id_pro_sentence_df.loc[:,cat] = models_pros[cat].predict(reviews_pros_w2v)
    company_id_con_sentence_df.loc[:,cat] = models_cons[cat].predict(reviews_cons_w2v)
    print('{} took {} seconds.'.format(cat,time.time() - start_time))
    start_time = time.time()

Culture & Values took 155.8564019203186 seconds.
Work/Life Balance took 200.39866709709167 seconds.
Senior Management took 181.68292212486267 seconds.
Comp & Benefits took 160.2242980003357 seconds.
Career Opportunities took 183.1466989517212 seconds.


In [113]:
#see how many sentences predicted into the different categories for PROs

print('PROs counts and percentages')

for cat in pros_categories:
    print(cat)
    print(company_id_pro_sentence_df.loc[:,cat].value_counts())
    print(company_id_pro_sentence_df.loc[:,cat].value_counts()*100/comp_id_pro_sentence_df.loc[:,cat].shape[0])

PROs counts and percentages
Culture & Values
0    1807218
1     683597
Name: Culture & Values, dtype: int64
0    72.555288
1    27.444712
Name: Culture & Values, dtype: float64
Work/Life Balance
0    2210517
1     280298
Name: Work/Life Balance, dtype: int64
0    88.746736
1    11.253264
Name: Work/Life Balance, dtype: float64
Senior Management
0    2289623
1     201192
Name: Senior Management, dtype: int64
0    91.922644
1     8.077356
Name: Senior Management, dtype: float64
Comp & Benefits
0    1916217
1     574598
Name: Comp & Benefits, dtype: int64
0    76.931326
1    23.068674
Name: Comp & Benefits, dtype: float64
Career Opportunities
0    2110302
1     380513
Name: Career Opportunities, dtype: int64
0    84.723354
1    15.276646
Name: Career Opportunities, dtype: float64


In [114]:
#see how many sentences predicted into the different categories for CONs

print('Cons counts and percentages')

for cat in cons_categories:
    print(cat)
    
    #number labeled in category
    print(company_id_con_sentence_df.loc[:,cat].value_counts())
    #percentage labeled in category
    print(company_id_con_sentence_df.loc[:,cat].value_counts() * 100 / \
              company_id_con_sentence_df.loc[:,cat].shape[0])

Cons counts and percentages
Culture & Values
0    2497268
1     852445
Name: Culture & Values, dtype: int64
0    74.5517
1    25.4483
Name: Culture & Values, dtype: float64
Work/Life Balance
0    3037333
1     312380
Name: Work/Life Balance, dtype: int64
0    90.674425
1     9.325575
Name: Work/Life Balance, dtype: float64
Senior Management
0    2846485
1     503228
Name: Senior Management, dtype: int64
0    84.976982
1    15.023018
Name: Senior Management, dtype: float64
Comp & Benefits
0    2960827
1     388886
Name: Comp & Benefits, dtype: int64
0    88.390468
1    11.609532
Name: Comp & Benefits, dtype: float64
Career Opportunities
0    2973921
1     375792
Name: Career Opportunities, dtype: int64
0    88.781367
1    11.218633
Name: Career Opportunities, dtype: float64


### Find companies' average sentiments

The last major computational step is performing sentiment analysis on the sentences. We will apply VADER to every sentence in our (cleaned) reviews dataset.

In [116]:
analyser = SentimentIntensityAnalyzer()

In [118]:
#find sentiment of each PRO sentence

start_time = time.time()

company_id_pro_sentence_df.loc[:,'sentiment'] = 
    company_id_pro_sentence_df.loc[:,'PROs sentence'].apply(lambda sentence: 
                                                         analyser.polarity_scores(sentence)['compound'])


print('Time to sentiment analyse {} sentences: {} seconds'.\
      format(comp_id_pro_sentence_df.shape[0], time.time()-start_time))


Time to sentiment analyse 2490815 sentences: 629.6292858123779 seconds


In [119]:
#find sentiment of each CON sentence

start_time = time.time()

company_id_con_sentence_df.loc[:,'sentiment'] = 
    company_id_con_sentence_df.loc[:,'CONs sentence'].apply(lambda sentence: 
                                                         analyser.polarity_scores(sentence)['compound'])

print('Time to sentiment analyse {} sentences: {} seconds'.format(company_id_con_sentence_df.shape[0], 
                                                                  time.time()-start_time))


Time to sentiment analyse 3349713 sentences: 983.3232078552246 seconds


### Find company sentiments

We group the sentences by company in order to determine each company's employees feel in the 10 bins. This will enable us to represent each company by a 10-vector.

In [124]:
pros_df_by_category = {}

In [125]:
#split PROs sentiment DataFrame into 5 DataFrames (one per category)

start_time = time.time()

for cat in pros_categories:
    #all sentences to be predicted in category
    pros_df_by_category[cat] = comp_id_pro_sentence_df[comp_id_pro_sentence_df[cat] == 1]
    
print('Took ' + str(time.time() - start_time) + ' seconds.')

Took 16.442720890045166 seconds.


In [126]:
#split PROs sentiment DataFrame into 5 DataFrames (one per category)

cons_df_by_category = {}

start_time = time.time()

for cat in cons_categories:
    cons_df_by_category[cat] = comp_id_con_sentence_df[comp_id_con_sentence_df[cat] == 1]
    
print('Took ' + str(time.time() - start_time) + ' seconds.')

Took 9.429229974746704 seconds.


In [133]:
#group sentiment DataFrame by company and find average company sentiments

start_time = time.time()

#keys: category
#values: DataFrame grouped by company about category

#For example:
# Company Id    PROs Career Opportunities mean    PROs Career Opportunities count
    4	                  0.653677	                          31

pros_df_by_category_company = {}

for cat in pros_categories:
    pros_df_by_category_company[cat] = pros_df_by_category[cat].groupby('Company Id')['sentiment'].agg(['mean','count']).reset_index()
    pros_df_by_category_company[cat].columns = ['Company Id', 'PROs ' + cat + ' mean', 'PROs ' + cat + ' count']
    
print('Took ' + str(time.time() - start_time) + ' seconds.')

Took 0.15370917320251465 seconds.


In [134]:
#group sentiment DataFrame by company and find average company sentiments

start_time = time.time()


#keys: category
#values: DataFrame grouped by company about category

#For example:
# Company Id    CONs Career Opportunities mean    CONs Career Opportunities count
    4	                  0.653677	                          31


cons_df_by_category_company = {}

for cat in cons_categories: 
    #find average sentiment and number of company reviews in category
    cons_df_by_category_company[cat] = cons_df_by_category[cat].groupby('Company Id')['sentiment'].agg(['mean','count']).reset_index()
    cons_df_by_category_company[cat].columns = ['Company Id', 'CONs ' + cat + ' mean', 'CONs ' + cat + ' count']
    
print('Took ' + str(time.time() - start_time) + ' seconds.')

Took 0.15651321411132812 seconds.


In [136]:
#merge all company sentiment DataFrames on company ID

pros_cons_ratings_by_company = pros_df_by_category_company['Culture & Values']

for idx in range(1,len(pros_categories)): #don't need to attach Culture & Values (initial df)
    pros_cons_ratings_by_company = pd.merge(pros_cons_ratings_by_company,
                                            pros_df_by_category_company[pros_categories[idx]],
                                            how = 'outer', 
                                            on = 'Company Id')
    
#then join CONs DataFrames to PROs DataFrames
for cat in cons_categories:
    pros_cons_ratings_by_company = pd.merge(pros_cons_ratings_by_company,
                                           cons_df_by_category_company[cat],
                                           how = 'outer',
                                           on = 'Company Id')

In [137]:
pros_cons_ratings_by_company.head()

Unnamed: 0,Company Id,PROs Culture & Values mean,PROs Culture & Values count,PROs Work/Life Balance mean,PROs Work/Life Balance count,PROs Senior Management mean,PROs Senior Management count,PROs Comp & Benefits mean,PROs Comp & Benefits count,PROs Career Opportunities mean,PROs Career Opportunities count,CONs Culture & Values mean,CONs Culture & Values count,CONs Work/Life Balance mean,CONs Work/Life Balance count,CONs Senior Management mean,CONs Senior Management count,CONs Comp & Benefits mean,CONs Comp & Benefits count,CONs Career Opportunities mean,CONs Career Opportunities count
0,4,0.473771,52.0,0.39866,25.0,0.36068,15.0,0.526512,43.0,0.433677,31.0,-0.056455,91.0,-0.189957,23.0,-0.206288,67.0,-0.110425,40.0,0.01118,35.0
1,7,0.452461,31.0,0.452033,15.0,0.53295,4.0,0.5419,22.0,0.550964,11.0,-0.074654,52.0,-0.117675,4.0,-0.227596,24.0,0.018573,15.0,0.164988,17.0
2,8,0.532774,450.0,0.519078,281.0,0.469322,85.0,0.555796,789.0,0.495784,204.0,-0.069795,613.0,-0.010618,320.0,-0.098412,395.0,-0.056911,471.0,0.078058,205.0
3,12,0.547925,310.0,0.536132,92.0,0.495026,78.0,0.606038,343.0,0.528553,237.0,-0.01746,395.0,-0.028166,67.0,-0.034812,234.0,0.037819,107.0,0.034389,228.0
4,14,0.216718,11.0,0.40062,10.0,0.243343,14.0,0.29834,20.0,0.448986,7.0,-0.049645,33.0,-0.19055,6.0,-0.041236,22.0,0.271979,14.0,0.094465,20.0


In [138]:
#see 754 companies with NaN in any column
pros_cons_ratings_by_company[pros_cons_ratings_by_company.isnull().any(axis=1)]

Unnamed: 0,Company Id,PROs Culture & Values mean,PROs Culture & Values count,PROs Work/Life Balance mean,PROs Work/Life Balance count,PROs Senior Management mean,PROs Senior Management count,PROs Comp & Benefits mean,PROs Comp & Benefits count,PROs Career Opportunities mean,PROs Career Opportunities count,CONs Culture & Values mean,CONs Culture & Values count,CONs Work/Life Balance mean,CONs Work/Life Balance count,CONs Senior Management mean,CONs Senior Management count,CONs Comp & Benefits mean,CONs Comp & Benefits count,CONs Career Opportunities mean,CONs Career Opportunities count
46,81,0.613433,6.0,0.550100,2.0,,,0.678886,7.0,0.722500,4.0,-0.012243,7.0,0.599400,1.0,0.013533,3.0,0.423440,5.0,-0.021733,3.0
57,100,0.554075,8.0,,,,,0.805233,3.0,0.713733,3.0,0.150000,16.0,0.000000,1.0,0.163400,2.0,,,0.190100,3.0
87,157,0.494238,13.0,0.918600,1.0,0.464300,2.0,0.725350,14.0,0.555500,5.0,-0.053600,7.0,,,-0.214743,7.0,0.010300,2.0,-0.023100,4.0
144,258,0.452717,6.0,0.000000,1.0,,,0.540600,4.0,0.668600,2.0,0.023857,30.0,-0.263350,2.0,-0.152240,15.0,-0.134480,5.0,-0.222400,5.0
145,260,0.333733,3.0,0.395300,2.0,,,0.808925,4.0,0.153100,1.0,0.177067,6.0,-0.680800,1.0,-0.476700,1.0,0.000000,1.0,0.381800,1.0
149,265,0.549482,11.0,0.263533,3.0,0.440350,2.0,0.567967,6.0,0.919233,3.0,0.030512,8.0,0.381800,1.0,0.166283,6.0,0.294700,7.0,,
151,269,0.649263,8.0,0.447717,6.0,0.318367,3.0,0.447370,10.0,,,-0.027037,16.0,-0.069900,2.0,0.095700,6.0,0.292417,6.0,-0.062167,3.0
153,271,0.610797,30.0,0.440129,7.0,0.425486,7.0,0.588490,21.0,0.648180,5.0,-0.068845,22.0,,,-0.155500,19.0,-0.089612,8.0,0.087843,14.0
175,321,0.000000,1.0,0.440400,1.0,0.822500,1.0,0.420967,3.0,,,0.220200,2.0,,,0.000000,1.0,,,0.000000,1.0
183,344,0.751017,6.0,0.615500,2.0,0.593582,11.0,0.517575,12.0,0.590014,7.0,0.021635,23.0,-0.202650,6.0,-0.240380,20.0,,,-0.171071,7.0


We find that there were 754 (about $18\%$) of companies were predicted not to have been rating in at least one (category,pro/con) pair.

No matter. We will just replace these null entries with a zero. I think this makes sense because there was no sentiment (i.e., 0 sentiment) for the category-PRO/CON.

In [139]:
pros_cons_ratings_by_company = pros_cons_ratings_by_company.fillna(0)

In [140]:
#check that no more null entries
pros_cons_ratings_by_company[pros_cons_ratings_by_company.isnull().any(axis=1)]

Unnamed: 0,Company Id,PROs Culture & Values mean,PROs Culture & Values count,PROs Work/Life Balance mean,PROs Work/Life Balance count,PROs Senior Management mean,PROs Senior Management count,PROs Comp & Benefits mean,PROs Comp & Benefits count,PROs Career Opportunities mean,PROs Career Opportunities count,CONs Culture & Values mean,CONs Culture & Values count,CONs Work/Life Balance mean,CONs Work/Life Balance count,CONs Senior Management mean,CONs Senior Management count,CONs Comp & Benefits mean,CONs Comp & Benefits count,CONs Career Opportunities mean,CONs Career Opportunities count


In [153]:
#rearrange columns to have means first and counts last

pros_cons_new_column_order = ['Company Id',
                              'PROs Culture & Values mean',
                              'PROs Work/Life Balance mean',
                              'PROs Senior Management mean',
                              'PROs Comp & Benefits mean',
                              'PROs Career Opportunities mean',
                              'CONs Culture & Values mean',
                              'CONs Work/Life Balance mean',
                              'CONs Senior Management mean', 
                              'CONs Comp & Benefits mean', 
                              'CONs Career Opportunities mean',
                              'PROs Culture & Values count',                        
                              'PROs Work/Life Balance count',                     
                              'PROs Senior Management count',                                
                              'PROs Comp & Benefits count', 
                              'PROs Career Opportunities count',                                
                              'CONs Culture & Values count',                                
                              'CONs Work/Life Balance count',                               
                              'CONs Senior Management count',                               
                              'CONs Comp & Benefits count',                               
                              'CONs Career Opportunities count']

pros_cons_ratings_by_company = pros_cons_ratings_by_company.loc[:,pros_cons_new_column_order]

In [156]:
print(pros_cons_ratings_by_company.shape)
pros_cons_ratings_by_company.head()

(4059, 21)


Unnamed: 0,Company Id,PROs Culture & Values mean,PROs Work/Life Balance mean,PROs Senior Management mean,PROs Comp & Benefits mean,PROs Career Opportunities mean,CONs Culture & Values mean,CONs Work/Life Balance mean,CONs Senior Management mean,CONs Comp & Benefits mean,CONs Career Opportunities mean,PROs Culture & Values count,PROs Work/Life Balance count,PROs Senior Management count,PROs Comp & Benefits count,PROs Career Opportunities count,CONs Culture & Values count,CONs Work/Life Balance count,CONs Senior Management count,CONs Comp & Benefits count,CONs Career Opportunities count
0,4,0.473771,0.39866,0.36068,0.526512,0.433677,-0.056455,-0.189957,-0.206288,-0.110425,0.01118,52.0,25.0,15.0,43.0,31.0,91.0,23.0,67.0,40.0,35.0
1,7,0.452461,0.452033,0.53295,0.5419,0.550964,-0.074654,-0.117675,-0.227596,0.018573,0.164988,31.0,15.0,4.0,22.0,11.0,52.0,4.0,24.0,15.0,17.0
2,8,0.532774,0.519078,0.469322,0.555796,0.495784,-0.069795,-0.010618,-0.098412,-0.056911,0.078058,450.0,281.0,85.0,789.0,204.0,613.0,320.0,395.0,471.0,205.0
3,12,0.547925,0.536132,0.495026,0.606038,0.528553,-0.01746,-0.028166,-0.034812,0.037819,0.034389,310.0,92.0,78.0,343.0,237.0,395.0,67.0,234.0,107.0,228.0
4,14,0.216718,0.40062,0.243343,0.29834,0.448986,-0.049645,-0.19055,-0.041236,0.271979,0.094465,11.0,10.0,14.0,20.0,7.0,33.0,6.0,22.0,14.0,20.0


In [155]:
#save to compare companies
pros_cons_ratings_by_company.to_csv('companies_pros_cons_ratings.csv')