# Hand labeling review sentences- Part 1

The overarching goal is to determine how each company's employees feel in the 5 Glassdoor categories (Culture & Values, Work/Life Balance, Senior Management, Comp & Benefits, and Career Opportunities). This is difficult because the PROs and CONs review are written in free-form format, which makes it difficult decipher a priori to discern how an employee feels in the different categories.

My strategy is to tokenize the text of each review into sentences and study each sentence separately. I will then figure out which categories each sentence speaks to. My thinking is that most sentences speak to just one category, or at the very least, have a similar sentiment about the categories described by it.

The hard part of this strategy is predicting which categories concerns. There are 5 million sentences in the Glassdoor reviews dataset. It would be impossible for me to hand-label all 5 million sentences (at least take about 10,000 hours=5 years of work weeks). Rather, I'll hand-label a subset of the sentences and then later use machine learning to predict the categories of the remaining sentences.

I chose to label 1000 reviews. Being able to label about 400-600 sentences per hour along with the fact that these reviews contained about 4000 sentences total, this was all I could hope to label and still do the rest of my project.

In this notebook, I will start by labeling the sentences from 1000 PROs reviews. I will also label 1000 sentences from CONs reviews. It should be noted that each PRO and CON review is about 2 sentences on average.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 100)

import numpy as np
import time

%matplotlib inline 
import matplotlib as mpl
import matplotlib.pyplot as plt

import nltk
import nltk.data
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords

import re

Using TensorFlow backend.
  return f(*args, **kwds)


## Import and clean reviews and companies data

In [2]:
start_time = time.time()

reviews = pd.read_csv('glassdoor_reviews_2.csv')

print('Took ' + str(time.time()-start_time) + ' seconds.')

  interactivity=interactivity, compiler=compiler, result=result)


Took 83.8594799041748 seconds.


In [5]:
#save initial version of reviews
reviews_original = reviews.copy()

In [6]:
#clean reviews, dropping reviews of different format

start_time = time.time()

reviews = reviews_original.copy()

#each review's "Author Title" should be of format "Employee Status - Job Title"
# for example, "Current Employee - Senior Engineer"

#determine how many parts each review's "Author Title" has (should be 2)
reviews.loc[:,'title_length'] = reviews.loc[:,'Author Title'].apply(lambda x: len(x.split(' - ')))

#only consider reviews of proper format "Employee Status - Author Title"
reviews = reviews[reviews['title_length'] == 2]
#could be omitting some job titles with 'dash' in name,
#but decreasing number of reviews from 2631927 to 2615691 (<1% change, so don't care)

#'Author Title' of all reviews now 2
reviews = reviews.drop('title_length', axis=1)

#break up "Author Title" into two columns: "Employee Status" and "Job Title"
reviews.loc[:,'Employee Status'] = reviews.loc[:,'Author Title'].apply(lambda x: x.split(' - ')[0])
reviews.loc[:,'Job Title'] = reviews.loc[:,'Author Title'].apply(lambda x: x.split(' - ')[1])

#remove 10 reviews have incorrect "Employee Status" 
#("Employee Status" not like "Current Employee", "Former Intern", etc.)
reviews = reviews[reviews['Employee Status'] != 'module.emp-review.current-'] #remove 4 reviews
reviews = reviews[reviews['Employee Status'] != 'module.emp-review.former-'] #remove 6 reviews

#add extra columns that states if employee is current or former employee
reviews.loc[:,'current_or_former'] = reviews.loc[:,'Employee Status'].apply(lambda x: x.split(' ')[0])

print('Took ' + str(time.time()-start_time) + ' seconds.')

Took 46.16398096084595 seconds.


In [7]:
reviews.shape

(2615681, 38)

In [8]:
#companies and number of reviews of company
companies = pd.read_csv('reviewed_companies.csv')

In [9]:
print(companies.shape)

companies.head()

(5832, 8)


Unnamed: 0.1,Unnamed: 0,Ticker Symbol,Ticker Sector,Ticker Industry,Company Id,Company URL,company_name,count
0,0,vtx:rog,Health Care,Pharmaceuticals & Biotechnology,274,https://www.glassdoor.com/Overview/Working-at-...,Genentech,609
1,1,bcs:falabella,,,10976,https://www.glassdoor.com/Overview/Working-at-...,Falabella,9
2,2,asx:wow,Consumer Services,Food & Drug Retailers,473193,https://www.glassdoor.com/Overview/Working-at-...,Big W,70
3,3,asx:wor,,,35193,https://www.glassdoor.com/Overview/Working-at-...,WorleyParsons,379
4,4,nyse:xom,Oil & Gas,Oil & Gas Producers,237,https://www.glassdoor.com/Overview/Working-at-...,ExxonMobil,845


In [10]:
#only consider companies with at least 100 reviews
#about 25% of companies have at least 100 reviews

minimum_reviews_to_consider = 100

#companies with at least 100 reviews
companies_at_least_min_reviews = companies[companies['count'] >= minimum_reviews_to_consider]

In [11]:
#only consider reviews from companies with >= 100 reviews
reviews_at_least_min_reviews = reviews[reviews['Company Id'].isin(companies_at_least_min_reviews.loc[:,'Company Id'])]

In [12]:
reviews_at_least_min_reviews.shape

(2339087, 38)

In [13]:
#filter to jobs in USA
reviews_at_least_min_reviews_usa = reviews_at_least_min_reviews[reviews_at_least_min_reviews['Author Country']=='USA']

print(reviews_at_least_min_reviews_usa.shape[0])

reviews_at_least_min_reviews_usa.head()

1007287


Unnamed: 0,Ticker Symbol,Entity Name,Dataset,CUSIP,ISIN,Unique ID,As Of Date,Review Url,Logo,Author Title,Author Location,Author Country,Summary,Description,PROs,CONs,Recommends Value,Recommends Description,Outlook Value,Outlook Description,CEO Review Value,CEO Review Description,Helpful Count,Rating: Overall,Rating: Work/Life Balance,Rating: Culture & Values,Rating: Career Opportunities,Rating: Comp & Benefits,Rating: Senior Management,Company Id,Company URL,Date Added,Date Updated,Ticker Sector,Ticker Industry,Employee Status,Job Title,current_or_former
4,nyse:xom,https://www.glassdoor.com?employer_id=237,2331755,30231G102,US30231G1022,3065906,2018-09-09 04:00:00+00,https://www.glassdoor.com/Reviews/Employee-Rev...,https://media.glassdoor.com/sqls/237/exxonmobi...,Former Employee - I T Analyst,"Houston, TX",USA,"""I.T. Analyst - Global Services Company""",I worked at ExxonMobil full-time (More than 10...,"Great benefits, smart co-workers, state of the...","Forced ranking, slow to adopt new technology, ...",1.0,Recommends,1.0,Positive Outlook,1.0,Approves of CEO,,4,3.0,5.0,3.0,5.0,4.0,237,https://www.glassdoor.com/Overview/Working-at-...,2018-09-10 07:53:45.26833+00,2018-09-10 07:53:45.268361+00,Oil & Gas,Oil & Gas Producers,Former Employee,I T Analyst,Former
5,nyse:cat,https://www.glassdoor.com?employer_id=137,2330013,149123101,US1491231015,3065701,2018-09-09 04:00:00+00,https://www.glassdoor.com/Reviews/Employee-Rev...,https://media.glassdoor.com/sqls/137/caterpill...,Current Employee - Welder/Fabricator,"Aurora, IL",USA,"""Decent job with decent pay.""",I have been working at Caterpillar full-time (...,Steady 40 hours a week. Union job.,Company is more worried about the all might do...,1.0,Recommends,0.0,Neutral Outlook,-1.0,Disapproves of CEO,,3,4.0,3.0,4.0,3.0,3.0,137,https://www.glassdoor.com/Overview/Working-at-...,2018-09-10 05:22:30.202947+00,2018-09-10 05:22:30.202988+00,Industrials,Industrial Engineering,Current Employee,Welder/Fabricator,Current
6,nyse:wrk,https://www.glassdoor.com?employer_id=1033056,2341899,96145D105,US96145D1054,3065903,2018-09-09 04:00:00+00,https://www.glassdoor.com/Reviews/Employee-Rev...,https://media.glassdoor.com/sqls/1033056/westr...,Current Employee - Quality,"Mebane, NC",USA,"""Quality""",I have been working at WestRock full-time (Mor...,Excellent health care and stock. 401k needs im...,No raise or pay adjustment in 2 years. Nothing...,-1.0,Doesn't Recommend,-1.0,Negative Outlook,-1.0,Disapproves of CEO,,2,1.0,1.0,1.0,4.0,1.0,1033056,https://www.glassdoor.com/Overview/Working-at-...,2018-09-10 07:52:04.797509+00,2018-09-10 07:52:04.797545+00,Industrials,General Industrials,Current Employee,Quality,Current
10,sto:secu-b,https://www.glassdoor.com?employer_id=16559,2333979,,,3066089,2018-09-09 04:00:00+00,https://www.glassdoor.com/Reviews/Employee-Rev...,https://media.glassdoor.com/sqls/16559/securit...,Current Employee - Security Officer,"Hillsboro, OR",USA,"""I enjoy the people here""",I have been working at Securitas Security Serv...,"It is a very large, stable company that has be...",I've been there a month now and have no cons.,1.0,Recommends,1.0,Positive Outlook,1.0,Approves of CEO,,5,4.0,5.0,5.0,4.0,5.0,16559,https://www.glassdoor.com/Overview/Working-at-...,2018-09-10 11:03:30.511047+00,2018-09-10 11:03:30.511088+00,,,Current Employee,Security Officer,Current
11,nyse:gs,https://www.glassdoor.com?employer_id=2800,2330564,38141G104,US38141G1040,3065765,2018-09-09 04:00:00+00,https://www.glassdoor.com/Reviews/Employee-Rev...,https://media.glassdoor.com/sqls/2800/goldman-...,Former Employee - Anonymous Employee,"New York, NY",USA,"""Great Company to Work for""",I worked at Goldman Sachs full-time (Less than...,Lots of learningLots of opportunities,Work PressureNeed to handle multiple task at s...,,,,,,,,5,1.0,5.0,5.0,5.0,4.0,2800,https://www.glassdoor.com/Overview/Working-at-...,2018-09-10 06:11:41.41912+00,2018-09-10 06:11:41.419162+00,Financials,Financial Services,Former Employee,Anonymous Employee,Former


## Extract sample of 500,000 reviews from 100+ review companies

In [14]:
#number of reviews to extract from reviews
size_of_sample = 500000

#extract size_of_sample reviews from reviews 
#(with at least 100 reviews)
#set random state for reproducibility
reviews_sample = reviews_at_least_min_reviews_usa.sample(n=size_of_sample, 
                                                     random_state=21).reset_index()

# Labeling pros

We will take a random sample of 1000 PROs reviews. We will then sentence tokenize each review and label each sentence into the difference categories. It should be noted that some sentences were labeled into multiple categories.

In [17]:
size_of_small_sample = 1000

#PROs and CONs for 1000 reviews
reviews_small_sample = reviews_sample.loc[:size_of_small_sample-1,['index','PROs','CONs']].copy().reset_index(drop=True)

reviews_small_sample_pros = reviews_sample.loc[:size_of_small_sample-1,['index','PROs']]


In [None]:
#convert PROs to type string
reviews_small_sample_pros.loc[:,'PROs'] = reviews_small_sample_pros.loc[:,'PROs'].apply(lambda pros: str(pros))

In [19]:
#sentence tokenize each PRO and include as new column
reviews_small_sample_pros.loc[:,'PROs_sentences'] = \
    reviews_small_sample_pros.loc[:,'PROs'].apply(lambda pros: sent_tokenize(pros))

In [22]:
#sentence tokenize
reviews_small_sample_pros.head()

Unnamed: 0,index,PROs,PROs_sentences
0,1791426,"Nice people,always there to help and we have fun","[Nice people,always there to help and we have ..."
1,1749854,"Good pay, especially for retail. Opportunity t...","[Good pay, especially for retail., Opportunity..."
2,2253474,"People, Pay, Work, Perks and Location","[People, Pay, Work, Perks and Location]"
3,1009831,"Good benefits, great work life balance","[Good benefits, great work life balance]"
4,2586930,"Large company, maybe a good place to start a c...","[Large company, maybe a good place to start a ..."


In [26]:
def pros_to_df(series):
    '''
    Breaks up a review series into a DataFrame, with a row for every sentence in PROs.
    
    Args:
    Series (index of review, PROs, PROs_sentences)
    
    Returns:
    DataFrame ((number of sentences in PROs) x 4)
    
        Example return:
        index     PROs                       sent_number   PROs_sentence
        525143    Great pay! I liked the managers.    0    Great Pay!
        525143    Great pay! I liked the managers.    1    I liked the managers.

    '''
    
    pros_df = pd.DataFrame.from_dict({'index':series['index'],
                                      'PROs':series['PROs'],
                                      'PROs_sentence':series['PROs_sentences'],
                                      'sent_number':range(len(series['PROs_sentences']))})
    
    return pros_df.loc[:,['index','PROs','sent_number','PROs_sentence']]

In [28]:
start_time = time.time()
pros_sentences_df = pd.concat([pros_to_df(reviews_small_sample_pros.loc[idx,:]) 
                               for idx in range(reviews_small_sample_pros.shape[0])],
                             ignore_index=True)

#actually concatenate 1000 sentences
print('Time to concatenate 100 reviews: {} seconds.'.format(time.time()-start_time))

Time to concatenate 100 reviews: 3.352073907852173 seconds.


In [29]:
#label category of sentences into 6 categories: 5 original categories + 'Other'
#   add extra column for labels
pros_sentences_df.loc[:,'categories'] = pd.Series(['ToBeFilledIn']*pros_sentences_df.shape[0])

In [32]:
pros_sentences_df.loc[:,'PROs_sentence'] = pros_sentences_df.loc[:,'PROs_sentence'].apply(lambda x: str(x))

## Labeling PROs sentences

We now are set up to label PROs sentences. In order to keep the reviews data confidential, I will simply show the code that enabled to input labels.

We label in chunks of typically 200 sentences at a time.

In [43]:
pros_sentences_df_0_199 = pros_sentences_df.loc[0:199,:].copy()

In [None]:
for idx in range(pros_sentences_df_0_199.shape[0]):
    print('\n')
    print(idx)
    category = input('\n' + pros_sentences_df.loc[idx,'PROs_sentence'] + '\n\n Category CV, WLB, SM, CB, CO, or O (or "break"): ')
    if category == 'break':
        print('Last index checked: {}'.format(idx-1))
        break
    else:
        pros_sentences_df.loc[idx,'categories'] = category

In [54]:
pros_sentences_df_0_199.to_csv('pros_sentences_df_0_199.csv')

In [55]:
pros_sentences_df_200_399 = pros_sentences_df.loc[200:399,:].copy()

In [62]:
def input_pros_categories(df):
    '''
    Enables user to classify pros sentences into different categories.
    
    Args:
        DataFrame of review sentences.
        
    Outputs:
        DataFrame of review sentences with categories inputted.
        
        Example return:
        index     PROs                       sent_number   PROs_sentence           categories
        525143    Great pay! I liked the managers.    0    Great Pay!              CB
        525143    Great pay! I liked the managers.    1    I liked the managers.   SM
    '''
    
    for idx in df.index:
        print('\n')
        print(idx)
        category = input('\n' + df.loc[idx,'PROs_sentence'] + '\n\n Category CV, WLB, SM, CB, CO, or O (or "break"): ')
        if category == 'break':
            print('Last index checked: {}'.format(idx-1))
            break
        else:
            df.loc[idx,'categories'] = category
            
    return df
        

In [None]:
#label sentences 200-399
pros_sentences_df_200_399 = input_pros_categories(pros_sentences_df_200_399)

In [64]:
pros_sentences_df_200_399.to_csv('pros_sentences_df_200_399.csv')

In [66]:
pros_sentences_df_400_599 = pros_sentences_df.loc[400:599,:].copy()

In [None]:
#label sentences 400-599
pros_sentences_df_400_599 = input_pros_categories(pros_sentences_df_400_599)

In [69]:
pros_sentences_df_400_599.to_csv('pros_sentences_df_400_599.csv')

In [70]:
pros_sentences_df_600_999 = pros_sentences_df.loc[600:999,:].copy()

In [None]:
#label sentences 600-999
pros_sentences_df_600_999 = input_pros_categories(pros_sentences_df_600_999)

In [73]:
pros_sentences_df_600_999.to_csv('pros_sentences_df_600_999.csv')

In [76]:
def replace_dash_period(a_string):
    a_string = a_string.replace('.', '. ')
    a_string = a_string.replace('-','. ')
    a_string = a_string.replace('+', '. ')
    
    return a_string

In [80]:
#have already sentences from first 543 reviews
    #label sentences from rest of reviews 544-999
reviews_pros_544_on = reviews_small_sample.loc[544:,['index','PROs']]

reviews_pros_544_on.loc[:,'PROs'] = reviews_pros_544_on.loc[:,'PROs'].apply(lambda text: replace_dash_period(text))

reviews_pros_544_on.loc[:,'PROs_sentences'] = reviews_pros_544_on.loc[:,'PROs'].apply(lambda text: sent_tokenize(text))

In [84]:
#split up sentences from reviews 543-999 into separate rows
pros_reviews_544_on_df = pd.concat([pros_to_df(reviews_pros_544_on.loc[idx,:])
                                   for idx in reviews_pros_544_on.index],
                                  ignore_index=True)

In [88]:
pros_reviews_544_on_df.shape

(993, 4)

In [90]:
pros_reviews_544_on_df.loc[:,'categories'] = pd.Series(['ToBeFilledIn'
                                                       for idx in range(pros_reviews_544_on_df.shape[0])])

In [None]:
#label reviews 544-99
pros_reviews_544_on_df = input_pros_categories(pros_reviews_544_on_df)

In [109]:
pros_reviews_544_on_df.to_csv('pros_reviews_544_on_df.csv')

## Labeling negative reviews

After labeling the 1000 reviews, we now need to do the same for CONs. We will start that here, labeling 1000 CONs sentences. We will label the other 1000 in another notebook in this 'hand labeling' subdirectory.

In [111]:
reviews_small_sample_cons = reviews_sample.loc[:size_of_small_sample-1,['index','CONs']]


In [118]:
#make sure CONs are type string
reviews_small_sample_cons.loc[:,'CONs'] = reviews_small_sample_cons.loc[:,'CONs'].apply(lambda pros: str(pros))

def replace_period(a_string):
    '''
    Turns '.' and '+' into '. ' in sentences to help sentence tokenizer work right.
    '''
    a_string = a_string.replace('.', '. ')
    a_string = a_string.replace('+', '. ')
    
    return a_string

#fix TEXT1.TEXT2 by adding space after periods
reviews_small_sample_cons.loc[:,'CONs'] = reviews_small_sample_cons.loc[:,'CONs'].apply(lambda cons: replace_period(cons))

#tokenize sentences
reviews_small_sample_cons.loc[:,'CONs_sentences'] = reviews_small_sample_cons.loc[:,'CONs'].apply(lambda cons: sent_tokenize(cons))

In [132]:
def cons_to_df(series):
    '''
    Breaks up a review series into a DataFrame, with a row for every sentence in CONs.
    
    Args:
    Series (index of review, CONs)
    
    Returns:
    DataFrame ((number of sentences in CONs) x 4)
    
        Example return:
        index     PROs                       sent_number   PROs_sentence           categories
        525143    Bad pay! I hated the managers.    0      Bad Pay!                CB
        525143    Bad pay! I hated the managers.    1      I liked the managers.   SM
    '''
    
    cons_df = pd.DataFrame.from_dict({'index':series['index'],
                                      'CONs':series['CONs'],
                                      'CONs_sentence':series['CONs_sentences'],
                                      'sent_number':range(len(series['CONs_sentences']))})
    
    return cons_df.loc[:,['index','CONs','sent_number','CONs_sentence']]

In [133]:
#split up reviews into a row for each sentence
cons_sentences_df = pd.concat([cons_to_df(reviews_small_sample_cons.loc[idx,:])
                              for idx in range(reviews_small_sample_cons.shape[0])],
                             ignore_index=True)

In [135]:
def input_cons_categories(df):
    '''
    Enables user to classify cons as belonging into different categories.
    
    Args:
        DataFrame with sentences from reviews.
    '''
    
    for idx in df.index:
        print('\n')
        print(idx)
        category = input('\n' + df.loc[idx,'CONs_sentence'] + '\n\n Category CV, WLB, SM, CB, CO, or O (or "break"): ')
        if category == 'break':
            print('Last index checked: {}'.format(idx-1))
            break
        else:
            df.loc[idx,'categories'] = category
            
    return df

In [None]:
#label first 100 CONs sentences
cons_sentences_df_0_99 = input_cons_categories(cons_sentences_df.loc[0:99,:].copy())

In [137]:
cons_sentences_df_0_99.loc[17,'categories'] = 'CV, SM'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [139]:
cons_sentences_df_0_99.to_csv('cons_sentences_df_0_99.csv')

In [140]:
cons_sentences_df_100_999 = cons_sentences_df.loc[100:999,:].copy()

cons_sentences_df_100_999.loc[:,'categories'] = pd.Series(['ToBeFilledIn' for idx in cons_sentences_df_100_999.index])

In [None]:
#label sentences 100-999
cons_sentences_df_100_999 = input_cons_categories(cons_sentences_df_100_999)

In [145]:
cons_sentences_df_100_999.to_csv('cons_sentences_df_100_999.csv')