# Collecting Employer Review Data from Indeed

Business value:
- To employers: Identify 1-2 key pain points that can be improved
- To job seekers: Understand at a glance what it's like to work within that organization
- To competitors: Learn from peers' successes and misses

Who already does something similar: Glassdoor - simple summaries and 1-to-1 comparisons of individual organizations

## Next actions
- Correct formatting error that bumps data values off the dataframe
- Ensure that the loop will not disregard the last few reviews
- Create summary graphics
    - Distribution of ratings by year
    - Median monthly ratings
    - Roles' review counts
- Try LDA groupings
    - On the roles
    - On the review text
- Classify
    - Roles
    - Reviews
- Aspect-level sentiment
    - Extract key concepts/points from text
    - Assign each concept/point sentiment (positive, neutral, negative)

In [4]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import re

BASE_URL = r'https://www.indeed.com/cmp/Pnc-Financial-Services-Group/reviews?start='
# URL_START = r'https://www.indeed.com/cmp/Truist-Financial/reviews?start='
# URL_END = r'&lang=en'
df = pd.DataFrame({'review_title': [], 'review_verbatim': [], 'role': [], 'status': [], 'location': [], 'date': [], 'rating': []})

Here's the original from Towards Data Science (Yasser Elsedawy)

In [5]:
ROLE_REGEX = re.compile(r"[\w\s]+\(") # matches the author role
STATUS_REGEX = re.compile(r"\(\w+ \w+\)") # matches whether current or former employee
BRACE_REPLACE = re.compile(r"(\(|\))") # replaces the () in above matching texts

# REVIEW_COUNT = 3284
PAGES = 120

for i in range(7): # 352 would get them all as of June 2022
    url = f'{BASE_URL}{i*20}'
    header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0"}
    page = requests.get(url, headers=header)
    soup = BeautifulSoup(page.content, 'lxml')
    results = soup.find("div", {"class":"cmp-ReviewsList"})
    elems = results.find_all(attrs={"data-tn-section":"reviews"}) 
    for elem in elems:
        title = elem.find(attrs = {'data-testid': 'title'})
        review = elem.find('div', {'data-tn-component': 'reviewDescription'})
        author = elem.find(attrs = {'itemprop': 'author'})
        author_details = author.text.split('-')
        rating = elem.find(attrs = {'class': "css-1c33izo e1wnkr790"})
        # rating = elem.find(attrs = {'aria-label': re.compile("\d.\d out of 5 stars.")})
        
        # fine-tuning
        temp = ROLE_REGEX.search(author_details[0])
        try:
            author_role = BRACE_REPLACE.sub("", temp.group(0))
        except:
            author_role = "None"
        temp = STATUS_REGEX.search(author_details[0])
        try:
            author_status = BRACE_REPLACE.sub("", temp.group(0))
        except:
            author_status = "None"

        df = df.append({'review_title': title.text, 'review_verbatim': review.text, 'role': author_role.strip(), 'status': author_status.strip(), 'location': author_details[1].strip(), 'date': author_details[2].strip(), 'rating': rating.text}, ignore_index=True)

AttributeError: 'NoneType' object has no attribute 'find_all'

Moment of truth:
- Does anything get fetched back?
- Do the hiddle text stay hidden?

In [6]:
df.head()

Unnamed: 0,review_title,review_verbatim,role,status,location,date,rating


In [101]:
import csv
df.to_csv('PNC_indeed_062022.csv', index=False, quoting=csv.QUOTE_ALL)