# Gathering Glassdoor data manually
-------------------

> <i>Description: In this notebook, we gather the manually extracted data from glassdoor. The reason that this is done manually is that we can't automatically get the data we want not facing Captcha issues.Therefore we took the data manually and paste it into txt files for each page and then we gather them around here.</i>


Input Files: 
1) txt files in glassdoor pages folder


Output:
1) Glassdoor_reviews_gathered.csv

In [None]:
import os
import pandas as pd
import numpy as np
import json

In [None]:
# creating empty text files based on the pages and languages and putting the data in them manually

directory = "glassdoor pages"  

if not os.path.exists(directory):
    os.makedirs(directory)

# Create files p5.txt to p170.txt
for i in range(1, 5):
    file_path = os.path.join(directory, f"p{i}-italy.txt")
    with open(file_path, 'w') as file:
        file.write("")  


# The manual way:
"""
reviewId
advice
cons
employer.shortName
employmentStatus
isCurrentJob
jobTitle.text
location.name
pros
ratingBusinessOutlook
ratingCareerOpportunities
ratingCeo
ratingCompensationAndBenefits
ratingCultureAndValues
ratingDiversityAndInclusion
ratingOverall
ratingRecommendToFriend
ratingSeniorLeadership
ratingWorkLifeBalance
reviewDateTime
summary

"""

In [104]:
# Path to the directory 
directory_path = "glassdoor pages"
rows = []
# Iterate over all files in the directory
for filename in os.listdir(directory_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(directory_path, filename)
        with open(file_path, 'r', encoding='utf-8') as file:
            data = json.load(file)

            for entry in data:
                # Check if 'data' key exists and contains the target information
                if "data" in entry and "employerReviews" in entry["data"]:
                    reviews = entry["data"]["employerReviews"].get("reviews", [])
                    for review in reviews:
                        row = {
                            "reviewId": review.get("reviewId", None),
                            "advice": review.get("advice", None),
                            "cons": review.get("cons", None),
                            "employer.shortName": review.get("employer", {}).get("shortName", None),
                            "employmentStatus": review.get("employmentStatus", None),
                            "isCurrentJob": review.get("isCurrentJob", None),
                            "jobTitle.text": review.get("jobTitle", {}).get("text") if review.get("jobTitle") else None,
                            "location.name": review.get("location").get("name") if review.get("location") else None,
                            "pros": review.get("pros", None),
                            "ratingBusinessOutlook": review.get("ratingBusinessOutlook", None),
                            "ratingCareerOpportunities": review.get("ratingCareerOpportunities", None),
                            "ratingCeo": review.get("ratingCeo", None),
                            "ratingCompensationAndBenefits": review.get("ratingCompensationAndBenefits", None),
                            "ratingCultureAndValues": review.get("ratingCultureAndValues", None),
                            "ratingDiversityAndInclusion": review.get("ratingDiversityAndInclusion", None),
                            "ratingOverall": review.get("ratingOverall", None),
                            "ratingRecommendToFriend": review.get("ratingRecommendToFriend", None),
                            "ratingSeniorLeadership": review.get("ratingSeniorLeadership", None),
                            "ratingWorkLifeBalance": review.get("ratingWorkLifeBalance", None),
                            "reviewDateTime": review.get("reviewDateTime", None),
                            "summary": review.get("summary", None)
                        }
                        rows.append(row)
                    


In [105]:
# Create DataFrame from collected rows
df = pd.DataFrame(rows)
print(len(df))
df.head(3)

1852


Unnamed: 0,reviewId,advice,cons,employer.shortName,employmentStatus,isCurrentJob,jobTitle.text,location.name,pros,ratingBusinessOutlook,...,ratingCeo,ratingCompensationAndBenefits,ratingCultureAndValues,ratingDiversityAndInclusion,ratingOverall,ratingRecommendToFriend,ratingSeniorLeadership,ratingWorkLifeBalance,reviewDateTime,summary
0,90627425,,Have not found any particular,HUGO BOSS,REGULAR,True,Intern,"Metzingen, Baden-Württemberg",Great collaboration and support from colleagues,POSITIVE,...,NO_OPINION,5.0,5,5,5,POSITIVE,5.0,5.0,2024-09-02T03:03:38.860,Great
1,90606395,Clarify Opportunities for Advancement: Offer c...,Während ein Praktikum bei HUGO BOSS AG viele V...,HUGO BOSS,INTERN,True,Internship,"Metzingen, Baden-Württemberg",HUGO BOSS is a world-renowned fashion brand an...,POSITIVE,...,APPROVE,5.0,5,5,5,POSITIVE,5.0,5.0,2024-09-01T00:11:02.493,Smooth Application Process with Excellent Comm...
2,90566658,,Zu viele Aufgaben und zu wenig Wertschätzung,HUGO BOSS,REGULAR,True,Expert IT Consultant,,Tolle Teams und spannende Aufgaben,NEUTRAL,...,NO_OPINION,3.0,4,5,4,POSITIVE,3.0,3.0,2024-08-30T04:33:43.887,Zufrieden


In [106]:
df.to_csv('Glassdoor_reviews_gathered.csv')

## End of Notebook