A project for big data processing and analysis, focusing on Glassdoor reviews. Extracts reviews and company branding information from www.glassdoor.com. Performs analysis using Python and/or MapReduce.
- Analysis was done in python to compare performance differences between Python and MapReduce.
- All analysis performed used MapReduce. Results from Hadoop analysis, was then processed in Python.
- Data visualized using Python-Flask Framework.
- pip install selenium
- pip install beautifulsoup4
- pip install numpy
- Ensure facebook accounts in accounts.json file is not banned and has completed glassdoor account setup.
- Current implementation uses facebook login as it is not blocked for scraping
- New facebook accounts have a 60-min delay before they can be used to create a glassdoor account.
- After glassdoor account registered to facebook account, follow their prompts to setup account, skipping the setup process will prevent scraping.
- Replace the company code, company name and account_number in main.py.
- Company codes can be obtained from the reviews page.
- https://www.glassdoor.sg/Overview/Working-at-Accenture-EI_IE4138.11,20.htm
- company code = 4138, company name = Accenture
- Run main.py it should open glassdoor website and attempt to login using account information provided in 'scraper/accounts.json' file.
- Chrome broswer should automatically open
- Automatic login via facebook Please solve captcha if it appears.
- On login success terminal should display login successful, and will automatically retrieve the list of urls to scrape and will begin scraping.
- You will see the chrome browser traversing through each page and storing reviews.
- Once 100 pages (~300-800 seconds) have been crawled, json containing reviews will be saved to the 'data/company_name' folder. Each json file will contain 1000 reviews.
- /miscellanous/Progress.md is used to track companies that were scraped.
- Use GlassDoorCompanyInformationWorker.py to retrieve Company Information on previously scraped datasets.
- Specify a json file that company code and company names, follow the same format used for multiple scrapes. (You can use the JSON files used for start_multiple_scrapes).
- Run GlassDoorCompanyInformationWorker.py, company information will be outputted ../data/Company Information/company_information.json.
- Future work will integrate this part while scraping for companies reviews.
To scrape multiple companies in parallel, run multiple instance of main.py. Each instance should use a different facebook account. Scraping with multiple chrome browsers increases likelihood of captcha and Invalid session. Can safely run 3-4 chrome browswers without captcha occurinng frequently (depends on facebook account used).
To resume a scrape from the last successful batch, invoke the resume_scrape() instead of start_one_scrape().
Create a json file and populate it with company_codes and company_names (see '../scraper/Industry/Financials' for example). Run the start_multiple_scrapes() instead of start_one_scrape(). You can also adopt the same approach to use multiple chrome sessions to scrape multiple companies in each session. Each instance of main.py will use a different company_list and facebook accounts.
- Note that you may need to solve the captchas that appear on each login. Fresh glassdoor accounts less prone to Captcha verification on login.
- Modify the FILE_PATH variable in main.
View README.md file in hadoop_analysis folder to see how data sets can be analyzed.
Copy the entire folder in eclipse in the VM. Before you export, ensure that the input and output folders are correct. Chanage the path according to your own needs. Otherwise, leave it as it is and export as a .Jar file. Once the jar file has been compiled. Scp the jar file to the big hadoop cluster.
- Command: scp -r <user@IPaddress:/home/user/>
There are 3 .jar files inside the JAR folder. To use them, use the command specified above to send them to the big hadoop cluster
Ensure that the company that you want to analyze is in a folder called "Materials". Note that the analysis can only be done 1 company at a time so ensure there is only 1 company folder inside the "Materials" folder. Here is the command to scp your codes from your main machine to the big hadoop cluster.
- Command: scp -r <user@IPaddress:/home/user/>
Next, Put the folder into the hadoop file system.
- Command: hadoop fs -put Materials /user/username/
Check the codes for the input and output file path. Below are the 3 different sets of directories you will need. To run word count analysis
- hadoop fs -mkdir /user/username/wordcount/
- hadoop fs -mkdir /user/username/wordcount/output
To run rating count analysis
- hadoop fs -mkdir /user/username/ratingcount/
- hadoop fs -mkdir /user/username/ratingcount/output
To run sentiment count analysis
- hadoop fs -mkdir /user/username/sentimentcount/
- hadoop fs -mkdir /user/username/sentimentcount/output
Once all the set up above has been done, use the commands below to run the jar files.
To run Word count .jar
- hadoop jar WordCountAnalysis.jar WordCountAnalysis
To run rating count .jar
- hadoop jar RatingCountAnalysis.jar RatingCountAnalysis
To run sentiment count .jar
- hadoop jar SentimentCountAnalysis.jar SentimentCountAnalysis
Once the Mapreduce process is finished. Write one of the 3 commands to view the output
Word Count output
- command: hadoop fs -cat /user/username/wordcount/output/part-r-00000
Rating Count output
- command: hadoop fs -cat /user/username/ratingcount/output/part-r-00000
Sentiment Count output
- command: hadoop fs -cat /user/username/sentimentcount/output/part-r-00000
Follow all ths steps above to carry out the different processes on a single company file.
- Note: If you want to run MapReduce for other companies, insert the company folder into the Materials folder into the Materials folder and scp the folder into the big hadoop cluster and run the entire process again.
pip install flask , pip install wordcloud
To run the website run visualization.py under 2107Hadoopflask folder.