MonsterCrawler

Description
Crawler for job description from job search engine monster.de. The job descriptions then were analyzed using Rapidminer: We built a Document-Term-Matrix for the whole corpus of real job offfers and for fictional job descriptions of our "dreamjobs". Then we used Cosine-Similarity to find the Job offers most similar to our dream-jobs.

Context
Master programme Data Science & Business Analytics
Lecture Introduction to Data Science
At University of Media, Stuttgart (DE)

Goal / Task
Come up with a use-case for clustering, classifaction or text analysis and implement a Proof-of-Concept using a self-service-analytics tool like Rapidminer.

Authors
Sanna and me (dynobo)

Timeline
Mar. 2017 - Apr. 2017

Repo
Web-Crawler implemented with Scrapy in Python; Rapidminer workflows for data cleaning, preparation and similiarity search.

Web-Scraper

Install

Install dependencies:

scrapy

Configure

In /spiders/jobs_spider.py, in the class linkSpider, adjust the variable url in order to get the result for desired region and keywords.

Run

First extract links to job offers from search results:

scrapy crawl links -a search=datascience -o datascience.json
scrapy crawl links -a search=itinstuttgart -o itinstuttgart.json

Second extract job descriptions:

scrapy crawl jobs  -a search=datascience -o datascience.xml
scrapy crawl jobs  -a search=itinstuttgart -o itinstuttgart.xml

Third combine the xmls :

xml_grep --pretty_print indented --wrap items --descr '' --cond "item" *.xml > jobs.xml

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
monster_webcrawler		monster_webcrawler
rapidminer_workflows		rapidminer_workflows
.gitignore		.gitignore
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monster_webcrawler

monster_webcrawler

rapidminer_workflows

rapidminer_workflows

.gitignore

.gitignore

README.md

README.md

scrapy.cfg

scrapy.cfg

Repository files navigation

MonsterCrawler

Web-Scraper

Install

Configure

Run

About

Releases

Packages

Languages

dynobo/MonsterCrawler

Folders and files

Latest commit

History

Repository files navigation

MonsterCrawler

Web-Scraper

Install

Configure

Run

About

Resources

Stars

Watchers

Forks

Languages