Well, the purpose of doing such an analysis is just to satisfy my curiosity: How tough is the Job market to CS students? 😮
😎 Eventually, after spending my weekend, I scraped the data and got this analysis.
😎 Let's take a look at the result firstly, and then I will tell you how to get it.
Interpretation
1. The Job word cloud: the bigger the word, the more job demands.
-
Well, disappointingly, it showed a stark absence of CS-related roles.
-
Look, the only thing we saw at a glance was postions like assistant or whaever..
2. The Job density map: The more desity, the more jobs.
-
I was supprised to find Two Job centers: St.John's and Labrador. I thought there would be only St.John's.
-
Also, nearby areas of St.John's like Mount Pearl and Paradise also showed significant opportunities.
3. HDBScan: These clusters illustrated the geographic spread of job opportunities
- It highlighted the concentration of jobs in St. John’s, marked by vibrant red and orange dots on the map.
- black dots, might representing outliers, indicated unique jobs in more remote locations.
4. LDA for job requirement analysis: to understand the nature of these jobs
-
For topic 1, it seems to revolve around positions that require a degree of responsibility and coordination, likely in an administrative or managerial capacity 😎 .
-
The presence of words like "assistant," "program," "manager," and "officer" suggests roles that may involve oversight, planning, and execution of various programs or projects.
-
The terms "health" and "safety" indicate a possible focus on roles related to health and safety management, while "training" and "services" might imply roles in the service industry or corporate training domains.
-
"Valid license" suggests jobs that require certification or specific qualifications.
Method
-
Scrape the job post data
- I already built a way, using selenium, to scrape the data from a popular job post website. >data_scraping.py
- Surely, you can take it as a reference to scrap data from other job websites.
- I have to say, selenium is way better than the way using BeautifulSoup, especially for this kind of dynamic update website which also have anti-scrapping mechanism.
-
Clean the scraped data
- This is an essential step, as the scrapped data contains noises or impurities, which will definitely affect the results.
- In this case, I spotted the errors and areas that need to be adjusted, like moving job type data from salary column to the correct place.
- Try >data_clean.py
-
Take a fast visulization
- In this step, >data_rough_viz.py could give you a general visulization by generating the Job Word Cloud and Job density Map.
- A little tip: When processing data, you can consider some algorithms or techniques to improve efficiency.
- For example, when I convert city to longitude and latitude, due to server limitations, I can only process one data in 10 seconds.
- If I simply convert one by one, then 1,000 data will take nearly 3 hours, which is too long.
- But if I can save the existing city areas and longitudes and latitudes, then every time I process the data, I can refer to the recorded ones without having to go to the server for processing.
- The result only takes less than 1 minute to complete.
-
Data mining and more visulization
- I have already created k-means, DBscan, HDBscan and Topic modeling in >data_mining_viz.py
- In this case, I used HDBscan and Topic modeling since I found these two are more suitable.