## Text Mining CA
#### This Text mining assignment will focus on job market mining and follow the general steps follows:
- Data Extraction through web crawler.
- Text preprocessing.
- Use the text mining basic methods list the conclusions.
- Advanced methods to get the findings (word2vec). 
 
In this part, I mainly wrote the web crawler to finish the data extraction, also tried to refine the data for the raw text data.

From https://www.mycareersfuture.sg/, near 220 job results (the website is d) with query about "machine learning" are listed in this sample containing the 12 columns. 

Use the <u>BeautifulSoup</u>, <u>selenium</u> to crawler the data in the web page, and also fetch the json data through request apis.

This notebook can be a guideline to show the basic steps.

**Attention!!! Download chromedriver in Method 1, and replace the file path in google_chrome_driver_path in Method 1**

In [2]:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import math
import requests
import pandas as pd

### Method 1: Use the root url like "https://www.mycareersfuture.sg" to set the query value to get the results

In [3]:
# need to download the driver support in selenium, refer to below 2 helpers.
# https://selenium-python.readthedocs.io/installation.html#drivers
# https://sites.google.com/a/chromium.org/chromedriver/downloads
google_chrome_driver_path = './chromedriver'
root_url = 'https://www.mycareersfuture.sg'

driver=webdriver.Chrome(google_chrome_driver_path)
driver.get(root_url)
# to wait the page finish loading.
time.sleep(2) 
# find and type in the search bar with "machine learning"
driver.find_element_by_name('search-text').send_keys('machine learning') 
time.sleep(1) 
# find and click the search button
driver.find_element_by_id('search-button').click()
time.sleep(1) 
# now, get the html of all the search result
html = driver.page_source

### Method 2: Can just joint the content you wanna query fill in the url:
see the url like https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page=0
- set the *search=* equals to your wanted content such as 'machine%20learning', "%20" means the 'space' which is the URL encoding rule, refer to https://zh.wikipedia.org/wiki/%E7%99%BE%E5%88%86%E5%8F%B7%E7%BC%96%E7%A0%81 
- set the *page=* equals to the web pages you wanna jump to. attention the **no result**.

In [2]:
basic_url = 'https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page={}'
urls = []
# construct 20 pages
for page in range(0, 20):
    query_url = basic_url.format(page)
    urls.append(query_url)
# driver.get(urls[0])
# html = driver.page_source


### Start to extract the contents

#### 1. define the fetch function

In [4]:
def fetch_data(url, head, payload):
    response = requests.get(url, headers=head, params=payload)
    if response.status_code == 200:
        return response.json()
    else:
        return {'info': 'error', 'error_code': response.status_code}

#### 2. get the query result number.

In [5]:
google_chrome_driver_path = '/Users/alexjzy/Desktop/Py-Projects/text_mining/chromedriver'
driver=webdriver.Chrome(google_chrome_driver_path)
query_url = 'https://api.mycareersfuture.sg/jobs?search=machine%20learning&sortBy=new_posting_date'
response = fetch_data(query_url, {}, {})
result_num = math.ceil(response['count']/20)

#### 3. Since the results showing in the pages, in this part get the total number and the total page


In [6]:
basic_url = 'https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page={}'
urls = []
# construct 20 pages
for page in range(0, result_num ):
    query_url = basic_url.format(page)
    urls.append(query_url)
print(result_num)
print("urls shape: ", len(urls))

11
urls shape:  11


In [7]:
query_url = 'https://api.mycareersfuture.sg/jobs?search=machine%20learning&sortBy=new_posting_date'
response = fetch_data(query_url, {}, {})
result_num = math.ceil(response['count']/20)

In [8]:
urls # all the result in pages.

['https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page=0',
 'https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page=1',
 'https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page=2',
 'https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page=3',
 'https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page=4',
 'https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page=5',
 'https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page=6',
 'https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page=7',
 'https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page=8',
 'https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page=9',
 'https://

#### 4. Use the uuid in every card to get the detail of the job description

In [9]:
def get_job_description(uuid):
    api_basic = 'https://api.mycareersfuture.sg/job/{}'
    api_jd_url = api_basic.format(uuid)
    json = fetch_data(api_jd_url, {}, {})
    jd = BeautifulSoup(json['job_description']).get_text(strip=True)
    jr = BeautifulSoup(str(json['other_requirements'])).get_text(strip=True)
    jsk = [item['skill'] for item in json['skills']]
    sal_max = json['max_monthly_salary']
    sal_min = json['min_monthly_salary']
    return jd, jr, jsk, sal_max, sal_min
    

#### 5. wrap the json return the result.

In [10]:
def get_detail(card):
    company = card.find("p", {"name": "company"}).get_text()
    job_title = card.find("h1", {"name": "job_title"}).get_text()
    
    # extract the data
    location = card.find_all("p", {"name": "location"})[0].get_text() if len(card.find_all("p", {"name": "location"})) > 0 else None
    employment_type = card.find_all("p", {"name": "employment_type"})[0].get_text() if len(card.find_all("p", {"name": "employment_type"})) > 0 else None
    seniority = card.find_all("p", {"name": "seniority"})[0].get_text() if len(card.find_all("p", {"name": "seniority"})) > 0 else None
    category = card.find_all("p", {"name": "category"})[0].get_text() if len(card.find_all("p", {"name": "category"})) > 0 else None
    
    # get the job detail and collect the jd and requirements which are the raw text
    job_uuid = card.find("a", href=True)['href'].split('-')[-1]
    job_description, job_requirement, job_skills, salary_max, salary_min = get_job_description(job_uuid)
    return {
        "company": company,
        "job_title": job_title,
        "location": location,
        "employment_type": employment_type,
        "seniority": seniority,
        "category": category,
        "job_description": job_description,
        "job_requirement": job_requirement,
        "job_skills": job_skills,
        "job_uuid": job_uuid,
        "salary_min": salary_min,
        "salary_max": salary_max
    }

#### 6. iterate the card in the cards list.

In [11]:
def get_card_info(page_url, res):
    driver.get(page_url)
    time.sleep(2)
    html = driver.page_source
    soup = BeautifulSoup(html)
    card_jobs = soup.find("div", {"class": "card-list"})
    cards = card_jobs.find_all("div", {"class": "card relative"})
    for card in cards:
        res.append(get_detail(card))

#### 7. get the result and convert to dataframe

In [None]:
result = []
for url in urls:
    get_card_info(url, result)
career_res = pd.DataFrame.from_dict(result)


### Start to refine and modify the dataframe

In [None]:
career_res['job_skills'] = career_res.job_skills.apply(lambda x: ', '.join(x))

In [None]:
career_res["job_description"] = \
career_res["job_description"].apply(lambda jd:BeautifulSoup(jd).get_text(strip=True))

career_res["job_requirement"] = \
career_res["job_requirement"].apply(lambda x:BeautifulSoup(str(x)).get_text(strip=True))

In [None]:
career_res["category"] = career_res.category.apply(lambda x:','.join(x.split('/ ')))

In [None]:
career_res["employment_type"] = career_res.employment_type.apply(lambda x: x.split('...')[0])

In [None]:
career_res.head()

#### Save the data to csv.

In [None]:
career_res.to_csv('mycareersfuture.csv')