# PSET4 - Web Scraping

This week you will be scraping job postings from one of Mongolia's top job boards: Zangia.mn (formerly BizNetwork). We will collect the following features from the site:

- Job title
- Job description
- Job sector
- Salary range

If you go to https://www.zangia.mn/job/list you will see all the job listings. There are several pages of listings. The recommended process is:

1. Make a list of job post results.
2. Scrape the listing URLs from the job list pages (1 through n).
3. Use the resulting URL list to scrape the features to a dataframe.

The final dataframe should include each of the features above. Not every job post will have each feature, but the resulting data should be a clean dataframe with the data in the right location.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import statistics

### This PSET was extremely hard to say the least. I learned alot, and it was very practical and useful. 
    I first took the URLs of every 'ad' in 'ads' and save it as a list called job_urls
    With that list I then wrote a loop that scrapes the needed information from every url in urls
    Getting the 'li' element as well as getting the <span> element in the second <div> was very hard but managed it
    Then I could clean the df by splitting the salary and averagin it as well as deleting Nan Values and 'None' values as well
    Wrote some extra code just for some curiosity

In [2]:
n_pages = 2  # Change this to the desired number of pages

base_url = "https://www.zangia.mn/"  # Base URL of the website

job_urls = []  # List to store job URLs

for page_num in range(1, n_pages + 1):
    # Construct the URL for the current page
    url = f"{base_url}/job/list/pg.{page_num}"

    # Send an HTTP request to the page
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all 'div' elements with 'class='ad'' and extract URLs
    ad_divs = soup.find_all('div', class_='ad')
    job_urls += [f"{base_url}{ad.find('a', href=True)['href']}" for ad in ad_divs]

In [3]:
job_data = []  # List to store scraped data

for job_url in job_urls:
    # Send an HTTP request to the job URL
    response = requests.get(job_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    job_title = soup.find('h3').get_text()

    salary_div = soup.find('div', class_='salary')
    salary_text = salary_div.get_text() if salary_div else ''
    
    # Check for the presence of "Тохиролцоно" in the salary text
    if "Тохиролцоно" in salary_text:
        continue  # Skip this job if "Тохиролцоно" is present in salary

    # Extract the text from the first <li> element within a <div class='section'>
    first_li_text = None
    section_divs = soup.find_all('div', class_='section')
    for section_div in section_divs:
        li_element = section_div.find('li')
        if li_element:
            first_li_text = li_element.get_text()
            break

    # Extract the text in <span> from the second <div> element in the first 'details' element
    details_div = soup.find('div', class_='details')
    second_div_element = details_div.find_all('div')[1].find('span')
    second_div_text = second_div_element.get_text()

    # Append the extracted data to the list
    job_data.append({
        'Job Title': job_title,
        'Salary': salary_text,
        'Job Description': first_li_text,
        'Job Sector': second_div_text
    })

# Create a DataFrame from the collected data
df = pd.DataFrame(job_data)

In [4]:
def calculate_average_salary(salary_range):
    values = [int(val.replace(',', '').split()[0]) for val in salary_range.split('-')]
    return sum(values) / 2

# Apply the function to the "Salary" column and convert to float
df['Salary'] = df['Salary'].apply(calculate_average_salary).astype(float)

df = df.dropna()

In [5]:
df

Unnamed: 0,Job Title,Salary,Job Description,Job Sector
0,Авто - Цахилгаанчин,2300000.0,Автын механик буюу автын цахилгаанчин мэргэжил...,Уул уурхай
1,Мэдээлэл технологийн мэргэжилтэн,1950000.0,"Сервер, хяналын камер болон компьютер техник х...","Мэдээлэл технологи, Харилцаа холбоо"
2,Харилцагчийн дэмжлэг үйлчилгээний ажилтан (Ня-бо),1650000.0,"Харилцагчтай холбогдон зөвлөгөө өгөх, асуудлыг...","Банк, санхүү, нягтлан бодох бүртгэл"
4,По-ны жолооч /Агуулах/,1950000.0,"Бэлэн бүтээгдэхүүн, авдар шилийг эвдрэл гэмтэл...","Үйлдвэрлэл, Дэд бүтэц"
8,Засвар төлөвлөлтийн туслах инженер /хотод ажил...,1650000.0,Техникийн бэлэн байдлыг сайжруулах талаар эвдр...,Уул уурхай
...,...,...,...,...
133,Дампын оператор,3500000.0,Уул уурхайн хүнд машин механизмаар хөрс зөөвөр...,Уул уурхай
134,Аюулгүй байдлын ажилтан ХОТОД,1100000.0,Обьектын аюулгүй байдлыг хангаж ажиллана,Үйлчилгээ
135,"Ээлжийн дарга, оператор",1350000.0,24/48 цагийн хувиараар ажиллана,Үйлчилгээ
136,Шуурхай эргүүлийн бүлгийн аюулгүй байдлын ажилтан,1350000.0,24/48 цагийн хувиараар ажиллана,Харуул хамгаалалт


### If my df is clean it should have no Nan values as well as no "None" in its values

In [6]:
nan_values = df.isna()

if nan_values.any().any():
    print("There are NaN values in the DataFrame.")
else:
    print("There are no NaN values in the DataFrame.")

There are no NaN values in the DataFrame.


In [7]:
contains_none = df.applymap(str).apply(lambda x: x.str.contains('None')).any().any()

if contains_none:
    print("The word 'None' is present in the DataFrame.")
else:
    print("The word 'None' is not present in the DataFrame.")

The word 'None' is not present in the DataFrame.


#### This code is just for fun, tried scraping informations only on the pages instead of different URLs

In [8]:
n_pages = 10  # Change this to the desired number of pages

job_titles = []
job_companies = [] 
job_salaries = []
job_locations = []

for page_num in range(1, n_pages + 1):
    # Construct the URL for the current page
    url = f"https://www.zangia.mn/job/list/pg.{page_num}"

    # Send an HTTP request to the page
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    companies = soup.find_all("div", {"class": "company"})
    ads = soup.find_all("div", {"class": "ad"})
    # I combined both soups to make the code more efficient
    for ad, company in zip(ads, companies):
        job_title = ad.find('b').get_text()
        job_titles.append(job_title)

        company_name = company.find('b').get_text()
        job_companies.append(company_name)

        job_salary = ad.find('span', {"class": "fsal"}).get_text()
        
        # Check if the text "Тохиролцоно" is in the salary, and if not, process it
        if "Тохиролцоно" not in job_salary:
            # Remove unwanted characters and spaces
            job_salary = job_salary.replace('MNT', '').replace(',', '').strip()
            
            # Split salary range and calculate the average
            salary_parts = job_salary.split('-')
            if len(salary_parts) == 2:
                min_salary = int(salary_parts[0])
                max_salary = int(salary_parts[1])
                avg_salary = (min_salary + max_salary) / 2
            else:
                avg_salary = int(job_salary)
            job_salaries.append(avg_salary)
        else:
            job_salaries.append(0)

        # Find the job location with error handling
        job_location = None
        floc_element = ad.find('span', {"class": "floc"})
        if floc_element:
            job_location = floc_element.get_text(strip=True).split('"')[-1].strip()
        job_locations.append(job_location if job_location else "N/A")  # This ensures that there are no "None" in the job_location column

# Create a DataFrame
df = pd.DataFrame({"job_titles": job_titles, "job_salary": job_salaries, "job_company": job_companies, "job_location": job_locations})

# Drop rows with salary value 0
df = df[df["job_salary"] != 0]

# Reset the DataFrame index
df.reset_index(drop=True, inplace=True)

In [9]:
df

Unnamed: 0,job_titles,job_salary,job_company,job_location
0,Худалдааны зөвлөх,1650000.0,Кашмер Холдинг ХХК,Улаанбаатар хот
1,Авто - Цахилгаанчин,2300000.0,Жи Си Ар Монголиа,Өмнөговь аймаг
2,Мэдээлэл технологийн мэргэжилтэн,1950000.0,Цахиур ХХК,Сонгинохайрхан дүүрэг
3,Харилцагчийн дэмжлэг үйлчилгээний ажилтан (Ня-бо),1650000.0,Эрин системс ХХК,"Улаанбаатар хот, Хан-Уул дүүрэг"
4,Автын механик,2300000.0,UB Auto Center,"Улаанбаатар хот, Баянзүрх дүүрэг"
...,...,...,...,...
501,Зээл төлүүлэлтийн мэргэжилтэн,2300000.0,Таван Богд Финанс ХХК,Улаанбаатар хот
502,Харилцааны менежер- УБ хот,2300000.0,Таван Богд Финанс ХХК,Улаанбаатар хот
503,Жолооч /Гэрээт/,1350000.0,Таван Богд Финанс ХХК,"Улаанбаатар хот, Хан-Уул дүүрэг"
504,Тусгай активын мэргэжилтэн,1350000.0,Омни Капитал ББСБ,"Улаанбаатар хот, Сүхбаатар дүүрэг"
