![wuzzuf logo](https://images.wuzzuf-data.net/files/training_programs/providers/wuzzuf-v1.png)

# Wuzzuf website data scrapping
**Note: All the links were collected on 20-12-2022**

In this notebook we will scrape all the jobs data in wuzzuf website

In [None]:
# Main Libraries
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import threading
import csv

In [None]:
# read my csv file to get all the collected links to get the data of this links
df = pd.read_csv('wuzzuf_job_links_20-12-2022.csv', encoding='utf-16')
errors_urls = [] # to collect all the urls that will fail

In [None]:
# get all the data I want from the job details page
def get_data(title, link, soup):
    # Get Job Types
    job_types = []
    try:
        for j_type in soup.find('div', class_='css-11rcwxl').find_all('a'):
            job_types.append(j_type.text)
        job_types = ';'.join(job_types)
    except Exception:
        job_types = None


    # Get Job Categories
    try:
        job_categories = [] 
        for category in soup.find('ul', class_='css-h5dsne').find_all('li'):
            job_categories.append(category.text)
        job_categories = ';'.join(job_categories)
    except Exception:
        job_categories = None


    # Get Job Description
    try:
        job_description = soup.find('div', class_='css-1uobp1k').text
    except Exception:
        job_description = None


    # Get Job Requirements
    try:
        job_requirements = soup.find('div', class_='css-1t5f0fr').text
    except Exception:
        job_requirements = None


    # Get Job Location
    try:
        job_location = soup.find('strong', class_='css-9geu3q').find(text=True, recursive=False)
    except Exception:
        job_location = None


    # Get Company Name
    try:
        company_name = soup.find('strong', class_='css-9geu3q').find(text=True)
    except Exception:
        company_name = None


    # Get Company Location
    try:
        company_location = soup.find('section', class_='css-1rhgoyg').find('span', class_='css-nhiaul').text.split('•')[0]
    except Exception:
        company_location = None


    # Get Company Size
    try:
        company_size = soup.find('section', class_='css-1rhgoyg').find('span', class_='css-nhiaul').text.split('•')[1]
    except Exception:
        company_size = None


    # Get Company Field
    try:
        company_field = soup.find('section', class_='css-1rhgoyg').find('span', class_='css-xilyze').text
    except Exception:
        company_field = None


    # Get Job Details (Experience Needed - Career Level - Education Level - Gender - salary)
    experience_needed = None
    career_level = None
    education_level = None
    salary = None
    gender = 'both'
    try:
        job_details_section = soup.find('section', class_='css-3kx5e2').find_all('div', class_='css-rcl8e5')
        for detail in job_details_section:
            try:
                detail_title = detail.find('span', class_='css-wn0avc').text
                if 'Experience' in detail_title:
                    experience_needed = detail.find('span', class_='css-4xky9y').text
                elif 'Career' in detail_title:
                    career_level = detail.find('span', class_='css-4xky9y').text
                elif 'Education' in detail_title:
                    education_level = detail.find('span', class_='css-4xky9y').text
                elif 'Gender' in detail_title:
                    gender = detail.find('span', class_='css-4xky9y').text
                elif 'Salary' in detail_title:
                    salary = detail.find('span', class_='css-4xky9y').text
            except Exception:
                pass
    except Exception:
        pass


    # Get Skills and Tools
    try:
        skills_and_tools = []
        for skill in soup.find('div', class_='css-s2o0yh').find_all('span', class_='css-158icaa'):
            skills_and_tools.append(skill.text)
        skills_and_tools = ';'.join(skills_and_tools)
    except Exception:
        skills_and_tools = None

    data = {
        'job_title': title,
        'job_type': job_types,
        'job_categories': job_categories,
        'job_description': job_description,
        'job_requirements': job_requirements,
        'job_location': job_location,
        'company_name': company_name,
        'company_location': company_location,
        'company_field': company_field,
        'company_size': company_size,
        'experience_needed': experience_needed,
        'career_level': career_level,
        'education_level': education_level,
        'gender': gender,
        'salary': salary,
        'skills_and_tools': skills_and_tools,
        'link': link
    }
    return data


### The Thread Function
* I am using Selenium because Wuzzuf needs to load the data after the page is loaded using JavaScript, so I need to wait until the data is loaded before I can scrape it. That's why I am using Selenium and the webdriver. The 5 seconds I gave for the driver to wait was determined after a series of trials to find the optimal time to wait.

In [None]:
# the task that will be ran in ascyn mode
def start_driver_thread(mini_df, thread_number):
    # start a chrom web driver 
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    driver = webdriver.Chrome(options=chrome_options)
    
    # create a new csv per thread to add my data to
    csv_file = open(f'data/wuzzuf_job_data_20-12-2022_{thread_number}.csv', 'a+', encoding="utf-16")
    writer = csv.DictWriter(csv_file, fieldnames=['job_title', 'job_type', 'job_categories', 'job_description', 'job_requirements', 'job_location', 'company_name', 'company_location', 'company_field', 'company_size', 'experience_needed', 'career_level', 'education_level', 'gender', 'salary', 'skills_and_tools', 'link'], delimiter ='~')
    writer.writeheader()
    
    # both the next variables used for preview puropose only to know the number of skipped urls and the progress of the loop
    loop_index = 1
    skipped_urls = 0
    
    # start collecting the data
    for index, row in mini_df.iterrows():
        # this loop is used to not allow the script from getting to the next page until it load the the javascript
        while(True):
            try:
                print(f'Thread #{thread_number}: Dataframe Index: {index} - Loop Index {loop_index}')

                # make a request to the page recive it and put it in soup object
                driver.get(row['link'])
                driver.implicitly_wait(5)
                html = driver.page_source
                soup =  BeautifulSoup(html, 'lxml')

                # collect the page data using get_data function
                page_data = get_data(title=row['title'], link=row['link'], soup=soup)

                # some times it took more than 5 seconds for the page to load the data so if it not loaded these if condtion will catch it and will make the page load again with the same url
                if page_data['job_categories'] == '' or page_data['job_categories'] == None:
                    print(f'Thread #{thread_number}: Retring (no category) - Dataframe Index: {index} - Loop Index {loop_index} - URL: {row["link"]}')
                    continue
                
                # add data to the csv file
                writer.writerow(page_data)
                
                loop_index += 1
                break
            except Exception as e:
                skipped_urls += 1
                print(f'Thread #{thread_number}: Skip Number {skipped_urls} - Dataframe Index: {index} - Loop Index: {loop_index} - URL: {row["link"]} \n {e}')
                errors_urls.append({
                    'link': row['link'],
                    'title': row['title']
                })
                break
    driver.close() # closeing the webdriver after finsishing

### As you can see there are alot of pages that I need to scrape and it could even get bigger so for ease of use and time I have used mutithreads to split the data and make them run asyn 
#### How multithreading helped:
According to the links CSV file, I have approximately 25k pages that I need to scrape. Because Wuzzuf needs to load data using JavaScript, I need the page to load. On average, if we say it takes 3 seconds for a page to load and another 5 seconds to load the data after the whole page is loaded (neglecting the time needed by Python to complete the task), it will take approximately 8 seconds per page. By doing simple math, **25000 * 8 = 200000 seconds = 3334 minutes = 56 hours = 2.3 days** to scrape the data.

The solution was to use multithreading to divide the work into multiple tasks that run at the same time. In my case, I used 16 threads _(which may change depending on the ability of your PC and your Python version)_. By doing the same calculation but dividing it by 16 (the number of threads), it will only take approximately **208 minutes = 3.4 hours**.

In [None]:
dfs = np.array_split(df, 16) # split my main data frame to 16 data frames (number of threads that I will ran)

# start runing the threads and gave each one the df that it will scrape
threads = []
for i, splited_df in enumerate(dfs):
    thread = threading.Thread(target=start_driver_thread, kwargs={'mini_df': splited_df, 'thread_number': str(i + 1)})
    threads.append(thread)
    thread.start()
    
for thread in threads:
    thread.join()