# Web Scraping


## Objective

Extracting `Data Engineers jobs` on [Wuzzuf](https://wuzzuf.net/search/jobs/?q=data+engineer&a=hpb), using `BeautifulSoup` library to extract informations from web pages.

## Importing libraries

Import any additional libraries we will need.


In [1]:
from bs4 import BeautifulSoup
import csv
import requests
from itertools import zip_longest
import pandas as pd

### Webpage Contents

Gather the contents of the webpage in text format using the `requests` library and assign it to the variable `link`, and store the content of the page in `src`


In [2]:
link = requests.get("https://wuzzuf.net/search/jobs/?q=data+engineer&a=hpb")
src = link.content

### Scraping the Data

Using `BeautifulSoup` parse the contents of the webpage in `soup` object


In [3]:
soup = BeautifulSoup(src, 'lxml')

Find the information we need in the tags using `soup` object, using the value of the class that identifies each part of the information


In [4]:
job_titles = soup.find_all("h2" , {"class" : "css-m604qf"})
company_names = soup.find_all("a" , {"class" : "css-17s97q8"})
locations = soup.find_all("span" , {"class" : "css-5wys0k"})
job_skills = soup.find_all("div" , {"class" : "css-y4udm8"})

Creating an empty lists to store the specific information

In [5]:
job_title = []
company_name = []
location = []
skills = []
links = []
full_links = []

Looping through the data to extract the specific `text` we need as the final result

In [6]:
for i in range(len(job_titles)):
    job_title.append(job_titles[i].text)
    company_name.append(company_names[i].text)
    location.append(locations[i].text)
    skills.append(job_skills[i].text)
    links.append(job_titles[i].find("a").attrs['href'])

The extracted part of links is look like missing the first part of the web page `https://wuzzuf.net`

So, we will loop through `links` to add the missing part of the link, and store it in `full_links`

In [7]:
for link in links:
    full_links.append("https://wuzzuf.net" + link)


### Loading the Data



Because each column of the data is loaded in horizontal way in the table, we will use `zip_longest` iterator (one of the `itertools` library) to make an iterator that aggregates elements from each of the iterables

In [8]:
file_list = [job_title , company_name, location, skills, full_links]
exported = zip_longest(*file_list)

## Saving the data

Using `csv` library to store the data inside a csv file

In [9]:
with open("/resources/labs/PY0221EN/Webscraping/Data_Engineering_Jobs.csv","w") as myfile :
    wr = csv.writer(myfile)
    wr.writerow(["Job Title", "Company" , "Location" , "Skills", "Links"])
    wr.writerows(exported)

## Display the final scraped data

In [10]:
final_result = pd.read_csv("Data_Engineering_Jobs.csv")
final_result.head()

Unnamed: 0,Job Title,Company,Location,Skills,Links
0,Data Security Engineer,Misr International Systems -,"Giza, Egypt",Full TimeExperienced · 3+ Yrs of Exp · IT/Soft...,https://wuzzuf.net/jobs/p/GC4dKxWYuvyF-Data-Se...
1,Data Center Engineer,Perfect Presentation -,"6th of October, Giza, Egypt",Full TimeExperienced · 5+ Yrs of Exp · IT/Soft...,https://wuzzuf.net/jobs/p/7zgMraWuoxvn-Data-Ce...
2,ETL Developer (Data Engineer),siParadigm Egypt -,"Sheraton, Cairo, Egypt",Full TimeEntry Level · 1 - 3 Yrs of Exp · IT/S...,https://wuzzuf.net/jobs/p/45nIdFIB6hGN-ETL-Dev...
3,Data Engineer,Centro CDX -,"Maadi, Cairo, Egypt",Full TimeExperienced · 3+ Yrs of Exp · IT/Soft...,https://wuzzuf.net/jobs/p/YgnlNerNUQml-Data-En...
4,ETL Developer (Data Engineer),Confidential -,"New Cairo, Cairo, Egypt",Full TimeEntry Level · 1 - 3 Yrs of Exp · IT/S...,https://wuzzuf.net/jobs/p/lh2XUX458sAE-ETL-Dev...


| Made by | E-Mail | LinkedIn        | Github                 |
| ----------------- | ------- | ----------------- | ---------------------------------- |
| Mohamed Essam        | mohamed.esam3105@gmail.com     | [Profile](https://www.linkedin.com/in/esamtronics) | [Repositories](https://github.com/esamtronics?tab=repositories) |
