# Web Scraping Project

The idea of this project is to practice and play with web scraping capabilities without getting to in-depth.

I found this fake site that allows me to scrape a basic web site for job posting data.

In [1]:
# Import packages to use for web scraping as well as data exploration.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
# Pulling all the data from the website
response = requests.get("https://realpython.github.io/fake-jobs/")

In [3]:
# Showing the data as 'text'
soup = BeautifulSoup(response.text)

In [4]:
print(soup.get_text())





Fake Python






        Fake Python
      

        Fake Jobs for Your Web Scraping Journey
      













Senior Python Developer
Payne, Roberts and Davis




        Stewartbury, AA
      

2021-04-08



Learn
Apply














Energy engineer
Vasquez-Davidson




        Christopherville, AA
      

2021-04-08



Learn
Apply














Legal executive
Jackson, Chambers and Levy




        Port Ericaburgh, AA
      

2021-04-08



Learn
Apply














Fitness centre manager
Savage-Bradley




        East Seanview, AP
      

2021-04-08



Learn
Apply














Product manager
Ramirez Inc




        North Jamieview, AP
      

2021-04-08



Learn
Apply














Medical technical officer
Rogers-Yates




        Davidville, AP
      

2021-04-08



Learn
Apply














Physiological scientist
Kramer-Klein




        South Christopher, AE
      

2021-04-08



Learn
Apply














Textile designer
Meyers-Johnson




        Port Jonathan, AE
 

In [5]:
# Putting data in to html format
content = response.content
soup2 = BeautifulSoup(content,"html.parser")

In [6]:
soup2

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Fake Python</title>
<link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
</head>
<body>
<section class="section">
<div class="container mb-5">
<h1 class="title is-1">
        Fake Python
      </h1>
<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
</div>
<div class="container">
<div class="columns is-multiline" id="ResultsContainer">
<div class="column is-half">
<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>

In [7]:
# Pulling the job titles from the html data and saving them to the list called 'job_titles'
job_titles = []
for i in soup.findAll("div",{"class":"media-content"}):
    job_titles.append((i.find("h2",{"class":"title is-5"})).text)

In [8]:
# See how many different job titles exist
len(job_titles)

100

In [9]:
# Pulling the company names from the html data and saving them to the list called 'company_names'
company_names = []
for i in soup.findAll("div",{"class":"media-content"}):
    company_names.append((i.find("h3",{"class":"subtitle is-6 company"})).text)

In [10]:
# make sure my list of company names had the same amount as the job titles
len(company_names)

100

In [11]:
# Pulling the job locations from the html data and saving them to the list called 'job_locations'
job_locations = []
for i in soup.findAll("div",{"class":"content"}):
    job_locations.append((i.find("p",{"class":"location"})).text)

In [12]:
# Visualize job locations list
job_locations

['\n        Stewartbury, AA\n      ',
 '\n        Christopherville, AA\n      ',
 '\n        Port Ericaburgh, AA\n      ',
 '\n        East Seanview, AP\n      ',
 '\n        North Jamieview, AP\n      ',
 '\n        Davidville, AP\n      ',
 '\n        South Christopher, AE\n      ',
 '\n        Port Jonathan, AE\n      ',
 '\n        Osbornetown, AE\n      ',
 '\n        Scotttown, AP\n      ',
 '\n        Ericberg, AE\n      ',
 '\n        Ramireztown, AE\n      ',
 '\n        Figueroaview, AA\n      ',
 '\n        Kelseystad, AA\n      ',
 '\n        Williamsburgh, AE\n      ',
 '\n        Mitchellburgh, AE\n      ',
 '\n        West Jessicabury, AA\n      ',
 '\n        Maloneshire, AE\n      ',
 '\n        Johnsonton, AA\n      ',
 '\n        South Davidtown, AP\n      ',
 '\n        Port Sara, AE\n      ',
 '\n        Marktown, AA\n      ',
 '\n        Laurenland, AE\n      ',
 '\n        Lauraton, AP\n      ',
 '\n        South Tammyberg, AP\n      ',
 '\n        North Brandonv

In [13]:
# Fixing job locations list by removing the '/n' from each entry
rep = []
for x in job_locations:
    rep.append(x.replace("\n", ""))

In [14]:
# continuation of fixing the job locations list by stripping all the white space around the actual job locations
job_locs = []
for x in rep:
    job_locs.append(x.strip())

In [15]:
# check job locs list to make sure it has the same amount of entries as the job titles and company names lists
len(job_locs)

100

In [16]:
# creating a list of lists that contain the job titles list, company names lists, and job locations lists.
lst_of_lsts = [job_titles, company_names, job_locs]
# creating a dataframe from the list of lists
df_cols_diff_jobs = pd.DataFrame(lst_of_lsts)

In [17]:
# visualize data frame
df_cols_diff_jobs

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,Senior Python Developer,Energy engineer,Legal executive,Fitness centre manager,Product manager,Medical technical officer,Physiological scientist,Textile designer,Television floor manager,Waste management officer,...,Software Developer (Python),"Surveyor, land/geomatics",Legal executive,"Librarian, academic",Barrister,Museum/gallery exhibitions officer,"Radiographer, diagnostic",Database administrator,Furniture designer,Ship broker
1,"Payne, Roberts and Davis",Vasquez-Davidson,"Jackson, Chambers and Levy",Savage-Bradley,Ramirez Inc,Rogers-Yates,Kramer-Klein,Meyers-Johnson,Hughes-Williams,"Jones, Williams and Villa",...,Moreno-Rodriguez,Brown-Ortiz,Hartman PLC,Brooks Inc,Washington-Castillo,"Nguyen, Yoder and Petty",Holder LLC,Yates-Ferguson,Ortega-Lawrence,"Fuentes, Walls and Castro"
2,"Stewartbury, AA","Christopherville, AA","Port Ericaburgh, AA","East Seanview, AP","North Jamieview, AP","Davidville, AP","South Christopher, AE","Port Jonathan, AE","Osbornetown, AE","Scotttown, AP",...,"Martinezburgh, AE","Joshuatown, AE","West Ericstad, AA","Tuckertown, AE","Perezton, AE","Lake Abigail, AE","Jacobshire, AP","Port Susan, AE","North Tiffany, AA","Michelleville, AP"


In [18]:
# Transpose the data frame so that each row is the combination of the lists instead of each column
df_rows_diff_jobs = df_cols_diff_jobs.transpose()
# Naming the columns in the transposed data frame to what the list contained for each column
df_rows_diff_jobs.columns = ['Job Title', 'Company', 'Location']

In [19]:
# Visualize the data frame with each job opportunity (including the job title, company, and location) as an individual row
df_rows_diff_jobs

Unnamed: 0,Job Title,Company,Location
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA"
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA"
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA"
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP"
4,Product manager,Ramirez Inc,"North Jamieview, AP"
...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE"
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP"
97,Database administrator,Yates-Ferguson,"Port Susan, AE"
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA"


This was a straight forward project learning how to scrape a basic website and then put that data into a dataframe that can be visualized and used later for data exploration if needed. We could possibly find the statistics of how many jobs are of different field types (ie. computer, sales, healthcare, etc...) or how many jobs are located in which cities.

We could further explore this data by possibly combining it with data of average salary per job title by scraping data from a site such as 'glass door' that shows some salary information for specific jobs.