#### WebScraping Example by @danilotuosto

###### **STEP 1.** Let's import the needed packages

In [23]:
import pandas as pd
import requests as rq
from bs4 import BeautifulSoup as bs

###### **STEP 2.** Now we're going to assign to a variable named 'link', the desired url through the *get* function from the *requests* package. Next, we're going to check if the url works with *status_code*

In [24]:
link = rq.get('https://realpython.github.io/fake-jobs/')
link.status_code

200

###### **STEP 3.** 

In [25]:
# We're going to create a BeautifulSoup object to hold only the content of the link variable.
soup = bs(link.content, "html.parser")

# Since soup it's a BeautifulSoup object we can use the bs function find to look for an element
# in the html file whose id is equal to ResultsContainer (a div we found in the original web page).
results = soup.find(id = 'ResultsContainer')

# Now, we'll assign to a list called job_elements all the findings in the web page whose is a div
# and, simultanously, its class is card-content.
job_elements = results.find_all('div', class_ = 'card-content')

# We create a list for each field of the job.
title_list, sub_list, loc_list = [], [], []

# Though a for cicle we scan for each element in the job_elements' list.
for job_element in job_elements:
# In each element, we're use the find function to find title, subtitle and location of each job, convert
# it to text, and then add it to the list with append.
    title = job_element.find('h2', class_ = 'title').text
    title_list.append(title)
    subtitle = job_element.find('h3', class_ = 'subtitle').text
    sub_list.append(subtitle)
    loc = job_element.find('p', class_ = 'location').text
    loc_list.append(loc)

# We're creating a dictoniary to hold these three lists.
tab = {
    'title': title_list,
    'company': sub_list,
    'location': loc_list
}

# Check lists' lenghts
print(len(title_list))
print(len(sub_list))
print(len(loc_list))

# Since the three lists don't always have the same length, pandas won't be able to create a DataFrame. 
# So, we're going to add null values how many times is the difference between the lists' length
loc_list += (len(title_list)-len(loc_list)) * [None]

# We assign to df the DataFrame containing the tab dictionary.
df = pd.DataFrame(tab)

# Save it into a csv file
df.to_csv('data_scrape', index= False)





100
100
100


Using pandas, we're going to remove \n (newline) from all the columns

In [26]:
df.replace(r'\n', '', regex=True)

Unnamed: 0,title,company,location
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA"
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA"
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA"
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP"
4,Product manager,Ramirez Inc,"North Jamieview, AP"
...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE"
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP"
97,Database administrator,Yates-Ferguson,"Port Susan, AE"
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA"


In [27]:
df.to_csv('trimmed_data_scrape', index=0)