# Survey On Resume Mistakes (2018) - Data Scraper

### About this file
This file is intended to extract data from a 2018 survey conducted by The Harris Poll on behalf of CareerBuilder, focusing on the most common resume mistakes leading to instant rejections.

The scraped data is organized into a pandas dataframe, accessible in CSV format. The dataset includes the following information about the survey:
- `mistakes`: The type of mistake observed in the resumes (string)
- `percent`: The relative frequency as voted by the survey participants (integer)

### Import necessary libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### Scrape Data
Let's store the long url address in the variable `url`.

In [2]:
url = "https://press.careerbuilder.com/2018-08-24-Employers-Share-Their-Most-Outrageous-Resume-Mistakes-and-Instant-Deal-Breakers-in-a-New-CareerBuilder-Study"

It is a best coding practice to verify whether the page has been successfully fetched before initiating the data extraction process from a website.

In [3]:
# Get the page request
page = requests.get(url)


# Check the status code
if page.status_code == 200:
    soup = BeautifulSoup(page.content, "html.parser")
    print("Page successfully fetched")
else:
    print(f"Error fetching page. Status code: {page.status_code}")

Page successfully fetched




The page has been successfully fetched. I can now proceed to my data extraction step. Remember that the data survey is presented on the CareerBuilder website in both short paragraphs and bullet point forms. What I am specifically interested in for my visualization is the data presented in list forms. In the code below, I located the data on the website, which is contained within the `<div class="wd_body wd_news_body">` and inside the `<ul>` container.

In [4]:
body = soup.find("div",class_ = "wd_body wd_news_body")
mistakes = body.find_all("ul")[1]
mistakes

<ul type="disc">
<li>Typos or bad grammar: 77 percent
</li><li>Unprofessional email address: 35 percent
</li><li>Resume without quantifiable results: 34 percent
</li><li>Resume with long paragraphs of text: 25 percent
</li><li>Resume is generic, not customized to company: 18 percent
</li><li>Resume is more than two pages: 17 percent
</li><li>No cover letter with resume: 10 percent </li></ul>

In [5]:
# Extract data 

data = []
for mistake in mistakes.select("li"):                                   
    mistake = mistake.text.strip().split(":")                 # Trim all extra spaces
    
    mistake_type = mistake[0].capitalize()                    # Standardize data
    mistake_percent = int(mistake[1].replace(" percent",""))  # Remove "percent" and convert to integer type
    data.append([mistake_type, mistake_percent])

    
# Store data into a pandas DataFrame
df = pd.DataFrame(data, columns = ["mistakes", "percent"])
df

Unnamed: 0,mistakes,percent
0,Typos or bad grammar,77
1,Unprofessional email address,35
2,Resume without quantifiable results,34
3,Resume with long paragraphs of text,25
4,"Resume is generic, not customized to company",18
5,Resume is more than two pages,17
6,No cover letter with resume,10


### Save into a CSV file

The final step is to save the dataframe into a CSV file named `data_survey.csv`.

In [6]:
df.to_csv("data_survey.csv", encoding='utf-8', index=False)