<a href="https://colab.research.google.com/github/anaribeiros/ndstudentjobs/blob/main/NDStudentJobsCollector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Notre Dame Student Jobs Collector**

**Goal:** Scrape all student jobs from ND's Job Board, organizing them and their compensations concisely in one Google Sheet.

**Why?**
  * As a broke college student, it's honestly annoying to have to click on multiple pages just to find out what job fits me the best (and pays me the most!)
  * So, I felt it'd be nice to just run this script and have everything stored on a spreadsheet (so much faster to read/go through, especially when there are 20+ listed jobs at the start of each semester).

**What I've used**
*   Beautiful Soup: to scrape data out of ND's Job Board
*   Pandas: to create/manipulate a dataframe w/ job information
*   Gspread: to access and edit a Google Sheet

[Link to spreadsheet!](https://docs.google.com/spreadsheets/d/1zT53P82LngsnbjDC02dx4G0BHBjGfUiQMZEA0ve9f5c/edit?usp=sharing)




#**For up-to-date ND job info on the spreadsheet => run me!**

##**1. Scrapping data from ND's website**

In [1]:
# importing requests and beautiful soup
import requests
from bs4 import BeautifulSoup

In [2]:
# requesting job board's url and converting it to a soup

index_url = "https://studentjobs.nd.edu"
page = requests.get(index_url)

soup = BeautifulSoup(page.content, "html.parser")

In [3]:
# getting all job urls and job types
jobs = soup.find_all('li',class_ = 'card-body')
urls = []
job_type = []

for job in jobs:
  job_type.append(job['data-category'])
  a = job.find_all(class_ = 'card-link')[0]
  urls.append(index_url + a['href'])

print(urls)
print(job_type)

['https://studentjobs.nd.edu/jobs/academic-year-driving-to-from-adams-high-school-/', 'https://studentjobs.nd.edu/jobs/after-school-care-for-three-children-/', 'https://studentjobs.nd.edu/jobs/monday-and-friday-fall-semester-childcare-for-3-yo-/', 'https://studentjobs.nd.edu/jobs/spirituality-student-worker-alumni-association/']
['off-campus', 'off-campus', 'off-campus', 'on-campus']


Notre Dame divides it's jobs into 3 categories: On Campus, Off Campus and Community Service. We'll create a dataframe to store the data on the job postings and to specify which type of job each one of them is.

In [4]:
from typing_extensions import dataclass_transform
data_cols = ['Job Type', 'Title', 'Department', 'Hours', 'Rate', 'URL']
data_jobs = []

for i in range(len(urls)):
  page = requests.get(urls[i])
  soup = BeautifulSoup(page.content, "html.parser")
  title = soup.find('h1',class_ = 'page-title').text
  qualifications = soup.find('div', class_ = 'job-details__meta')
  divs_qualifications = qualifications.find_all('div')


  try:
    department = qualifications.find('dt', text='Department').findNext('dd').text
  except:
    department = 'N/A'

  try:
    hours = qualifications.find('dt', text='Hours').findNext('dd').text
  except:
    hours = 'Not informed'

  try:
    rate = qualifications.find('dt', text='Rate').findNext('dd').text
  except:
    rate = 'Not informed'


  data = [job_type[i], title, department, hours, rate, urls[i]]
  data_jobs.append(data)

  department = qualifications.find('dt', text='Department').findNext('dd').text
  hours = qualifications.find('dt', text='Hours').findNext('dd').text
  rate = qualifications.find('dt', text='Rate').findNext('dd').text


In [5]:
# converting list into dataframe
import pandas as pd

df_jobs = pd.DataFrame(data_jobs, columns=data_cols)
df_jobs.head(30)

Unnamed: 0,Job Type,Title,Department,Hours,Rate,URL
0,off-campus,Academic Year Driving To/From Adams High School,,"7-830am, 330-4pm",$20.00,https://studentjobs.nd.edu/jobs/academic-year-...
1,off-campus,After-school care for three children,,4,$15.00,https://studentjobs.nd.edu/jobs/after-school-c...
2,off-campus,Monday and Friday Fall semester childcare for ...,,8; 11:45am - 3:45pm,$16.00,https://studentjobs.nd.edu/jobs/monday-and-fri...
3,on-campus,Spirituality Student Worker,Alumni Association,6,$15.00,https://studentjobs.nd.edu/jobs/spirituality-s...


##**2. Inserting all that data on a Google Spreadsheet**

In [6]:
!pip install gspread --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gspread
  Downloading gspread-5.9.0-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/41.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: gspread
  Attempting uninstall: gspread
    Found existing installation: gspread 3.4.2
    Uninstalling gspread-3.4.2:
      Successfully uninstalled gspread-3.4.2
Successfully installed gspread-5.9.0


In [7]:
from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

# Open our new sheet and add some data.
worksheet = gc.open('ND Job Board Data').sheet1

In [8]:
cols_list = df_jobs.columns.tolist()
data_list = df_jobs.values.tolist()

worksheet.clear()
worksheet.append_row(cols_list)
worksheet.append_rows(data_list)


{'spreadsheetId': '1zT53P82LngsnbjDC02dx4G0BHBjGfUiQMZEA0ve9f5c',
 'tableRange': 'Sheet1!A1:F1',
 'updates': {'spreadsheetId': '1zT53P82LngsnbjDC02dx4G0BHBjGfUiQMZEA0ve9f5c',
  'updatedRange': 'Sheet1!A2:F5',
  'updatedRows': 4,
  'updatedColumns': 6,
  'updatedCells': 24}}