# RemoteOK Job Scraping, Cleaning & Visualization

End-to-end pipeline notebook.


## ðŸ“Œ Project Overview

This notebook performs an **end-to-end data pipeline**:
- Ethical web scraping from RemoteOK  
- Data cleaning & preprocessing  
- Exploratory Data Analysis (EDA)  
- Visual insights for job market trends  

**Objective:**  
Understand in-demand roles, skills, locations, and hiring patterns in remote jobs.


## 1. Imports & Configuration

In [None]:

import os
import re
import time
import random
from datetime import datetime, timedelta
from collections import Counter

import pandas as pd
import matplotlib.pyplot as plt
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup


## 2. Configuration

In [None]:

BASE_URL = "https://remoteok.com/remote-{}-jobs"
CATEGORIES = ["engineer", "management", "design", "financial", "marketing"]

TOTAL_JOB_LIMIT = 500
PAGE_LOAD_DELAY = 5
SCROLL_DELAY = 3
CATEGORY_DELAY_RANGE = (2, 3)
MAX_SCROLLS_PER_CATEGORY = 5


## 3. Selenium Driver Setup

In [None]:

def create_driver():
    options = Options()
    options.add_argument("--start-maximized")
    service = Service(ChromeDriverManager().install())
    return webdriver.Chrome(service=service, options=options)


## 4. Scraping RemoteOK

In [None]:

def scrape_remoteok():
    driver = create_driver()
    all_jobs = []

    for category in CATEGORIES:
        if len(all_jobs) >= TOTAL_JOB_LIMIT:
            break

        url = BASE_URL.format(category)
        print(f"Opening category: {category}")
        driver.get(url)
        time.sleep(PAGE_LOAD_DELAY)

        for _ in range(MAX_SCROLLS_PER_CATEGORY):
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(SCROLL_DELAY)

        soup = BeautifulSoup(driver.page_source, "html.parser")
        job_rows = soup.find_all("tr", class_="job")

        for job in job_rows:
            if len(all_jobs) >= TOTAL_JOB_LIMIT:
                break

            title = job.get("data-position")
            company = job.get("data-company")

            loc_tag = job.get("data-location") or job.find("div", class_="location")
            location = loc_tag.get_text(strip=True) if loc_tag else "Remote"
            location = re.sub(r"[^\w\s,]", "", location).strip()

            skill_tags = job.find_all(["a", "span"], class_=["tag", "skill", "badge"])
            skills = ", ".join([s.get_text(strip=True) for s in skill_tags])

            time_tag = job.find("time")
            raw_date = time_tag.get_text(strip=True) if time_tag else None
            if raw_date and "d" in raw_date:
                days_ago = int(re.search(r"\d+", raw_date).group())
                date_posted = datetime.today() - timedelta(days=days_ago)
            else:
                date_posted = datetime.today()

            link_tag = job.find("a", class_="preventLink")
            job_url = "https://remoteok.com" + link_tag["href"] if link_tag else "N/A"

            if title and company:
                all_jobs.append({
                    "Job Title": title,
                    "Company Name": company,
                    "Skills": skills,
                    "Location": location,
                    "Date Posted": date_posted,
                    "Job URL": job_url
                })

        time.sleep(random.uniform(*CATEGORY_DELAY_RANGE))

    driver.quit()
    return pd.DataFrame(all_jobs)


## 5. Run Scraper

In [None]:

df_raw = scrape_remoteok()
df_raw.head()


## 6. Data Cleaning

In [None]:

df = df_raw.drop_duplicates(subset=["Job Title", "Company Name", "Job URL"])
df["Date Posted"] = pd.to_datetime(df["Date Posted"], errors="coerce")
df.to_csv("remoteok_jobs_cleaned.csv", index=False)
df.head()


## 7. Visualization

In [None]:

os.makedirs("visuals", exist_ok=True)

# Top job titles
top_titles = df["Job Title"].value_counts().head(10)
top_titles.plot(kind="barh", title="Top 10 Job Titles")
plt.tight_layout()
plt.savefig("visuals/top_job_titles.png")
plt.show()

# Jobs over time
jobs_per_day = df.groupby(df["Date Posted"].dt.date).size()
jobs_per_day.plot(marker="o", title="Jobs Posted Over Time")
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig("visuals/jobs_over_time.png")
plt.show()


## 8. Skill Demand Analysis

In [None]:

COMMON_SKILLS = [
    "python","java","sql","aws","docker","react","javascript",
    "node","c++","c#","ruby","php","html","css","django",
    "machine learning","data","ai","devops"
]

from collections import Counter

skills_found = []
for title in df["Job Title"].astype(str).str.lower():
    for skill in COMMON_SKILLS:
        if skill in title:
            skills_found.append(skill)

skill_counts = Counter(skills_found)

if skill_counts:
    skills, counts = zip(*skill_counts.most_common(10))
    plt.figure(figsize=(10,6))
    plt.bar(skills, counts)
    plt.xticks(rotation=45, ha="right")
    plt.title("Top 10 In-Demand Skills (Inferred from Job Titles)")
    plt.xlabel("Skill")
    plt.ylabel("Number of Jobs")
    plt.tight_layout()
    plt.savefig("visuals/top_10_skills.png")
    plt.show()
else:
    print("No skills detected.")


## 9. Job Type Distribution

In [None]:

job_type_counts = df["Job Type"].value_counts()

plt.figure(figsize=(6,6))
plt.pie(
    job_type_counts,
    labels=job_type_counts.index,
    autopct="%1.1f%%",
    startangle=140
)
plt.title("Job Type Distribution")
plt.axis("equal")
plt.tight_layout()
plt.savefig("visuals/job_type_distribution.png")
plt.show()


## 10. Top Hiring Locations

In [None]:

top_locations = df["Location"].value_counts().head(10)

plt.figure(figsize=(10,6))
plt.barh(top_locations.index[::-1], top_locations.values[::-1])
plt.title("Top 10 Job Locations")
plt.xlabel("Number of Jobs")
plt.ylabel("Location")
plt.tight_layout()
plt.savefig("visuals/top_10_locations.png")
plt.show()



## âœ… Conclusion

**Key Insights:**
- Engineering and software roles dominate remote hiring  
- Python, JavaScript, and cloud-related skills are highly demanded  
- Majority of roles are full-time  
- Remote-first hiring is globally distributed  

**Future Scope:**
- Salary trend analysis  
- Company-wise hiring patterns  
- NLP on job descriptions  

ðŸ“Œ *This notebook is suitable for academic submission and project demonstrations.*
