
Etapes de scrap du site [the muse](https://www.themuse.com/).\
On utilise au maximum les json récupéré dans les appels api site pour avoir des données structurées

### Il y a 3 étapes principales. 

1.  requete sur un type de job exemple : "data engineer".
    - method: get
    - url : https://www.themuse.com/api/search-renderer/jobs?
    - params : ctsEnabled=false&query=Data+Engineer&preference=krcbqorfvz&limit=20&timeout=5000
    - recupération du json qui présente les différentes offres. Données conservées
        - job_title
        - company.short_name
        - short_title
        - posted_at
        - cursor (le dernier cursor est utile pour la pagination) --> start_after = dernier cursor
        - has_more (utile pour la pagination)

2.  récupération de chaque job dans le json reçu et requete pour obtenir le html de chaque job
    - method: get
    - url : https://www.themuse.com/jobs/
    - params: [hit.company.short_name]/[hit.short_title]

3.  dans le html, recupérér le json
    - dans la balise <script id="__NEXT_DATA__" type="application/json"></script>

### données conservées

Pour l'instant on conservce les données suivantes :


In [None]:
import aiohttp
from string import Template
from typing import List

site_url: str = "https://www.themuse.com"
search_url: str = "/api/search-renderer/jobs"
job_url: Template = Template("/jobs/$company/$job_name")
user_agents: List[str] = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Windows; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36",
]

: 

In [9]:
from typing import Dict, List
import re
import random
from itertools import cycle


user_agent_cycle = cycle(user_agents)

async def requests_jobs_summaries(query: str, cursor: str | None = None) -> List[Dict]:
    """Effectue une requête pour lister les jobs.
        Dans le cas d'une pagination, on ajoute start_after= cursor du dernier job

    Args:
        query (str): requête dont les espaces sont remplacés par des +
        limit (int, optional): Limite de réponse max. Defaults à 20.
        cursor (str | None, optional): cursor à partir duquel est lancé la requete si pagination. Defaults to None.

    Returns:
        List[Dict]: liste des jobs
    """
    query: str = re.sub(r"\s+", "+", query)
    search_params: Dict = {
        "ctsEnabled": "false",
        "query": query,
        "preference": "krcbqorfvz",
        "limit": 20,
        "timeout": 5000,
    }
    if cursor:
        search_params.update({"start_after": cursor})

    # proxy: str = random.choice(proxies)
    user_agent = next(user_agent_cycle)
    headers = {"User-Agent": user_agent, "Accept": "application/json"}
    url: str = f"{site_url}{search_url}"

    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(
                url=url, params=search_params, headers=headers
            ) as resp:
                if resp.status == 200:
                    return await resp.json()

        except Exception as exc:
            print(exc)


In [10]:
# extraction des données à conserver dans les jobs
# on peut utiliser glom pour l'extraction des json
from typing import Tuple
from glom import glom
from dataclasses import dataclass, field


# creation des Entités qui structurent les données et permettent une conservation en base
@dataclass
class JobUrl:
    company: str
    job_name: str
    url: str = field(init=False)
    processed: bool = False

    def __post_init__(self):
        self.url = job_url.safe_substitute(company=self.company, job_name=self.job_name)


async def extract_summaries_infos(jobs_summaries: Dict) -> Dict:  # type: ignore
    """Extrait les jobs a partir la reponse de l'api qui liste les jobs

    Args:
        jobs_summaries (Dict): dictionnaire renvoyé par l'api

    Returns:
        Dict: dictionnaire comprenant les différentes information résultant du scrap
    """
    job_specs = {
        "company": "hit.company.short_name",
        "job_name": "hit.short_title",
        "score": "score",
        "cursor": "cursor",
    }
    job_list: List[Dict] = [glom(job, job_specs) for job in jobs_summaries.get("hits")]
    job_list = sorted(job_list, key=lambda k: k["score"], reverse=True)
    job_urls: List[JobUrl] = [
        JobUrl(company=job.get("company"), job_name=job.get("job_name"))
        for job in job_list
    ]

    cursor: str = job_list[-1].get("cursor")
    has_more: bool = jobs_summaries.get("has_more", False)

    return dict(has_more=has_more, cursor=cursor, job_urls_ls=job_urls)


In [11]:
# appel de la methode une 1ere fois
from pprint import pprint

job_urls_ls: List[JobUrl] = []
count: int = 0
cursor: str | None = None
query: str = "Data engineer"
has_more: bool = True

while has_more:
    # while count < 10:

    job_api_reponse: List[Dict] = await requests_jobs_summaries(
        query=query, cursor=cursor
    )
    scrap_result: Dict = await extract_summaries_infos(job_api_reponse)

    has_more: str = scrap_result.get("has_more")
    cursor: str = scrap_result.get("cursor")
    job_urls_ls.extend(scrap_result.get("job_urls_ls"))

    print(has_more)
    print(cursor)

    # simulation has_more=False apres 5 iterations
    count += 1
    if count > 2:
        has_more = False

print(len(job_urls_ls))
job_urls_ls

True
85.663956,1730507033000,4484a577-86ad-49cd-af5a-4d3d2fc7ec54
True
65.35175,1690931340000,2262703c-b2b6-4ade-9bc3-7022f659b9e0
True
64.37459,1731630288000,69611e20-18fb-46fa-9724-822e95f01266
60


[JobUrl(company='arcadia', job_name='data-engineer-950ad7', url='/jobs/arcadia/data-engineer-950ad7', processed=False),
 JobUrl(company='kyndryl', job_name='data-engineer-33d845', url='/jobs/kyndryl/data-engineer-33d845', processed=False),
 JobUrl(company='coinbase', job_name='data-engineer-0b1092', url='/jobs/coinbase/data-engineer-0b1092', processed=False),
 JobUrl(company='kyndryl', job_name='data-engineer-f6b05a', url='/jobs/kyndryl/data-engineer-f6b05a', processed=False),
 JobUrl(company='healthfirst', job_name='data-engineer-287e98', url='/jobs/healthfirst/data-engineer-287e98', processed=False),
 JobUrl(company='atlassian', job_name='data-engineer-4a4723', url='/jobs/atlassian/data-engineer-4a4723', processed=False),
 JobUrl(company='wealthfront', job_name='data-engineer-931218', url='/jobs/wealthfront/data-engineer-931218', processed=False),
 JobUrl(company='arcadia', job_name='senior-data-engineer', url='/jobs/arcadia/senior-data-engineer', processed=False),
 JobUrl(company='a

In [12]:
import asyncio
import aiohttp
from aiohttp import ClientSession


async def fetch_url(sem, session, url):
    """
    Récupère une URL en utilisant un sémaphore pour limiter les requêtes simultanées.
    """
    async with sem:  # Limiter le nombre de connexions simultanées
        try:
            async with session.get(url) as response:
                data = await response.text()
                print(f"Fetched {url} with status {response.status}")
                return data
        except Exception as e:
            print(f"Error fetching {url}: {e}")
            return None


async def fetch_all(urls: List[str], max_concurrent_requests: int=5):
    """
    Récupère plusieurs URLs en parallèle avec un maximum de requêtes simultanées.
    """
    # Initialisation du sémaphore
    sem = asyncio.Semaphore(max_concurrent_requests)
    
    # Création d'une session HTTP partagée
    async with ClientSession() as session:
        # Création des coroutines pour chaque URL
        tasks = [fetch_url(sem, session, url) for url in urls]
        
        # Exécution des tâches en parallèle
        results = await asyncio.gather(*tasks)
        return results


In [13]:

urls = [f"{site_url}{job.url}" for job in  job_urls_ls]

results = await fetch_all(urls)
results

Fetched https://www.themuse.com/jobs/coinbase/data-engineer-0b1092 with status 200
Fetched https://www.themuse.com/jobs/healthfirst/data-engineer-287e98 with status 200
Fetched https://www.themuse.com/jobs/kyndryl/data-engineer-33d845 with status 200
Fetched https://www.themuse.com/jobs/arcadia/data-engineer-950ad7 with status 200
Fetched https://www.themuse.com/jobs/kyndryl/data-engineer-f6b05a with status 200
Fetched https://www.themuse.com/jobs/atlassian/data-engineer-4a4723 with status 200
Fetched https://www.themuse.com/jobs/wealthfront/data-engineer-931218 with status 200
Fetched https://www.themuse.com/jobs/arcadia/senior-data-engineer with status 200
Fetched https://www.themuse.com/jobs/nationwideinsurance/data-engineer-specialist with status 200
Fetched https://www.themuse.com/jobs/atlassian/principal-data-engineer-bfa9cb with status 200
Fetched https://www.themuse.com/jobs/kyndryl/cloud-data-engineer with status 200
Fetched https://www.themuse.com/jobs/kyndryl/azure-data-engi

['<!DOCTYPE html><html lang="en"><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width, initial-scale=1"/><meta name="twitter:card" content="summary_large_image"/><meta name="twitter:site" content="@TheMuse"/><meta property="og:type" content="website"/><meta property="og:image" content="https://www.themuse.com/static/images/muse-og-image.png"/><meta property="og:image:alt" content="The Muse Logo"/><meta property="og:image:width" content="400"/><meta property="og:image:height" content="400"/><meta property="og:locale" content="en_US"/><meta property="og:site_name" content="The Muse"/><title>Data Engineer at Arcadia | The Muse</title><meta name="robots" content="index,follow"/><meta name="description" content="Find our Data Engineer job description for Arcadia,  as well as other career opportunities that the company is hiring for."/><meta property="og:title" content="Data Engineer at Arcadia | The Muse"/><meta property="og:description" content="Find our Data Engi

In [None]:
from selectorlib import Extractor
import re

extractor = Extractor.from_yaml_file('rules_muse.yaml')

def scrape_page(html: str):
    try:
        # Appliquer les règles définies dans le fichier YAML
        data = extractor.extract(html)
        return data
    except Exception as e:
        print(f"Error: {e}")
        return None

@dataclass 
class JobMuse:
    company_size: str
    company_domain: List[str]
    contenu: str

    def __post_init__(self):
        self.company_size = re.sub(r'Size\:\s*','',self.company_size)
        industries: str = re.sub(r'Industry:\s*','',self.company_domain)
        self.company_domain = industries.split(',')
    

ModuleNotFoundError: No module named 'selecctorlib'

In [None]:
job_text_ls: List[JobMuse]=[]
for page in results[:5]:
    job_text_ls.append(JobMuse(**scrape_page(page)))

pprint(job_text_ls)