[Website](https://jiji.ng/cars)

### Importing necessary Libraries
- requests to send HTTP/1.1 requests easily, such as GET and POST, to interact with web APIs or retrieve web pages

- pandas for data handling, cleaning, manipulation and analysis.
- BeautifulSoup for parsing HTML and XML documents to extract specific data elements using a tree-like structure.
- selenium automates web browsers and user interaction like clicking buttons or dynamically waiting for items to load. Also for scraping sites that requires JavaScript rendering.
    - Service manages the ChromeDriver service for Selenium to interact with the Chrome browse

    - By provides methods to locate elements on a webpage (e.g., by ID, name, class name, etc.).
    - WebDriverWait explicitly waits for specific conditions to be met before proceeding with browser actions.
    - expected_conditions provides a collection of pre-built conditions for WebDriverWait (e.g., element visibility, clickability).
- time provides time-related functions like adding delays (e.g., time.sleep()), and working with timestamps, or measuring execution time.
- tqdm for visualizing the progress of loops in data processing or web scraping.
- json for serializing Python objects into JSON format.

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from tqdm import tqdm
import json

### Scraping Car Data

This page incorporates the infinite scroller phenomenon so the logic starts with:
    
- scrolling as far as possible and then 
- looping through all items (cars) loaded to extract their links (href) and finally
- visiting each link one by one to extract car data as needed

The conversion to csv after each iteration is very much intentional as any Exception (eg HTML Timeout) will result in losing any previous data extracted.

In [None]:
url = "https://jiji.ng/cars"
response = requests.get(url)

path = "C:/Users/HP/Downloads/chromedriver-win64/chromedriver.exe"

service = Service(path)
driver = webdriver.Chrome(service=service)

driver.get(url)

SCROLL_PAUSE_TIME = 0.2

last_height = driver.execute_script("return document.body.scrollHeight")

for i in tqdm(range(1500), desc='Scroller'):
    scroll_height =  driver.execute_script("return document.body.scrollHeight") *3
    driver.execute_script("window.scrollTo(0,  document.body.scrollHeight);")
    time.sleep(SCROLL_PAUSE_TIME)
    new_height = driver.execute_script("return document.body.scrollHeight")

cars = []

total_cars=0
all_cars_hrefs = []

frames = [link.get_attribute('href') for link in driver.find_elements(By.XPATH, "//a[@href][ancestor::*[contains(@class, 'b-list-advert__gallery__item')]]")]
all_cars_hrefs.extend(frames)

with tqdm(total=len(all_cars_hrefs), desc="Scraping Cars' Data") as pbar:
    try:
        for frame in frames: 
            driver.get(frame)
            time.sleep(5)

            page_source = driver.page_source
            soup = BeautifulSoup(page_source, 'html.parser')

            specs = {}

            try:
                show_more = WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.XPATH, "//button[@class='fw-button qa-fw-button fw-button--type-primary-link-like fw-button--size-small']"))
                )
                show_more.click()  
                time.sleep(1)
            except:
                pass

            caption = soup.find_all('div', class_='b-advert-attribute__key')
            header = [cap.text.strip() for cap in caption]

            value = soup.find_all('div', class_='b-advert-attribute__value')
            body = [val.text.strip() for val in value]
            item = [b.replace('\n', '').replace('     ', '').strip() for b in body]

            key, value = header, item
            specs = dict(zip(header, item))

            price_element = driver.find_element(By.XPATH, '//div[@itemprop="price"]')
            price_content = price_element.get_attribute("content")
            specs['Price'] = float(price_content)

            region = soup.find('div', class_='b-advert-info-statistics b-advert-info-statistics--region')
            if region:  
                where = region.text.strip().split(',')[0]
                specs['Location'] = where
            else:
                specs['Location'] = 'Unknown' 

            cars.append(specs)
            total_cars +=1
            pbar.update(1)

            df = pd.DataFrame(cars)
            df.to_csv('vroomvroom.csv')
    except Exception as e:
        print(f"Error Processing {frame}': {e}")   

print(f"Successfully scraped {len(cars)} Cars across entire iteration.")

Scroller: 100%|██████████| 1500/1500 [07:50<00:00,  3.19it/s]
Scraping Cars' Data:  76%|███████▌  | 19/25 [06:13<01:50, 18.48s/it]

Store car data in JSON format

In [14]:
j_path = "vroomvroom.json"

with open(j_path, 'w') as file:
    json.dump(cars, file, indent=4)

print(f"Data successfully saved to {j_path}")

Data successfully saved to vroomvroom.json


Display.glance()

In [15]:
df.head()

Unnamed: 0,Second Condition,Make,Model,Year of Manufacture,Trim,Body,Drivetrain,Engine Size,Number of Cylinders,Horse Power,...,Interior Color,Seats,Registered Car,Exchange Possible,Price,Location,VIN Chassis number,Registered city,Selling Condition,Bought Condition
0,No faults,Ford,Edge,2013,Limited 4dr AWD (3.5L 6cyl 6A),SUV,All Wheel,3500 cc,6.0,285 hp,...,Brown,5.0,No,No,12000000.0,Lagos,,,,
1,"No faults, Unpainted, Original parts",Nissan,Micra,2005,,,,,,,...,Other,,No,Yes,4000000.0,Oyo,,,,
2,"First owner, First registration",Lexus,RX,2002,,,,,,,...,Beige,,Yes,,4200000.0,Rivers,,,,
3,,BMW,7 Series,2009,,,,,,,...,,,,,12500000.0,Kaduna,,,,
4,No faults,Lexus,GX,2017,460 Luxury,SUV,4x4,4600 cc,8.0,301 hp,...,Other,7.0,No,,58000000.0,Lagos,JTJBM7FXXH5******,,,


### Data Visualization

...