# Car Web Scraping Data Project

## web scraping using python!

# Description:

In this data science project, I leveraged Python and its powerful libraries, including Beautiful Soup and Requests, to scrape valuable information from car-related websites. The goal of the project was to gather data about various car models, their specifications, prices, and other relevant details.

** Description: In this data science project, I leveraged Python and its powerful libraries, including Beautiful Soup and Requests, to scrape valuable information from car-related websites. The goal of the project was to gather data about various car models, their specifications, prices, and other relevant details.

** Scope: The scope of the project encompassed identifying and scraping data from a car listing website, representing different car model or a category. The scraped data included text descriptions and numeric specifications. I cleaned and structured the scraped data to ensure its usability for further analysis.

** Challenges: While scraping the data, I encountered challenges related to website structures, varying layouts, and potential rate-limiting or IP-blocking from the websites due to excessive requests. To mitigate these challenges, I implemented measures such as using user-agents and incorporating time delays between requests.
Outcome: The project resulted in a well-organized dataset containing a comprehensive collection of car information. This dataset serves as a valuable resource for conducting analyses such as price trends, feature comparisons, and sentiment analysis based on customer reviews. The scraped data can also be integrated into machine learning models for predictive analytics related to car pricing and customer preferences.

** Key Skills: Web scraping, HTML parsing, data cleaning, data structuring, Python (Beautiful Soup, urllib3), data analysis, data visualization.

** Future Directions: In the future, this project can be expanded to include data from more websites, encompassing a broader range of car models and brands. Additionally, advanced techniques such as implementing automated scraping scripts, using proxies, and handling dynamic web content (JavaScript-rendered pages) could enhance the project's scope and capabilities.

# Import Library

In [62]:
from bs4 import BeautifulSoup
import requests
import urllib3
import certifi


import pandas as pd

In [63]:
# car data
car_dict = {
    'car_id': [],
    'description': [],
    'amount': [],
    'region': [],
    'make': [],
    'model': [],
    'year_of_man': [],
    'color': [],
    'condition': [],
    'mileage': [],
    'engine_size': [],
    'selling_cond': [],
    'bought_cond': [],
    'trim': [],
    'drive_train': [],
    'reg_city': [],
    'seat': [],
    'num_cylinder': [],
    'horse_power': []
}

In [64]:
# Getting other data

def get_details(id):
        
    car_details = {}
    
    res = requests.get(f"https://www.cars45.com/{id}")

    soup = BeautifulSoup(res.content, "html.parser")

    car_overview = soup.find('div', class_='svg flex').get_text().strip()    

    # Find the <span> tag with the specified text
    make_span = soup.find('span', string='Make')
    model_span = soup.find('span', string='Model')
    year_of_man_span = soup.find('span', string='Year of manufacture')
    color_span = soup.find('span', string='Colour')
    condition_span = soup.find('span', string='Condition')
    mileage_span = soup.find('span', string='Mileage')
    engine_size_span = soup.find('span', string='Engine Size')
    selling_cond_span = soup.find('span', string='Selling Condition')
    bought_cond_span = soup.find('span', string='Bought Condition')
    trim_span = soup.find('span', string='Trim')
    drive_train_span = soup.find('span', string='Drivetrain')
    reg_city_span = soup.find('span', string='Registered city')
    seat_span = soup.find('span', string='Seats')
    num_cylinder_span = soup.find('span', string='Number of Cylinders')
    horse_power_span = soup.find('span', string='Horse Power')
    
    # If the target <span> tag is found, find the preceding <p> tag
    if make_span:
        preceding_p_tag = make_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['make'] = preceding_p_tag.get_text().strip()
        else:
            car_details['make'] = ''
    else:
        car_details['make'] = ''
    
    if model_span:
        preceding_p_tag = model_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['model'] = preceding_p_tag.get_text().strip()
        else:
            car_details['model'] = ''
    else:
        car_details['model'] = ''
        
    
    if year_of_man_span:
        preceding_p_tag = year_of_man_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['year_of_man'] = preceding_p_tag.get_text().strip()
        else:
            car_details['year_of_man'] = ''
    else:
        car_details['year_of_man'] = ''
        
        
    if color_span:
        preceding_p_tag = color_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['color'] = preceding_p_tag.get_text().strip()
        else:
            car_details['color'] = ''
    else:
        car_details['color'] = ''
        
    
    if condition_span:
        preceding_p_tag = condition_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['condition'] = preceding_p_tag.get_text().strip()
        else:
            car_details['condition'] = ''
    else:
        car_details['condition'] = ''
            
    
    if mileage_span:
        preceding_p_tag = mileage_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['mileage'] = preceding_p_tag.get_text().strip()
        else:
            car_details['mileage'] = ''
    else:
        car_details['mileage'] = ''
        
        
    if engine_size_span:
        preceding_p_tag = engine_size_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['engine_size'] = preceding_p_tag.get_text().strip()
        else:
            car_details['engine_size'] = ''
    else:
        car_details['engine_size'] = ''
            
    
    if selling_cond_span:
        preceding_p_tag = selling_cond_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['selling_cond'] = preceding_p_tag.get_text().strip()
        else:
            car_details['selling_cond'] = ''
    else:
        car_details['selling_cond'] = ''
        
    
    if bought_cond_span:
        preceding_p_tag = bought_cond_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['bought_cond'] = preceding_p_tag.get_text().strip()
        else:
            car_details['bought_cond'] = ''
    else:
        car_details['bought_cond'] = ''
        
        
    
    if trim_span:
        preceding_p_tag = trim_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['trim'] = preceding_p_tag.get_text().strip()
        else:
            car_details['trim'] = ''
    else:
        car_details['trim'] = ''
        
        
            
    
    if drive_train_span:
        preceding_p_tag = drive_train_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['drive_train'] = preceding_p_tag.get_text().strip()
        else:
            car_details['drive_train'] = ''
    else:
        car_details['drive_train'] = ''
        
            
    
    if reg_city_span:
        preceding_p_tag = reg_city_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['reg_city'] = preceding_p_tag.get_text().strip()
        else:
            car_details['reg_city'] = ''
    else:
        car_details['reg_city'] = ''
        
        
        
    if seat_span:
        preceding_p_tag = seat_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['seat'] = preceding_p_tag.get_text().strip()
        else:
            car_details['seat'] = ''
    else:
        car_details['seat'] = ''
        

    
    if num_cylinder_span:
        preceding_p_tag = num_cylinder_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['num_cylinder'] = preceding_p_tag.get_text().strip()
        else:
            car_details['num_cylinder'] = ''
    else:
        car_details['num_cylinder'] = ''
        
            
    
    if horse_power_span:
        preceding_p_tag = horse_power_span.find_previous_sibling('p')
        if preceding_p_tag:
            car_details['horse_power'] = preceding_p_tag.get_text().strip()
        else:
            car_details['horse_power'] = ''
    else:
        car_details['horse_power'] = ''

        
    return car_details


In [65]:
# funtion to return all car list in a dictionary|
def get_info(car_listings):
    # loop through all list
    for car in car_listings:
        car_id = car['href'].replace('/','')
        description = car.find('p', class_="car-feature__name").get_text().strip()
        amount = car.find('p', class_="car-feature__amount").get_text().strip()
        region = car.find('p', class_="car-feature__region").get_text().strip()
        
        car_details = get_details(car_id)
        
        car_dict['car_id'].append(car_id)
        car_dict['description'].append(description)
        car_dict['amount'].append(amount)
        car_dict['region'].append(region)
        
        car_dict['make'].append(car_details['make'])
        car_dict['model'].append(car_details['model'])
        car_dict['year_of_man'].append(car_details['year_of_man'])
        car_dict['color'].append(car_details['color'])
        car_dict['condition'].append(car_details['condition'])
        car_dict['mileage'].append(car_details['mileage'])
        car_dict['engine_size'].append(car_details['engine_size'])
        car_dict['selling_cond'].append(car_details['selling_cond'])
        car_dict['bought_cond'].append(car_details['bought_cond'])
        car_dict['trim'].append(car_details['trim'])
        car_dict['drive_train'].append(car_details['drive_train'])
        car_dict['reg_city'].append(car_details['reg_city'])
        car_dict['seat'].append(car_details['seat'])
        car_dict['num_cylinder'].append(car_details['num_cylinder'])
        car_dict['horse_power'].append(car_details['horse_power'])

In [66]:
def main():
    # loop through the pages
    
#     http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
    for page in range(1,201):
        res = requests.get(f'https://www.cars45.com/listing?page={page}')
    
        soup = BeautifulSoup(res.content, 'html.parser')
        car_listings = soup.find_all('a', class_='car-feature car-feature--wide-mobile')
            
        # invoke the get_info()
        get_info(car_listings)
    
    return car_dict

In [67]:
car_data = main()

ConnectionError: HTTPSConnectionPool(host='www.cars45.com', port=443): Max retries exceeded with url: /listing?page=1 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002C27BC028F0>: Failed to establish a new connection: [Errno 11002] getaddrinfo failed'))

# Convert Data to DataFrame

In [47]:
car_df = pd.DataFrame(car_data)
car_df

NameError: name 'car_data' is not defined

# Data Inspection and Wrangling

In [68]:
car_df.info()

NameError: name 'car_df' is not defined

In [69]:
# check number of duplicate
car_df.duplicated().sum()

NameError: name 'car_df' is not defined

In [70]:
# remove duplicate
car_df.drop_duplicates(inplace=True)

NameError: name 'car_df' is not defined

In [71]:
car_df.head()

NameError: name 'car_df' is not defined

In [None]:
# replace '/' with ''
car_df['car_id'] = car_df['car_id'].str.replace('/','')

In [None]:
# replace '₦' with ''
car_df['amount'] = car_df['amount'].str.replace('₦','')

In [None]:
# replace ',' with ''
car_df['amount'] = car_df['amount'].str.replace(',','')

In [None]:
# replace 'km' with ''
car_df['mileage'] = car_df['mileage'].str.replace('km','')

In [None]:
car_df.head()

In [None]:
car_df

# Export into CSV

In [None]:
car_df.to_csv("car45_data.csv")