# Part 1: Scraping

## 1.1: Retrieve all data.

**In here, I used undetected_chromedriver to semi-bypass cloudflare captcha that pops up when browsing "sahibinden.com".**

While normally vanilla selenium is enough for basic web scraping, the presence of a human verifier agent makes things complicated.
That is because vanilla selenium has flags and some tags in tis user-agent, alogn with some javascript markers that tell the website its visiting that browser is being automated. Many modern webistes have protections against automation software, as it is the first line of defence aganist bot attacks (albeit its so trivial).    

Undetected_chromedriver is a selenium webdriver that has been modified to have a human-like user-agent, and also has some other masking features. This makes it possible to bypass the captcha.

**The data is retrieved and processed to follow a structure, as defined at `helpers.py`**

- For each car brand, the program searches up to 20 pages of seach pages and get the listing data from each page.
- In the search result table, there were ads rows as well and they were mixing the results, so they are skipped.
- The table is imported and listing data for each car is manufactured. 
- These listing details are converted to a usable CarListing object specified in `helpers.py`.
- The CarListing object is then appended to a list of CarListing objects.


In [1]:

# Import webdriver

from selenium import webdriver as wd
import undetected_chromedriver as uc


from helpers import scrape_cars
from helpers import brand_items

root_url = "https://www.sahibinden.com/kategori-vitrin?viewType=Classic&pagingSize=50&sorting=price_asc"

# driver = wd.Chrome()
driver = uc.Chrome()

max_pages = 20

all_cars = []

brands = brand_items


#  a brand item is as such:
# {'name': 'Alfa Romeo',
#                 'href': '/kategori-vitrin?viewType=Classic&pagingSize=50&category=3545&sorting=price_asc',
#                 'count': 24,
#                 'category': 3545}

for brand_item in brands:
    brand_name = brand_item['name']

    # print current brand

    print(f"Scraping {brand_name}...")

    # this is the category id for the brand, we will use it in the url
    categoryId = brand_item['category']

    # Divide count by 50 to get max_pages
    max_pages = int(brand_item['count']/50) + 1

    for page in range(0, max_pages+1):
        try:
            print(f"Scraping page {page} for {brand_name}...")
            url = f"{root_url}&pagingOffset={str(page*50)}&category={categoryId}"
            cars = scrape_cars(url, driver, by='brand', brand_name=brand_name)
            all_cars.extend(cars)
        except Exception as e:
            # print which url we are working on
            print(f"Error on page {page}")
            # url:
            print(url)
            print(e)
            continue
        print()

# url = root_url
# cars = scrape_cars(url, driver)


Scraping Alfa Romeo...
Scraping page 0 for Alfa Romeo...
CarListing(brand='Alfa Romeo', series='145', model='1.4 TS STD', ad_title='Alfa Romeo 145 1.4 TEMİZ VE BAKIMLI', year=1998, mileage=260750, color='Bej', price=177500.0, ad_date='20 Mayıs 2023', city='Bursa', district='Nilüfer', url='/ilan/vasita-otomobil-alfa-romeo-alfa-romeo-145-1.4-temiz-ve-bakimli-1100582023/detay')
CarListing(brand='Alfa Romeo', series='156', model='2.0 TS', ad_title='SAHİBİNDEN UYGUN FİYATA 156', year=1998, mileage=295000, color='Gümüş Gri', price=195000.0, ad_date='18 Mayıs 2023', city='İstanbul', district='Pendik', url='/ilan/vasita-otomobil-alfa-romeo-sahibinden-uygun-fiyata-156-1100167703/detay')
CarListing(brand='Alfa Romeo', series='156', model='2.0 TS', ad_title='Hatasız tertemiz Alfa', year=1998, mileage=344000, color='Füme', price=220000.0, ad_date='24 Mayıs 2023', city='Tekirdağ', district='Süleymanpaşa', url='/ilan/vasita-otomobil-alfa-romeo-hatasiz-tertemiz-alfa-1085480895/detay')
CarListing(bran

In [3]:
all_cars

[CarListing(brand='Skoda', series='Favorit', model='1.3 LX', ad_title="EMSALSİZ 1993 Skoda 1.3 Sadece 2-3 Parça boyalı LPG'li", year=1993, mileage=239000, color='Beyaz', price=105000.0, ad_date='18 Mayıs', city='Adana', district='Sarıçam', url='/ilan/vasita-otomobil-skoda-emsalsiz-1993-skoda-1.3-sadece-2-3-parca-boyali-lpg-li-1100104039/detay'),
 CarListing(brand='Tofaş', series='Şahin', model='S', ad_title='1998 model Şahin s', year=1998, mileage=200000, color='Beyaz', price=105000.0, ad_date='18 Mayıs', city='Ankara', district='Mamak', url='/ilan/vasita-otomobil-tofas-1998-model-sahin-s-1100137408/detay'),
 CarListing(brand='Renault', series='R 9', model='1.4 Spring', ad_title='DEDE YADİGARI MASRAFSIZ', year=1992, mileage=100000, color='Beyaz', price=106000.0, ad_date='22 Mayıs', city='Bursa', district='Mustafakemalpaşa', url='/ilan/vasita-otomobil-renault-dede-yadigari-masrafsiz-1100983843/detay'),
 CarListing(brand='Fiat', series='Tempra', model='1.6 SX AK', ad_title="2024'E Vize T

## 1.2: Storing Retrieved Data

**The data is stored in several formats.**

I wanted to store the car listings that I scraped in different formats, just because it is possible and I wondered what different file types for data storage are out there. I converted my `all_cars` object to a pandas dataframe to export.




In [2]:
import json
import pandas as pd


filename = "data/v2"

# Make data into pandas dataframe
df = pd.DataFrame(all_cars)

# Save dataframe as csv
df.to_csv( filename + '.csv', index=False)

# Save dataframe as excel
df.to_excel( filename + '.xlsx', index=False)

# Save dataframe as pickle
df.to_pickle( filename + '.pkl')

# Save dataframe as sql
import sqlite3
conn = sqlite3.connect( filename + '.db')
df.to_sql( filename + '', conn, if_exists='replace', index=False)
# Save dataframe as json
df.to_json( filename + '.json', orient='records')

# Save dataframe as html
df.to_html( filename + '.html', index=False)

# Save dataframe as latex
df.to_latex( filename + '.tex', index=False)

# Save dataframe as markdown
df.to_markdown( filename + '.md', index=False)

# Save dataframe as clipboard
df.to_clipboard(index=False)


### Followings did not work for various reasons:

# Save dataframe as stata
# df.to_stata( filename + '.dta', write_index=False)

# Save dataframe as feather
# df.to_feather( filename + '.feather')

# Save dataframe as parquet
# df.to_parquet( filename + '.parquet')

# Save dataframe as hdf
# df.to_hdf( filename + '.hdf', key= 'data, mode='w')



  df.to_latex( filename + '.tex', index=False)
