# Spider


### What is "class"?

A class is a programming construct that serves as a blueprint for creating objects. Objects are instances of a class and can have attributes (characteristics) and methods (functions) associated with them. Classes enable the organization and encapsulation of code, allowing for the creation of reusable and structured code in object-oriented programming paradigms.

# Real Estate Web Scraping Summary

This Python script utilizes the Scrapy framework to scrape real estate data from the website www.emlakjet.com for various districts in Istanbul. The spider, named "Real estate_Spider," extracts information such as property titles, prices, links, and building ages for properties listed in specific districts.

## Spider Functionality

1. **Initialization:**
    - The spider class `Realestate_Spider` is defined, including class variables for district names, titles, prices, links, and building ages.

2. **Request Generation:**
    - The `start_requests` method generates requests for each district's real estate listings on www.emlakjet.com, creating URLs based on the district names.

3. **Parsing Listings:**
    - The `parse1` method extracts property titles, cleans HTML tags, and appends the cleaned titles to the `titles` list.
    - It also extracts property prices and appends them to the `prices` list.
    - Property links are extracted and appended to the `links` list.
    - For each property link, a new request is followed to the `parse2` method.

4. **Additional Property Details:**
    - The `parse2` method extracts building ages from individual property pages and appends them to the `age_list`.

## CrawlerProcess Initialization

- A `CrawlerProcess` object is created to manage the scraping process.

## Start Crawling

- The spider class `Realestate_Spider` is passed to the `process.crawl` method to initiate the scraping.
- The scraping process is started with `process.start()`.

## Result Storage

- The extracted data (titles, prices, links, and building ages) is stored within the spider class as class variables.
- Users can access the scraped data after the crawling process is complete by referencing the class variables.

Note: Ensure you run this code in a Scrapy-enabled environment for successful execution.






### Basic structure of spider class

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess
import re

class Realestate_Spider(scrapy.Spider):
    name = "Real estate_Spider"

    def start_requests(self):
        yield scrapy.Request(url = 'https://www.emlakjet.com', callback = self.parse1)
        
    def parse1(self, response):
        titles_in = response.xpath('//h4/text()').extract()
        for i in titles_in:
            titles.append(i)
                                      
titles = []

process = CrawlerProcess()
process.crawl(Realestate_Spider)
process.start()

titles

2024-11-26 11:11:15 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2024-11-26 11:11:15 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.2.0 (OpenSSL 3.0.12 24 Oct 2023), cryptography 41.0.3, Platform Windows-10-10.0.22621-SP0
2024-11-26 11:11:15 [scrapy.crawler] INFO: Overridden settings:
{}


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-11-26 11:11:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-11-26 11:11:15 [scrapy.extensions.telnet] INFO: Telnet Password: 12da08863e15d7d0
2024-11-26 11:11:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',


['En Yeni İlanlar', 'Öne Çıkan Projeler', 'Emlak Haberleri']

### Our spider class

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess
import re

class Realestate_Spider(scrapy.Spider):
    name = "Real estate_Spider"
    
    district_list = ["uskudar", "fatih", "umraniye", "kadikoy","sariyer","zeytinburnu","beylikduzu","maltepe"]        
    titles = []
    prices = []
    links = []
    age_list = []

    def start_requests(self):
        for district in self.district_list:
            link1 = 'https://www.emlakjet.com/satilik-konut/istanbul-' + district        
            yield scrapy.Request(url=link1, callback=self.parse1)

    def parse1(self, response):        
        titles_in = response.xpath('//*[@id="listing-search-wrapper"]/div/a/div[3]/div/div[1]/h3/text()').extract()           
        for i in titles_in:
            self.titles.append(i)        
            
        prices_in = response.xpath("//*[@id='listing-search-wrapper']/div/a/div[3]/div/div[3]/div/p/span/span/text()").extract()
        for i in prices_in:
            self.prices.append(i)

        links_in = response.xpath("//*[@id='listing-search-wrapper']/div[@class='_3qUI9q']/a/@href").extract()

        for i in links_in:
            first_part_url = 'https://www.emlakjet.com/'
            link2 = first_part_url + i
            self.links.append(link2)
            
            yield response.follow(url = link2,  callback = self.parse2)
            
    def parse2(self, response):
        building_age = response.xpath("//*[@id='bilgiler']/div/div[2]/div/div[1]/div[2]/div[5]/div[2]/text()").extract()
        for i in building_age:
            self.age_list.append(i)
            

# Create a CrawlerProcess
process = CrawlerProcess()

# Start the crawling process by passing the spider class, not an instance
process.crawl(Realestate_Spider)

# Start the process
process.start()

2024-11-29 12:01:52 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2024-11-29 12:01:52 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.2.0 (OpenSSL 3.0.12 24 Oct 2023), cryptography 41.0.3, Platform Windows-10-10.0.22621-SP0
2024-11-29 12:01:52 [scrapy.crawler] INFO: Overridden settings:
{}


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-11-29 12:01:53 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-11-29 12:01:53 [scrapy.extensions.telnet] INFO: Telnet Password: 0e36f63d2a2da1a3
2024-11-29 12:01:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',


In [2]:
# Access the titles after the crawling process is complete
titles = Realestate_Spider.titles
prices = Realestate_Spider.prices
links = Realestate_Spider.links
age_list = Realestate_Spider.age_list

print(len(titles),len(prices),len(links),len(age_list))

240 240 240 240


In [3]:
import pandas as pd
dict_real_estate = {'title':titles, 'price':prices, 'link': links, 'age': age_list}
data_real_estate = pd.DataFrame(dict_real_estate)
data_real_estate

2024-11-29 12:03:47 [numexpr.utils] INFO: Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-11-29 12:03:47 [numexpr.utils] INFO: NumExpr defaulting to 8 threads.


Unnamed: 0,title,price,link,age
0,Beylerbeyi Antteras'ta Manzaralı Çatı Dubleksi,60.000.000,https://www.emlakjet.com//ilan/beylerbeyi-antt...,3
1,Çengelköy'ün En Güzel Noktasında Boğaz Manzara...,30.600.000,https://www.emlakjet.com//ilan/cengelkoy-un-en...,0 (Yeni)
2,Üsküdar Tunusbağında Ana Caddeye Yakın 1+1 70m...,1.700.000,https://www.emlakjet.com//ilan/uskudar-tunusba...,21 Ve Üzeri
3,Ashill'den Çengelköy Mah. Lüks Sitede Satılık ...,35.500.000,https://www.emlakjet.com//ilan/ashill-den-ceng...,5-10
4,Üsküdar Merkezi Konumda 2+1 Yüksek Giriş,2.995.000,https://www.emlakjet.com//ilan/uskudar-merkezi...,0 (Yeni)
...,...,...,...,...
235,Bahçeköy Merkezde Belgrada Sınır Satılık Giriş...,4.500.000,https://www.emlakjet.com//ilan/bahcekoy-merkez...,21 Ve Üzeri
236,Sarıyer Maden Tünele 2dk Orman Manzaralı 5+1 Ç...,23.500.000,https://www.emlakjet.com//ilan/sariyer-maden-t...,0 (Yeni)
237,Sarıyer Maden Tünele 2dk Butik Site İçerisinde...,14.750.000,https://www.emlakjet.com//ilan/sariyer-maden-t...,0 (Yeni)
238,Sarıyer Sahil Yalı Dairesi Deniz Manzaralı Öze...,16.750.000,https://www.emlakjet.com//ilan/sariyer-sahil-y...,21 Ve Üzeri


In [None]:
data_real_estate.to_excel("data_real_estate.xlsx", index=False)

### Including more-than-one pages for the districts

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess
import re

class Realestate_Spider(scrapy.Spider):
    name = "Real estate_Spider"
    
    district_list = ["uskudar", "fatih"]        
    titles = []
    prices = []
    links = []
    age_list = []

    def start_requests(self):
        for district in self.district_list:
            link1 = 'https://www.emlakjet.com/satilik-konut/istanbul-' + district
            yield scrapy.Request(url=link1, callback=self.parse0)
                
    def parse0(self, response):
        number_of_pages = int(response.xpath('//*[@id="listing-search-wrapper"]/div/div[1]/ul/li[7]/div/a/text()').extract()[0])
        for district in self.district_list:
            for i in range(1,number_of_pages+1):
                link1_5 = 'https://www.emlakjet.com/satilik-konut/istanbul-' + district + "/" + str(i)
                yield response.follow(url = link1_5,  callback = self.parse1)

    def parse1(self, response):        
        titles_in = response.xpath('//*[@id="listing-search-wrapper"]/div/a/div[3]/div/div[1]/h3/text()').extract()           
        for i in titles_in:
            self.titles.append(i)        
            
        prices_in = response.xpath("//*[@id='listing-search-wrapper']/div/a/div[3]/div/div[3]/div/p/span/span/text()").extract()
        for i in prices_in:
            self.prices.append(i)

        links_in = response.xpath("//*[@id='listing-search-wrapper']/div[@class='_3qUI9q']/a/@href").extract()

        for i in links_in:
            first_part_url = 'https://www.emlakjet.com/'
            link2 = first_part_url + i
            self.links.append(link2)
            
            yield response.follow(url = link2,  callback = self.parse2)
            
    def parse2(self, response):
        building_age = response.xpath("//*[@id='bilgiler']/div/div[2]/div/div[1]/div[2]/div[5]/div[2]/text()").extract()
        for i in building_age:
            self.age_list.append(i)
            

# Create a CrawlerProcess
process = CrawlerProcess()

# Start the crawling process by passing the spider class, not an instance
process.crawl(Realestate_Spider)

# Start the process
process.start()

2024-11-29 12:20:26 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2024-11-29 12:20:26 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.2.0 (OpenSSL 3.0.12 24 Oct 2023), cryptography 41.0.3, Platform Windows-10-10.0.22621-SP0
2024-11-29 12:20:26 [scrapy.crawler] INFO: Overridden settings:
{}


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-11-29 12:20:26 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-11-29 12:20:26 [scrapy.extensions.telnet] INFO: Telnet Password: 7ae9e950221db74f
2024-11-29 12:20:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',


In [2]:
# Access the titles after the crawling process is complete
titles = Realestate_Spider.titles
prices = Realestate_Spider.prices
links = Realestate_Spider.links
age_list = Realestate_Spider.age_list

print(len(titles),len(prices),len(links),len(age_list))

1724 1724 1724 1722
