# Spider


### What is "class"?

A class is a programming construct that serves as a blueprint for creating objects. Objects are instances of a class and can have attributes (characteristics) and methods (functions) associated with them. Classes enable the organization and encapsulation of code, allowing for the creation of reusable and structured code in object-oriented programming paradigms.

# Real Estate Web Scraping Summary

This Python script utilizes the Scrapy framework to scrape real estate data from the website www.emlakjet.com for various districts in Istanbul. The spider, named "Real estate_Spider," extracts information such as property titles, prices, links, and building ages for properties listed in specific districts.

## Spider Functionality

1. **Initialization:**
    - The spider class `Realestate_Spider` is defined, including class variables for district names, titles, prices, links, and building ages.

2. **Request Generation:**
    - The `start_requests` method generates requests for each district's real estate listings on www.emlakjet.com, creating URLs based on the district names.

3. **Parsing Listings:**
    - The `parse1` method extracts property titles, cleans HTML tags, and appends the cleaned titles to the `titles` list.
    - It also extracts property prices and appends them to the `prices` list.
    - Property links are extracted and appended to the `links` list.
    - For each property link, a new request is followed to the `parse2` method.

4. **Additional Property Details:**
    - The `parse2` method extracts building ages from individual property pages and appends them to the `age_list`.

## CrawlerProcess Initialization

- A `CrawlerProcess` object is created to manage the scraping process.

## Start Crawling

- The spider class `Realestate_Spider` is passed to the `process.crawl` method to initiate the scraping.
- The scraping process is started with `process.start()`.

## Result Storage

- The extracted data (titles, prices, links, and building ages) is stored within the spider class as class variables.
- Users can access the scraped data after the crawling process is complete by referencing the class variables.

Note: Ensure you run this code in a Scrapy-enabled environment for successful execution.






### Basic structure of spider class

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess
import re

class Realestate_Spider(scrapy.Spider):
    name = "Real estate_Spider"

    def start_requests(self):
        yield scrapy.Request(url = 'https://www.emlakjet.com', callback = self.parse1)
        
    def parse1(self, response):
        titles_in = response.xpath('//h4/text()').extract()
        for i in titles_in:
            titles.append(i)
                                      
titles = []

process = CrawlerProcess()
process.crawl(Realestate_Spider)
process.start()

titles

### Our spider class

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess
import re

class Realestate_Spider(scrapy.Spider):
    name = "Real estate_Spider"
    
    district_list = ["uskudar", "fatih", "umraniye", "kadikoy","sariyer","zeytinburnu","beylikduzu","maltepe"]        
    titles = []
    prices = []
    links = []
    age_list = []

    def start_requests(self):
        for district in self.district_list:
            link1 = 'https://www.emlakjet.com/satilik-konut/istanbul-' + district        
            yield scrapy.Request(url=link1, callback=self.parse1)

    def parse1(self, response):        
        titles_in = response.xpath('//*[@id="listing-search-wrapper"]/div/a/div[3]/div/div[1]/h3/text()').extract()           
        for i in titles_in:
            self.titles.append(i)        
            
        prices_in = response.xpath("//*[@id='listing-search-wrapper']/div/a/div[3]/div/div[3]/div/p/span/span/text()").extract()
        for i in prices_in:
            self.prices.append(i)

        links_in = response.xpath("//*[@id='listing-search-wrapper']/div[@class='_3qUI9q']/a/@href").extract()

        for i in links_in:
            first_part_url = 'https://www.emlakjet.com/'
            link2 = first_part_url + i
            self.links.append(link2)
            
            yield response.follow(url = link2,  callback = self.parse2)
            
    def parse2(self, response):
        building_age = response.xpath("//*[@id='bilgiler']/div/div[2]/div/div[1]/div[2]/div[5]/div[2]/text()").extract()
        for i in building_age:
            self.age_list.append(i)
            

# Create a CrawlerProcess
process = CrawlerProcess()

# Start the crawling process by passing the spider class, not an instance
process.crawl(Realestate_Spider)

# Start the process
process.start()

In [None]:
# Access the titles after the crawling process is complete
titles = Realestate_Spider.titles
prices = Realestate_Spider.prices
links = Realestate_Spider.links
age_list = Realestate_Spider.age_list

print(len(titles),len(prices),len(links),len(age_list))

In [None]:
import pandas as pd
dict_real_estate = {'title':titles, 'price':prices, 'link': links, 'age': age_list}
data_real_estate = pd.DataFrame(dict_real_estate)
data_real_estate

In [None]:
data_real_estate.to_excel("data_real_estate.xlsx", index=False)

### Including more-than-one pages for the districts

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess
import re

class Realestate_Spider(scrapy.Spider):
    name = "Real estate_Spider"
    
    district_list = ["uskudar"]        
    titles = []
    prices = []
    links = []
    age_list = []

    def start_requests(self):
        for district in self.district_list:
            link1 = 'https://www.emlakjet.com/satilik-konut/istanbul-' + district
            yield scrapy.Request(url=link1, callback=self.parse0)
                
    def parse0(self, response):
        number_of_pages = int(response.xpath('//*[@id="listing-search-wrapper"]/div/div[1]/ul/li[7]/div/a/text()').extract()[0])
        for district in self.district_list:
            for i in range(1,number_of_pages+1):
                link1_5 = 'https://www.emlakjet.com/satilik-konut/istanbul-' + district + "/" + str(i)
                yield response.follow(url = link1_5,  callback = self.parse1)

    def parse1(self, response):        
        titles_in = response.xpath('//*[@id="listing-search-wrapper"]/div/a/div[3]/div/div[1]/h3/text()').extract()           
        for i in titles_in:
            self.titles.append(i)        
            
        prices_in = response.xpath("//*[@id='listing-search-wrapper']/div/a/div[3]/div/div[3]/div/p/span/span/text()").extract()
        for i in prices_in:
            self.prices.append(i)

        links_in = response.xpath("//*[@id='listing-search-wrapper']/div[@class='_3qUI9q']/a/@href").extract()

        for i in links_in:
            first_part_url = 'https://www.emlakjet.com/'
            link2 = first_part_url + i
            self.links.append(link2)
            
            yield response.follow(url = link2,  callback = self.parse2)
            
    def parse2(self, response):
        building_age = response.xpath("//*[@id='bilgiler']/div/div[2]/div/div[1]/div[2]/div[5]/div[2]/text()").extract()
        for i in building_age:
            self.age_list.append(i)
            

# Create a CrawlerProcess
process = CrawlerProcess()

# Start the crawling process by passing the spider class, not an instance
process.crawl(Realestate_Spider)

# Start the process
process.start()

In [None]:
# Access the titles after the crawling process is complete
titles = Realestate_Spider.titles
prices = Realestate_Spider.prices
links = Realestate_Spider.links
age_list = Realestate_Spider.age_list

print(len(titles),len(prices),len(links),len(age_list))