# Assignment 2 - Data Scraping 

In this project, I will be scraping data from the General Directorate of Civil Society Relations (Sivil Toplumla İlişkiler Genel Müdürlüğü). I will use this data to obtain the geo-locations of NGOs from their addresses using Google's Geocoding API. This API returns latitude and longitude values for a given address.

In this assignment, I only used requests and pandas. I could have used Scrapy, but it would have only made this process more complicated without any real contribution. So instead, I've used Insomnia (a Postman-like API testing software) to separate the payload and headers for the request.


In [1]:
# I've imported the libraries
import requests
import pandas as pd
import math

In [2]:
"""
I'm defining a function to send requests with the related parameters.
I will use il_plakası, page and page_size
page_size defines number of records to return
page means number of records in each page 
for example if I set page=1 and page_size=10 I will get the first 10 record. But if I set page = 2 and page_size = 10
then i will get the records from 11 to 20
"""

def make_request(il_plakası,page, page_size):
    url = "https://derbis.dernekler.gov.tr/IstatistikDerneklerWeb/GetIlFaaliyetDernek"
    
    serializedData = f'{{"secilenTeskilatPk":"{il_plakası}","secilenIlcePk":"999999999","neviler":"0","altneviler":"0"}}'
    payload = f"sort=&group=&filter=&page={page}&pageSize={page_size}&serializedData={serializedData}"

    headers = {
        "cookie": "ASP.NET_SessionId=fdle2s30xtn1wnux0y3dkklw",
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Accept-Language": "tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7",
        "Connection": "keep-alive",
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "Origin": "https://derbis.dernekler.gov.tr",
        "Referer": "https://derbis.dernekler.gov.tr/IstatistikDerneklerWeb/IlFaaliyetAlaniDernekler",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest",
        "sec-ch-ua": "\"Google Chrome\";v=\"119\", \"Chromium\";v=\"119\", \"Not?A_Brand\";v=\"24\"",
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": "\"Windows\""
    }
    
    response = requests.post(url, data=payload, headers=headers)
    if response.status_code == 200:
        return response.json()
    else:
        return {"error": "Request failed", "status_code": response.status_code}

In [3]:
# First I will loop through each city and get the number of NGO's
# For this I will just set the page_size = 0
results_dict = {}


for il_plakası in range(1, 82):
    result = make_request(il_plakası,1, 0)
    total_value = result.get("Total", None)  
    results_dict[il_plakası] = total_value  



In [4]:
# This will be my main dataframe
combined_dataframe = pd.DataFrame()

# I will loop through my results dictionary. This dictionary contains provincial license plate number as key and number of NGO's as value.
for il_plakası, total_records in results_dict.items():
    if total_records < 1000:
        result = make_request(il_plakası, 1, total_records)
        
        # Check if the "Data" key exists in the result
        # This Data key is the main key for the API output.
        if "Data" in result:
            # I will convert JSON to df
            df = pd.json_normalize(result['Data'])
            
            # Add the dataframe to the main dataframe
            combined_dataframe = pd.concat([combined_dataframe, df], ignore_index=True)
        else:
            print(f"No 'Data' key in the response for il_plakası={il_plakası}")
    else:
        # As I mentioned earlier, if we have more than 1000 NGO's we need to make multiple calls.
        num_pages = math.ceil(total_records / 1000)
        for page_number in range(1, num_pages + 1):
            result = make_request(il_plakası, page_number, 1000)
            if "Data" in result:
                df = pd.json_normalize(result['Data'])
                combined_dataframe = pd.concat([combined_dataframe, df], ignore_index=True)
            else:
                print(f"No 'Data' key in the response for il_plakası={il_plakası}")

# Print the combined dataframe
print(combined_dataframe)

          Ad  Sayi  Oran  Plaka  \
0       None     0     0      1   
1       None     0     0      1   
2       None     0     0      1   
3       None     0     0      1   
4       None     0     0      1   
...      ...   ...   ...    ...   
101009  None     0     0     81   
101010  None     0     0     81   
101011  None     0     0     81   
101012  None     0     0     81   
101013  None     0     0     81   

                                                AnaNevisi       telNo  \
0                    YAŞLI ve ÇOCUKLARA YÖNELİK DERNEKLER               
1                           EĞİTİM ARAŞTIRMA  DERNEKLERİ                
2                         MESLEKİ ve DAYANIŞMA DERNEKLERİ  5061525466   
3       DİNİ HİZMETLERİN GERÇEKLEŞTİRİLMESİNE YÖNELİK ...  5322446120   
4       DİNİ HİZMETLERİN GERÇEKLEŞTİRİLMESİNE YÖNELİK ...               
...                                                   ...         ...   
101009  DİNİ HİZMETLERİN GERÇEKLEŞTİRİLMESİNE YÖNELİK ...  535738388

In [5]:
# Export the main dataframe
combined_dataframe.to_excel("./Boran_Göksel_Assignment2_Scrapped_Data.xlsx")