#### **What is Crunchbase**?
 In simple terms it is a platform that helps users get all the information about companies all over the world. It includes information such as Revenue, Investors, Number of employees, contact information, and more. 
 For my use case, I would need to pull data on space startups from Crunchbase API. You can visit [Cruncbase Data](https://data.crunchbase.com/docs/getting-started) to get a complete list of data points as a reference.

 #### Step 1: Get Crunchbase API key and request URL

The API URL will be of the following format:

`https://api.crunchbase.com/api/v4/entities/organizations/crunchbase?user_key=INSERT_YOUR_API_KEY_HERE`

Since I am searching for organizations in aerospace, I will be using “POST /search/organizations” URL.

Here is the full list of filters I used:
- **description keywords**: `space deterrence`, `space force`, `space-based sensors`, `orbital`, `counterspace`, `missile defense`, `space`, `space-based`, `anti-satellite`
- **headquarters location**: `United States`
- **industry**: `Manufacturing`, `Robotics`, `Artificial Intelligence (AI)`, `Information Technology`, `National Security`, `Satellite Communications`, `Aerospace`
- **founded**: `custom date range 2015-2024`

Filtering produced 908 results.

#### Step 2: Request data using Python

Import all necessary packages

In [None]:
# we will use request module to send an API request to Crunchbase
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize 
from dotenv import load_dotenv
import os

Define API user key

In [None]:
# I used .env file to keep my key save

API_KEY = os.getenv('CRUNCHBASE_API_KEY')

# API endpoint
BASE_URL = 'https://api.crunchbase.com/api/v4/searches/organizations'

Define search query parameters

In [None]:
query = {
    "field_ids": [
        "name",
        "founded_on",
        "categories",
        "location_identifiers",
        "postal_code",
        "short_description",
        "operating_status",
        "rank_org",
        "funding_stage",
        "last_funding_date",
        "last_funding_type",
        "acquired_by",
        "acquisition_price",
        "ipo_status",
        "ipo_valuation",
        "ipo_date",
        "estimated_revenue_range",
        "num_employees_enum",
        "num_funding_rounds",
        "funding_total",
        "investor_types",
        "last_equity_funding_type",
        "num_lead_investors"
    ],
    "limit": 1000,
    "query": [
        {
            "type": "predicate",
            "field_id": "location_identifiers",
            "operator_id": "includes",
            "values": [
                "502465b7609bc908c96be2b362a676b1"  # UUID for United States
            ]
        },
        {
            "type": "predicate",
            "field_id": "categories",
            "operator_id": "includes",
            "values": [
                "manufacturing",
                "robotics",
                "artificial intelligence",
                "information technology",
                "national security",
                "satellite communications",
                "aerospace"
            ]
        },
        {
            "type": "predicate",
            "field_id": "founded_on",
            "operator_id": "between",
            "values": [
                "2015-01-01",
                "2024-12-31"
            ]
        },
        {
            "type": "predicate",
            "field_id": "short_description",
            "operator_id": "contains",
            "values": [
                "space deterrence",
                "space force",
                "space-based sensors",
                "orbital",
                "counterspace",
                "missile defense",
                "space",
                "space-based",
                "anti-satellite"
            ]
        },
        {
            "type": "predicate",
            "field_id": "facet_ids",
            "operator_id": "includes",
            "values": [
                "company"
            ]
        }
    ]
}

**To get UUID of anything**:

Go to SwaggerHub -> GET /autocomplete -> Click “Try it out” -> type in query in the querybox -> Execute -> copy the UUID in response body.



#### Step 3: Create functions that will return our companies and extract data. 
After that, we can save it to Pandas df.

In [None]:
# initializing an empty DataFrame to store all results
raw = pd.DataFrame()

# functions
def company_count(query):
    headers = {
        "X-cb-user-key": API_KEY,
        "Content-Type": "application/json"
    }
    r = requests.post("https://api.crunchbase.com/api/v4/searches/organizations", headers=headers, json=query)
    result = json.loads(r.text)
    total_companies = result.get("count", 0)
    return total_companies

def url_extraction(query):
    global raw
    headers = {
        "X-cb-user-key": API_KEY,
        "Content-Type": "application/json"
    }
    
    # initializing variables for pagination
    page = 1
    has_next_page = True
    
    while has_next_page:
        query['page'] = page
        r = requests.post("https://api.crunchbase.com/api/v4/searches/organizations", headers=headers, json=query)
        result = json.loads(r.text)
        
        if 'entities' in result:
            normalized_raw = json_normalize(result['entities'])
            raw = pd.concat([raw, normalized_raw], ignore_index=True)
            
            # Check if there's a next page
            has_next_page = result.get('next_page_url') is not None
            page += 1
        else:
            has_next_page = False
        
    return raw

In [None]:
# getting the total count of companies
total = company_count(query)
print(f"Total number of companies: {total}")

In [None]:
# extracting data for all companies
data = url_extraction(query)
print(f"Number of companies extracted: {len(data)}")

In [None]:
# saving the data to a CSV file
data.to_csv("space_startups.csv", index=False)
print("Data saved to space_startups.csv")

This notebook contains purely the steps taken to extract data and save it to csv, without output. For data cleaning steps, please check `cleaning_data.ipynb`.