# ðŸ“Š LinkedIn Datasets API

Access Bright Data's pre-collected LinkedIn datasets:
- **LinkedIn People Profiles**: 620M+ profiles with 42 fields
- **LinkedIn Company Profiles**: 58.5M+ companies with 36 fields

Unlike web scrapers that collect data on-demand, datasets provide instant access to pre-collected, structured data filtered by your criteria.

---

## Setup

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

API_TOKEN = os.getenv("BRIGHTDATA_API_TOKEN")
if not API_TOKEN:
    raise ValueError("Set BRIGHTDATA_API_TOKEN in .env file")

print(f"API Token: {API_TOKEN[:10]}...{API_TOKEN[-4:]}")
print("Setup complete!")

API Token: 7011787d-2...3336
Setup complete!


## Initialize Client

In [2]:
from brightdata import BrightDataClient

client = BrightDataClient(token=API_TOKEN)

print("Client initialized")
print(f"Available datasets: linkedin_profiles, linkedin_companies, amazon_products, crunchbase_companies")

Client initialized
Available datasets: linkedin_profiles, linkedin_companies, amazon_products, crunchbase_companies


---
## Test 1: List Available Datasets

List all datasets available in your account.

In [3]:
print("Fetching available datasets...\n")

async with client:
    datasets = await client.datasets.list()

print(f"Found {len(datasets)} datasets:\n")
for ds in datasets[:10]:  # Show first 10
    size_m = ds.size / 1_000_000 if ds.size else 0
    print(f"  - {ds.name}")
    print(f"    ID: {ds.id}")
    print(f"    Size: {size_m:.1f}M records")
    print()

Fetching available datasets...

Found 174 datasets:

  - Crunchbase companies information
    ID: gd_l1vijqt9jfj7olije
    Size: 2.3M records

  - Instagram - Profiles
    ID: gd_l1vikfch901nx3by4
    Size: 620.0M records

  - Manta businesses 
    ID: gd_l1vil1d81g0u8763b2
    Size: 5.6M records

  - US lawyers directory
    ID: gd_l1vil5n11okchcbvax
    Size: 1.4M records

  - LinkedIn company information
    ID: gd_l1vikfnt1wgvvqz95w
    Size: 55.0M records

  - LinkedIn people profiles
    ID: gd_l1viktl72bvl7bjuj0
    Size: 115.0M records

  - TikTok - Profiles
    ID: gd_l1villgoiiidt09ci
    Size: 152.0M records

  - Slintel 6sense company information
    ID: gd_l1vilg5a1decoahvgq
    Size: 10.9M records

  - Owler companies information
    ID: gd_l1vilaxi10wutoage7
    Size: 6.1M records

  - VentureRadar company information
    ID: gd_l1vilsfd1xpsndbtpr
    Size: 0.3M records



---
## Test 2: Explore LinkedIn Profiles Fields

Before filtering, explore available fields using the class metadata.

In [4]:
from brightdata.datasets import LinkedInPeopleProfiles

print("=== LinkedIn People Profiles Dataset ===")
print(f"Dataset ID: {LinkedInPeopleProfiles.DATASET_ID}")
print(f"Total fields: {len(LinkedInPeopleProfiles.FIELDS)}")

# Get high fill rate fields (more reliable for filtering)
high_fill = LinkedInPeopleProfiles.get_high_fill_rate_fields(min_rate=70.0)
print(f"\nHigh fill rate fields (>70%): {len(high_fill)}")
for field in high_fill:
    info = LinkedInPeopleProfiles.FIELDS[field]
    print(f"  - {field}: {info['type']} ({info['fill_rate']}%)")
    print(f"    {info['description']}")

=== LinkedIn People Profiles Dataset ===
Dataset ID: gd_l1viktl72bvl7bjuj0
Total fields: 42

High fill rate fields (>70%): 19
  - id: text (100.0%)
    A unique identifier for the person's LinkedIn profile
  - name: text (97.54%)
    Profile name
  - first_name: text (95.1%)
    First name of the user
  - last_name: text (94.8%)
    Last name of the user
  - city: text (96.3%)
    Geographical location of the user
  - country_code: text (97.11%)
    Geographical location of the user
  - position: text (91.23%)
    The current job title or position of the profile
  - url: url (100.0%)
    URL that links directly to the LinkedIn profile
  - input_url: url (100.0%)
    The URL that was entered when starting the scraping process
  - linkedin_id: text (100.0%)
    LinkedIn profile identifier
  - linkedin_num_id: text (100.0%)
    Numeric LinkedIn profile ID
  - avatar: url (96.28%)
    URL that links to the profile picture of the LinkedIn user
  - banner_image: url (96.28%)
    Banner image

In [5]:
# Show all available field names
print("\n=== All Available Fields ===")
for name, info in LinkedInPeopleProfiles.FIELDS.items():
    print(f"  {name}: {info['type']} - {info.get('fill_rate', 'N/A')}%")


=== All Available Fields ===
  id: text - 100.0%
  name: text - 97.54%
  first_name: text - 95.1%
  last_name: text - 94.8%
  city: text - 96.3%
  country_code: text - 97.11%
  location: text - 61.93%
  position: text - 91.23%
  about: text - 18.9%
  url: url - 100.0%
  input_url: url - 100.0%
  linkedin_id: text - 100.0%
  linkedin_num_id: text - 100.0%
  avatar: url - 96.28%
  banner_image: url - 96.28%
  default_avatar: boolean - 95.73%
  followers: number - 71.39%
  connections: number - 70.33%
  recommendations_count: number - 3.65%
  influencer: boolean - 46.06%
  memorialized_account: boolean - 99.44%
  current_company_name: text - 69.6%
  current_company_company_id: text - 38.94%
  current_company: object - 100.0%
  experience: array - 71.49%
  education: array - 41.97%
  educations_details: text - 42.08%
  posts: array - 1.27%
  activity: array - 32.95%
  certifications: array - 8.35%
  courses: array - 2.55%
  languages: array - 9.19%
  publications: array - 1.23%
  patents:

---
## Test 3: Get Dataset Metadata from API

Fetch live metadata from the API to see current field schema.

In [6]:
print("Fetching LinkedIn Profiles metadata from API...\n")

async with client:
    metadata = await client.datasets.linkedin_profiles.get_metadata()

print(f"Dataset ID: {metadata.id}")
print(f"Total fields from API: {len(metadata.fields)}")

print("\n=== Sample Fields ===")
for i, (name, field) in enumerate(list(metadata.fields.items())[:10]):
    print(f"  {name}:")
    print(f"    type: {field.type}")
    print(f"    active: {field.active}")
    print(f"    description: {field.description or 'N/A'}")

Fetching LinkedIn Profiles metadata from API...

Dataset ID: gd_l1viktl72bvl7bjuj0
Total fields from API: 45

=== Sample Fields ===
  id:
    type: text
    active: True
    description: A unique identifier for the person's LinkedIn profile
  name:
    type: text
    active: True
    description: Profile name
  city:
    type: text
    active: True
    description: Geographical location of the user
  country_code:
    type: text
    active: True
    description: Geographical location of the user
  position:
    type: text
    active: True
    description: The current job title or position of the profile
  about:
    type: text
    active: True
    description: A concise profile summary. In some cases, only a truncated version with "â€¦" is displayed on the website, and this is the version we capture
  posts:
    type: array
    active: True
    description: Contains information related to the user's last LinkedIn posts. It typically includes the post title, created date, URL link to th

---
## Test 4: Filter Dataset (Simple Filter)

Filter profiles by a single criterion. Returns a snapshot_id for later download.

In [7]:
# Simple filter: profiles with 10,000+ followers
FILTER = {
    "name": "followers",
    "operator": ">",
    "value": 10000
}
LIMIT = 2  # Only get 2 records for demo

print(f"Filter: {FILTER}")
print(f"Records limit: {LIMIT}\n")

async with client:
    snapshot_id = await client.datasets.linkedin_profiles.filter(
        filter=FILTER,
        records_limit=LIMIT
    )

print(f"Snapshot created: {snapshot_id}")
print("\nNote: filter() returns immediately with a snapshot_id.")
print("The snapshot is built asynchronously - use get_status() or download() next.")

Filter: {'name': 'followers', 'operator': '>', 'value': 10000}
Records limit: 2

Snapshot created: snap_mlev60jlf03ta3ev

Note: filter() returns immediately with a snapshot_id.
The snapshot is built asynchronously - use get_status() or download() next.


---
## Test 5: Check Snapshot Status

Check the status of a snapshot before downloading.

In [11]:
print(f"Checking status for snapshot: {snapshot_id}\n")

async with client:
    status = await client.datasets.linkedin_profiles.get_status(snapshot_id)

print(f"=== Snapshot Status ===")
print(f"ID: {status.id}")
print(f"Status: {status.status}")
print(f"Dataset ID: {status.dataset_id}")
print(f"Records: {status.dataset_size}")
print(f"File size: {status.file_size} bytes")
print(f"Cost: ${status.cost}")

if status.error:
    print(f"Error: {status.error}")

Checking status for snapshot: snap_mlev60jlf03ta3ev

=== Snapshot Status ===
ID: snap_mlev60jlf03ta3ev
Status: ready
Dataset ID: gd_l1viktl72bvl7bjuj0
Records: 2
File size: 21733 bytes
Cost: $0


---
## Test 6: Download Snapshot Data

Download the filtered data. This polls until ready, then returns the records.

In [3]:
snapshot_id="snap_mlev60jlf03ta3ev"

In [4]:
print(f"Downloading snapshot: {snapshot_id}")
print("(This will poll until ready...)\n")

async with client:
    data = await client.datasets.linkedin_profiles.download(
        snapshot_id,
        format="jsonl",
        timeout=300,  # 5 minutes
        poll_interval=5  # Check every 5 seconds
    )

print(f"Downloaded {len(data)} profiles\n")

# Display first few profiles
for i, profile in enumerate(data[:3]):
    print(f"=== Profile {i+1} ===")
    print(f"  Name: {profile.get('name', 'N/A')}")
    print(f"  Position: {profile.get('position', 'N/A')}")
    print(f"  City: {profile.get('city', 'N/A')}")
    print(f"  Country: {profile.get('country_code', 'N/A')}")
    print(f"  Followers: {profile.get('followers', 'N/A')}")
    print(f"  Connections: {profile.get('connections', 'N/A')}")
    print(f"  URL: {profile.get('url', 'N/A')}")
    print()

Downloading snapshot: snap_mlev60jlf03ta3ev
(This will poll until ready...)

Downloaded 2 profiles

=== Profile 1 ===
  Name: Jacques Wakefield
  Position: Affiliate Marketer
  City: Jackson, Tennessee, United States
  Country: US
  Followers: 15700
  Connections: 500
  URL: https://linkedin.com/in/jacqueswakefield

=== Profile 2 ===
  Name: Ajay Anand
  Position: Ajay Anand, EY Global Vice Chair, Global Delivery Services |Innovator | Technologist | Board Advisor
  City: San Francisco Bay Area
  Country: US
  Followers: 10649
  Connections: 500
  URL: https://ae.linkedin.com/in/ajay-anand-1912512



---
## Test 7: Combined Filter (AND/OR)

Filter with multiple conditions using AND/OR operators.

In [None]:
# Step 1: Create filter
COMBINED_FILTER = {
    "operator": "and",
    "filters": [
        {"name": "country_code", "operator": "=", "value": "US"},
        {"name": "followers", "operator": ">", "value": 5000}
    ]
}

print("Filter: US-based profiles with 5000+ followers")
print(f"Records limit: 5\n")

async with client:
    snapshot_id = await client.datasets.linkedin_profiles.filter(
        filter=COMBINED_FILTER,
        records_limit=5
    )

print(f"Snapshot created: {snapshot_id}")

In [None]:
# Step 2: Download data
print(f"Downloading snapshot: {snapshot_id}\n")

async with client:
    data = await client.datasets.linkedin_profiles.download(snapshot_id)

print(f"Downloaded {len(data)} profiles:")
for profile in data:
    print(f"  - {profile.get('name', 'N/A')} ({profile.get('country_code', 'N/A')}) - {profile.get('followers', 0)} followers")

---
## Test 8: LinkedIn Company Profiles

Access the LinkedIn Company Profiles dataset.

In [None]:
# Step 1: Create filter
COMPANY_FILTER = {
    "name": "company_size",
    "operator": "=",
    "value": "1001-5000 employees"
}

print(f"Filter: {COMPANY_FILTER}")
print(f"Records limit: 5\n")

async with client:
    snapshot_id = await client.datasets.linkedin_companies.filter(
        filter=COMPANY_FILTER,
        records_limit=5
    )

print(f"Snapshot created: {snapshot_id}")

In [None]:
# Step 2: Download data
print(f"Downloading snapshot: {snapshot_id}\n")

async with client:
    data = await client.datasets.linkedin_companies.download(snapshot_id)

print(f"Downloaded {len(data)} companies:")
for company in data:
    print(f"\n=== {company.get('name', 'N/A')} ===")
    print(f"  Industry: {company.get('industries', 'N/A')}")
    print(f"  Size: {company.get('company_size', 'N/A')}")
    print(f"  HQ: {company.get('headquarters', 'N/A')}")
    print(f"  Website: {company.get('website', 'N/A')}")
    print(f"  Followers: {company.get('followers', 'N/A')}")

In [None]:
from brightdata.datasets import export_json, export_csv, export

# Export to JSON
json_file = export_json(data, "linkedin_results.json")
print(f"Exported to: {json_file}")

# Export to CSV
csv_file = export_csv(data, "linkedin_results.csv")
print(f"Exported to: {csv_file}")

# Or use auto-detect based on extension
# export(data, "results.json")
# export(data, "results.csv")

print(f"\nRecords: {len(data)}")

---
## Test 9: Export Results to JSON

In [None]:
import json
from pathlib import Path

if data:
    output_file = Path.cwd() / "linkedin_dataset_results.json"
    
    with open(output_file, "w") as f:
        json.dump(data, f, indent=2, default=str)
    
    print(f"Exported to: {output_file}")
    print(f"Records: {len(data)}")
else:
    print("No data to export")

---
## Summary

### Datasets vs Web Scrapers

| Feature | Datasets | Web Scrapers |
|---------|----------|-------------|
| Data source | Pre-collected database | Live scraping |
| Speed | Instant filtering | Real-time collection |
| Use case | Bulk data, analytics | Specific URLs, fresh data |
| Pricing | Per record filtered | Per request |

### Available LinkedIn Datasets

| Dataset | Records | Fields | Access |
|---------|---------|--------|--------|
| LinkedIn People Profiles | 620M+ | 42 | `client.datasets.linkedin_profiles` |
| LinkedIn Company Profiles | 58.5M+ | 36 | `client.datasets.linkedin_companies` |

### Dataset Methods

| Method | Description |
|--------|-------------|
| `get_metadata()` | Get field schema from API |
| `filter(filter, records_limit)` | Create filtered snapshot (returns snapshot_id) |
| `get_status(snapshot_id)` | Check snapshot status |
| `download(snapshot_id)` | Poll and download data |

### Filter Operators

| Operator | Description | Example |
|----------|-------------|---------|
| `=` | Equal to | `{"name": "country_code", "operator": "=", "value": "US"}` |
| `!=` | Not equal | `{"name": "country_code", "operator": "!=", "value": "CN"}` |
| `>`, `<`, `>=`, `<=` | Numeric comparison | `{"name": "followers", "operator": ">", "value": 10000}` |
| `in` | Value in list | `{"name": "country_code", "operator": "in", "value": ["US", "UK"]}` |
| `includes` | Text contains | `{"name": "position", "operator": "includes", "value": "Engineer"}` |
| `is_null` | Field is null | `{"name": "about", "operator": "is_null"}` |
| `is_not_null` | Field is not null | `{"name": "about", "operator": "is_not_null"}` |

### Combined Filters

```python
# AND condition
{
    "operator": "and",
    "filters": [
        {"name": "country_code", "operator": "=", "value": "US"},
        {"name": "followers", "operator": ">", "value": 5000}
    ]
}

# OR condition
{
    "operator": "or",
    "filters": [
        {"name": "country_code", "operator": "=", "value": "US"},
        {"name": "country_code", "operator": "=", "value": "UK"}
    ]
}
```

### Class Helper Methods

| Method | Description |
|--------|-------------|
| `get_field_names()` | List all field names |
| `get_high_fill_rate_fields(min_rate)` | Fields with fill rate above threshold |