# Example 01: Basic Queries

Retrieving data from Socrata databases using sodapy


## Setup


In [1]:
import os
import pandas as pd

from sodapy import Sodapy

## Find some data

The first step is to find a dataset on an open data portal that uses Socrata. The following search options can help you find some great datasets for getting started:

- Limit to Datasets (pre-analyzed stuff is great, but if you're using sodapy you probably want the raw numbers!)
- Sort by "Most Accessed"

Here are [New York City's](https://data.cityofnewyork.us/browse?sortBy=most_accessed&limitTo=datasets), as an example.

Let's look at [NYC's 311 requests](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9). Note that the URL has the following format:

https://<**data.cityofnewyork.us**>/Social-Services/311-Service-Requests-from-2010-to-Present/<**erm2-nwe9**>

That domain and identifier will be used below.

When you go to the `Export` menu, you will see an `API` option. Note that the domain and identifier are used there as well.

https://<**data.cityofnewyork.us**>/resource/<**erm2-nwe9**>.json

The identifier is also known as the dataset's "four-four".


![Socrata Interface](socrata_interface.png)


In [2]:
# Enter the information from those sections here
nyc_domain = "data.cityofnewyork.us"
nyc_dataset_identifier = "fhrw-4uyv"


# App Tokens can be generated by creating an account at https://opendata.socrata.com/signup
# Tokens are optional (`None` can be used instead), though requests will be rate limited.
#
# If you choose to use a token, run the following command on the terminal (or add it to your .bashrc)
# $ export SODAPY_APPTOKEN=<token>
socrata_token = os.environ.get("SODAPY_APPTOKEN")

## Get all the data


In [3]:
nyc_client = Sodapy(nyc_domain, socrata_token)
nyc_results = nyc_client.get(nyc_dataset_identifier)
nyc_df = pd.DataFrame(nyc_results)
nyc_df.head()



Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,...,vehicle_type,resolution_action_updated_date,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,resolution_description,closed_date,taxi_pick_up_location,facility_type
0,64837352,2025-05-04T01:51:11.000,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10036,9 AVENUE,9 AVENUE,...,,,,,,,,,,
1,64838382,2025-05-04T01:51:01.000,NYPD,New York City Police Department,Noise - Vehicle,Car/Truck Music,Street/Sidewalk,10035,345 EAST 118 STREET,EAST 118 STREET,...,Other,,,,,,,,,
2,64836551,2025-05-04T01:50:47.000,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,10307,224 ELLIS STREET,ELLIS STREET,...,,,,,,,,,,
3,64839521,2025-05-04T01:50:46.000,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,11211,130 MARCY AVENUE,MARCY AVENUE,...,,2025-05-04T02:48:55.000,,,,,,,,
4,64841506,2025-05-04T01:50:40.000,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11368,41-30 JUNCTION BOULEVARD,JUNCTION BOULEVARD,...,,2025-05-04T02:48:20.000,,,,,,,,


Success! Let's do some minimal analysis.


In [4]:
nyc_df.groupby("complaint_type").size().sort_values(ascending=False)

complaint_type
Noise - Residential               353
Noise - Street/Sidewalk           247
Noise - Commercial                114
Illegal Parking                   110
Noise - Vehicle                    39
Blocked Driveway                   31
Homeless Person Assistance         11
Noise - Park                       10
Drinking                            9
Traffic Signal Condition            6
Abandoned Vehicle                   6
Damaged Tree                        5
Non-Emergency Police Matter         4
Encampment                          4
Dirty Condition                     4
Street Condition                    3
Food Establishment                  3
Street Light Condition              3
Residential Disposal Complaint      3
Noise                               2
Overgrown Tree/Branches             2
PLUMBING                            2
Lost Property                       2
Illegal Fireworks                   2
HEAT/HOT WATER                      2
Dead/Dying Tree                    

## Multiple Data Sources

That was much less annoying than downloading a CSV, though you can always save the dataframe to a CSV if you'd like. Where sodapy really shines though is in grabbing different data sources and mashing them together.

For example, let's compare NYC's 311 calls to [Chattanooga, TN](https://www.chattadata.org/dataset/311-Service-Requests/8qb9-5fja/about_data). Socrata makes it so easy, you'd be crazy _not_ to do it!


In [5]:
chatt_domain = "www.chattadata.org"
chatt_dataset_identifier = "8qb9-5fja"
chatt_client = Sodapy(chatt_domain, socrata_token)
chatt_results = chatt_client.get(chatt_dataset_identifier)
chatt_df = pd.DataFrame(chatt_results)
chatt_df.head()



Unnamed: 0,service_request_key,created_date,due_at,completed_at,on_time_indicator,department,request_type,request_type_code,status_code,intake_form,actual_days_to_complete_working,sla_fy_2019,ispublic,citydst,publiclocation,description
0,9862668,2021-04-11T18:37:16.000,2021-04-19T18:37:16.000,2021-04-16T14:03:34.000,Yes,PW - Solid Waste,Bagged Yard Waste,URGENT,O-CLOSED,Android,5,4,yes,4,"{'type': 'Point', 'coordinates': [-85.12258379...",
1,9862719,2021-04-11T18:53:28.000,2021-04-16T18:53:28.000,2021-04-20T12:09:41.000,No,PW - Solid Waste,Brush Collection,URGENT,O-CLOSED,iOS,7,4,yes,7,"{'type': 'Point', 'coordinates': [-85.330092, ...",On Virginia Ave between 51st and 52nd
2,9862582,2021-04-11T18:05:49.000,2021-04-16T18:05:49.000,2021-04-17T13:44:05.000,No,PW - Solid Waste,Brush Collection,URGENT,O-CLOSED,Android,5,4,yes,6,"{'type': 'Point', 'coordinates': [-85.24198942...",
3,9863888,2021-04-12T08:41:53.000,2021-04-16T08:41:53.000,2021-04-15T13:04:38.000,Yes,PW - Solid Waste,Bulk Trash,URGENT,O-CLOSED,iOS,3,4,yes,3,"{'type': 'Point', 'coordinates': [-85.243424, ...",
4,9862879,2021-04-11T19:51:47.000,2021-04-16T19:51:47.000,2021-04-19T15:27:02.000,No,PW - Solid Waste,Brush Collection,URGENT,O-CLOSED,Iframe,6,4,yes,6,"{'type': 'Point', 'coordinates': [-85.17743692...",brush


In [6]:
# extract tree-related complaints
tree_related = pd.concat(
    [
        nyc_df["complaint_type"].str.contains(r"[T|t]ree").value_counts(),
        chatt_df["description"].str.contains(r"[T|t]ree").value_counts(),
    ],
    axis=1,
    keys=["nyc", "chatt"],
)
tree_related.div(tree_related.sum()).round(2)

Unnamed: 0,nyc,chatt
False,0.74,0.86
True,0.26,0.14


Looks like trees are a higher percentage of NYC complaints than Chattanooga's.

Note that we can only talk about percentages, since our query results got truncated to 1,000 rows.

What if we want to be smarter about what we ask for, so that we can get 100% of the subset of data
we're most interested in? That's the subject of a future example, so stay tuned!
