# Citi Bike Commuter Pass Workshop

Welcome to the hands-on segment of our AI workshop. We will use a LangChain SQL agent to explore Citi Bike trip data while hypothesizing about launching a commuter pass aimed at current "Customer" riders. The idea is to offer a discounted, route-specific pass that nudges commuters into becoming "Subscribers".

Our descriptive analysis path will be:
- start by comparing the share of Subscribers vs. Customers across all routes to understand the baseline mix;
- narrow the focus to commuter windows (7-10 AM) to see how that mix shifts during peak rides;
- inspect where morning trips originate and how the Customer share varies at those times to identify promising origin-destination pairs.

Each section below explains the code we run so you can narrate the workflow clearly to participants.


## Align the Python environment

We upgrade the LangChain, Groq, and supporting libraries to known versions so everyone in the workshop has a consistent toolkit. Keeping dependencies aligned avoids distracting package mismatch issues when we run the SQL agent live.


In [None]:
!wget -q https://raw.githubusercontent.com/alikarami76/genai-workshop/refs/heads/main/db_loader.py
!wget -q https://raw.githubusercontent.com/alikarami76/genai-workshop/refs/heads/main/llm_connector.py
!wget -q https://raw.githubusercontent.com/alikarami76/genai-workshop/refs/heads/main/requirements.txt

In [None]:
%pip install -r requirements.txt

## Connect to our Groq-hosted LLM

We import a helper that authenticates with the Groq API and returns a LangChain-compatible LLM. This model is the reasoning engine that will interpret our prompts, follow the SQL agent instructions, and summarize findings for the group.


In [None]:
from llm_connector import langchain_groq_llm_connector

llm = langchain_groq_llm_connector("Input API Key here","meta-llama/llama-4-maverick-17b-128e-instruct")


## Stage the Citi Bike database

The helper `prepare_citibike_database` can set up the SQLite file we prepared for the workshop. Here we point directly to the sample database so we can run fast, local queries during the session. This dataset contains the trip records we will mine for subscriber vs. customer behavior.


In [None]:
from db_loader import prepare_citibike_database

db_path, conn, run_query = prepare_citibike_database()

## Import LangChain agent utilities

These imports pull in the SQL agent factory, toolkit, and display helpers we'll use later to format outputs inside the notebook. Having them ready keeps the rest of the notebook focused on the analytic story rather than setup boilerplate.


In [None]:
import os
from IPython.display import Markdown, HTML, display
from langchain.agents import AgentType, create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase

## Author the system prompt (`prefix`)

`MSSQL_AGENT_PREFIX` is the agent's system prompt. It sets the standing rules the LLM must follow whenever it reasons about our SQL data: how to construct queries, limit result counts, and return answers. System prompts apply globally across turns, whereas user prompts (like the question we define later) express the specific task we want answered in that moment. By tailoring this prefix, we teach the model to behave like a careful data analyst for the entire workshop.


In [None]:
MSSQL_AGENT_PREFIX = """

You are an agent designed to interact with a SQL database.
## Instructions:
- Given an input question, create a syntactically correct {dialect} query
to run, then look at the results of the query and return the answer.
- Unless the user specifies a specific number of examples they wish to
obtain, **ALWAYS** limit your query to at most {top_k} results.
- You can order the results by a relevant column to return the most
interesting examples in the database.
- Never query for all the columns from a specific table, only ask for
the relevant columns given the question.
- You have access to tools for interacting with the database.
- You MUST double check your query before executing it.If you get an error
while executing a query,rewrite the query and try again.
- DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.)
to the database.
- DO NOT MAKE UP AN ANSWER OR USE PRIOR KNOWLEDGE, ONLY USE THE RESULTS
OF THE CALCULATIONS YOU HAVE DONE.
- Your response should be in Markdown. However, **when running  a SQL Query
in "Action Input", do not include the markdown backticks**.
Those are only for formatting the response, not for executing the command.
- ALWAYS, as part of your final answer, explain how you got to the answer
on a section that starts with: "Explanation:". Include the SQL query as
part of the explanation section.
- If the question does not seem related to the database, just return
"I don\'t know" as the answer.
- Only use the below tools. Only use the information returned by the
below tools to construct your query and final answer.
- Do not make up table names, only use the tables returned by any of the
tools below.

## Tools:

"""

## Describe the reasoning format

`MSSQL_AGENT_FORMAT_INSTRUCTIONS` tells the agent exactly how to structure its internal Thought/Action/Observation loop using the ReAct pattern. Unlike the system prompt content, these format instructions focus on how the model should explain its reasoning steps, not what domain guidance to follow. Keeping the structure explicit makes the live demo easier to narrate because participants can see each hop the agent takes.


In [None]:
MSSQL_AGENT_FORMAT_INSTRUCTIONS = """

## Use the following format:

Question: the input question you must answer.
Thought: you should always think about what to do.
Action: the action to take, should be one of [{tool_names}].
Action Input: the input to the action.
Observation: the result of the action.
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer.
Final Answer: the final answer to the original input question.

Example of Final Answer:
<=== Beginning of example

Action: query_sql_db
Action Input: 
For each given station, on average how many bikes were available vs docks? And which station had the least bicycles and most docks available on average (or was the busiest station)?

Observation:
	avg_available_bikes	free_dock_count	max_dock_capacity	name
0	4.73	6.26	11	Castro Street and El Camino Real
1	5.27	5.72	11	Cowper at University
2	5.29	5.69	11	Santa Clara at Almaden
3	5.67	9.32	15	Commercial at Montgomery
4	6.09	8.89	15	Broadway St at Battery St
5	6.13	12.81	19	2nd at Folsom
6	6.13	8.84	15	Clay at Battery
7	6.22	4.76	11	University and Emerson
8	6.25	8.73	15	Embarcadero at Vallejo
9	6.36	8.63	15	San Jose City Hall
Thought:I now know the final answer
Final Answer: The station with the least bicycles and most docks available on average (or the busiest station) is "2nd at Folsom" with an average of 6.13 bikes available and 12.81 free docks.

Explanation:
I calculated the average number of available bikes and free docks for each station using the provided data. By comparing these averages, I identified "2nd at Folsom" as the station with the highest average of free docks, indicating it is likely the busiest station.

```sql
SELECT ROUND(AVG(status.bikes_available),2) AS avg_available_bikes, 
            ROUND(AVG(status.docks_available),2) AS free_dock_count,
            station.dock_count AS max_dock_capacity,
            station.name
     FROM status
     INNER JOIN station
     ON status.station_id = station.id
     GROUP BY name
     ORDER BY 1, 2 DESC
     LIMIT 10;'"
```
===> End of Example

"""

## Build the SQL RAG toolkit

`SQLDatabase.from_uri` wraps our SQLite file so LangChain can issue SQL queries on demand. The `SQLDatabaseToolkit` then exposes that database as a tool the agent can call. Together they implement Retrieval-Augmented Generation (RAG) over structured data: the agent decides which SQL query will retrieve the relevant facts (retrieval), runs it through the toolkit, and then the LLM generates a natural-language explanation grounded in those query results (generation). This is how we let the model extract knowledge directly from the Citi Bike tables.


In [None]:
db = SQLDatabase.from_uri(f'sqlite:///{db_path}')
toolkit = SQLDatabaseToolkit(db=db, llm=llm)

## Craft the user prompt

`QUESTION` is the user prompt we hand to the agent. User prompts are turn-by-turn instructions—in this case, a descriptive request about trip totals and duration by subscription type. We can iteratively swap in commuter-pass-specific questions (e.g., filtering to 7-10 AM windows) during the workshop to walk through the case study narrative while the system prompt keeps the agent disciplined.


In [None]:
# Sample question to test the agent with: ""Give me the total number of trips and the average duration based on each subscription type.""
QUESTION = """
"Give me the share of subscribers vs customers from total trips."
"""

agent_executor_SQL = create_sql_agent(
    prefix=MSSQL_AGENT_PREFIX,
    format_instructions = MSSQL_AGENT_FORMAT_INSTRUCTIONS,
    llm=llm,
    toolkit=toolkit,
    top_k=30,
    verbose=True,
)

## Measuring the output manually
Here we want to manually use SQL to calculate the results for fact checking.

In [None]:
manual_results = run_query("""WITH tot AS (
  SELECT COUNT(*) AS n
  FROM trip
),
dist AS (
  SELECT
    CASE
      WHEN UPPER(subscription_type) LIKE 'SUBSCRIBER%' THEN 'Subscriber'
      WHEN UPPER(subscription_type) LIKE 'CUSTOMER%'   THEN 'Customer'
      ELSE 'Other/Unknown'
    END AS user_type,
    COUNT(*) AS trips
  FROM trip
  GROUP BY 1
)
SELECT
  user_type,
  trips,
  ROUND(trips * 100.0 / (SELECT n FROM tot), 2) AS pct_of_all_trips
FROM dist
WHERE user_type IN ('Subscriber','Customer')  -- keep only the two classes
ORDER BY trips DESC;
""")
manual_results

## Run the agent and review the output

`agent_executor_SQL.invoke(QUESTION)` kicks off the full loop: the LLM reads the system prompt, follows the format instructions, retrieves data through SQL, and returns an answer. Pause after this cell when teaching to interpret the SQL it ran and link the findings back to the commuter pass hypothesis.


In [None]:
agent_executor_SQL.invoke(QUESTION)

## Optional exploration space

Use this empty cell to prototype follow-up questions live—such as drilling into commuter routes or testing the impact of different origin-destination pairs—without modifying the main workflow above.
