Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to count the number of occurrences of an item in a a list across all rows? #4169

Closed
vedantroy opened this issue Jul 20, 2022 · 9 comments

Comments

@vedantroy
Copy link

Hi! Not sure if this is the best place to ask questions, but I'd figured I'd give it a shot.

Let's say I have 2 rows (right now the second column is a string, but it could easily be a list):

(1, "foo bar baz bux qux")
(2, "foo baz bux qux bax")

And I want to write a query that outputs something like:

{
    foo: 2,
    baz: 2,
    bux: 1,
    qux: 2,
    bax: 1
}

what might be a good query to do this / can someone provide a bit of guidance?
I've been investigating the LIST and MAP datatypes, which seem useful for this. But I'm pretty unused to the declarative style of SQL.

If this was JS I'd do:

let tagToOcc = {}
for (const row of rows) {
    const tags = row[1].split(" ")
   for (const tag of tags) {
         tagtoOcc[tag] += 1
  }
}

but I have no idea how to make loops in SQL / no sure if that's even the thing I should be doing

@Alex-Monahan
Copy link
Contributor

Howdy and welcome to SQL-Land! :-)

Two functions will do the trick for you! String_split, and unnest. Unnest is a special function that takes lists and creates a separate row for each element.

With lists_split_into_rows as (
select 
  col1,
  unnest(string_split(col2, ' ')) as new_column
from my_table
)
Select 
  new_column,
  count(*) as my_count
from lists_split_into_rows
Group by
  New_column

https://duckdb.org/docs/sql/query_syntax/unnest

@vedantroy
Copy link
Author

@Alex-Monahan This worked! Do you have an idea of how I can query this result from the Python API?
I don't think using fetchall is the correct API here, and execute does not return a result.

@vedantroy
Copy link
Author

@Alex-Monahan This worked! Do you have an idea of how I can query this result from the Python API? I don't think using fetchall is the correct API here, and execute does not return a result.

Nevermind, got it.

    conn = duckdb.connect(duck_db, read_only=True)
    conn.execute(query)
    print(conn.fetchall())

@Alex-Monahan
Copy link
Contributor

If you are using Pandas, I recommend .df() instead! For large result sets, it should be much faster (since it's passing numpy arrays around instead of individual Python objects)

@vedantroy
Copy link
Author

@Alex-Monahan

Thanks for the tip! I ended up storing the string as a VARCHAR[] inside of the database.
If I want to to retrieve the VARCHAR[] as a single string that is composed of white-space separated tokens, what would be the best thing to do?

I tried doing:

SELECT id, concat_ws(' ', tags) as text FROM items

But it seems like that returns [a, b, c, def, asd] (presumably the VARCHAR[] is cast to a string before being passed to concat_ws).

@Alex-Monahan
Copy link
Contributor

Howdy! I think the function you are looking for is list_string_agg!
https://duckdb.org/docs/sql/functions/nested#list-aggregates

@vedantroy
Copy link
Author

vedantroy commented Jul 21, 2022

Howdy! I think the function you are looking for is list_string_agg! https://duckdb.org/docs/sql/functions/nested#list-aggregates

This function is pretty close, but it joins with a comma (I cannot control the separator, which is important here).
I'll see if string_agg does the trick.

Edit: That's not the right function either.

It's not a big deal though, I can just do the relevant processing in Python

@Alex-Monahan
Copy link
Contributor

Ah, sorry! I'm not at a desktop to validate...
Let's try:

select 
  id, 
  string_agg(unnest(tags),' ') as text
from items
group by
  id

@vedantroy
Copy link
Author

Thanks for the solution! I haven't had the chance to try it yet, because I found a different workaround, but when I revisit the code, I'll see if it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants