Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Count the cumulative number of unique elements in a column #515

Closed
ericmjl opened this issue Jul 30, 2019 · 22 comments · Fixed by #563 or #575
Closed

[ENH] Count the cumulative number of unique elements in a column #515

ericmjl opened this issue Jul 30, 2019 · 22 comments · Fixed by #563 or #575
Labels
available for hacking This issue has not been claimed by any individual. enhancement New feature or request good first issue Good for newcomers

Comments

@ericmjl
Copy link
Member

ericmjl commented Jul 30, 2019

Brief Description

I would like to propose a function that counts the cumulative number of unique items in a column.

Example implementation

def cumulative_unique(df, column_name, new_column_name):
    """
    Compute the cumulative number of unique items in a column that have been seen.
    
    This function can be used to limit the number of source plates
    that we use to seed colonies.
    """
    df = df.copy()
    unique_elements = set()
    cumulative_unique = []
    for val in df[column_name].values:
        unique_elements.add(val)
        cumulative_unique.append(len(unique_elements))
    df[new_column_name] = cumulative_unique
    return df

I'm happy for anybody else to run with the implementation above and improve upon it.

@ericmjl ericmjl added available for hacking This issue has not been claimed by any individual. enhancement New feature or request good first issue Good for newcomers labels Jul 30, 2019
@jk3587
Copy link
Collaborator

jk3587 commented Aug 10, 2019

Would this be the same as pandas built-in nunique link?

def cumulative_unique(df, column_name, new_column_name):
    """
    Compute the cumulative number of unique items in a column that have been seen.
    
    This function can be used to limit the number of source plates
    that we use to seed colonies.
    """
    df = df.copy()
    df[new_column_name] = df[column_name].nunique()
    return df

@ericmjl
Copy link
Member Author

ericmjl commented Aug 10, 2019

@jk3587 not exactly the same, I think.

nunique returns a single number, whereas I was interested in looking for a function that, as it goes down the series, keeps track of the total number of unique elements already seen.

@rahosbach
Copy link
Contributor

@jk3587 and @ericmjl:
I got some inspiration using the selected answer from this SO question. But, in this case, we don't actually need to use groupby(), which is nice (although, that could be a nice feature to add if the user wants to get a cumulative count of unique values by a group column that they specify). In the simplest case, we only need a dummy variable.

Here's a possible implementation:

df = pd.DataFrame({'x': [1, 2, 3, 1, 1, 3, 4, 5, 3, 2, 6]})
df['cum_count_x'] = (df
    .drop_duplicates()
    .assign(dummy = 1)
    .dummy
    .cumsum()).reindex(df.index).ffill().astype(int)

The result is:

    x            cum_count_x
0   1            1
1   2            2
2   3            3
3   1            3
4   1            3
5   3            3
6   4            4
7   5            5
8   3            5
9   2            5
10  6            6

@Ram-N
Copy link
Contributor

Ram-N commented Aug 13, 2019

@rahosbach The idea of dropping duplicates and then ffill() is clever! I wonder about the execution time for large df's though.

@rahosbach
Copy link
Contributor

@Ram-N Can you think of anything faster?

I just ran a test on my laptop, and the above method completes in less than 75 ms if we are performing the method on a column of 1 million random integers between 0 and 99. I still get less than 100 ms if we are performing the method on 1 million random lowercase ASCII letters (a single letter per row). That seems fast enough to me.

Maybe we can implement as I suggested; then, if someone comes up with a more efficient solution, they can add it as a performance enhancement PR.

@ericmjl
Copy link
Member Author

ericmjl commented Aug 14, 2019

@rahosbach neat stuff! Does this work with a column of strings as the thing we want to do a cumulative_unique count on?

@rahosbach
Copy link
Contributor

@ericmjl Yes, it should work for strings as well. It's worth noting that the strings will be compared on a case-sensitive basis (i.e., 'a' != 'A').

Users could specify a boolean case_sensitive value for columns that are not numeric. If case_sensitive is FALSE, then the column would just need to be transformed with to_lower() prior to running the cumulative_unique count method.

@szuckerman
Copy link
Collaborator

So, you might be able to increase the speed the following way:

This part in the code:

    for val in df[column_name].values:
        unique_elements.add(val)
        cumulative_unique.append(len(unique_elements))

Do we really need to make a list of all the values that then get copied? Let's say there's only two unique values in a DataFrame with 1MM rows; it appears unnecessary to build a list of 1MM elements.

I would think we could do something like this for efficiency:

df = pd.DataFrame({'a':[9,8,8,9,8,8,9,9,9,10]})
# df = pd.DataFrame({'a':[1,2,3,4,5,6,7,8]})
column_name = 'a'
count = 0
n_unique = 1
cumulative_unique_list = [[n_unique, 1]]
unique_vals = {df[column_name].values[0]}

for idx, val in enumerate(df[column_name].values[1:], 1):
    if val not in unique_vals:
        n_unique += 1
        cumulative_unique_list[-1][1] = idx-1
        cumulative_unique_list.append([n_unique, idx])
        unique_vals.add(val)

# https://stackoverflow.com/questions/4029436/subtracting-the-current-and-previous-item-in-a-list

new_vals = [cumulative_unique_list[0][1]] + [y[1] - x[1] for x,y in zip(cumulative_unique_list,cumulative_unique_list[1:])]

if new_vals[0] == 0:
    new_vals[0] = 1

cumulative_unique_list_vals = [val[0] for val in cumulative_unique_list]

final_list = list(zip(cumulative_unique_list_vals, new_vals))

def return_vals():
    for unique_items in final_list:
        yield [unique_items[0]] * unique_items[1]

def flat_list():
    for sublist in return_vals():
        for item in sublist:
            yield item

df['unique_vals'] = list(flat_list())

The main difference is that it stores the values as pairs (n_unique_vals, num_times_to_repeat) so the size of the list is only based on the number of "changes"; you don't have to keep adding to a gigantic list that is kept in memory.

@rahosbach
Copy link
Contributor

@szuckerman That's an interesting method, for sure. I turned your code into a stand-alone method and tried to do some profiling on it.

I haven't dug into your code enough to know exactly why this is happening, but sometimes when I run the method (not all the time) the length of list(flat_list()) is not the expected number of rows of df. Therefore, a ValueError is thrown:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-94c9b13a2391> in <module>
----> 1 get_ipython().run_line_magic('timeit', "cum_count2(df, 'x', 'new_x')")

~\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in run_line_magic(self, magic_name, line, _stack_depth)
   2285                 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
   2286             with self.builtin_trap:
-> 2287                 result = fn(*args,**kwargs)
   2288             return result
   2289 

<decorator-gen-62> in timeit(self, line, cell, local_ns)

~\Anaconda3\lib\site-packages\IPython\core\magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

~\Anaconda3\lib\site-packages\IPython\core\magics\execution.py in timeit(self, line, cell, local_ns)
   1129             for index in range(0, 10):
   1130                 number = 10 ** index
-> 1131                 time_number = timer.timeit(number)
   1132                 if time_number >= 0.2:
   1133                     break

~\Anaconda3\lib\site-packages\IPython\core\magics\execution.py in timeit(self, number)
    158         gc.disable()
    159         try:
--> 160             timing = self.inner(it, self.timer)
    161         finally:
    162             if gcold:

<magic-timeit> in inner(_it, _timer)

<ipython-input-14-0edcf8b68942> in cum_count2(df, column_name, new_column_name)
     40             for item in sublist:
     41                 yield item
---> 42     df[new_column_name] = list(flat_list())
     43     return df

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   3368         else:
   3369             # set column
-> 3370             self._set_item(key, value)
   3371 
   3372     def _setitem_slice(self, key, value):

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
   3443 
   3444         self._ensure_valid_index(value)
-> 3445         value = self._sanitize_column(key, value)
   3446         NDFrame._set_item(self, key, value)
   3447 

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)
   3628 
   3629             # turn me into an ndarray
-> 3630             value = sanitize_index(value, self.index, copy=False)
   3631             if not isinstance(value, (np.ndarray, Index)):
   3632                 if isinstance(value, list) and len(value) > 0:

~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in sanitize_index(data, index, copy)
    517 
    518     if len(data) != len(index):
--> 519         raise ValueError('Length of values does not match length of index')
    520 
    521     if isinstance(data, ABCIndexClass) and not copy:

ValueError: Length of values does not match length of index

@szuckerman
Copy link
Collaborator

@rahosbach I'd have to see the DataFrame you're running it on, but what first comes to mind is if there's an issue with the first two values or last two values being distinct.

I had to put in some dummy code:

if new_vals[0] == 0:
    new_vals[0] = 1

because the length of the first element would be set at 0, we want it at least at 1.

@rahosbach
Copy link
Contributor

@szuckerman With or without the dummy code, the error persists. I generated a single-column DataFrame of 100 randomly-generated integers. The first two and last two values are all distinct.

x
--
36
79
75
97
82
67
78
2
49
48
...
30
5
41
84
2
3
5
89
2
51

100 rows × 1 columns

@szuckerman
Copy link
Collaborator

Hmm, might be an Anaconda thing. The following works for me on Python 3.4:

import pandas as pd 
import timeit

df = pd.DataFrame({'a':[36, 79, 75, 97, 82, 67, 78, 2, 49, 48, 30, 5, 41, 84, 2, 3, 5, 89, 2, 51]})

def count_func(df, column_name, new_col):
    count = 0
    n_unique = 1
    cumulative_unique_list = [[n_unique, 1]]
    unique_vals = {df[column_name].values[0]}
    for idx, val in enumerate(df[column_name].values[1:], 1):
        if val not in unique_vals:
            n_unique += 1
            cumulative_unique_list[-1][1] = idx-1
            cumulative_unique_list.append([n_unique, idx])
            unique_vals.add(val)
    # https://stackoverflow.com/questions/4029436/subtracting-the-current-and-previous-item-in-a-list
    new_vals = [cumulative_unique_list[0][1]] + [y[1] - x[1] for x,y in zip(cumulative_unique_list,cumulative_unique_list[1:])]
    if new_vals[0] == 0:
        new_vals[0] = 1
    cumulative_unique_list_vals = [val[0] for val in cumulative_unique_list]
    final_list = list(zip(cumulative_unique_list_vals, new_vals))
    def return_vals():
        for unique_items in final_list:
            yield [unique_items[0]] * unique_items[1]
    def flat_list():
        for sublist in return_vals():
            for item in sublist:
                yield item
    df[new_col] = list(flat_list())

count_func(df, 'a', 'unique_vals')

num=10000
timed=timeit.timeit("count_func(df, 'a', 'unique_vals')", number=num, setup="from __main__ import count_func, df")
print('Ran in %s seconds, %s seconds per interval' %(timed, timed/num))
>>> Ran in 1.6408557668328285 seconds, 0.00016408557668328284 seconds per interval

@rahosbach
Copy link
Contributor

Yeah, this is really interesting. I am using iPython console and Jupyter Lab to run my tests, so it could be an Anaconda thing (I'm running Python 3.6.5). What's weird is that I can get your latest method to work along with the dataframe you supplied. However, I can't run your latest method with my randomly-generated data frame of 100 values.

In terms of timing, your method with your dataframe runs in 329 µs on my machine, compared to my method which runs in 5.09 ms. So, for a small series your method is definitely faster. I am unable to test it on a larger dataframe, though. Could you generate a large dataframe and test your method against this:

def cum_count(df, column_name, new_column_name):
    df[new_column_name] = (df[[column_name]]
        .drop_duplicates()
        .assign(dummy = 1)
        .dummy
        .cumsum()).reindex(df.index).ffill().astype(int)            
    return df

@szuckerman
Copy link
Collaborator

Oh, I think I found the bug in mine.... it "stops" once it's hit the last unique value...

When looking at 1,000,000 values unique from 1-50, the unique ones end after 188 and I return a list of 188. We want a list of 1,000,000.

@rahosbach
Copy link
Contributor

@szuckerman Did you have a chance to update your method yet?

I added a very simple if construct at the end that simply appends a list of the last number as many times as necessary to reach the required array length:

list_vals = list(flat_list())
    if len(list_vals) < df.shape[0]:
        df[new_col] = list_vals + ([list_vals[-1]] * (df.shape[0] - len(list_vals)))
    else:
        df[new_col] = list_vals

It works, but I'm sure you have a better (more efficient) thought in mind about how to achieve this.
Anyways, with that addition I find that (on my laptop) your method runs faster than mine up to about 8,300 random numbers, at which point my version starts running faster (btw, I'm just using %timeit magic to profile).

@szuckerman
Copy link
Collaborator

I need to clean up the code a bit; but the initial tests show that your code is much faster; I believe the bottleneck comes from that last part of appending two huge lists together. I'm going to try to make it as a generator which might help with some time/space complexity.

@ericmjl
Copy link
Member Author

ericmjl commented Aug 16, 2019

Wow, you guys, I'm enjoying the conversation here! 🎉 I'm glad this idea kicked off some thoughts. Let me know if there's any way I can facilitate this into a PR!

@szuckerman
Copy link
Collaborator

So, we're going with the original one from @rahosbach. Apparently his runs in O(1) time whereas mine is O(n). nrows is the number of rows in the randomly generated DataFrame.

The code is below if you'd like to try this yourself:

2019-08-16_15-17-46

import pandas as pd 
import timeit
import numpy as np
from plotnine import *


def count_func(df, column_name, new_col):
    count = 0
    n_unique = 1
    cumulative_unique_list = [[n_unique, 1]]
    unique_vals = {df[column_name].values[0]}
    for idx, val in enumerate(df[column_name].values[1:], 1):
        if val not in unique_vals:
            n_unique += 1
            cumulative_unique_list[-1][1] = idx-1
            cumulative_unique_list.append([n_unique, idx])
            unique_vals.add(val)
    # https://stackoverflow.com/questions/4029436/subtracting-the-current-and-previous-item-in-a-list
    new_vals = [cumulative_unique_list[0][1]] + [y[1] - x[1] for x,y in zip(cumulative_unique_list,cumulative_unique_list[1:])]
    if new_vals[0] == 0:
        new_vals[0] = 1
    cumulative_unique_list_vals = [val[0] for val in cumulative_unique_list]
    final_list = list(zip(cumulative_unique_list_vals, new_vals))
    def return_vals():
        for unique_items in final_list:
            yield [unique_items[0]] * unique_items[1]
    def flat_list():
        for sublist in return_vals():
            for item in sublist:
                yield item
    final_flat_list = list(flat_list())
    if len(cumulative_unique_list_vals) == len(unique_vals) and df.shape[0] != len(final_flat_list):
        df[new_col] = final_flat_list + [final_flat_list[-1]] * (df.shape[0] - len(final_flat_list))
    else:
        df[new_col] = final_flat_list
        

def cum_count(df, column_name, new_column_name):
    df[new_column_name] = (df[[column_name]]
        .drop_duplicates()
        .assign(dummy = 1)
        .dummy
        .cumsum()).reindex(df.index).ffill().astype(int)            
    return df


sizes = [10, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000, 1000000]

count_func_list = []
cum_count_list = []

for size in sizes:
    df = pd.DataFrame({'a':np.random.randint(50, size=size)})
    num=10
    timed=timeit.timeit("count_func(df, 'a', 'unique_vals')", number=num, setup="from __main__ import count_func, df")
    print('Ran in %s seconds, %s seconds per interval' %(timed, timed/num))
    count_func_list.append(timed/num)
    df = pd.DataFrame({'a':np.random.randint(50, size=100)})
    timed=timeit.timeit("cum_count(df, 'a', 'unique_vals')", number=num, setup="from __main__ import cum_count, df")
    print('Ran in %s seconds, %s seconds per interval' %(timed, timed/num))
    cum_count_list.append(timed/num)

dat = pd.DataFrame({'nrows': sizes, 'count_func': count_func_list, 'cum_count': cum_count_list})

dat_melted = dat.melt(id_vars=['nrows'], value_vars=['count_func', 'cum_count'], var_name='function_type', value_name='time_per_iteration')

ggplot(dat_melted, aes(x='nrows', y='time_per_iteration', color='function_type', group='function_type')) + geom_line()

@rahosbach
Copy link
Contributor

@szuckerman @ericmjl
Apologies for my delayed response here; I was out on vacation. However, it sounds like we'll push forward with creating a PR using the drop_duplicates() and cumsum() approach I had specified before (at least as a first pass at adding this functionality).

@ericmjl: I was thinking of naming the method count_cumulative_unique(). Does that work? And where would you like the method added to the codebase?

@ericmjl
Copy link
Member Author

ericmjl commented Aug 31, 2019

@rahosbach no worries! I hope you had a great vacation 😄

For the function name, might there be a two-word name that you can think of? (I was thinking cumulative_unique might be explicit enough, but if you think we should prefix the count_, then I'm good with that too!)

As for where it should go, I think this would be best added in the functions.py module.

Don't forget docs and tests! We can talk more about it on the PR 😄.

As always, I'm very appreciative of both of your time and effort here, @szuckerman and @rahosbach 😸. Thanks for your contributions!

@rahosbach
Copy link
Contributor

@ericmjl That sounds good. However, I thought that part of the pyjanitor philosophy was to have verb-based method names to call on data frames (hence why I started my suggested method name with count)...Am I wrong about this?

@ericmjl
Copy link
Member Author

ericmjl commented Aug 31, 2019

@rahosbach you're not wrong at all! Indeed, there's a bit of tension between "short function names" and "verb-based" sometimes, and in my mind I'm still not 100% clear on whether it's better to be consistent with shortness or consistent with verb-ness. Actually, now that you've reminded me about the original design goals outlined in the README and in my talk, let's go with your suggestion. count_cumulative_unique it is! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
available for hacking This issue has not been claimed by any individual. enhancement New feature or request good first issue Good for newcomers
Projects
None yet
5 participants