# advanced dataframe manipulations

today we will be expanding on our work with pandas dataframes. 

there are times where we need to filter and run operations on dataframes across multiple columns and rows simultaneously. this can get tricky, but luckily for us, ludovica just went through such an exercise and lived to tell about it. 

the following is a toy example based on her work on the location project. she found herself faced with a formidable dataframe in which shee needed to count occurances of a certain value across multiple columns, aggregated by the values an categorical column. specifically, she is interested in counting how many location codes occur in the filters of any of the various dashboards for each client. she also needs to count how many times a given location code is filtered on per client and overall. 

this kind of by multiple counting is a common enough operation to warrant special attention. in discussing this with her, i thought this might offer an opportunity to talk about more topics related to data frame manipulation. 

we will start with an introduction to some useful dataframe functions: `rank`, `mask`, `transpose`, `name` and then talk about how to reshape dataframes for fun and profit. 

# this week's exercise

given a toy dataframe A with columns: client_id, dash_a_filter, dash_b_filter, dash_c_filter, we want to answer questions like how many times does each client filter (a,b,c,d) by a each location code? how many dashboards does each location code occur in? how many times overall?

consider that we have dataframeß that have an id column (say, `client_id`), and at least one categorical column of values you want to count, but possibly many more such columns. note that it is likely that some (many?) of the values in the three columns are `None`s. 

imagine that we do not know the name of any of the columns in advance, nor do we know their number but we want to:

- write a function that accepts: 
    + such a dataframe, 
    + the name of the id column, 
    + and a list of the (categorical) columns names.  
    then, the function returns a series whose indeces are all the distinct values that occur at least once in *any* of the categorical columns, and whose corresponding values are the number of distinct ids in the dataframe that the index occurs in.
- write another function that accepts the same kind of input, but this time returns a series indexed on the ids, (one row per id) and whose values are the number of distinct values in all three columns the dataframe has. 

## template
you might start with this template, if you'd like:

```
import pandas as pd

def count_id_occurances_per_value(df, id_col, list_of_cat_cols):
   """
   counts the number of distinct ids from the id column (id_col) in which which each 
   distinct value occurring in any of the categorical columns in the list_of_cat_cols 
   occurs in the df dataframe. 
   """
   distinct_values = set()
   for acolumn in list_of_cat_cols:
       # get the distinct values occuring in column `acolumn`
       ...
       # update the set of distinct values
       ...
    value_series_counts = pd.Series(index = distinct_values)
    ...
    
   return value_series_counts
   
def count_distinct_value_occurances_per_id(df, id_col, list_of_cat_cols):
    """
    counts, for each id in `id_col` the number of occurancs of each distinct value occurring in 
    any of the categorical columns listed in `list_of_cat_cols`
    """
    # get distinct ids:
    ....
    id_series_counts = pd.Series(index=df[])
    ...
    return id_series_counts

```

In [1]:
import numpy as np
import pandas as pd
import random

In [2]:
# we start by creating a dataframe to play with. this is how many rows it shall have:
NUM_ROWS = 12
# these are the possible ids that it has:
all_distinct_ids  = [10,11,13,14,15,16,17,18,19,25] # note that 12 is missing!
# and these are the possible values that each of the categorical column can have:
all_distinct_values = [None, 'cat', 'dog', 'dogs', 'hamster', 'parrot', None]

# remember that a pandas dataframe is comprised of a set of equally long pandas series, 
# each series being a single column. let us construct the series 
id_list = sorted([random.choice(all_distinct_ids) for _ in range(NUM_ROWS)])
column_id = pd.Series(
    id_list, 
    name='person_id')

column_values_a = pd.Series(
    [random.choice(all_distinct_values[:3]) for _ in range(NUM_ROWS)], name='col_a')

column_values_b = pd.Series(
    [random.choice(all_distinct_values[3:]) for _ in range(NUM_ROWS)], 
    name='col_b')

value_list = [random.choice(all_distinct_values) for _ in range(NUM_ROWS)]
column_values_c = pd.Series(
    value_list, 
    name='col_c')

value_list = [random.choice(['cut','cot', 'cat']) for _ in range(NUM_ROWS)]
column_values_bull = pd.Series(
    value_list, 
    name='another_column')

# now we assemble the series into a dataframe:
all_columns = [column_id, column_values_a, column_values_b, column_values_c, column_values_bull]
df = pd.concat(all_columns, axis=1)
print(df)

    person_id col_a    col_b    col_c another_column
0          10  None  hamster     None            cut
1          11   dog     dogs      dog            cot
2          11   cat     dogs     None            cut
3          14   dog     None     dogs            cot
4          14   cat  hamster     None            cot
5          14   dog     dogs     dogs            cut
6          16  None  hamster  hamster            cut
7          17  None     dogs     dogs            cat
8          18   dog     dogs     None            cat
9          19   cat   parrot      dog            cat
10         25   dog     dogs   parrot            cot
11         25   dog   parrot      dog            cat


In [3]:
# note the names of the categorical columns:
column_names = ['col_a','col_b','col_c',]
# and the bane of the id_column 
id_column = "person_id"

In [4]:
# how many distinct people (ids) do we have in the set?
distinct_person_ids = df[id_column].unique()
print('there are', len(distinct_person_ids), 'people in the dataframe. these are:')
print(distinct_person_ids)

there are 8 people in the dataframe. these are:
[10 11 14 16 17 18 19 25]


In [5]:
# how many distinct values occur in each column of the dataframe?
print('there are', len(df.col_a.unique()), 'distinct values in column a')
print('there are', len(df.col_b.unique()), 'distinct values in column b')
print('there are', len(df.col_c.unique()), 'distinct values in column c')
df.col_c.unique()

there are 3 distinct values in column a
there are 4 distinct values in column b
there are 5 distinct values in column c


array([None, 'dog', 'dogs', 'hamster', 'parrot'], dtype=object)

In [6]:
# review: we can filter the dataframe on values we are interested in:
df[df.col_a == 'dog']

Unnamed: 0,person_id,col_a,col_b,col_c,another_column
1,11,dog,dogs,dog,cot
3,14,dog,,dogs,cot
5,14,dog,dogs,dogs,cut
8,18,dog,dogs,,cat
10,25,dog,dogs,parrot,cot
11,25,dog,parrot,dog,cat


In [7]:
# we can filter on any number of criteria simultaneously:
df[(df.col_a == 'dog') | (df.col_b == 'dogs')] # | means logical or

Unnamed: 0,person_id,col_a,col_b,col_c,another_column
1,11,dog,dogs,dog,cot
2,11,cat,dogs,,cut
3,14,dog,,dogs,cot
5,14,dog,dogs,dogs,cut
7,17,,dogs,dogs,cat
8,18,dog,dogs,,cat
10,25,dog,dogs,parrot,cot
11,25,dog,parrot,dog,cat


In [8]:
# handling cases where there are two ways to code the semantically same value:
df[df.col_c.isin(['dog','dogs'])]

Unnamed: 0,person_id,col_a,col_b,col_c,another_column
1,11,dog,dogs,dog,cot
3,14,dog,,dogs,cot
5,14,dog,dogs,dogs,cut
7,17,,dogs,dogs,cat
9,19,cat,parrot,dog,cat
11,25,dog,parrot,dog,cat


In [9]:
# this is slightly different for columns with None value:
df[df.col_a.isna()]

Unnamed: 0,person_id,col_a,col_b,col_c,another_column
0,10,,hamster,,cut
6,16,,hamster,hamster,cut
7,17,,dogs,dogs,cat


In [10]:
# which users (if any) have no animal at all
df[(df.col_a.isna()) & (df.col_b.isna()) & (df.col_c.isna()) ]

Unnamed: 0,person_id,col_a,col_b,col_c,another_column


In [11]:
# review: we can count occurances/values of a series (columns)
df.col_a.value_counts(dropna=False) # outputs a series indexed on the col_a values and ordered by counts

dog    6
cat    3
NaN    3
Name: col_a, dtype: int64

In [12]:
# this is the same as:
df.groupby('col_a').agg({'col_a':'count'}) # outputs a dataframe with col_a values as the index.

Unnamed: 0_level_0,col_a
col_a,Unnamed: 1_level_1
cat,3
dog,6


In [13]:
df.col_c.value_counts(dropna=False) # outputs a series indexed on the col_a values and ordered by counts

NaN        4
dogs       3
dog        3
hamster    1
parrot     1
Name: col_c, dtype: int64

In [14]:
# let us collect all the values that occur across all three column. 
# (do not assume any one column has all of them).
# there are multiple ways to get there:
#  i use a loop + set operations:
all_distinct_values = set() # start with an empty set
# 
for name in column_names:
    values_in_col = df[name].unique()
    all_distinct_values.update(values_in_col)
print(all_distinct_values)

{None, 'dogs', 'dog', 'parrot', 'cat', 'hamster'}


In [15]:
all_distinct_values = sorted([value for value in all_distinct_values if value is not None])
print(all_distinct_values)

['cat', 'dog', 'dogs', 'hamster', 'parrot']


In [16]:
# let us count the number of occurances of each value across all columns. 
# start by creating and initialising a series to collect the counts in
occurance_counts_per_value = pd.Series(data=[0 for _ in range(len(all_distinct_values))], index=all_distinct_values)
occurance_counts_per_value

cat        0
dog        0
dogs       0
hamster    0
parrot     0
dtype: int64

In [17]:
# then, for each row, and each categorical column add the value to the appropriate count in the series:
for index, row in df.iterrows():
    for name in column_names:
        current_value = row[name]
        if current_value:
            # in case of `None`s
            occurance_counts_per_value[current_value] += 1         
print(occurance_counts_per_value)

cat        3
dog        9
dogs       9
hamster    4
parrot     3
dtype: int64


In [18]:
# note that the number of rows in df varies from id to id:
df.groupby(id_column).agg({id_column:'count'})

Unnamed: 0_level_0,person_id
person_id,Unnamed: 1_level_1
10,1
11,2
14,3
16,1
17,1
18,1
19,1
25,2


In [19]:
# now, for each of the distinct values, count how many ids actually have that value in any of their column
id_count_per_value = pd.Series(data=[0 for _ in range(len(all_distinct_values))], index=all_distinct_values)

for value in all_distinct_values:
    set_of_ids = {} # start with empty set of ids
    # do not assume the names of categorical columns
    for column in column_names:
        subset_df = df[df[column] == value]
        set_of_ids.update(subset_df[id_column])
    id_count_per_value[value] = len(set_of_ids)
    
print(id_count_per_value)

cat        3
dog        7
dogs       7
hamster    3
parrot     3
dtype: int64


we could also use a `for` loop:

In [20]:
# start by initialising the counter we want to populate:
occurance_counts_per_id = pd.Series(data=[{} for _ in range(len(all_distinct_ids))], index=all_distinct_ids)
# for each id, we will populate have a dict, where each key is an animal that occurs in any of the columns 
# for that id and the dict value for each key is that animals count of occurances
occurance_counts_per_id

10    {}
11    {}
13    {}
14    {}
15    {}
16    {}
17    {}
18    {}
19    {}
25    {}
dtype: object

In [21]:
for person in all_distinct_ids:
    # only retain columns relavant to person
    count_dict = {}
    df_subset = df[df[id_column]==person]
    for column in column_names:
        single_col_count = df_subset[column].value_counts()
        for animal, count in single_col_count.items():
            old_value = count_dict.get(animal,0)
            new_value = old_value + count
            count_dict[animal] = new_value
    occurance_counts_per_id[person] = count_dict
print(occurance_counts_per_id)

10                                   {'hamster': 1}
11                  {'dog': 2, 'cat': 1, 'dogs': 2}
13                                               {}
14    {'dog': 2, 'cat': 1, 'dogs': 3, 'hamster': 1}
15                                               {}
16                                   {'hamster': 2}
17                                      {'dogs': 2}
18                            {'dog': 1, 'dogs': 1}
19                {'cat': 1, 'parrot': 1, 'dog': 1}
25               {'dog': 3, 'dogs': 1, 'parrot': 2}
dtype: object


# the `melt` function
now we start using this to count for each value of the id column. there are many ways to do this. one would be to use the `melt` function, which 'melts' dataframe columns into rows. 

In [22]:
df

Unnamed: 0,person_id,col_a,col_b,col_c,another_column
0,10,,hamster,,cut
1,11,dog,dogs,dog,cot
2,11,cat,dogs,,cut
3,14,dog,,dogs,cot
4,14,cat,hamster,,cot
5,14,dog,dogs,dogs,cut
6,16,,hamster,hamster,cut
7,17,,dogs,dogs,cat
8,18,dog,dogs,,cat
9,19,cat,parrot,dog,cat


In [23]:
#melted = df.melt(id_vars = (id_column), value_vars=(column_names), var_name='source')
melted = df.melt(id_vars = (id_column), value_vars=(column_names), var_name='source').rename(columns={'value':'animal'})
melted

Unnamed: 0,person_id,source,animal
0,10,col_a,
1,11,col_a,dog
2,11,col_a,cat
3,14,col_a,dog
4,14,col_a,cat
5,14,col_a,dog
6,16,col_a,
7,17,col_a,
8,18,col_a,dog
9,19,col_a,cat


this melted dataframe is easier to count:

In [28]:
melted.groupby([id_column,'animal'])

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x110243c18>

In [24]:
melted.groupby([id_column,'animal']).agg({'animal':'count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,animal
person_id,animal,Unnamed: 2_level_1
10,hamster,1
11,cat,1
11,dog,2
11,dogs,2
14,cat,1
14,dog,2
14,dogs,3
14,hamster,1
16,hamster,2
17,dogs,2


we could also use a `for` loop:

In [25]:
for index, value in occurance_counts_per_value.items():
    print(index, value)

cat 3
dog 9
dogs 9
hamster 4
parrot 3


## def count_id_occurances_per_value(df, id_col, list_of_cat_cols):
   """
   counts the number of distinct ids from the id column (id_col) in which each 
   distinct value occurring in any of the categorical columns in the list_of_cat_cols 
   occurs in the df dataframe. 
   """
   distinct_values = set()
   for acolumn in list_of_cat_cols:
       # get the distinct values occuring in column `acolumn`
       ...
       # update the set of distinct values
    value_series_counts = pd.Series(index = distinct_values)
    for 
   
   return value_series_counts


In [26]:
# 


In [27]:
s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))
s['d'] = s['b'] # so there's a tie
print(s)
print(s.rank())

a    0.102629
b   -2.659927
c    0.626794
d   -2.659927
e    0.018209
dtype: float64
a    4.0
b    1.5
c    5.0
d    1.5
e    3.0
dtype: float64
