# Weekly Challenge 12

*Original URL* https://community.alteryx.com/t5/Weekly-Challenge/Challenge-12-Creating-an-HR-Hierarchy/td-p/36740 and [**My Alteryx Approach**](https://github.com/dsmdavid/Alteryx-Weekly-Challenge/tree/master/submitted/sub_Challenge%2312)

## Brief

### Basic Text Mining:

For this challenge let’s look at creating a multi-level hierarchy from employee-manager data. As always there are several ways to do this challenge, I have designated it as an advanced challenge because there is an elegant way to solve it using iterative macros. The advantage to the iterative macro solution is that it becomes very dynamic. Other hard coded solutions would get you to the answer with this data, but if the depth of the hierarchy were to change, you would have to modify the workflow to support the change. It is a great example to see how iterative macros can make a workflow dynamic.

#### The use case:

An HR department wants to use Alteryx to quickly understand the reporting structure for employees across their organization.

The Input source contains 5 employees and an identifier that uniquely identifies the individual and the manager they report to.
The goal is to create a hierarchy field identifying each relationship between employee and manager(s). For example, a Director reports directly to the Vice President which is 1 level up. The Director is then 2 levels away from the CEO (in this data set). As a result the hierarchy identifier represents how many levels removed the employee is from management team they report into.

In [1]:
import pandas as pd

## Approach I want to follow:
1. Read the data.
1. Search each of the buckets in the text field.
1. Return a "sorted bucket". 
1. Summarize the results.

In [2]:
#Load the data
#Treate TIMESTAMP and Time_Now as dates
df = pd.read_csv("./12_files/input.csv", encoding="latin")     
df.head()

Unnamed: 0,employee,id,man_id
0,Analyst,3,2.0
1,Manager,2,1.0
2,Director,1,4.0
3,Vice President,4,5.0
4,CEO,5,


In [5]:
df.set_index('id', inplace=True)

In [8]:
df.man_id.fillna(value=0, inplace=True)
df.man_id = df.man_id.apply(int)
df

Unnamed: 0_level_0,employee,man_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1
3,Analyst,2
2,Manager,1
1,Director,4
4,Vice President,5
5,CEO,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
employee    5 non-null object
id          5 non-null int64
man_id      4 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes


In [93]:
hierarchy = {}

In [21]:
hierarchy.get('2') is not None

False

In [92]:
def findHierarchy(someone_id):
    
    someone = df.loc[someone_id,'employee']
    hierarchy[someone] = {}
    hierarchy[someone]['previous'] = [someone]
    hierarchy[someone]['next'] = df[df['employee'] == df.loc[someone_id,'man_id']]['employee'].values[0]
    while hierarchy[someone]['next'] != 0:
        exists = hierarchy.get(hierarchy[someone]['next'])
        if exists is not None:
            hierarchy[someone]['previous'].extend(exists['previous'])
            hierarchy[someone]['next'] = 0
            break
        
        hierarchy[someone]['previous'].append(hierarchy[someone]['next'])
        hierarchy[someone]['next'] = df.loc[hierarchy[someone]['next'],'man_id']

        

In [28]:
findHierarchy(3)

In [91]:
df[df['employee'] == df.loc[3,'employee']]['employee'].values[0]

'Analyst'

In [94]:
for member in df.index:
    findHierarchy(member)
hierarchy

IndexError: index 0 is out of bounds for axis 0 with size 0

In [95]:
hierarchy

{'Analyst': {'previous': ['Analyst']}}

In [33]:
df_h = pd.DataFrame.(hierarchy, orient='index')

In [36]:
df_h2 =pd.DataFrame(df_h.previous.split(), expand = True)

AttributeError: 'Series' object has no attribute 'split'

In [56]:
list_df = []
for key in hierarchy.keys():
    print(key)
    list_df.append(pd.DataFrame(hierarchy[key]['previous'], columns=[key]))

3
2
1
4
5


In [52]:
pd.DataFrame(data= hierarchy[3]['previous'], columns=[3])

Unnamed: 0,3
0,3
1,2
2,1
3,4
4,5


In [57]:
list_df

[   3
 0  3
 1  2
 2  1
 3  4
 4  5,    2
 0  2
 1  1
 2  4
 3  5,    1
 0  1
 1  4
 2  5,    4
 0  4
 1  5,    5
 0  5]

In [59]:
pd.concat(list_df, axis=1).unstack()

3  0    3.0
   1    2.0
   2    1.0
   3    4.0
   4    5.0
2  0    2.0
   1    1.0
   2    4.0
   3    5.0
   4    NaN
1  0    1.0
   1    4.0
   2    5.0
   3    NaN
   4    NaN
4  0    4.0
   1    5.0
   2    NaN
   3    NaN
   4    NaN
5  0    5.0
   1    NaN
   2    NaN
   3    NaN
   4    NaN
dtype: float64

In [6]:
# What are the terms to search / buckets?
df_buckets = pd.read_csv("./11_files/buckets.csv", encoding="latin" )
df_buckets.head()

Unnamed: 0,Search,Bucket
0,beep,Tones
1,screen,Screen
2,trigger,Trigger


In [7]:
# Better to sort them once here:
df_bucketsOrdered = df_buckets.sort_values(by = 'Bucket')
df_bucketsOrdered.set_index('Search', inplace=True)
df_bucketsOrdered

Unnamed: 0_level_0,Bucket
Search,Unnamed: 1_level_1
screen,Screen
beep,Tones
trigger,Trigger


In [8]:
def find_buckets(field_6):
    '''
    Receives a text field and finds the "search" terms in the buckets list provided.
    Returns a string with all the buckets found displayed in alphabetical order
    '''
    matched = []
    try:
        text_to_find = field_6.lower()
    except:
        print("Some error happened here: couldn't convert to lowercase")
        print(field_6)
    for i in df_bucketsOrdered.index:
        if text_to_find.find(i) != -1:
            matched.append(df_bucketsOrdered.loc[i,'Bucket'])
        
    return ",".join(matched)
    

In [9]:
df['Bucket'] = df['Field_6'].apply(find_buckets)

In [10]:
#Values present in the bucket:
df['Bucket'].unique()

array(['', 'Tones', 'Trigger', 'Screen', 'Screen,Tones', 'Tones,Trigger',
       'Screen,Trigger', 'Screen,Tones,Trigger'], dtype=object)

In [11]:
#Summarize
df.groupby(by='Bucket').count().sort_values(by='Field_6', ascending=False)

Unnamed: 0_level_0,Field_6
Bucket,Unnamed: 1_level_1
,175251
Screen,2250
Trigger,1086
Tones,213
"Screen,Tones",64
"Screen,Trigger",14
"Tones,Trigger",9
"Screen,Tones,Trigger",2


## Condensed approach:

In [12]:
import time
t1 = time.time()
import pandas as pd

#Input data
df = pd.read_csv("./11_files/input.csv", encoding="latin")
df.fillna(value = "", inplace=True)
df_buckets = pd.read_csv("./11_files/buckets.csv", encoding="latin" )
df_buckets.sort_values(by = 'Bucket', inplace=True)
df_buckets.set_index('Search', inplace=True)

#Create function
def find_buckets(field_6):
    '''
    Receives a text field and finds the "search" terms in the buckets list provided.
    Returns a string with all the buckets found displayed in alphabetical order
    '''
    matched = []
    try:
        text_to_find = field_6.lower()
    except:
        print("Some error happened here: couldn't convert to lowercase")
        print(field_6)
    for i in df_bucketsOrdered.index:
        if text_to_find.find(i) != -1:
            matched.append(df_bucketsOrdered.loc[i,'Bucket'])
        
    return ",".join(matched)

#Assign Buckets
df['Bucket'] = df['Field_6'].apply(find_buckets)

#Summarize
print(df.groupby(by='Bucket').count().sort_values(by='Field_6', ascending=False))
t2 = time.time()
t2-t1

                      Field_6
Bucket                       
                       175251
Screen                   2250
Trigger                  1086
Tones                     213
Screen,Tones               64
Screen,Trigger             14
Tones,Trigger               9
Screen,Tones,Trigger        2


8.651309967041016