# 2021: Week 5 - Dealing with Duplication

February 03, 2021

Challenge by: Jenny Martin

Have you ever been working with a dataset in Tableau Desktop and noticed some duplication occurring? Of course, this is something you can fix with some potentially tricky LODs or Table Calc filters, but wouldn't it be nicer for your dataset to be viz ready before heading into Desktop? 

If you attended the Tableau Fringe Festival last year, this concept may feel familiar, as I did a quick demo explaining why I, personally, would prefer to use Prep to solve my duplication issues. You can find the video [here](https://www.youtube.com/watch?v=oNV_S-td4SU&list=PLlwjCfjqIxWlEZskASdjsN61DLOfmCk-Y&index=4) if you like.

## Input
The dataset we'll be working with for this challenge follows the same theme as the Fringe Festival. We have information relating to which of our Clients are attending our training sessions. Also included in our dataset is which Account Managers look after which Clients. However, we have historical information about Account Ownership which is leading to duplication. So how can we fix it?

<img src='https://1.bp.blogspot.com/-d0bHWDavGPk/X-HeazG_L2I/AAAAAAAAAp8/OwmFLFlx8zc6TOKgk6vDLSMTEtsys9JtwCLcBGAsYHQ/w640-h173/Joined%2BDataset.png'>

## Requirements
If you're new to the technique of deduplicating data, then check out [this blog](https://preppindata.blogspot.com/2020/03/how-to-deduplicate.html) post for some helpful thoughts about how to approach this challenge.
- [Input the data](https://drive.google.com/file/d/1syLzqzqIqsHJbJzxV3GVqcPdTOjCA-A1/view?usp=sharing) 
- For each Client, work out who the most recent Account Manager is (help)
- Filter the data so that only the most recent Account Manager remains (help)
    - Be careful not to lose any attendees from the training sessions!
- In some instances, the Client ID has changed along with the Account Manager. Ensure only the most recent Client ID remains
- [Output the data](https://drive.google.com/file/d/1mTsq_0puEnSeypvIB8z2jnvWNpIR4OfH/view?usp=sharing)

## Output

<img src='https://1.bp.blogspot.com/-Kr_f9TYhdhg/X-HgeNwEzPI/AAAAAAAAAqQ/CDMGEy5J64Yml_B_JXOfX4sB95xEtuccQCLcBGAsYHQ/w640-h212/Output%2B2020W5.png'>

- 7 fields
    - Training
    - Contact Email
    - Contact Name
    - Client
    - Client ID
    - Account Manager
    - From Date
- 13,528 rows (13,529 including headers)

In [54]:
import pandas as pd

In [55]:
# Input the data
input = 'Joined Dataset.csv'
df = pd.read_csv(input)

# Correct data type
df['From Date'] = pd.to_datetime(df['From Date'], format=r'%d/%m/%Y')
print(df.head(5))
print(df.info())



                Training                                 Contact Email  \
0  Prep 101 - 2020-10-01                 abagael.matresse@brauninc.com   
1  Prep 101 - 2020-10-01               abagail.macconnell@lakinllc.com   
2  Prep 101 - 2020-10-01                  abagail.moodey@raynorinc.com   
3  Prep 101 - 2020-10-01                    abby.eager@paucekgroup.com   
4  Prep 101 - 2020-10-01  abelard.mechell@lehner.swiftanddickinson.com   

         Contact Name                       Client  Client ID Account Manager  \
0    Abagael Matresse                    Braun Inc       1200     Xiaoxuan Ma   
1  Abagail MacConnell                    Lakin LLC        924  Lucy Stevenson   
2      Abagail Moodey                   Raynor Inc        444     Nancy Smith   
3          Abby Eager                 Paucek Group        893     Nancy Smith   
4     Abelard Mechell  Lehner, Swift and Dickinson       1323     Xiaoxuan Ma   

   From Date  
0 2019-12-31  
1 2019-01-01  
2 2015-07-01  
3 2018-0

In [56]:
# For each Client, work out who the most recent Account Manager is
# Summarize the client information
account_managers_info = df[['Client', 'Client ID', 'Account Manager', 'From Date']]
account_managers_info = account_managers_info.drop_duplicates()

# Get the most recent date for each client
max_from_date_df = account_managers_info.groupby('Client').agg(max_date=('From Date', 'max')).reset_index()

# Keep only the most recent information for each client
filter_to_current_AM = pd.merge(left=account_managers_info, right=max_from_date_df, left_on=['Client','From Date'], right_on=['Client','max_date'], how='inner')
filter_to_current_AM.drop(columns='max_date', inplace=True)

print(filter_to_current_AM.head(5))
print(filter_to_current_AM.info())





                        Client  Client ID Account Manager  From Date
0                    Lakin LLC        924  Lucy Stevenson 2019-01-01
1                   Raynor Inc        444     Nancy Smith 2015-07-01
2                 Paucek Group        893     Nancy Smith 2018-09-20
3  Lehner, Swift and Dickinson       1323     Xiaoxuan Ma 2019-12-31
4    Vandervort, Will and Wiza       1137    Louisa James 2018-09-01
<class 'pandas.core.frame.DataFrame'>
Int64Index: 527 entries, 0 to 526
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Client           527 non-null    object        
 1   Client ID        527 non-null    int64         
 2   Account Manager  527 non-null    object        
 3   From Date        527 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 20.6+ KB
None


In [57]:
# Filter the data so that only the most recent Account Manager remains
# Be careful not to lose any attendees from the training sessions!
# In some instances, the Client ID has changed along with the Account Manager. Ensure only the most recent Client ID remains
training_info = df[['Training', 'Contact Email', 'Contact Name', 'Client']]
training_info = training_info.drop_duplicates()

output = pd.merge(left=training_info, right=filter_to_current_AM, on='Client')
print(output.head(5))
print(output.info())


                Training                     Contact Email  \
0  Prep 101 - 2020-10-01     abagael.matresse@brauninc.com   
1  Prep 101 - 2020-10-01        amory.sinnatt@brauninc.com   
2  Prep 101 - 2020-10-01          aurel.arter@brauninc.com   
3  Prep 101 - 2020-10-01  briggs.sleightholme@brauninc.com   
4  Prep 101 - 2020-10-01          hodge.letch@brauninc.com   

          Contact Name     Client  Client ID Account Manager  From Date  
0     Abagael Matresse  Braun Inc       2460     Oscar Adams 2020-06-30  
1        Amory Sinnatt  Braun Inc       2460     Oscar Adams 2020-06-30  
2          Aurel Arter  Braun Inc       2460     Oscar Adams 2020-06-30  
3  Briggs Sleightholme  Braun Inc       2460     Oscar Adams 2020-06-30  
4          Hodge Letch  Braun Inc       2460     Oscar Adams 2020-06-30  
<class 'pandas.core.frame.DataFrame'>
Int64Index: 13528 entries, 0 to 13527
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           