# Overview - [Preppin' Data Challenge 2023: Week 4 - New Customers](https://preppindata.blogspot.com/2023/01/2023-week-4-new-customers.html)

In this project we will be practicing cleaning and preparing data for analysis in Python. 

**Challenge Level: Easy**

We will be using the dataset linked in the title from the blog **Preppin' Data**, and will look to satisfy the following requirements per its instructions: 

### Requirements
- Input the data
- We want to stack the tables on top of one another, since they have the same fields in each sheet. We can do this one of 2 ways (help):
    - Drag each table into the canvas and use a union step to stack them on top of one another
    - Use a wildcard union in the input step of one of the tables
- Some of the fields aren't matching up as we'd expect, due to differences in spelling. Merge these fields together
- Make a Joining Date field based on the Joining Day, Table Names and the year 2023
- Now we want to reshape our data so we have a field for each demographic, for each new customer (help)
- Make sure all the data types are correct for each field
- Remove duplicates (help)
    - If a customer appears multiple times take their earliest joining date
- Output the data

# Project Code

## Import Necessary Packages for Project

We will be importing the following packages/modules for the following reasons
- **Pandas:** allows us to create/format/clean our dataset for easy analysis

In [1]:
import pandas as pd

## Load/Combine Month Data into Single DataFrame

In [2]:
#create file path to easily replace URL if necessary
file_path = r"D:\Work\Professional\Side_Projects\Data Cleaning Challenges\PreppinDataChallenge2023_Week4-NewCustomers\New Customers.xlsx"

#create formatted path to format URL for .read_excel
formatted_path = "{path}".format(path = file_path).replace("\\", "/")

#import all tabs within excel file into df with "sheet_name=None"
df = pd.read_excel(formatted_path, sheet_name=None)
df

{'January':          ID  Joining Day    Demographic      Value
 0    490910            3      Ethnicity      White
 1    490910            3  Date of Birth  5/23/1981
 2    490910            3   Account Type      Basic
 3    369221           18      Ethnicity      Black
 4    369221           18  Date of Birth   3/4/2019
 ..      ...          ...            ...        ...
 244  840464            6  Date of Birth  7/18/1968
 245  840464            6   Account Type       Gold
 246  674869           13      Ethnicity      Asian
 247  674869           13  Date of Birth   6/6/1991
 248  674869           13   Account Type   Platinum
 
 [249 rows x 4 columns],
 'February':          ID  Joining Day    Demographic      Value
 0    473692           20      Ethnicity      White
 1    473692           20  Date of Birth  3/23/2012
 2    473692           20   Account Type      Basic
 3    150853            5      Ethnicity      Other
 4    150853            5  Date of Birth   7/9/1991
 ..      ...  

In [3]:
#iterate through month headers to check for consistency between dataset headers
for month in df:
    print(month, "\n", df[month].columns)

January 
 Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
February 
 Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
March 
 Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
April 
 Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
May 
 Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
June 
 Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
July 
 Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
August 
 Index(['ID', 'Joining Day', 'Demographiic', 'Value'], dtype='object')
September 
 Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
October 
 Index(['ID', 'Joining Day', 'Demagraphic', 'Value'], dtype='object')
November 
 Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
December 
 Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')


In [4]:
#change dataset headers so that they are all consistent --> no problems when concatenating datasets

#create list with proper column names
column_names = ["ID", "Joining Day", "Demographic", "Value"]

for month in df:
    #rename columns according to list
    df[month].columns = column_names
    #print col names to verify corrected cols
    print(df[month].columns)

Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')
Index(['ID', 'Joining Day', 'Demographic', 'Value'], dtype='object')


In [5]:
#create empty dataset with same column names to use in concat loop (column index names must be matching to concat)
total_df = pd.DataFrame(columns = column_names)

for month in df:
    month_df = df[month]
    # add month of each df to column in dataset so when the dfs are joined, we still have access to the data
    month_df["Month"] = month
    # concat month datasets together into one master dataset
    total_df = pd.concat([total_df, month_df])
    
total_df

Unnamed: 0,ID,Joining Day,Demographic,Value,Month
0,490910,3,Ethnicity,White,January
1,490910,3,Date of Birth,5/23/1981,January
2,490910,3,Account Type,Basic,January
3,369221,18,Ethnicity,Black,January
4,369221,18,Date of Birth,3/4/2019,January
...,...,...,...,...,...
268,174699,2,Date of Birth,3/13/1989,December
269,174699,2,Account Type,Gold,December
270,514598,28,Ethnicity,Other,December
271,514598,28,Date of Birth,10/10/1971,December


In [6]:
#reset index have row indexes consistent with new DataFrame
total_df.reset_index(drop=True, inplace=True)
total_df

Unnamed: 0,ID,Joining Day,Demographic,Value,Month
0,490910,3,Ethnicity,White,January
1,490910,3,Date of Birth,5/23/1981,January
2,490910,3,Account Type,Basic,January
3,369221,18,Ethnicity,Black,January
4,369221,18,Date of Birth,3/4/2019,January
...,...,...,...,...,...
2965,174699,2,Date of Birth,3/13/1989,December
2966,174699,2,Account Type,Gold,December
2967,514598,28,Ethnicity,Other,December
2968,514598,28,Date of Birth,10/10/1971,December


## Clean and Reformat Data

### Create "Account Type", "Ethnicity", "Date of Birth" columns from "Demographic" column

In [7]:
#turn demographics into their own columns in the dataset

total_df = total_df.pivot(index=["ID", "Joining Day", "Month"], columns="Demographic", values="Value").reset_index()
total_df

  total_df = total_df.pivot(index=["ID", "Joining Day", "Month"], columns="Demographic", values="Value").reset_index()
  total_df = total_df.pivot(index=["ID", "Joining Day", "Month"], columns="Demographic", values="Value").reset_index()


Demographic,ID,Joining Day,Month,Account Type,Date of Birth,Ethnicity
0,100185,20,May,Basic,7/29/1952,Asian
1,101515,14,April,Gold,8/11/1974,Black
2,101744,29,August,Basic,1/21/1945,Asian
3,102704,23,January,Basic,3/9/2000,Black
4,103488,28,August,Basic,9/26/1957,Other
...,...,...,...,...,...,...
985,994016,29,March,Platinum,3/29/1955,Other
986,994289,16,June,Gold,5/9/1990,White
987,994611,10,January,Basic,6/19/1994,Black
988,995456,17,February,Basic,3/5/1975,Other


In [8]:
# change default index name
total_df = total_df.rename_axis(None, axis=1)
total_df

Unnamed: 0,ID,Joining Day,Month,Account Type,Date of Birth,Ethnicity
0,100185,20,May,Basic,7/29/1952,Asian
1,101515,14,April,Gold,8/11/1974,Black
2,101744,29,August,Basic,1/21/1945,Asian
3,102704,23,January,Basic,3/9/2000,Black
4,103488,28,August,Basic,9/26/1957,Other
...,...,...,...,...,...,...
985,994016,29,March,Platinum,3/29/1955,Other
986,994289,16,June,Gold,5/9/1990,White
987,994611,10,January,Basic,6/19/1994,Black
988,995456,17,February,Basic,3/5/1975,Other


### Create "Joining Date" column from "Joining Day", "Month" columns

In [9]:
#check dtypes of columns being used to create "Joining Date"
total_df.dtypes

ID                int64
Joining Day       int64
Month            object
Account Type     object
Date of Birth    object
Ethnicity        object
dtype: object

In [10]:
#change type to string so that it can be concatenated with other values
total_df["Joining Day"] = total_df["Joining Day"].astype(str)
#add leading zero padding for single digit days 
total_df["Joining Day"] = total_df["Joining Day"].str.zfill(2)

#combine day, month, year into single column
total_df["Joining Date"] = total_df["Month"] + " " + total_df["Joining Day"] + ", " + "2023" 
total_df["Joining Date"]

0            May 20, 2023
1          April 14, 2023
2         August 29, 2023
3        January 23, 2023
4         August 28, 2023
              ...        
985        March 29, 2023
986         June 16, 2023
987      January 10, 2023
988     February 17, 2023
989    September 19, 2023
Name: Joining Date, Length: 990, dtype: object

In [11]:
#check dtype of "Joining Date" column
total_df.dtypes

ID                int64
Joining Day      object
Month            object
Account Type     object
Date of Birth    object
Ethnicity        object
Joining Date     object
dtype: object

In [12]:
#convert "Joining Date" to datetime
total_df["Joining Date"] = pd.to_datetime(total_df["Joining Date"], format='%B %d, %Y')

#check "Joining Date" column datatype
print(total_df, "\n")
print(total_df["Joining Date"].dtype)

         ID Joining Day      Month Account Type Date of Birth Ethnicity  \
0    100185          20        May        Basic     7/29/1952     Asian   
1    101515          14      April         Gold     8/11/1974     Black   
2    101744          29     August        Basic     1/21/1945     Asian   
3    102704          23    January        Basic      3/9/2000     Black   
4    103488          28     August        Basic     9/26/1957     Other   
..      ...         ...        ...          ...           ...       ...   
985  994016          29      March     Platinum     3/29/1955     Other   
986  994289          16       June         Gold      5/9/1990     White   
987  994611          10    January        Basic     6/19/1994     Black   
988  995456          17   February        Basic      3/5/1975     Other   
989  997703          19  September     Platinum      1/7/1998     Other   

    Joining Date  
0     2023-05-20  
1     2023-04-14  
2     2023-08-29  
3     2023-01-23  
4   

In [13]:
#test "Joining Date" datatype by sorting values
total_df["Joining Date"].sort_values()

871   2023-01-01
837   2023-01-01
847   2023-01-02
725   2023-01-02
393   2023-01-02
         ...    
660   2023-12-28
627   2023-12-28
447   2023-12-28
560   2023-12-30
524   2023-12-30
Name: Joining Date, Length: 990, dtype: datetime64[ns]

In [14]:
#drop columns that were used to create joining date (Joining Day, Month)
total_df.drop(columns=["Joining Day", "Month"], inplace=True)
total_df

Unnamed: 0,ID,Account Type,Date of Birth,Ethnicity,Joining Date
0,100185,Basic,7/29/1952,Asian,2023-05-20
1,101515,Gold,8/11/1974,Black,2023-04-14
2,101744,Basic,1/21/1945,Asian,2023-08-29
3,102704,Basic,3/9/2000,Black,2023-01-23
4,103488,Basic,9/26/1957,Other,2023-08-28
...,...,...,...,...,...
985,994016,Platinum,3/29/1955,Other,2023-03-29
986,994289,Gold,5/9/1990,White,2023-06-16
987,994611,Basic,6/19/1994,Black,2023-01-10
988,995456,Basic,3/5/1975,Other,2023-02-17


### Format "Date of Birth" column

In [15]:
#check "Date of Birth" dtype
total_df.dtypes

ID                        int64
Account Type             object
Date of Birth            object
Ethnicity                object
Joining Date     datetime64[ns]
dtype: object

In [16]:
#change datatype of date of birth column to datetime for analysis
total_df["Date of Birth"] = pd.to_datetime(total_df["Date of Birth"])

#check "Date of Birth" datatype
total_df.dtypes

ID                        int64
Account Type             object
Date of Birth    datetime64[ns]
Ethnicity                object
Joining Date     datetime64[ns]
dtype: object

### Reorder Columns

In [17]:
#reorder columns 

#grab column labels and reorder with lists --> reference and reassign with specified list order
cols = total_df.columns.tolist()
cols = cols[:1] + cols[-1:] + cols[1:-1]

total_df = total_df[cols]
total_df

Unnamed: 0,ID,Joining Date,Account Type,Date of Birth,Ethnicity
0,100185,2023-05-20,Basic,1952-07-29,Asian
1,101515,2023-04-14,Gold,1974-08-11,Black
2,101744,2023-08-29,Basic,1945-01-21,Asian
3,102704,2023-01-23,Basic,2000-03-09,Black
4,103488,2023-08-28,Basic,1957-09-26,Other
...,...,...,...,...,...
985,994016,2023-03-29,Platinum,1955-03-29,Other
986,994289,2023-06-16,Gold,1990-05-09,White
987,994611,2023-01-10,Basic,1994-06-19,Black
988,995456,2023-02-17,Basic,1975-03-05,Other


## Export Data

In [18]:
#create file name variable to easily set output file name and excel file type
file_name = "PreppinData_2023Week4-NewCustomers_Output.xlsx"

#create file path to easily replace URL if necessary
folder_path = r"D:\Work\Professional\Side_Projects\Data Cleaning Challenges\PreppinDataChallenge2023_Week4-NewCustomers"

#create formatted path to format URL for .read_excel
formatted_path = "{parent_path}\{file}".format(parent_path = folder_path, file=file_name).replace("\\", "/")

#output
total_df.to_excel(formatted_path, sheet_name="output")