# Data Cleaning with Pandas Library

### In the vast realm of data science and analysis, before any sophisticated analysis or machine learning models come into play, there's a foundational step that is crucial to the success of any project: data cleaning. Data, in its raw form, often comes with various imperfections—missing values, inconsistencies, or anomalies—that can affect the outcome and reliability of analytics processes. Cleaning such data is imperative to ensure the integrity of downstream operations.

### Enter Pandas—a powerful Python library tailor-made for data analysis. Born from the necessity to handle large datasets with ease and flexibility, Pandas has become the de facto tool for data manipulation in Python. Its data structures provide an intuitive interface to slice and dice data, reshape datasets, and perform aggregations, among other tasks.

### In this project, I was given a dataset of potential clients on a call list that will later be used by our sales team for contacting these individuals. Management would like this dataset to be cleaned so that the sales team can more easily and efficiently make use of the data.

# First Steps - Import the Library and Create a DataFrame

In [1]:
# Import Library
import pandas as pd

In [88]:
# Pandas will read the Excel file and assign it to the variable 'df'
df = pd.read_excel(r"C:\Users\unbou\Desktop\PresentationProjects\DataCleaning\PandasCleaning\Customer Call List.xlsx")

In [89]:
# Printing the dataframe for a quick overview
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Not_Useful_Column
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No,True
1,1002,Abed,Nadir,123/643/9775,93 West Main Street,No,Yes,False
2,1003,Walter,/White,7066950392,298 Drugs Driveway,N,,True
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y,True
4,1005,Jon,Snow,876|678|3469,123 Dragons Road,Y,No,True
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes,True
6,1007,Jeff,Winger,,1209 South Street,No,No,False
7,1008,Sherlock,Holmes,876|678|3469,98 Clue Drive,N,No,False
8,1009,Gandalf,,N/a,123 Middle Earth,Yes,,False
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No,True


# Initial Cleaning

In [90]:
# The first step to cleaning most datasets to remove duplicates
# This function creates a new dataset with no duplicates
# It is important not to forget to set our dataframe equal to this code
df = df.drop_duplicates()

In [91]:
# Next it would be wise to remove data not useful to our sales team
# We can utilize the .drop() function to select columns in our dataset
# for removal
df = df.drop(columns = "Not_Useful_Column")

# Advanced Cleaning

### Last Name

In [92]:
# Looking at the dataset, it appears the column "Last_Name" has some issues
df["Last_Name"]

0         Baggins
1           Nadir
2          /White
3         Schrute
4            Snow
5         Swanson
6          Winger
7          Holmes
8             NaN
9          Parker
10         Gamgee
11      ...Potter
12         Draper
13          Knope
14    Flenderson_
15        Weasley
16          Scott
17           Kent
18         Braton
19      Skywalker
Name: Last_Name, dtype: object

In [93]:
# We can use the .strip() function to remove these inconsistencies
# An important deviation from the previous workflow is that we have to set
# the column equal to this code to avoid errasing our entire dataframe
df["Last_Name"] = df["Last_Name"].str.lstrip("...")
df["Last_Name"] = df["Last_Name"].str.lstrip("/")
df["Last_Name"] = df["Last_Name"].str.rstrip("_")
df["Last_Name"]

0        Baggins
1          Nadir
2          White
3        Schrute
4           Snow
5        Swanson
6         Winger
7         Holmes
8            NaN
9         Parker
10        Gamgee
11        Potter
12        Draper
13         Knope
14    Flenderson
15       Weasley
16         Scott
17          Kent
18        Braton
19     Skywalker
Name: Last_Name, dtype: object

### Phone Numbers

In [94]:
# We want each phone number to follow the same format
# To do so we can remove any non-numeric symbols so that each phone number
# is simply a string of numbers we can later format
df['Phone_Number'] = df["Phone_Number"].str.replace('[^a-zA-Z0-9]', '')
df['Phone_Number'] = df['Phone_Number'].str.replace('[^\d]', '', regex=True)

In [95]:
# We must first convert each number into a string
df["Phone_Number"] = df["Phone_Number"].apply(lambda x : str(x))

In [96]:
# We can now use a lambda to format each number into the same format
df["Phone_Number"] = df["Phone_Number"].apply(lambda x: x[0:3] + "-" + x[3:6] + "-" + x[6:10])

In [97]:
# Our final step is to remove the 'nan' and '--'
df["Phone_Number"] = df["Phone_Number"].str.replace('nan--', '')
df["Phone_Number"] = df["Phone_Number"].str.replace('--', '')

In [98]:
# The 'Phone_Number' column is now clean!
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No
1,1002,Abed,Nadir,123-643-9775,93 West Main Street,No,Yes
2,1003,Walter,White,,298 Drugs Driveway,N,
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y
4,1005,Jon,Snow,876-678-3469,123 Dragons Road,Y,No
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes
6,1007,Jeff,Winger,,1209 South Street,No,No
7,1008,Sherlock,Holmes,876-678-3469,98 Clue Drive,N,No
8,1009,Gandalf,,,123 Middle Earth,Yes,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No


## Address 

In [99]:
# Sales would also like the address information split into multiple columns
# to get a better picture of where each client lives at a glance
# We c
df[["Street", "State", "Area Code"]] = df["Address"].str.split(',',n=2, expand=True)
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Street,State,Area Code
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No,123 Shire Lane,Shire,
1,1002,Abed,Nadir,123-643-9775,93 West Main Street,No,Yes,93 West Main Street,,
2,1003,Walter,White,,298 Drugs Driveway,N,,298 Drugs Driveway,,
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y,980 Paper Avenue,Pennsylvania,18503.0
4,1005,Jon,Snow,876-678-3469,123 Dragons Road,Y,No,123 Dragons Road,,
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes,768 City Parkway,,
6,1007,Jeff,Winger,,1209 South Street,No,No,1209 South Street,,
7,1008,Sherlock,Holmes,876-678-3469,98 Clue Drive,N,No,98 Clue Drive,,
8,1009,Gandalf,,,123 Middle Earth,Yes,,123 Middle Earth,,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No,25th Main Street,New York,


## Paying Customer & Do Not Contact

In [100]:
# For consistency, the 'PayingCustomer' column should be converted so that
# all values follow a similar format
df["Paying Customer"] = df["Paying Customer"].str.replace("Yes", "Y")
df["Paying Customer"] = df["Paying Customer"].str.replace("No", "N")

df["Do_Not_Contact"] = df["Do_Not_Contact"].str.replace("Yes", "Y")
df["Do_Not_Contact"] = df["Do_Not_Contact"].str.replace("No", "N")

df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Street,State,Area Code
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Y,N,123 Shire Lane,Shire,
1,1002,Abed,Nadir,123-643-9775,93 West Main Street,N,Y,93 West Main Street,,
2,1003,Walter,White,,298 Drugs Driveway,N,,298 Drugs Driveway,,
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Y,Y,980 Paper Avenue,Pennsylvania,18503.0
4,1005,Jon,Snow,876-678-3469,123 Dragons Road,Y,N,123 Dragons Road,,
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Y,Y,768 City Parkway,,
6,1007,Jeff,Winger,,1209 South Street,N,N,1209 South Street,,
7,1008,Sherlock,Holmes,876-678-3469,98 Clue Drive,N,N,98 Clue Drive,,
8,1009,Gandalf,,,123 Middle Earth,Y,,123 Middle Earth,,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Y,N,25th Main Street,New York,


# Filtering the Data

### It is important that the data be filtered such that the list that gets handed to the sales team only contains clients who they are both able to contact and those that want to be contacted.

## Remove Empty Values

In [101]:
# Removes values that are empty
df = df.fillna('')

## Remove Clients that are DNC

In [102]:
# Create a for loop that identifies clients who do not want to be contacted
# and removes them from the list.
for x in df.index:
    if df.loc[x, "Do_Not_Contact"] == 'Y':
        df.drop(x, inplace = True)
        
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Street,State,Area Code
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Y,N,123 Shire Lane,Shire,
2,1003,Walter,White,,298 Drugs Driveway,N,,298 Drugs Driveway,,
4,1005,Jon,Snow,876-678-3469,123 Dragons Road,Y,N,123 Dragons Road,,
6,1007,Jeff,Winger,,1209 South Street,N,N,1209 South Street,,
7,1008,Sherlock,Holmes,876-678-3469,98 Clue Drive,N,N,98 Clue Drive,,
8,1009,Gandalf,,,123 Middle Earth,Y,,123 Middle Earth,,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Y,N,25th Main Street,New York,
10,1011,Samwise,Gamgee,,"612 Shire Lane, Shire",Y,N,612 Shire Lane,Shire,
11,1012,Harry,Potter,,2394 Hogwarts Avenue,Y,,2394 Hogwarts Avenue,,
12,1013,Don,Draper,123-543-2345,2039 Main Street,Y,N,2039 Main Street,,


In [103]:
# Remove clients that do not have a number listed
for x in df.index:
    if df.loc[x, "Phone_Number"] == '':
        df.drop(x, inplace = True)
        
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Street,State,Area Code
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Y,N,123 Shire Lane,Shire,
4,1005,Jon,Snow,876-678-3469,123 Dragons Road,Y,N,123 Dragons Road,,
7,1008,Sherlock,Holmes,876-678-3469,98 Clue Drive,N,N,98 Clue Drive,,
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Y,N,25th Main Street,New York,
12,1013,Don,Draper,123-543-2345,2039 Main Street,Y,N,2039 Main Street,,
13,1014,Leslie,Knope,876-678-3469,343 City Parkway,Y,N,343 City Parkway,,
14,1015,Toby,Flenderson,304-762-2467,214 HR Avenue,N,N,214 HR Avenue,,
15,1016,Ron,Weasley,123-545-5421,2395 Hogwarts Avenue,N,N,2395 Hogwarts Avenue,,
16,1017,Michael,Scott,123-643-9775,"121 Paper Avenue, Pennsylvania",Y,N,121 Paper Avenue,Pennsylvania,
19,1020,Anakin,Skywalker,876-678-3469,"910 Tatooine Road, Tatooine",Y,N,910 Tatooine Road,Tatooine,


In [111]:
# Reset the index to get a better idea of the length of the list
# and make the list more visually appealing
df = df.reset_index(drop=True)

# Final Product

In [107]:
# Here is a complete list of potential clients for the sales team to contact
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Street,State,Area Code
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Y,N,123 Shire Lane,Shire,
1,1005,Jon,Snow,876-678-3469,123 Dragons Road,Y,N,123 Dragons Road,,
2,1008,Sherlock,Holmes,876-678-3469,98 Clue Drive,N,N,98 Clue Drive,,
3,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Y,N,25th Main Street,New York,
4,1013,Don,Draper,123-543-2345,2039 Main Street,Y,N,2039 Main Street,,
5,1014,Leslie,Knope,876-678-3469,343 City Parkway,Y,N,343 City Parkway,,
6,1015,Toby,Flenderson,304-762-2467,214 HR Avenue,N,N,214 HR Avenue,,
7,1016,Ron,Weasley,123-545-5421,2395 Hogwarts Avenue,N,N,2395 Hogwarts Avenue,,
8,1017,Michael,Scott,123-643-9775,"121 Paper Avenue, Pennsylvania",Y,N,121 Paper Avenue,Pennsylvania,
9,1020,Anakin,Skywalker,876-678-3469,"910 Tatooine Road, Tatooine",Y,N,910 Tatooine Road,Tatooine,
