Author: Charles Pugh
Google-certified Data Analyst
Email: charlespughtech@gmail.com
LinkedIn: https://www.linkedin.com/in/charlespughtech/
Date: June 08, 2025
- Project Overview
- Dataset
- Requirements
- Project Structure
- Data Cleaning
- Usage
- Important Note on File Paths
- Contact
This project involves cleaning a Customer Call List dataset using Pandas to prepare it for a contact center. The dataset, sourced from a CSV file, requires several cleaning steps to address issues such as duplicates, inconsistent formatting, missing values, and irrelevant data. The final cleaned dataset is saved as an Excel file for further use.
Raw Dataset URL: https://github.com/AlexTheAnalyst/PandasYouTubeSeries/blob/main/Customer%20Call%20List.xlsx
- Python 3.8 or higher
- Jupyter Notebook
- Pandas library
data_cleaning_pandas_python/
├── README.md
├── data_cleaning_pandas.ipynb
└── data_cleaning_pandas.html
The data cleaning process involves the following steps:
Duplicates in the dataset are identified and removed to ensure each record is unique.
df = df.drop_duplicates()
The column 'Not_Useful_Column'
is removed as it does not provide relevant information for the contact center.
df = df.drop(columns='Not_Useful_Column')
Non-alphanumeric characters are stripped from the 'Last_Name'
column to standardize the data.
df['Last_Name'] = df['Last_Name'].str.strip('./_')
Phone numbers are cleaned by removing non-numeric characters and formatted to a standard format (e.g., 123-456-7890).
df['Phone_Number'] = df['Phone_Number'].str.replace('[^a-zA-Z0-9]', '', regex=True)
df['Phone_Number'] = df['Phone_Number'].apply(lambda x: str(x))
df['Phone_Number'] = df['Phone_Number'].apply(lambda x: x[0:3] + '-' + x[3:6] + '-' + x[6:10])
df['Phone_Number'] = df['Phone_Number'].str.replace('nan--', '').str.replace('Na--', '')
The 'Address'
column is split into three separate columns: 'Street_Address'
, 'State'
, and 'Zip_Code'
.
df[['Street_Address', 'State', 'Zip_Code']] = df['Address'].str.split(',', n=2, expand=True)
df = df.drop(columns='Address')
Values in the 'Paying_Customer'
and 'Do_Not_Contact'
columns are standardized to 'Y'
for Yes and 'N'
for No.
df['Paying_Customer'] = df['Paying_Customer'].str.replace('Yes', 'Y').str.replace('No', 'N')
df['Do_Not_Contact'] = df['Do_Not_Contact'].str.replace('Yes', 'Y').str.replace('No', 'N')
Missing values (NaN
, N/a
) are replaced with empty strings to clean the dataset.
df = df.replace({'NaN': '', 'N/a': ''}).fillna('')
Rows where 'Do_Not_Contact'
is 'Y'
or 'Phone_Number'
is empty are removed, as these records are not useful for the contact center.
for x in df.index:
if df.loc[x, 'Do_Not_Contact'] == 'Y':
df.drop(x, inplace=True)
for x in df.index:
if df.loc[x, 'Phone_Number'] == '':
df.drop(x, inplace=True)
The index is reset to provide a clean, sequential index for the final dataset.
df = df.reset_index(drop=True)
To replicate this project:
-
Clone the repository:
git clone https://github.com/charlespughtech/data_cleaning_pandas_python.git cd data_cleaning_pandas_python
-
Install the required libraries:
pip install pandas
-
Open the Jupyter Notebook:
jupyter notebook data_cleaning_pandas.ipynb
-
Update the file paths in the notebook to match the location of the files on your computer (see Important Note on File Paths).
-
Run the cells sequentially to perform the data cleaning steps.
-
The cleaned data will be saved in the specified location on your computer.
The file paths used in the code are specific to the author's computer. You will need to modify these file locations to match the file locations on your own PC for the following parts of the code:
-
Reading the input file with
pd.read_csv()
: Replace the file path in the code with the actual location of the CSV file on your PC.- For example, if your input CSV file is located at
D:\data\project\Customer Call List.csv
, update the line to:df = pd.read_csv(r"D:\data\project\Customer Call List.csv")
- For example, if your input CSV file is located at
-
Saving the output file with
df.to_excel()
: Replace the file path with the desired location and filename for the cleaned Excel file on your PC.- For example, to save the output to
D:\data\project\cleaned\customer_call_list_clean.xlsx
, use:df.to_excel(r"D:\data\project\cleaned\customer_call_list_clean.xlsx", index=False)
- For example, to save the output to
Ensure that you use the correct file extensions and names as per your files.
For any inquiries or further information, please contact:
Charles Pugh
Google-certified Data Analyst
Email: charlespughtech@gmail.com
LinkedIn: https://www.linkedin.com/in/charlespughtech/
This README.md was generated on June 08, 2025, at 10:41 PM BST.