# Python Phone Number and Address Data Cleaning

Performing basic Python data cleaning using the Panda and Numpy add-on libraries. Specifically focusing on one way you can clean up and standarize phone numbers and dates. I am using a fictional company sales data set, already summarized together by customer.

Importing, manipulating and showing the data through the use of the dataframe data structure in the Pandas library.

My Python IDE of choice is Jupyter Notebook because I love how easy it is to have text and markup, images, Python code and visualizations all in one document.

In [234]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import numpy as np

In [235]:
df = pd.read_csv(r'C:\Users\user\Downloads\customers.csv')
df.head()

Unnamed: 0,first_name,last_name,email,phone,address,gender,age,registered,orders,spent,job,hobbies,is_married
0,Joseph,Rice,josephrice131@slingacademy.com,+1-800-040-3135x6208,"91773 Miller Shoal\nDiaztown, FL 38841",male,43,12/18/2018,7,568.29,Artist,Playing sports,False
1,Gary,Moore,garymoore386@slingacademy.com,221.945.4191x8872,"6450 John Lodge\nTerriton, KY 95945",male,71,1/3/2020,11,568.9,Artist,Swimming,True
2,John,Walker,johnwalker944@slingacademy.com,388-142-4883x5370,"27265 Murray Island\nKevinfort, PA 63231",male,44,11/18/2019,11,497.12,Clerk,Painting,False
3,William,Jackson,williamjackson427@slingacademy.com,625.626.9133x374,"170 Jackson Loaf\nKristenland, AS 48876",male,58,2/13/2022,14,151.59,Engineer,Reading,False
4,Nicole,Jones,nicolejones228@slingacademy.com,1783534757,"14354 Baker Harbor Apt. 017\nEricville, HI 11192",female,33,3/22/2021,19,33.17,Unemployed,Running,True


In [236]:
df = df.drop_duplicates()

## Cleanup, standarize Phone Numbers

In [237]:
# Start by removing the extensions of phone numbers that have them, in this data set remove the "x" and everything after.
df["phone"] = df["phone"].str.split('x').str[0]
df.head()

Unnamed: 0,first_name,last_name,email,phone,address,gender,age,registered,orders,spent,job,hobbies,is_married
0,Joseph,Rice,josephrice131@slingacademy.com,+1-800-040-3135,"91773 Miller Shoal\nDiaztown, FL 38841",male,43,12/18/2018,7,568.29,Artist,Playing sports,False
1,Gary,Moore,garymoore386@slingacademy.com,221.945.4191,"6450 John Lodge\nTerriton, KY 95945",male,71,1/3/2020,11,568.9,Artist,Swimming,True
2,John,Walker,johnwalker944@slingacademy.com,388-142-4883,"27265 Murray Island\nKevinfort, PA 63231",male,44,11/18/2019,11,497.12,Clerk,Painting,False
3,William,Jackson,williamjackson427@slingacademy.com,625.626.9133,"170 Jackson Loaf\nKristenland, AS 48876",male,58,2/13/2022,14,151.59,Engineer,Reading,False
4,Nicole,Jones,nicolejones228@slingacademy.com,1783534757,"14354 Baker Harbor Apt. 017\nEricville, HI 11192",female,33,3/22/2021,19,33.17,Unemployed,Running,True


In [238]:
# Remove US Country codes
df["phone"] = df["phone"].str.replace('\+1\-','')
df["phone"] = df["phone"].str.replace('001\-','')

In [239]:
df.head(20)

Unnamed: 0,first_name,last_name,email,phone,address,gender,age,registered,orders,spent,job,hobbies,is_married
0,Joseph,Rice,josephrice131@slingacademy.com,800-040-3135,"91773 Miller Shoal\nDiaztown, FL 38841",male,43,12/18/2018,7,568.29,Artist,Playing sports,False
1,Gary,Moore,garymoore386@slingacademy.com,221.945.4191,"6450 John Lodge\nTerriton, KY 95945",male,71,1/3/2020,11,568.9,Artist,Swimming,True
2,John,Walker,johnwalker944@slingacademy.com,388-142-4883,"27265 Murray Island\nKevinfort, PA 63231",male,44,11/18/2019,11,497.12,Clerk,Painting,False
3,William,Jackson,williamjackson427@slingacademy.com,625.626.9133,"170 Jackson Loaf\nKristenland, AS 48876",male,58,2/13/2022,14,151.59,Engineer,Reading,False
4,Nicole,Jones,nicolejones228@slingacademy.com,1783534757,"14354 Baker Harbor Apt. 017\nEricville, HI 11192",female,33,3/22/2021,19,33.17,Unemployed,Running,True
5,David,Davis,daviddavis980@slingacademy.com,067.435.8553,"021 Katherine Mall\nJameston, DC 24685",male,59,1/29/2022,9,970.96,Doctor,Playing sports,False
6,Jason,Montgomery,jasonmontgomery889@slingacademy.com,208-220-1519,"14657 Scott Loop Apt. 735\nPort Ashley, NH 34470",male,58,6/28/2021,12,676.2,Waitress,Collecting,False
7,Kent,Weaver,kentweaver695@slingacademy.com,(000)101-6979,"6644 Mitchell Burg\nVictorhaven, KS 66356",male,61,1/24/2023,1,674.37,Clerk,Hiking,False
8,Jacqueline,Wang,jacquelinewang322@slingacademy.com,691-557-6502,"16963 Stewart Curve Suite 279\nSouth Cameron, ...",female,22,9/23/2022,12,962.47,Housewife,Collecting,False
9,Jodi,Gonzalez,jodigonzalez185@slingacademy.com,137.205.7066,"378 Johnson Oval\nSouth Stacie, RI 76332",female,69,2/7/2021,12,68.67,Baker,Photography,False


In [240]:
# Remove everything else that is not a letter or number (i.e. +, -, (, ), etc.)
df["phone"] = df["phone"].str.replace('[^a-zA-Z0-9]','')
df.head(20)

Unnamed: 0,first_name,last_name,email,phone,address,gender,age,registered,orders,spent,job,hobbies,is_married
0,Joseph,Rice,josephrice131@slingacademy.com,8000403135,"91773 Miller Shoal\nDiaztown, FL 38841",male,43,12/18/2018,7,568.29,Artist,Playing sports,False
1,Gary,Moore,garymoore386@slingacademy.com,2219454191,"6450 John Lodge\nTerriton, KY 95945",male,71,1/3/2020,11,568.9,Artist,Swimming,True
2,John,Walker,johnwalker944@slingacademy.com,3881424883,"27265 Murray Island\nKevinfort, PA 63231",male,44,11/18/2019,11,497.12,Clerk,Painting,False
3,William,Jackson,williamjackson427@slingacademy.com,6256269133,"170 Jackson Loaf\nKristenland, AS 48876",male,58,2/13/2022,14,151.59,Engineer,Reading,False
4,Nicole,Jones,nicolejones228@slingacademy.com,1783534757,"14354 Baker Harbor Apt. 017\nEricville, HI 11192",female,33,3/22/2021,19,33.17,Unemployed,Running,True
5,David,Davis,daviddavis980@slingacademy.com,674358553,"021 Katherine Mall\nJameston, DC 24685",male,59,1/29/2022,9,970.96,Doctor,Playing sports,False
6,Jason,Montgomery,jasonmontgomery889@slingacademy.com,2082201519,"14657 Scott Loop Apt. 735\nPort Ashley, NH 34470",male,58,6/28/2021,12,676.2,Waitress,Collecting,False
7,Kent,Weaver,kentweaver695@slingacademy.com,1016979,"6644 Mitchell Burg\nVictorhaven, KS 66356",male,61,1/24/2023,1,674.37,Clerk,Hiking,False
8,Jacqueline,Wang,jacquelinewang322@slingacademy.com,6915576502,"16963 Stewart Curve Suite 279\nSouth Cameron, ...",female,22,9/23/2022,12,962.47,Housewife,Collecting,False
9,Jodi,Gonzalez,jodigonzalez185@slingacademy.com,1372057066,"378 Johnson Oval\nSouth Stacie, RI 76332",female,69,2/7/2021,12,68.67,Baker,Photography,False


In [209]:
# Format the phone number using a lambda
df["phone"] = df["phone"].apply(lambda x: x[0:3] + '-' + x[3:6] + '-' + x[6:10])
df.head(20)

Unnamed: 0,first_name,last_name,email,phone,address,gender,age,registered,orders,spent,job,hobbies,is_married
0,Joseph,Rice,josephrice131@slingacademy.com,800-040-3135,"91773 Miller Shoal\nDiaztown, FL 38841",male,43,12/18/2018,7,568.29,Artist,Playing sports,False
1,Gary,Moore,garymoore386@slingacademy.com,221-945-4191,"6450 John Lodge\nTerriton, KY 95945",male,71,1/3/2020,11,568.9,Artist,Swimming,True
2,John,Walker,johnwalker944@slingacademy.com,388-142-4883,"27265 Murray Island\nKevinfort, PA 63231",male,44,11/18/2019,11,497.12,Clerk,Painting,False
3,William,Jackson,williamjackson427@slingacademy.com,625-626-9133,"170 Jackson Loaf\nKristenland, AS 48876",male,58,2/13/2022,14,151.59,Engineer,Reading,False
4,Nicole,Jones,nicolejones228@slingacademy.com,178-353-4757,"14354 Baker Harbor Apt. 017\nEricville, HI 11192",female,33,3/22/2021,19,33.17,Unemployed,Running,True
5,David,Davis,daviddavis980@slingacademy.com,067-435-8553,"021 Katherine Mall\nJameston, DC 24685",male,59,1/29/2022,9,970.96,Doctor,Playing sports,False
6,Jason,Montgomery,jasonmontgomery889@slingacademy.com,208-220-1519,"14657 Scott Loop Apt. 735\nPort Ashley, NH 34470",male,58,6/28/2021,12,676.2,Waitress,Collecting,False
7,Kent,Weaver,kentweaver695@slingacademy.com,000-101-6979,"6644 Mitchell Burg\nVictorhaven, KS 66356",male,61,1/24/2023,1,674.37,Clerk,Hiking,False
8,Jacqueline,Wang,jacquelinewang322@slingacademy.com,691-557-6502,"16963 Stewart Curve Suite 279\nSouth Cameron, ...",female,22,9/23/2022,12,962.47,Housewife,Collecting,False
9,Jodi,Gonzalez,jodigonzalez185@slingacademy.com,137-205-7066,"378 Johnson Oval\nSouth Stacie, RI 76332",female,69,2/7/2021,12,68.67,Baker,Photography,False


In [210]:
# Remove remaining phone numbers that are not in the XXX-XXX-XXXX. There will be less than 12 numbers now in all the
# invalid ones. We will have to reach out to them via e-mail instead of needing to correct. Phone numbers are where we
# want them now!
df['phone'] = df['phone'].apply(lambda x: x if len(x) == 12 else x[:-len(x)])
df.head(20)

Unnamed: 0,first_name,last_name,email,phone,address,gender,age,registered,orders,spent,job,hobbies,is_married
0,Joseph,Rice,josephrice131@slingacademy.com,800-040-3135,"91773 Miller Shoal\nDiaztown, FL 38841",male,43,12/18/2018,7,568.29,Artist,Playing sports,False
1,Gary,Moore,garymoore386@slingacademy.com,221-945-4191,"6450 John Lodge\nTerriton, KY 95945",male,71,1/3/2020,11,568.9,Artist,Swimming,True
2,John,Walker,johnwalker944@slingacademy.com,388-142-4883,"27265 Murray Island\nKevinfort, PA 63231",male,44,11/18/2019,11,497.12,Clerk,Painting,False
3,William,Jackson,williamjackson427@slingacademy.com,625-626-9133,"170 Jackson Loaf\nKristenland, AS 48876",male,58,2/13/2022,14,151.59,Engineer,Reading,False
4,Nicole,Jones,nicolejones228@slingacademy.com,178-353-4757,"14354 Baker Harbor Apt. 017\nEricville, HI 11192",female,33,3/22/2021,19,33.17,Unemployed,Running,True
5,David,Davis,daviddavis980@slingacademy.com,067-435-8553,"021 Katherine Mall\nJameston, DC 24685",male,59,1/29/2022,9,970.96,Doctor,Playing sports,False
6,Jason,Montgomery,jasonmontgomery889@slingacademy.com,208-220-1519,"14657 Scott Loop Apt. 735\nPort Ashley, NH 34470",male,58,6/28/2021,12,676.2,Waitress,Collecting,False
7,Kent,Weaver,kentweaver695@slingacademy.com,000-101-6979,"6644 Mitchell Burg\nVictorhaven, KS 66356",male,61,1/24/2023,1,674.37,Clerk,Hiking,False
8,Jacqueline,Wang,jacquelinewang322@slingacademy.com,691-557-6502,"16963 Stewart Curve Suite 279\nSouth Cameron, ...",female,22,9/23/2022,12,962.47,Housewife,Collecting,False
9,Jodi,Gonzalez,jodigonzalez185@slingacademy.com,137-205-7066,"378 Johnson Oval\nSouth Stacie, RI 76332",female,69,2/7/2021,12,68.67,Baker,Photography,False


## Cleanup, standarize Addresses

In [241]:
# We need to ultimately split the "Address" up into four separate columns: Address, City, State, Zip
df.loc[:,['address']].head(20)

Unnamed: 0,address
0,"91773 Miller Shoal\nDiaztown, FL 38841"
1,"6450 John Lodge\nTerriton, KY 95945"
2,"27265 Murray Island\nKevinfort, PA 63231"
3,"170 Jackson Loaf\nKristenland, AS 48876"
4,"14354 Baker Harbor Apt. 017\nEricville, HI 11192"
5,"021 Katherine Mall\nJameston, DC 24685"
6,"14657 Scott Loop Apt. 735\nPort Ashley, NH 34470"
7,"6644 Mitchell Burg\nVictorhaven, KS 66356"
8,"16963 Stewart Curve Suite 279\nSouth Cameron, ..."
9,"378 Johnson Oval\nSouth Stacie, RI 76332"


In [242]:
# First we will replace the \n newline character to separate address out.
df["address"] = df["address"].str.replace('\n',',')

In [243]:
# Now it's time to split out Address, City, State/Zip. It appends the new columns to the end, so can't see them,
# so going to use .loc function to show address columns to see how we are doing.
df[["street_address","city","state_zip"]] = df["address"].str.split(',',2, expand=True)
df.loc[:,['address', 'street_address','city','state_zip']].head(20)

Unnamed: 0,address,street_address,city,state_zip
0,"91773 Miller Shoal,Diaztown, FL 38841",91773 Miller Shoal,Diaztown,FL 38841
1,"6450 John Lodge,Territon, KY 95945",6450 John Lodge,Territon,KY 95945
2,"27265 Murray Island,Kevinfort, PA 63231",27265 Murray Island,Kevinfort,PA 63231
3,"170 Jackson Loaf,Kristenland, AS 48876",170 Jackson Loaf,Kristenland,AS 48876
4,"14354 Baker Harbor Apt. 017,Ericville, HI 11192",14354 Baker Harbor Apt. 017,Ericville,HI 11192
5,"021 Katherine Mall,Jameston, DC 24685",021 Katherine Mall,Jameston,DC 24685
6,"14657 Scott Loop Apt. 735,Port Ashley, NH 34470",14657 Scott Loop Apt. 735,Port Ashley,NH 34470
7,"6644 Mitchell Burg,Victorhaven, KS 66356",6644 Mitchell Burg,Victorhaven,KS 66356
8,"16963 Stewart Curve Suite 279,South Cameron, N...",16963 Stewart Curve Suite 279,South Cameron,ND 25136
9,"378 Johnson Oval,South Stacie, RI 76332",378 Johnson Oval,South Stacie,RI 76332


In [244]:
# Now we will split out the State and Zip. Drop temp state_zip column, but keep original address column
df["state_zip"] = df["state_zip"].str.strip()
df[["state","zip"]] = df["state_zip"].str.split(" ", 1, expand=True)
df = df.drop('state_zip', axis=1)

In [148]:
# Address is good to go!
df.loc[:,['address', 'street_address','city','state', 'zip']].head(20)

Unnamed: 0,address,street_address,city,state,zip
0,"91773 Miller Shoal,Diaztown, FL 38841",91773 Miller Shoal,Diaztown,FL,38841
1,"6450 John Lodge,Territon, KY 95945",6450 John Lodge,Territon,KY,95945
2,"27265 Murray Island,Kevinfort, PA 63231",27265 Murray Island,Kevinfort,PA,63231
3,"170 Jackson Loaf,Kristenland, AS 48876",170 Jackson Loaf,Kristenland,AS,48876
4,"14354 Baker Harbor Apt. 017,Ericville, HI 11192",14354 Baker Harbor Apt. 017,Ericville,HI,11192
5,"021 Katherine Mall,Jameston, DC 24685",021 Katherine Mall,Jameston,DC,24685
6,"14657 Scott Loop Apt. 735,Port Ashley, NH 34470",14657 Scott Loop Apt. 735,Port Ashley,NH,34470
7,"6644 Mitchell Burg,Victorhaven, KS 66356",6644 Mitchell Burg,Victorhaven,KS,66356
8,"16963 Stewart Curve Suite 279,South Cameron, N...",16963 Stewart Curve Suite 279,South Cameron,ND,25136
9,"378 Johnson Oval,South Stacie, RI 76332",378 Johnson Oval,South Stacie,RI,76332


# Modify Dates

In [245]:
# Modifying dates
df.loc[:,['registered','spent']].head(5)

Unnamed: 0,registered,spent
0,12/18/2018,568.29
1,1/3/2020,568.9
2,11/18/2019,497.12
3,2/13/2022,151.59
4,3/22/2021,33.17


In [246]:
# Checkout data type of "registered" first. Oops it's an object.
df.dtypes

first_name         object
last_name          object
email              object
phone              object
address            object
gender             object
age                 int64
registered         object
orders              int64
spent             float64
job                object
hobbies            object
is_married           bool
street_address     object
city               object
state              object
zip                object
dtype: object

In [247]:
# Convert "registered to date/time"

df['registered'] = pd.to_datetime(df.registered)
df.loc[:,['registered','spent']].head(5)

Unnamed: 0,registered,spent
0,2018-12-18,568.29
1,2020-01-03,568.9
2,2019-11-18,497.12
3,2022-02-13,151.59
4,2021-03-22,33.17


In [248]:
# Much better!
df.dtypes

first_name                object
last_name                 object
email                     object
phone                     object
address                   object
gender                    object
age                        int64
registered        datetime64[ns]
orders                     int64
spent                    float64
job                       object
hobbies                   object
is_married                  bool
street_address            object
city                      object
state                     object
zip                       object
dtype: object

In [249]:
# Can now using date/time functions like strftime, to format our date/time however we desire.
df['registered'] = df['registered'].dt.strftime('%m/%d/%Y')
df.loc[:,['registered','spent']].head(5)

Unnamed: 0,registered,spent
0,12/18/2018,568.29
1,01/03/2020,568.9
2,11/18/2019,497.12
3,02/13/2022,151.59
4,03/22/2021,33.17
