# Data Analyst Technical Assessment

## Scenario
Movido Media Verlag GmbH is a local marketing firm offering a SaaS solution that enhances businesses' online visibility across platforms like Google Maps, Apple Maps, and various business directories. Our mission is to ensure our clients' business information is accurate and easily discoverable by potential customers. Our data-driven sales process involves processing large volumes of lead data, which needs to be cleaned, normalised, and enhanced before being fed into downstream systems. We're also tasked with developing a robust system for storing, retrieving, and tracking lead information to measure future performance

### Objective

As a Data Analyst, my role will be the following:
- optimising these data processes
- improving lead quality
- contributing to our clients' success in the digital landscape

## Data Cleaning

The **data_cleanup_assignment.xlsx** file contains multiple sheets/datasets. In order to ensure data consistency and data quality. I will clean the datasets individually before merging them into one Table. 


### Importing Libraries


In [1040]:
import pandas as pd
import numpy as np

In [1042]:
import re

#### Read all datasets into separate DataFrames

##### Germany Dataset

In [1700]:
de_data = pd.read_excel('data cleanup assignment.xlsx', sheet_name='DE')

In [1702]:
de_data.head()

Unnamed: 0,firma,street,plz,city,old_ctry,telefon
0,Abschleppdienst Arnolds,Völlesbruchstrasse 19,52152,Simmerath,DE,0049177-8754883
1,AAS-Fink GmbH,Morsbach 39,42857,Remscheid,DE,172.2086056
2,Allfolia Deutschland GmbH,Morsbach 39,42857,Remscheid,DE,+491734636476
3,Autohaus Hentschel GmbH,Vahrenwalder Str. 141,30165,Hannover,DE,+49_x001D_17_x0011_86221169
4,Autohaus Schmohl GmbH,Potsdamer Str. 175,14469,Potsdam,DE,0049/160466 6050


##### Austria Dataset

In [1704]:
at_data = pd.read_excel('data cleanup assignment.xlsx', sheet_name='AT')

In [1706]:
at_data.head()

Unnamed: 0,firma,street,plz,city,country,telefon
0,Abschleppdienst und Reparatur Graber Hans,Brennerstraße 5,6150,Steinach am Brenner,AT,+43 664 0020108
1,Anhänger Steininger & Partner GmbH,Windhager Straße 22,3931,Schweiggers,AT,/664/3019220
2,Täubl Sonnenschutz,Otto-Scharmitzerstr. 24,3464,Goldgeben,AT,Hotline: 676-4429785 (+43)
3,Schönhacker Auto- und Fahrradzubehör,Ernest-Thum-Straße 1,3542,Gföhl,AT,+436998521286
4,Wuppinger Karosseriebau GmbH,Breitwies 6,5303,Thalgau,AT,+436763066695


##### Switzerland Dataset

In [1708]:
ch_data = pd.read_excel('data cleanup assignment.xlsx', sheet_name='CH')

In [1710]:
ch_data.head()

Unnamed: 0,firma,street,plz,city,country,telefon
0,Fankhauser AG Huttwil,Walke 1,4938,Rohrbach,CH,300 CALL 3378 / 78
1,TS-Velos GmbH,Jurastrasse 2,4554,Etziken,CH,+41abc5837-1_x001A_54
2,Nocera & Strub AG,Hirzenstrasse 1,9244,Niederuzwil,CH,Telefon: 00 037 CALL 9506 / 79
3,Druckerei Lutz AG,Hauptstrasse 18,9042,Speicher,CH,0_x0008_0_x001F_4178abc8579-115
4,Hch. Borer Kartenverlag AG,Ilbachstrasse 39,4228,Erschwil,CH,0041760352498


##### Mixed Dataset

In [1712]:
mixed_data = pd.read_excel('data cleanup assignment.xlsx', sheet_name='mixed')

In [1716]:
mixed_data.head()

Unnamed: 0,firma,street,plz,city,landesvorwahl,telefonnr,anrede,vorname,nachname
0,Hammurabi Restaurant,Untere Königsstraße,34117,Kassel,49.0,56128730000.0,,,
1,M&M Fahrzeugpflege,Friedrich-Ebert-Straße 9,32339,Espelkamp,49.0,16095580000.0,,,
2,Chic Änderungsschneiderei,Venloer Straße 503,50825,Köln,49.0,22116930000.0,,,
3,Miss Döner,Carl-von-Ossietzky-Platz 1,20099,Hamburg,49.0,1746529000.0,,,
4,Kumpir Haus Ehrenfeld,Venloer Straße 378,50825,Köln,49.0,15510140000.0,,,


#### Preliminary Data Exploration and Observation

Upon reviewing the datasets, I observed several irregularities, such as inconsistent phone number formats, encoding errors, missing values, and different data structures and name columns across the different sheets/datasets. 

The next steps will focus on cleaning these inconsistencies to ensure the datasets are standardized before being merged.

#### Understanding Data Structure and Data Types

In this phase, I examine the data types and structure of each dataset. This makes sure that each column is in the correct format (e.g. strings, integers, floats) which is crucial for efficient data manipulation.


In [83]:
# plz should be numeric instead of a string as these as Postal Codes
# telefon remains as a string as it contains special characters and is changed to object in 'mixed_data dataset
# landesvorwahl is changed to 'object' instead of float as to NOT appear as 49.0

In [1718]:
de_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 865 entries, 0 to 864
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   firma     865 non-null    object
 1   street    865 non-null    object
 2   plz       865 non-null    object
 3   city      865 non-null    object
 4   old_ctry  865 non-null    object
 5   telefon   865 non-null    object
dtypes: object(6)
memory usage: 40.7+ KB


In [1720]:
at_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   firma    200 non-null    object
 1   street   200 non-null    object
 2   plz      200 non-null    object
 3   city     200 non-null    object
 4   country  200 non-null    object
 5   telefon  189 non-null    object
dtypes: object(6)
memory usage: 9.5+ KB


In [1722]:
ch_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   firma    92 non-null     object
 1   street   92 non-null     object
 2   plz      92 non-null     int64 
 3   city     92 non-null     object
 4   country  92 non-null     object
 5   telefon  85 non-null     object
dtypes: int64(1), object(5)
memory usage: 4.4+ KB


In [1724]:
mixed_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3618 entries, 0 to 3617
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   firma          3072 non-null   object 
 1   street         3618 non-null   object 
 2   plz            3062 non-null   object 
 3   city           3026 non-null   object 
 4   landesvorwahl  3617 non-null   float64
 5   telefonnr      3581 non-null   float64
 6   anrede         354 non-null    object 
 7   vorname        468 non-null    object 
 8   nachname       395 non-null    object 
dtypes: float64(2), object(7)
memory usage: 254.5+ KB


#### Standardising the Data Across Datasets

In [1726]:
# Rename old_ctry to country
de_data = de_data.rename(columns={'old_ctry': 'country'})

In [1728]:
de_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 865 entries, 0 to 864
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   firma    865 non-null    object
 1   street   865 non-null    object
 2   plz      865 non-null    object
 3   city     865 non-null    object
 4   country  865 non-null    object
 5   telefon  865 non-null    object
dtypes: object(6)
memory usage: 40.7+ KB


In [1730]:
# rename telefonnr to telefon similar to other datasets
mixed_data = mixed_data.rename(columns= {'telefonnr': 'telefon'})

In [1732]:

mixed_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3618 entries, 0 to 3617
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   firma          3072 non-null   object 
 1   street         3618 non-null   object 
 2   plz            3062 non-null   object 
 3   city           3026 non-null   object 
 4   landesvorwahl  3617 non-null   float64
 5   telefon        3581 non-null   float64
 6   anrede         354 non-null    object 
 7   vorname        468 non-null    object 
 8   nachname       395 non-null    object 
dtypes: float64(2), object(7)
memory usage: 254.5+ KB


In [1734]:
# Convert 'telefon' to string to properly handle phone numbers
mixed_data['telefon'] = mixed_data['telefon'].astype(str)


In [1736]:
mixed_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3618 entries, 0 to 3617
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   firma          3072 non-null   object 
 1   street         3618 non-null   object 
 2   plz            3062 non-null   object 
 3   city           3026 non-null   object 
 4   landesvorwahl  3617 non-null   float64
 5   telefon        3618 non-null   object 
 6   anrede         354 non-null    object 
 7   vorname        468 non-null    object 
 8   nachname       395 non-null    object 
dtypes: float64(1), object(8)
memory usage: 254.5+ KB


In [1738]:
# Convert 'landesvorwahl' to integer, handling missing values appropriately
mixed_data['landesvorwahl'] = mixed_data['landesvorwahl'].fillna(0).astype(int)

In [1740]:
mixed_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3618 entries, 0 to 3617
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   firma          3072 non-null   object
 1   street         3618 non-null   object
 2   plz            3062 non-null   object
 3   city           3026 non-null   object
 4   landesvorwahl  3618 non-null   int64 
 5   telefon        3618 non-null   object
 6   anrede         354 non-null    object
 7   vorname        468 non-null    object
 8   nachname       395 non-null    object
dtypes: int64(1), object(8)
memory usage: 254.5+ KB


#### Handling Missing Data

In this step, I will identify the columns with missing data in the datasets. This is important to understand which columns have incomplete records, so I can decide whether to fill in the missing values or remove the affected rows based on the extent of missing data

In [1742]:
# finds the columns that have NaN values
de_data.isna().any()

firma      False
street     False
plz        False
city       False
country    False
telefon    False
dtype: bool

In [1744]:
at_data.isna().any()

firma      False
street     False
plz        False
city       False
country    False
telefon     True
dtype: bool

In [1746]:
ch_data.isna().any()

firma      False
street     False
plz        False
city       False
country    False
telefon     True
dtype: bool

In [1143]:
#mixed_data.isna().any()

firma             True
street           False
plz               True
city              True
landesvorwahl     True
telefon          False
anrede            True
vorname           True
nachname          True
dtype: bool

Movido Media Verlag GmbH, every lead is crucial for potential business opportunities. I will not delete or 'drop' the rows with missing data particularly **phone numbers**, as these will later be enriched using automated data retrieval methods in the **Data Enrichment** phase.

In the meantime, **NaN** will be placed as a placeholder to flag missing fields for future enhancement, ensuring no valuable leads are lost

In [1748]:
at_data.fillna(np.nan, inplace=True)

In [1750]:
# displays the count/number of NaN values 
at_data.isnull().sum()

firma       0
street      0
plz         0
city        0
country     0
telefon    11
dtype: int64

In [1752]:
# Check if NaN has been populated into the missing fields
at_data[at_data.isna().any(axis=1)].head()

Unnamed: 0,firma,street,plz,city,country,telefon
72,Aktiv & Vital Forum Isabella Wulz e. U.,Gewerbestraße 8,9330,Althofen,AT,
73,Mütterstudio Tulln,Karl Metz-Gasse 22,3430,Tulln an der Donau,AT,
80,Orthopädie Schuhtechnik Nindl GmbH,Kirchenstraße 36,5733,Bramberg am Wildkogel,AT,
81,A.Novak GmbH,Wimmergasse 7,1050,Wien,AT,
183,Buschenschank Luttenberger,Seibersdorf 20,8423,St. Veit in der Südsteiermark,AT,


In [1754]:
ch_data.fillna(np.nan, inplace=True)

In [1756]:
ch_data.isnull().sum()

firma      0
street     0
plz        0
city       0
country    0
telefon    7
dtype: int64

In [1758]:
ch_data[ch_data.isnull().any(axis=1)].head()

Unnamed: 0,firma,street,plz,city,country,telefon
10,Simon's Steakhouse Grill & Restaurant & Bar,Niederdorfstrasse 13,8001,Zürich,CH,
86,Café Henrici,Niederdorfstrasse 1,8001,Zürich,CH,
87,Babu's,Löwenstrasse 1,8001,Zürich,CH,
88,Grande Café & Bar,Limmatquai 118,8001,Zürich,CH,
89,Café des Amis,Nordstrasse 88,8037,Zürich,CH,


In [1095]:
#mixed_data.fillna(np.nan, inplace=True)

In [1760]:
# Replace empty strings or strings with 'nan' (created during conversion) with actual NaN
mixed_data['telefon'] = mixed_data['telefon'].replace(['', ' ', 'nan'], pd.NA)

In [1762]:
mixed_data.isna().any()

firma             True
street           False
plz               True
city              True
landesvorwahl    False
telefon           True
anrede            True
vorname           True
nachname          True
dtype: bool

In [1764]:
mixed_data.isnull().sum()

firma             546
street              0
plz               556
city              592
landesvorwahl       0
telefon            37
anrede           3264
vorname          3150
nachname         3223
dtype: int64

In [1766]:
mixed_data[mixed_data.isnull().any(axis=1)].head()

Unnamed: 0,firma,street,plz,city,landesvorwahl,telefon,anrede,vorname,nachname
0,Hammurabi Restaurant,Untere Königsstraße,34117,Kassel,49,56128726696.0,,,
1,M&M Fahrzeugpflege,Friedrich-Ebert-Straße 9,32339,Espelkamp,49,16095579280.0,,,
2,Chic Änderungsschneiderei,Venloer Straße 503,50825,Köln,49,22116933906.0,,,
3,Miss Döner,Carl-von-Ossietzky-Platz 1,20099,Hamburg,49,1746528901.0,,,
4,Kumpir Haus Ehrenfeld,Venloer Straße 378,50825,Köln,49,15510139521.0,,,


#### Handling Duplicate Data

This step will identify and remove any duplicate entries across the datasets. Duplicate data can skew analysis and reduce the efficiency of Movido's lead tracking system, so it is important to ensure that each lead is unique.


In [1768]:
# identify duplicates in the DataFrames
de_data.duplicated().sum()

0

In [1770]:
at_data.duplicated().sum()

0

In [1772]:
ch_data.duplicated().sum()

0

In [1774]:
mixed_data.duplicated().sum()

26

In [1776]:
# Check the exact duplicates in the mixed_data dataset
duplicate_rows = mixed_data[mixed_data.duplicated(keep=False)]

In [1778]:
duplicate_rows.head()

Unnamed: 0,firma,street,plz,city,landesvorwahl,telefon,anrede,vorname,nachname
22,,,,,49,,,,
156,,,,,41,787969693.0,,,
168,aquarium-spezialanfertigung,Hauptstraße 38,63691.0,Ranstadt,49,6041962396.0,,,
169,aquarium-spezialanfertigung,Hauptstraße 38,63691.0,Ranstadt,49,6041962396.0,,,
204,,,,,49,512169023.0,,,


In [1780]:
# Some of the rows above are not duplicates as shown above
# I do not want to remove any data that could be considered a lead
# therefore I will remove exact duplicates and keep one instance of each lead 

mixed_data.drop_duplicates(keep='first', inplace=True)

In [1782]:
# check duplicates
mixed_data.duplicated().sum()


0

### Data Standardising and Data Cleaning Process

The process involves removing special characters, handling missing or incorrect data, and extracting the country name, country code, and cleaned local phone number. 

The final output includes three columns: country name, country code, and local number (without leading zeros), ensuring consistency and usability of phone number dat

#### 1. Remove Special Characters

- first step is to remove special characters in the phone numbers
- I will do this by extracting the numbers only and excluding everything else


##### Germany

In [1784]:
# Remove special characters and spaces from the 'telefon' column
de_data['telefon_cleaned'] = de_data['telefon'].str.replace(r'[^\d]', '', regex=True)


In [1788]:
# special characters have been removed including '+'
de_data.head()

Unnamed: 0,firma,street,plz,city,country,telefon,telefon_cleaned
0,Abschleppdienst Arnolds,Völlesbruchstrasse 19,52152,Simmerath,DE,0049177-8754883,491778754883
1,AAS-Fink GmbH,Morsbach 39,42857,Remscheid,DE,172.2086056,1722086056
2,Allfolia Deutschland GmbH,Morsbach 39,42857,Remscheid,DE,+491734636476,491734636476
3,Autohaus Hentschel GmbH,Vahrenwalder Str. 141,30165,Hannover,DE,+49_x001D_17_x0011_86221169,4900117001186221169
4,Autohaus Schmohl GmbH,Potsdamer Str. 175,14469,Potsdam,DE,0049/160466 6050,491604666050


##### Austria

In [1790]:
# Remove special characters and spaces from the 'telefon' column
at_data['telefon_cleaned'] = at_data['telefon'].str.replace(r'[^\d]', '', regex=True)

In [1792]:
at_data.head()

Unnamed: 0,firma,street,plz,city,country,telefon,telefon_cleaned
0,Abschleppdienst und Reparatur Graber Hans,Brennerstraße 5,6150,Steinach am Brenner,AT,+43 664 0020108,436640020108
1,Anhänger Steininger & Partner GmbH,Windhager Straße 22,3931,Schweiggers,AT,/664/3019220,6643019220
2,Täubl Sonnenschutz,Otto-Scharmitzerstr. 24,3464,Goldgeben,AT,Hotline: 676-4429785 (+43),676442978543
3,Schönhacker Auto- und Fahrradzubehör,Ernest-Thum-Straße 1,3542,Gföhl,AT,+436998521286,436998521286
4,Wuppinger Karosseriebau GmbH,Breitwies 6,5303,Thalgau,AT,+436763066695,436763066695


##### Switzerland

In [1794]:
# Remove special characters and spaces from the 'telefon' column
ch_data['telefon_cleaned'] = ch_data['telefon'].str.replace(r'[^\d]', '', regex=True)

In [1796]:
ch_data.head()

Unnamed: 0,firma,street,plz,city,country,telefon,telefon_cleaned
0,Fankhauser AG Huttwil,Walke 1,4938,Rohrbach,CH,300 CALL 3378 / 78,300337878
1,TS-Velos GmbH,Jurastrasse 2,4554,Etziken,CH,+41abc5837-1_x001A_54,415837100154
2,Nocera & Strub AG,Hirzenstrasse 1,9244,Niederuzwil,CH,Telefon: 00 037 CALL 9506 / 79,37950679
3,Druckerei Lutz AG,Hauptstrasse 18,9042,Speicher,CH,0_x0008_0_x001F_4178abc8579-115,8000141788579115
4,Hch. Borer Kartenverlag AG,Ilbachstrasse 39,4228,Erschwil,CH,0041760352498,41760352498


##### Mixed 

In [1798]:
# Remove special characters and spaces from the 'telefon' column
mixed_data['telefon_cleaned'] = mixed_data['telefon'].str.replace(r'[^\d]', '', regex=True)

In [1800]:
mixed_data.head()

Unnamed: 0,firma,street,plz,city,landesvorwahl,telefon,anrede,vorname,nachname,telefon_cleaned
0,Hammurabi Restaurant,Untere Königsstraße,34117,Kassel,49,56128726696.0,,,,561287266960
1,M&M Fahrzeugpflege,Friedrich-Ebert-Straße 9,32339,Espelkamp,49,16095579280.0,,,,160955792800
2,Chic Änderungsschneiderei,Venloer Straße 503,50825,Köln,49,22116933906.0,,,,221169339060
3,Miss Döner,Carl-von-Ossietzky-Platz 1,20099,Hamburg,49,1746528901.0,,,,17465289010
4,Kumpir Haus Ehrenfeld,Venloer Straße 378,50825,Köln,49,15510139521.0,,,,155101395210


#### 2. Extract country code and Create local number Number Column

- extracting the local number by removing "0049", "0043", "0041" from the start of the phone number
- This leaves us with just the local part of the number

##### Germany

In [1686]:
# Remove the country code from the cleaned phone number to get the local number
# turns out this code doesnt apply to numbers that look like +49...
#de_data['local_number'] = de_data['telefon_cleaned'].str.replace(r'^0049', '', regex=True)


In [1802]:
# Remove both '0049' and '49' from the start of the cleaned phone numbers
de_data['local_number'] = de_data['telefon_cleaned'].str.replace(r'^(0049|49)', '', regex=True)

In [1804]:
de_data.head()

Unnamed: 0,firma,street,plz,city,country,telefon,telefon_cleaned,local_number
0,Abschleppdienst Arnolds,Völlesbruchstrasse 19,52152,Simmerath,DE,0049177-8754883,491778754883,1778754883
1,AAS-Fink GmbH,Morsbach 39,42857,Remscheid,DE,172.2086056,1722086056,1722086056
2,Allfolia Deutschland GmbH,Morsbach 39,42857,Remscheid,DE,+491734636476,491734636476,1734636476
3,Autohaus Hentschel GmbH,Vahrenwalder Str. 141,30165,Hannover,DE,+49_x001D_17_x0011_86221169,4900117001186221169,117001186221169
4,Autohaus Schmohl GmbH,Potsdamer Str. 175,14469,Potsdam,DE,0049/160466 6050,491604666050,1604666050


##### Austria

In [1806]:
# Remove both '0043' and '43' from the start of the cleaned phone numbers
at_data['local_number'] = at_data['telefon_cleaned'].str.replace(r'^(0043|43)', '', regex=True)

In [1808]:
at_data.head()

Unnamed: 0,firma,street,plz,city,country,telefon,telefon_cleaned,local_number
0,Abschleppdienst und Reparatur Graber Hans,Brennerstraße 5,6150,Steinach am Brenner,AT,+43 664 0020108,436640020108,6640020108
1,Anhänger Steininger & Partner GmbH,Windhager Straße 22,3931,Schweiggers,AT,/664/3019220,6643019220,6643019220
2,Täubl Sonnenschutz,Otto-Scharmitzerstr. 24,3464,Goldgeben,AT,Hotline: 676-4429785 (+43),676442978543,676442978543
3,Schönhacker Auto- und Fahrradzubehör,Ernest-Thum-Straße 1,3542,Gföhl,AT,+436998521286,436998521286,6998521286
4,Wuppinger Karosseriebau GmbH,Breitwies 6,5303,Thalgau,AT,+436763066695,436763066695,6763066695


##### Switzerland

In [1811]:
# Remove both '0041' and '41' from the start of the cleaned phone numbers
ch_data['local_number'] = ch_data['telefon_cleaned'].str.replace(r'^(0041|41)', '', regex=True)

In [1813]:
ch_data.head()

Unnamed: 0,firma,street,plz,city,country,telefon,telefon_cleaned,local_number
0,Fankhauser AG Huttwil,Walke 1,4938,Rohrbach,CH,300 CALL 3378 / 78,300337878,300337878
1,TS-Velos GmbH,Jurastrasse 2,4554,Etziken,CH,+41abc5837-1_x001A_54,415837100154,5837100154
2,Nocera & Strub AG,Hirzenstrasse 1,9244,Niederuzwil,CH,Telefon: 00 037 CALL 9506 / 79,37950679,37950679
3,Druckerei Lutz AG,Hauptstrasse 18,9042,Speicher,CH,0_x0008_0_x001F_4178abc8579-115,8000141788579115,8000141788579115
4,Hch. Borer Kartenverlag AG,Ilbachstrasse 39,4228,Erschwil,CH,0041760352498,41760352498,760352498


##### Mixed

In [1820]:
# Remove country code for Germany (49)
mixed_data.loc[mixed_data['landesvorwahl'] == 49, 'local_number'] = mixed_data['telefon_cleaned'].str.replace(r'^(0049|49)', '', regex=True)

In [1822]:
# Remove country code for Austria (43)
mixed_data.loc[mixed_data['landesvorwahl'] == 43, 'local_number'] = mixed_data['telefon_cleaned'].str.replace(r'^(0043|43)', '', regex=True)

In [1824]:
# Remove country code for Switzerland (41)
mixed_data.loc[mixed_data['landesvorwahl'] == 41, 'local_number'] = mixed_data['telefon_cleaned'].str.replace(r'^(0041|41)', '', regex=True)

In [1834]:
mixed_data.head()

Unnamed: 0,firma,street,plz,city,landesvorwahl,telefon,anrede,vorname,nachname,telefon_cleaned,local_number
0,Hammurabi Restaurant,Untere Königsstraße,34117,Kassel,49,56128726696.0,,,,561287266960,561287266960
1,M&M Fahrzeugpflege,Friedrich-Ebert-Straße 9,32339,Espelkamp,49,16095579280.0,,,,160955792800,160955792800
2,Chic Änderungsschneiderei,Venloer Straße 503,50825,Köln,49,22116933906.0,,,,221169339060,221169339060
3,Miss Döner,Carl-von-Ossietzky-Platz 1,20099,Hamburg,49,1746528901.0,,,,17465289010,17465289010
4,Kumpir Haus Ehrenfeld,Venloer Straße 378,50825,Köln,49,15510139521.0,,,,155101395210,155101395210


#### 3. Create a Country code Column

- here i assign all rows with "0049", "0043", "0041" as specified in the task

##### Germany

In [1843]:
# Assign the country code "0049" to all rows
de_data['country_code'] = '0049'

In [1845]:
de_data.head()

Unnamed: 0,firma,street,plz,city,country,telefon,telefon_cleaned,local_number,country_code
0,Abschleppdienst Arnolds,Völlesbruchstrasse 19,52152,Simmerath,DE,0049177-8754883,491778754883,1778754883,49
1,AAS-Fink GmbH,Morsbach 39,42857,Remscheid,DE,172.2086056,1722086056,1722086056,49
2,Allfolia Deutschland GmbH,Morsbach 39,42857,Remscheid,DE,+491734636476,491734636476,1734636476,49
3,Autohaus Hentschel GmbH,Vahrenwalder Str. 141,30165,Hannover,DE,+49_x001D_17_x0011_86221169,4900117001186221169,117001186221169,49
4,Autohaus Schmohl GmbH,Potsdamer Str. 175,14469,Potsdam,DE,0049/160466 6050,491604666050,1604666050,49


##### Austria

In [1847]:
# Assign the country code "0043" to all rows
at_data['country_code'] = '0043'

In [1849]:
at_data.head()

Unnamed: 0,firma,street,plz,city,country,telefon,telefon_cleaned,local_number,country_code
0,Abschleppdienst und Reparatur Graber Hans,Brennerstraße 5,6150,Steinach am Brenner,AT,+43 664 0020108,436640020108,6640020108,43
1,Anhänger Steininger & Partner GmbH,Windhager Straße 22,3931,Schweiggers,AT,/664/3019220,6643019220,6643019220,43
2,Täubl Sonnenschutz,Otto-Scharmitzerstr. 24,3464,Goldgeben,AT,Hotline: 676-4429785 (+43),676442978543,676442978543,43
3,Schönhacker Auto- und Fahrradzubehör,Ernest-Thum-Straße 1,3542,Gföhl,AT,+436998521286,436998521286,6998521286,43
4,Wuppinger Karosseriebau GmbH,Breitwies 6,5303,Thalgau,AT,+436763066695,436763066695,6763066695,43


##### Switzerland

In [1852]:
# Assign the country code "0041" to all rows
ch_data['country_code'] = '0041'

In [1854]:
ch_data.head()

Unnamed: 0,firma,street,plz,city,country,telefon,telefon_cleaned,local_number,country_code
0,Fankhauser AG Huttwil,Walke 1,4938,Rohrbach,CH,300 CALL 3378 / 78,300337878,300337878,41
1,TS-Velos GmbH,Jurastrasse 2,4554,Etziken,CH,+41abc5837-1_x001A_54,415837100154,5837100154,41
2,Nocera & Strub AG,Hirzenstrasse 1,9244,Niederuzwil,CH,Telefon: 00 037 CALL 9506 / 79,37950679,37950679,41
3,Druckerei Lutz AG,Hauptstrasse 18,9042,Speicher,CH,0_x0008_0_x001F_4178abc8579-115,8000141788579115,8000141788579115,41
4,Hch. Borer Kartenverlag AG,Ilbachstrasse 39,4228,Erschwil,CH,0041760352498,41760352498,760352498,41


##### Mixed

In [1857]:
# Create the country code column based on the landesvorwahl to match each country
mixed_data['country_code'] = mixed_data['landesvorwahl'].map({49: '0049', 43: '0043', 41: '0041'})

In [1863]:
mixed_data.head()

Unnamed: 0,firma,street,plz,city,landesvorwahl,telefon,anrede,vorname,nachname,telefon_cleaned,local_number,country_code
0,Hammurabi Restaurant,Untere Königsstraße,34117,Kassel,49,56128726696.0,,,,561287266960,561287266960,49
1,M&M Fahrzeugpflege,Friedrich-Ebert-Straße 9,32339,Espelkamp,49,16095579280.0,,,,160955792800,160955792800,49
2,Chic Änderungsschneiderei,Venloer Straße 503,50825,Köln,49,22116933906.0,,,,221169339060,221169339060,49
3,Miss Döner,Carl-von-Ossietzky-Platz 1,20099,Hamburg,49,1746528901.0,,,,17465289010,17465289010,49
4,Kumpir Haus Ehrenfeld,Venloer Straße 378,50825,Köln,49,15510139521.0,,,,155101395210,155101395210,49


#### 4. Get cleaned dataframe with the columns in the correct order

- I do not need the 'telefon' and 'telefon_cleaned' columns since i have already extracted the data i need from them

##### Germany

In [1865]:
# Reorder the columns and remove 'telefon' and 'telefon_cleaned'
reordered_data_de = de_data[['firma', 'street', 'plz', 'city', 'country', 'country_code', 'local_number']]

In [1869]:
reordered_data_de.head()

Unnamed: 0,firma,street,plz,city,country,country_code,local_number
0,Abschleppdienst Arnolds,Völlesbruchstrasse 19,52152,Simmerath,DE,49,1778754883
1,AAS-Fink GmbH,Morsbach 39,42857,Remscheid,DE,49,1722086056
2,Allfolia Deutschland GmbH,Morsbach 39,42857,Remscheid,DE,49,1734636476
3,Autohaus Hentschel GmbH,Vahrenwalder Str. 141,30165,Hannover,DE,49,117001186221169
4,Autohaus Schmohl GmbH,Potsdamer Str. 175,14469,Potsdam,DE,49,1604666050


##### Austria

In [1873]:
# Reorder the columns and remove 'telefon' and 'telefon_cleaned'
reordered_data_at = at_data[['firma', 'street', 'plz', 'city', 'country', 'country_code', 'local_number']]

In [1875]:
reordered_data_at.head()

Unnamed: 0,firma,street,plz,city,country,country_code,local_number
0,Abschleppdienst und Reparatur Graber Hans,Brennerstraße 5,6150,Steinach am Brenner,AT,43,6640020108
1,Anhänger Steininger & Partner GmbH,Windhager Straße 22,3931,Schweiggers,AT,43,6643019220
2,Täubl Sonnenschutz,Otto-Scharmitzerstr. 24,3464,Goldgeben,AT,43,676442978543
3,Schönhacker Auto- und Fahrradzubehör,Ernest-Thum-Straße 1,3542,Gföhl,AT,43,6998521286
4,Wuppinger Karosseriebau GmbH,Breitwies 6,5303,Thalgau,AT,43,6763066695


##### Switzerland

In [1878]:
# Reorder the columns and remove 'telefon' and 'telefon_cleaned'
reordered_data_ch = ch_data[['firma', 'street', 'plz', 'city', 'country', 'country_code', 'local_number']]

In [1880]:
reordered_data_ch.head()

Unnamed: 0,firma,street,plz,city,country,country_code,local_number
0,Fankhauser AG Huttwil,Walke 1,4938,Rohrbach,CH,41,300337878
1,TS-Velos GmbH,Jurastrasse 2,4554,Etziken,CH,41,5837100154
2,Nocera & Strub AG,Hirzenstrasse 1,9244,Niederuzwil,CH,41,37950679
3,Druckerei Lutz AG,Hauptstrasse 18,9042,Speicher,CH,41,8000141788579115
4,Hch. Borer Kartenverlag AG,Ilbachstrasse 39,4228,Erschwil,CH,41,760352498


##### Mixed

In [None]:
# Adding additional columns from the Mixed dataset 
#reordered_mixed = mixed_data[['firma', 'street', 'plz', 'city', 'landesvorwahl', 'country_code', 'local_number', 'anrede', 'vorname', 'nachname']]


In [1884]:
# In order to not have conflicts during merging, I am creating a country column 
mixed_data['country'] = mixed_data['landesvorwahl'].map({49:'DE', 43:'AT', 41:'CH'})

In [1886]:
mixed_data.head()

Unnamed: 0,firma,street,plz,city,landesvorwahl,telefon,anrede,vorname,nachname,telefon_cleaned,local_number,country_code,country
0,Hammurabi Restaurant,Untere Königsstraße,34117,Kassel,49,56128726696.0,,,,561287266960,561287266960,49,DE
1,M&M Fahrzeugpflege,Friedrich-Ebert-Straße 9,32339,Espelkamp,49,16095579280.0,,,,160955792800,160955792800,49,DE
2,Chic Änderungsschneiderei,Venloer Straße 503,50825,Köln,49,22116933906.0,,,,221169339060,221169339060,49,DE
3,Miss Döner,Carl-von-Ossietzky-Platz 1,20099,Hamburg,49,1746528901.0,,,,17465289010,17465289010,49,DE
4,Kumpir Haus Ehrenfeld,Venloer Straße 378,50825,Köln,49,15510139521.0,,,,155101395210,155101395210,49,DE


In [1888]:
# drop the landesvorwahl, it is not needed anymore
mixed_data.drop('landesvorwahl', axis=1, inplace=True)

In [1890]:
mixed_data.head()

Unnamed: 0,firma,street,plz,city,telefon,anrede,vorname,nachname,telefon_cleaned,local_number,country_code,country
0,Hammurabi Restaurant,Untere Königsstraße,34117,Kassel,56128726696.0,,,,561287266960,561287266960,49,DE
1,M&M Fahrzeugpflege,Friedrich-Ebert-Straße 9,32339,Espelkamp,16095579280.0,,,,160955792800,160955792800,49,DE
2,Chic Änderungsschneiderei,Venloer Straße 503,50825,Köln,22116933906.0,,,,221169339060,221169339060,49,DE
3,Miss Döner,Carl-von-Ossietzky-Platz 1,20099,Hamburg,1746528901.0,,,,17465289010,17465289010,49,DE
4,Kumpir Haus Ehrenfeld,Venloer Straße 378,50825,Köln,15510139521.0,,,,155101395210,155101395210,49,DE


In [1892]:
# Now i rearrange the order of the columns to match the rest of the other datasets
reordered_data_mixed = mixed_data[['firma', 'street', 'plz', 'city', 'country', 'country_code', 'local_number', 'anrede', 'vorname', 'nachname']]


In [1894]:
reordered_data_mixed.head()

Unnamed: 0,firma,street,plz,city,country,country_code,local_number,anrede,vorname,nachname
0,Hammurabi Restaurant,Untere Königsstraße,34117,Kassel,DE,49,561287266960,,,
1,M&M Fahrzeugpflege,Friedrich-Ebert-Straße 9,32339,Espelkamp,DE,49,160955792800,,,
2,Chic Änderungsschneiderei,Venloer Straße 503,50825,Köln,DE,49,221169339060,,,
3,Miss Döner,Carl-von-Ossietzky-Platz 1,20099,Hamburg,DE,49,17465289010,,,
4,Kumpir Haus Ehrenfeld,Venloer Straße 378,50825,Köln,DE,49,155101395210,,,


#### 5. Remove any leading zeros in the local number

##### Germany

In [1898]:
# Remove leading zeros from the local number using .loc to avoid the warning
reordered_data_de.loc[:, 'local_number'] = reordered_data_de['local_number'].str.lstrip('0')


In [1902]:
reordered_data_de.sample(100)

Unnamed: 0,firma,street,plz,city,country,country_code,local_number
8,Autohaus Siegmar GmbH,Anton-Erhardt-Straße 5,9117,Chemnitz,DE,0049,100044917900089703167
511,"Naturkosmetikfachgeschäft, Natural Beauty",Rödergasse 5,91541,Rothenburg ob der Tauber,DE,0049,1752485665
248,Käskoung Stub’n,Poppenreuther Straße 6,90419,Nürnberg,DE,0049,1786551593
313,VSB Versicherungsservice Bantel,Furtbergstr. 106,71665,Vaihingen,DE,0049,571779066239
479,Agentur für Saisonprodukte Best Season GmbH,Gutenbergstr.1,31157,Sarstedt,DE,0049,1766933301
...,...,...,...,...,...,...,...
284,"Ausbildungsstall Jochen Leicht, Inhaber: Joche...",Eisendorferstraße 21,85567,Grafing,DE,0049,1524588703
770,Exalack GmbH,Industriestrasse 52,8112,Otelfingen,DE,0049,941526342474
686,Verputzer- und Malermeisterbetrieb Walter Dietz,Zehntstrasse 16,97618,Strahlungen,DE,0049,1746690463
236,Bohlen & Tammling GmbH & Co KG,Heisfelder Str. 161,26789,Leer,DE,0049,1602104553


##### Austria

In [1904]:
# Remove leading zeros from the local number using .loc to avoid the warning
reordered_data_at.loc[:, 'local_number'] = reordered_data_at['local_number'].str.lstrip('0')

In [1906]:
reordered_data_at.head()

Unnamed: 0,firma,street,plz,city,country,country_code,local_number
0,Abschleppdienst und Reparatur Graber Hans,Brennerstraße 5,6150,Steinach am Brenner,AT,43,6640020108
1,Anhänger Steininger & Partner GmbH,Windhager Straße 22,3931,Schweiggers,AT,43,6643019220
2,Täubl Sonnenschutz,Otto-Scharmitzerstr. 24,3464,Goldgeben,AT,43,676442978543
3,Schönhacker Auto- und Fahrradzubehör,Ernest-Thum-Straße 1,3542,Gföhl,AT,43,6998521286
4,Wuppinger Karosseriebau GmbH,Breitwies 6,5303,Thalgau,AT,43,6763066695


##### Switzerland

In [1909]:
# Remove leading zeros from the local number using .loc to avoid the warning
reordered_data_ch.loc[:, 'local_number'] = reordered_data_ch['local_number'].str.lstrip('0')

In [1911]:
reordered_data_ch.head()

Unnamed: 0,firma,street,plz,city,country,country_code,local_number
0,Fankhauser AG Huttwil,Walke 1,4938,Rohrbach,CH,41,300337878
1,TS-Velos GmbH,Jurastrasse 2,4554,Etziken,CH,41,5837100154
2,Nocera & Strub AG,Hirzenstrasse 1,9244,Niederuzwil,CH,41,37950679
3,Druckerei Lutz AG,Hauptstrasse 18,9042,Speicher,CH,41,8000141788579115
4,Hch. Borer Kartenverlag AG,Ilbachstrasse 39,4228,Erschwil,CH,41,760352498


##### Mixed

In [1914]:
# Remove leading zeros from the local number using .loc to avoid the warning
reordered_data_mixed.loc[:, 'local_number'] = reordered_data_mixed['local_number'].str.lstrip('0')

In [1916]:
reordered_data_mixed.head()

Unnamed: 0,firma,street,plz,city,country,country_code,local_number,anrede,vorname,nachname
0,Hammurabi Restaurant,Untere Königsstraße,34117,Kassel,DE,49,561287266960,,,
1,M&M Fahrzeugpflege,Friedrich-Ebert-Straße 9,32339,Espelkamp,DE,49,160955792800,,,
2,Chic Änderungsschneiderei,Venloer Straße 503,50825,Köln,DE,49,221169339060,,,
3,Miss Döner,Carl-von-Ossietzky-Platz 1,20099,Hamburg,DE,49,17465289010,,,
4,Kumpir Haus Ehrenfeld,Venloer Straße 378,50825,Köln,DE,49,155101395210,,,


#### 6. Flagging Long Phone Numbers for Manual Review

- Phone numbers that exceed the typical length of 13 digits for local German numbers are likely incorrect 
- Since these numbers cannot be automatically corrected, I will flag them for manual review to ensure data integrity and avoid losing potential business leads

##### Germany

In [2102]:
# Creates a new column to flag numbers that are longer than 11 digits
reordered_data_de.loc[:, 'long_number_flag'] = (reordered_data_de['local_number'].str.len() > 11).astype(str).replace({'True': 'Yes', 'False': 'No'})


In [2104]:
reordered_data_de.head()

Unnamed: 0,firma,street,plz,city,country,country_code,local_number,long_number_flag
0,Abschleppdienst Arnolds,Völlesbruchstrasse 19,52152,Simmerath,DE,49,1778754883,No
1,AAS-Fink GmbH,Morsbach 39,42857,Remscheid,DE,49,1722086056,No
2,Allfolia Deutschland GmbH,Morsbach 39,42857,Remscheid,DE,49,1734636476,No
3,Autohaus Hentschel GmbH,Vahrenwalder Str. 141,30165,Hannover,DE,49,117001186221169,Yes
4,Autohaus Schmohl GmbH,Potsdamer Str. 175,14469,Potsdam,DE,49,1604666050,No


##### Austria

In [2106]:
# Creates a new column to flag numbers that are longer than 11 digits
reordered_data_at.loc[:, 'long_number_flag'] = (reordered_data_at['local_number'].str.len() > 11).astype(str).replace({'True': 'Yes', 'False': 'No'})

In [2108]:
reordered_data_at.head()

Unnamed: 0,firma,street,plz,city,country,country_code,local_number,long_number_flag
0,Abschleppdienst und Reparatur Graber Hans,Brennerstraße 5,6150,Steinach am Brenner,AT,43,6640020108,No
1,Anhänger Steininger & Partner GmbH,Windhager Straße 22,3931,Schweiggers,AT,43,6643019220,No
2,Täubl Sonnenschutz,Otto-Scharmitzerstr. 24,3464,Goldgeben,AT,43,676442978543,Yes
3,Schönhacker Auto- und Fahrradzubehör,Ernest-Thum-Straße 1,3542,Gföhl,AT,43,6998521286,No
4,Wuppinger Karosseriebau GmbH,Breitwies 6,5303,Thalgau,AT,43,6763066695,No


##### Switzerland

In [2110]:
# Creates a new column to flag numbers that are longer than 11 digits
reordered_data_ch.loc[:, 'long_number_flag'] = (reordered_data_ch['local_number'].str.len() > 11).astype(str).replace({'True': 'Yes', 'False': 'No'})

In [2112]:
reordered_data_ch.head()

Unnamed: 0,firma,street,plz,city,country,country_code,local_number,long_number_flag
0,Fankhauser AG Huttwil,Walke 1,4938,Rohrbach,CH,41,300337878,No
1,TS-Velos GmbH,Jurastrasse 2,4554,Etziken,CH,41,5837100154,No
2,Nocera & Strub AG,Hirzenstrasse 1,9244,Niederuzwil,CH,41,37950679,No
3,Druckerei Lutz AG,Hauptstrasse 18,9042,Speicher,CH,41,8000141788579115,Yes
4,Hch. Borer Kartenverlag AG,Ilbachstrasse 39,4228,Erschwil,CH,41,760352498,No


##### Mixed

In [2114]:
# Creates a new column to flag numbers that are longer than 11 digits
reordered_data_mixed.loc[:, 'long_number_flag'] = (reordered_data_mixed['local_number'].str.len() > 11).astype(str).replace({'True': 'Yes', 'False': 'No'})

In [2116]:
reordered_data_mixed.head()

Unnamed: 0,firma,street,plz,city,country,country_code,local_number,anrede,vorname,nachname,long_number_flag
0,Hammurabi Restaurant,Untere Königsstraße,34117,Kassel,DE,49,561287266960,,,,Yes
1,M&M Fahrzeugpflege,Friedrich-Ebert-Straße 9,32339,Espelkamp,DE,49,160955792800,,,,Yes
2,Chic Änderungsschneiderei,Venloer Straße 503,50825,Köln,DE,49,221169339060,,,,Yes
3,Miss Döner,Carl-von-Ossietzky-Platz 1,20099,Hamburg,DE,49,17465289010,,,,No
4,Kumpir Haus Ehrenfeld,Venloer Straße 378,50825,Köln,DE,49,155101395210,,,,Yes


#### 7. Check for duplicates and Missing Values

- it's important to identify bad data and irrecoverable records

In [1950]:
#Germany
reordered_data_de.duplicated().sum()

0

In [1645]:
missing_values = reordered_data[['firma', 'street', 'plz', 'city', 'country', 'country_code', 'local_number']].isnull().sum()

In [1647]:
missing_values

firma           0
street          0
plz             0
city            0
country         0
country_code    0
local_number    0
dtype: int64

In [1952]:
# Austria
reordered_data_at.duplicated().sum()

0

In [1954]:
# Switzerland
reordered_data_ch.duplicated().sum()

0

In [1956]:
# Mixed
reordered_data_mixed.duplicated().sum()

2

In [1970]:
# Check where the duplicates are in the Mixed Dataset
duplicate_rows = reordered_data_mixed[reordered_data_mixed.duplicated(keep=False)]

In [1972]:
duplicate_rows

Unnamed: 0,firma,street,plz,city,country,country_code,local_number,anrede,vorname,nachname,long_number_flag
261,,,,,,,,,,,No
724,,,,,,,,,,,No
871,,,,,,,,,,,No


#### 8. Removing Duplicates

- checking duplicates, shows that there are rows that are completely empty in the Mixed Dataset
- The rows that have NA all throughout will be deleted

In [2008]:
# drop any remaining duplicate rows, which could be entirely NaN
reordered_data_mixed = reordered_data_mixed.drop_duplicates(keep=False)

In [2014]:
reordered_data_mixed.duplicated().sum()

0

#### 9. Merging Datasets

- Will use concat instead of merge because the dataframes have the same columns
- And i want the Merged Dataset to be one big dataset combined row by row
- I will exclude the 'Anrede', 'Vorname' and 'Nachname' from the merging to avoid conflicts and a lot of NA values

In [2028]:
# merging all 4 datasets
merged_dataset = pd.concat([reordered_data_de, 
                            reordered_data_at,
                            reordered_data_ch, 
                            reordered_data_mixed[['firma', 'street', 'plz', 'city', 'country', 'country_code', 'local_number']]],
                           ignore_index=True)

In [2037]:
merged_dataset.head()

Unnamed: 0,firma,street,plz,city,country,country_code,local_number,long_number_flag
0,Abschleppdienst Arnolds,Völlesbruchstrasse 19,52152,Simmerath,DE,49,1778754883,No
1,AAS-Fink GmbH,Morsbach 39,42857,Remscheid,DE,49,1722086056,No
2,Allfolia Deutschland GmbH,Morsbach 39,42857,Remscheid,DE,49,1734636476,Yes
3,Autohaus Hentschel GmbH,Vahrenwalder Str. 141,30165,Hannover,DE,49,117001186221169,Yes
4,Autohaus Schmohl GmbH,Potsdamer Str. 175,14469,Potsdam,DE,49,1604666050,No


In [2055]:
merged_dataset.shape[0]

4746

In [2075]:
# check for missing data
missing_values = merged_dataset.isnull().sum()


In [2077]:
missing_values

firma                533
street                 0
plz                  543
city                 579
country                2
country_code           2
local_number          53
long_number_flag    3589
dtype: int64

#### 10. Handling Missing Data in the Merged Dataset

- I address missing values directly in the merged dataset to ensure data completeness and quality
- By focusing on the merged dataset, I streamline the cleaning process and avoid duplicating efforts across multiple individual datasets
- Many rows in the merged dataset lack company names and addresses, with only country codes and local numbers, many of which are invalid
- It's impractical to recover businesses using incomplete or inaccurate data, so I will drop these rows to maintain data quality and streamline the dataset.

In [2083]:
# Drop rows where 'firma' (company name) is missing
merged_dataset_cleaned = merged_dataset.dropna(subset=['firma'])

In [2139]:
# dropping rows where 'firma' appears as 'NN'
merged_dataset_cleaned = merged_dataset_cleaned[merged_dataset_cleaned['firma'] != 'NN']

In [2141]:
merged_dataset_cleaned.shape[0]

4126

#### 10. Saving Cleaned Datasets to Excel File & CSV File

- I saved the cleaned datasets as an Excel file with multiple sheets to preserve the original structure and keep the datasets easily accessible in one file for future use.

In [2124]:
with pd.ExcelWriter("cleaned_datasets.xlsx") as writer:
        reordered_data_de.to_excel(writer, sheet_name= "DE", index=False)
        reordered_data_at.to_excel(writer, sheet_name="AT", index=False)
        reordered_data_ch.to_excel(writer, sheet_name="CH", index=False)
        reordered_data_mixed.to_excel(writer, sheet_name="Mixed", index=False)

In [2118]:
# Save the cleaned dataset to a CSV file to read in second Notebook
merged_dataset_cleaned.to_csv('final_lead_dataset.csv', index=False)

In [2146]:
# check missing values
missing_phone = merged_dataset_cleaned['local_number'].isna().sum()

In [2148]:
missing_phone

50

In [2150]:
missing_street = merged_dataset_cleaned['street'].isna().sum()

In [2152]:
missing_street

0

### Summary of Data Quality Before and After Cleaning

### Data Quality Before Cleaning:

The datasets contained a variety of issues, including inconsistent phone number formats (some with special characters, varying country codes like +49 and 0049), missing values in critical fields like phone numbers and addresses, and duplicate entries.
Some datasets lacked consistent column structures (e.g., the mixed dataset had different column names such as landesvorwahl).
The presence of incomplete or empty rows reduced the overall reliability of the data, which could have impacted downstream analysis and lead tracking.


### Data Quality After Cleaning:

All phone numbers were standardized by removing special characters, extracting country codes, and ensuring local numbers had no leading zeros.
Missing values were flagged where appropriate, and no valuable lead data was removed. This ensures that the datasets can be enriched later without losing potential information.
Duplicate rows, especially those with completely missing information, were identified and removed to improve data consistency.
All datasets were cleaned, structured uniformly, and merged into a single, consolidated dataset with consistent columns across all regions (DE, AT, CH, Mixed). This improved data quality and ensured the data was ready for downstream processes, such as uploading to the database.