# Data Enhancement

## Objective

Some of the records in the merged dataset are not complete. There are many missing phone numbers.
In this phase, I will use Webscraping to retrive the phone numbers for the missing Businesses. 

### Importing Libraries

In [7]:
import pandas as pd

In [82]:
# Load the cleaned dataset from the CSV file in order to access it in this Notebook
merged_dataset = pd.read_csv('final_lead_dataset.csv')

#### 1. Filter the missing phone numbers

- I identify which businesses have missing phone numbers in the dataset/database. 
- This allows me to focus only on the incomplete records that need to be enriched with valid phone numbers, streamlining the data enhancement process

In [84]:
# Filter rows where 'local_number' is missing
missing_phone_numbers = merged_dataset[merged_dataset['local_number'].isna()]

In [86]:
missing_phone_numbers.head()

Unnamed: 0,firma,street,plz,city,country,country_code,local_number,long_number_flag
937,Aktiv & Vital Forum Isabella Wulz e. U.,Gewerbestraße 8,9330,Althofen,AT,43.0,,No
938,Mütterstudio Tulln,Karl Metz-Gasse 22,3430,Tulln an der Donau,AT,43.0,,No
945,Orthopädie Schuhtechnik Nindl GmbH,Kirchenstraße 36,5733,Bramberg am Wildkogel,AT,43.0,,No
946,A.Novak GmbH,Wimmergasse 7,1050,Wien,AT,43.0,,No
1048,Buschenschank Luttenberger,Seibersdorf 20,8423,St. Veit in der Südsteiermark,AT,43.0,,No


In [88]:
number_of_missing_phone = merged_dataset['local_number'].isna().sum()

In [90]:
number_of_missing_phone

50

### 2. Select 26 Businesses Randomly to Scrape

- From the filtered records with **53** missing phone numbers, I  select a sample of 26 businesses. 
- This smaller subset helps test the scraping process and check the ability to retrieve missing data

In [96]:
businesses_to_scrape = missing_phone_numbers.sample(26)

In [98]:
businesses_to_scrape

Unnamed: 0,firma,street,plz,city,country,country_code,local_number,long_number_flag
1063,Lugeck,Lugeck 4,1010,"Wien,",AT,43.0,,No
1054,MALAT Weingut GmbH & Co KG,Hafnerstraße 12,3511,Palt,AT,43.0,,No
1557,BLUME2000 im Famila Bad Segeberg,Eutiner Straße 45,23795,Klein Rönnau,,,,
4101,Aitech Engineering GmbH,Sindelfinger Str. 108,71069,Sindelfingen,DE,49.0,,
1155,Milchbar - das Gaumenfreudenhaus,Kappelergasse 16,8001,Zürich,CH,41.0,,No
1152,Babu's,Löwenstrasse 1,8001,Zürich,CH,41.0,,No
4208,dck media GmbH,Von-Steuben-Str. 18,48143,"Münster, Westfalen",DE,49.0,,
4098,Studio Hamburg Enterprises GmbH,Freigrafenweg 2,44357,Dortmund,DE,49.0,,
1156,Kafi Freud,Schaffhauserstrasse 118,8057,Zürich,CH,41.0,,No
1484,César,Baumgartenstraße 3,26122,Oldenburg,DE,49.0,,


#### 3. Install Web Scraping Libraries

Web scraping requires specific Python libraries to interact with websites

In [31]:
!pip install requests beautifulsoup4


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Install Libraries for Web Scraping

In [34]:
import requests
from bs4 import BeautifulSoup

### 4. Scraping the Das Örtliche Website

- The goal here was to scrape phone numbers from the Das Örtliche website 
- However, I encountered an error: AttributeError: 'NoneType' object has no attribute 'find'
- This according to research, occurred because the expected HTML element was not found on the page, likely due to JavaScript rendering or incorrect element selection
- Automation of the page or Selenium would need to be used first in order to achieve this


In [None]:
# Initialize a list to store company names and phone numbers for review
#phone_numbers_for_review = []

# Loop through the 26 businesses to scrape phone numbers
#for row_index, row in businesses_to_scrape.iterrows():
    #company_name = row['firma']  # Assuming 'firma' is the column for business names
    #city = row['city']  # Assuming 'city' is available

    # Construct the search URL
    #search_url = f"https://www.dasoertliche.de/?kw={company_name}&ci={city}"  # Add city for better accuracy
    
    # Send the HTTP request
    #response = requests.get(search_url)
    
    # Check if the request was successful
    #if response.status_code == 200:
        # Parse the HTML content
        #soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find the phone number in the HTML (adjusted to your specified structure)
        #phone_number_tag = soup.find("div", {"class": "phone block"}).find("span")
        
        # If a phone number is found, add it to the list for review
        #if phone_number_tag:
            #phone_number = phone_number_tag.get_text(strip=True)
            #phone_numbers_for_review.append((company_name, phone_number))  # Store company name and phone number
    #else:
        #print(f"Failed to retrieve data for {company_name}")

# Check the retrieved phone numbers for review
#for company, phone in phone_numbers_for_review:
    #print(f"Company: {company}, Phone: {phone}")


### 5. Using Google Places API to Retrive Business Information

- Since web scraping encountered limitations, I changed course, and used the **Google Places API**, a reliable alternative for gathering business information, such as phone numbers. 
- This allows me to retrieve structured and accurate data directly from Google’s vast database

In [40]:
import requests

#### 6. Retrieve the place_id for the Business Using the Find Place from Text API

- I use the Find Place from Text API to retrieve a unique place_id for each business, based on their name and address. 
- The **place_id** is an identifier that helps query Google for detailed information about a specific business.

In [50]:
api_key = "API-Key"
business_name = "A. Koch GmbH"
location = "Germany"
plz = "09224"

# URL to find place ID for the business
url = f"https://maps.googleapis.com/maps/api/place/findplacefromtext/json?input={business_name} {plz} {location}&inputtype=textquery&fields=place_id,name,formatted_address&key={api_key}"


response = requests.get(url)
data = response.json()

# Print the retrieved data, including place_id
print(data)

{'candidates': [{'formatted_address': 'Insterburger Str. 4, 37574 Einbeck, Deutschland', 'name': 'A. Koch GmbH, Straßen- und Tiefbau', 'place_id': 'ChIJzwYm1VDOukcRCPGXCh3McLU'}], 'status': 'OK'}


#### 7. Use the Place Details API to Retrieve Phone Numbers

- With the place_id obtained from the previous step, I query the **Place Details API** to get detailed business information, including phone numbers. 
- This step completes the data enhancement process by filling in the missing phone numbers in the dataset/database

In [53]:
place_id = 'ChIJzwYm1VDOukcRCPGXCh3McLU'  # The place_id retrieved earlier
api_key = 'API-Key'

# Place Details API endpoint
place_details_url = f"https://maps.googleapis.com/maps/api/place/details/json?place_id={place_id}&fields=formatted_phone_number&key={api_key}"

response = requests.get(place_details_url)
details = response.json()

# Check the response
if details['status'] == 'OK':
    phone_number = details['result'].get('formatted_phone_number', 'No phone number found')
    print(f"Phone number: {phone_number}")
else:
    print(f"Error: {details['status']}")

Phone number: 05561 94940


#### 8. Automated Retrieval of Missing Phone Numbers and Addresses Using Google Places API

- I first tested the Find Place from Text API and the Place Details API on a business that already had complete information in the database to ensure the APIs worked as expected. As demonstrated above.

- After confirming that they work as expected, I loop through the business names to automate the retrieval of missing phone numbers and addresses for 50 businesses in the dataset. 
- This process reduces manual work, improves data quality, and ensures that the dataset is enhanced with accurate information efficiently

In [100]:
# List of businesses with missing phone numbers
businesses_to_scrape = merged_dataset[merged_dataset['local_number'].isna()]

# list to store the results
api_results = []

# Get the 'firma'and 'city' columns from the dataset directly
company_names = businesses_to_scrape['firma'].tolist()
cities = businesses_to_scrape['city'].tolist()

# Loop through the 53 businesses
for i in range(len(company_names)):
    company_name = company_names[i]  # Get the company name
    city = cities[i]  # Get the city
    
    # First API: Find Place from Text (to get place_id)
    search_url = f"https://maps.googleapis.com/maps/api/place/findplacefromtext/json?input={company_name}%20{city}&inputtype=textquery&fields=place_id,name,formatted_address&key={api_key}"
    search_response = requests.get(search_url)
    search_data = search_response.json()
    
    if search_data['status'] == 'OK':
        place_id = search_data['candidates'][0]['place_id']
        address = search_data['candidates'][0]['formatted_address']
        
        # Second API: Place Details (to get the phone number)
        details_url = f"https://maps.googleapis.com/maps/api/place/details/json?place_id={place_id}&fields=formatted_phone_number&key={api_key}"
        details_response = requests.get(details_url)
        details_data = details_response.json()
        
        if details_data['status'] == 'OK':
            phone_number = details_data['result'].get('formatted_phone_number', 'Phone number not found')
        else:
            phone_number = 'Phone number not found'
        
        # Store the results
        api_results.append({
            'company_name': company_name,
            'address': address,
            'phone_number': phone_number
        })
    else:
        print(f"No results found for {company_name}")


No results found for Brigid Innovation Gmbh


In [101]:
api_results

[{'company_name': 'Aktiv & Vital Forum Isabella Wulz e. U.',
  'address': 'Klosterstraße 112, 40211 Düsseldorf, Deutschland',
  'phone_number': '0211 30154854'},
 {'company_name': 'Mütterstudio Tulln',
  'address': 'Karl-Metz-Gasse 22, 3430 Tulln an der Donau, Österreich',
  'phone_number': '0676 7553692'},
 {'company_name': 'Orthopädie Schuhtechnik Nindl GmbH',
  'address': 'Kirchenstraße 36, 5733 Bramberg am Wildkogel, Österreich',
  'phone_number': '06566 7493'},
 {'company_name': 'A.Novak GmbH',
  'address': 'Wimmergasse 7, 1050 Wien, Österreich',
  'phone_number': '01 5444988'},
 {'company_name': 'Buschenschank Luttenberger',
  'address': 'Seibersdorf b. St. Veit 20, 8423 Seibersdorf bei St. Veit, Österreich',
  'phone_number': '0664 1457049'},
 {'company_name': 'Weinbau Hermine Kovacs',
  'address': 'Elisabethgasse 41, 7301 Deutschkreutz, Österreich',
  'phone_number': '0664 3151569'},
 {'company_name': 'MALAT Weingut GmbH & Co KG',
  'address': 'Palt AT, Hafnerstraße 12, 3511 Fu

In [104]:
# this confirms that out of the 50 missing phone numbers, 49 were found
len(api_results)

49

##### Handling Company Data that is NOT Found

- One address were not found during the API calls, so they will be left unchanged in the dataset for manual checking and updates later


#### Updating Dataset with Retrieved Phone Numbers

- The final step is to update the database with the business phone numbers retrieved by the APIs
- I did not have any missing addresses in my dataset as most of the missing data was deleted as it did not have a business name, city or postcode. 
- Only the phone numbers will be updated



In [111]:
# Loop through the api_results list and update only the missing phone numbers in the 'local_number' column
for result in api_results:
    # Find row where the company name matches
    mask = (merged_dataset['firma'] == result['company_name'])
    
    # Update 'local_number' only if missing
    merged_dataset.loc[mask & merged_dataset['local_number'].isnull(), 'local_number'] = result['phone_number']


#### Verify the Changes by Filtering for Updated Phone Numbers

#### Save the Updated Dataset

In [116]:
# Save the updated dataset to a new file
merged_dataset.to_csv('final_lead_dataset_updated.csv', index=False)