Last Update: 6th October 2023

Status: No pending updates, ready to use

---

#### 1. Import Library (Required Library Installation If Module Not Found)

In [1]:
from googlesearch import search
import pandas as pd
import tldextract

#### 2. Read CSV (Suggesting to convert file to CSV UTF-8 for better performance)

<u>Read Me Before Start Scraping</u>
1. To avoid any filepath issue, it is best to keep this "ipynb" file with the Excel file that you wish to read in the same folder.
2. This Python script only supports CSV file, convert the Excel file into CSV UTF-8.
1. Ensure that the list of companies do not have any duplicates. **[REMOVE DUPLICATES]**
2. Check whether companies with empty website has any previous record in the dataset. If yes, please fill it in with the existing data in the dataset. 
    
    However, do take note that each company should have only one website as there might be chances of one company having multiple website in the dataset due to technical issues or human errors. **[CLEAN THE DATASET FIRST BEFORE SCRAPING]**

3. Only include the companies with no website for the web scraping to save time. **[INCLUDE ONLY COMPANY WITH BLANK WEBSITE]**

In [2]:
# update the filename
excel_file = "Test_Google_Domain_Finder" 

df = pd.read_csv(excel_file + ".csv")
df = df['Company_Name']
df

0         The TJX Companies, Inc.
1       BJ's Wholesale Club, Inc.
2                      Fred Meyer
3           ADT Security Services
4                  Pier 1 Imports
5                            LCBO
6                    Michael Kors
7           ToysÃ¢â‚¬Å“RÃ¢â‚¬ÂUs
8             Lululemon Athletica
9                     Lucky Brand
10                         Fossil
11          ACC Facility Services
12                      Barcoding
13                         Conn's
14                   ProShip, Inc
15                 Shamrock Foods
16                        L'oreal
17                      Cosy Robo
18                       Coborn's
19                        ALANTRA
20                  buchheits.com
21       Alimentation Couche-Tard
22               American Freight
23                 channeladvisor
24                  Anthropologie
25                  Champs Sports
26                  Walmart, Inc.
27                  Neiman Marcus
28                     Tory Burch
29            

#### 3. Get URL

In [8]:
domains = []

GoodConnection = True
start = 0

while GoodConnection:
    try:
        for i in range(start, len(df)):
            for url in search(df[i], stop=1):
                domains.append(url)
                if not len(domains) == i+1:
                    domains.append('-')
                print(f'Exporting Company no {i+1}: {df[i]}')
    except:
        print(f'Fail to export company no {i+1}: {df[i]}. Try Again')
        start = i
        continue
    else:
        break #Exit loop if no error occurred

print(f'Scraping complete, {i+1} out of {df.shape[0]} domain being extracted. {"OK" if i+1 == df.shape[0] else "NOT OK"}')

Exporting Company no 1: The TJX Companies, Inc.
Exporting Company no 2: BJ's Wholesale Club, Inc.
Exporting Company no 3: Fred Meyer
Fail to export company no 4: ADT Security Services. Try Again
Fail to export company no 4: ADT Security Services. Try Again
Fail to export company no 4: ADT Security Services. Try Again
Fail to export company no 4: ADT Security Services. Try Again
Fail to export company no 4: ADT Security Services. Try Again
Fail to export company no 4: ADT Security Services. Try Again
Fail to export company no 4: ADT Security Services. Try Again
Fail to export company no 4: ADT Security Services. Try Again
Fail to export company no 4: ADT Security Services. Try Again
Fail to export company no 4: ADT Security Services. Try Again
Fail to export company no 4: ADT Security Services. Try Again
Fail to export company no 4: ADT Security Services. Try Again
Fail to export company no 4: ADT Security Services. Try Again
Fail to export company no 4: ADT Security Services. Try Again

#### Previous Script (BackUp)

In [4]:
# domains = []

# for i in range(0, len(df)):
#     for url in search(df[i], stop=1):
#         domains.append(url)
#         if not len(domains) == i+1:
#             domains.append('-')
#         print(f'Exporting Company no {i+1}: {df[i]}')

# print(f'Scraping complete, {i+1} out of {df.shape[0]} domain being extracted. {"OK" if i+1 == df.shape[0] else "NOT OK"}')

#### 4. Save results as Dataframe

In [12]:
data = []
data = pd.DataFrame(columns=['Company Name', 'URL'])
data['Company Name'] = df
data['URL'] = domains
data

Unnamed: 0,Company Name,URL
0,"The TJX Companies, Inc.",https://www.tjx.com/
1,"BJ's Wholesale Club, Inc.",https://www.bjs.com/
2,Fred Meyer,https://www.fredmeyer.com/
3,ADT Security Services,https://www.adt.com/
4,Pier 1 Imports,https://www.pier1.com/
5,LCBO,https://www.lcbo.com/
6,Michael Kors,https://www.michaelkors.com/
7,ToysÃ¢â‚¬Å“RÃ¢â‚¬ÂUs,https://stackoverflow.com/questions/31671906/h...
8,Lululemon Athletica,https://shop.lululemon.com/
9,Lucky Brand,https://www.luckybrand.com/


#### 5. Extract company domain

Reference: https://www.askpython.com/python/examples/extract-domain-name-from-url

In [13]:
data["Subdomain"] = data["URL"].apply(lambda x : tldextract.extract(x).subdomain)
data["Domain"] = data["URL"].apply(lambda x : tldextract.extract(x).domain)
data["Suffix"] = data["URL"].apply(lambda x : tldextract.extract(x).suffix)

#### 6. Label invalid domain such as stackoverflow, facebook, twitter, linkedin, wikipedia.

In [14]:
invalid_domain = ["stackoverflow", "facebook", "twitter", "linkedin", "wikipedia"]

data["Common_Error"] = data["Domain"].apply(lambda x: 1 if x in invalid_domain else 0)

#### 6. Save to Excel

In [15]:
# update the filename
new_File = excel_file + "_Output"

writer = pd.ExcelWriter(new_File + ".xlsx")
data.to_excel(writer, index = None)
writer.close()