## Generic Web Scraper - Core Offerings (Products & Solutions)

__Summary__: This is a generic web scraper to extract details related to Products, Solutions, Services, Platforms & Clients of companies

__The scraper works as follows__:
- Input file containing the URLs of companies is read
- Selenium opens the URL in browser and extract the page source using 
- The source html data is parsed using BeautifulSoup
- Required data is then extracted, filtered and processed
  - Identify Stop words and Stop phrases
- Final output is saved in an output file 
   - Errors are also saved in an error file

__Next steps__: 
- The current scraper uses hardcoded rules to try and extract products, solution and services. Your task is to modify the scraper so that it uses AI to pull the products, solutions and services

In [1]:
!pip install selenium
!pip install webdriver_manager

In [2]:
# Import necessary libraries

from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import time
import pandas as pd
import bs4
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import re
import string


%run Linkedinscraping.ipynb

  driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)


 - Function to download the browser and parse the html page source

In [3]:
def download_browser(my_url):
    options = webdriver.ChromeOptions()

    # Selenium accesses the Chrome browser driver in incognito mode and without actually opening a browser window 
    # (headless argument). Any certificates errors are ignored
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--incognito')
    options.add_argument('--headless')

    browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
    #     browser = webdriver.Firefox(executable_path = executable_path, options = options)
    #     browser.get("https://{}".format(my_url))

    browser.get(my_url)
    # Pass the page source to Beautiful Soup for parsing
    page_soup = BeautifulSoup(browser.page_source, 'lxml')
    #page_soup2 = soup(browser.page_source, 'lmxl')
    return (browser, page_soup)



In [4]:
#summarization model inputs

import requests

API_URL = "https://api-inference.huggingface.co/models/Capstone/autotrain-summarization-capstone-mode-2151969392"
headers_summarization = {"Authorization": f"Bearer {'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'}"}

def query(payload):
    response = requests.post(API_URL, headers=headers_summarization, json=payload)
    return response.json()

- Function to initialize the lists. Will be called for every URL

In [5]:
def init_lists():
    products = []
    solutions = []
    services = []
    platforms = []
    clients = []
    page_text = []
    overviewpage_link = []
    linkedinpage_link = []
    overview_text = []
    
    return (products, solutions, services, platforms, clients, page_text, overviewpage_link, linkedinpage_link, overview_text)

- Specify Input / Output files
- Read the Input file and create a list of Company name and URL

In [6]:
in_file = r'companies data.xlsx'
out_file = r'webscraping output.csv'
out_file_err = r'error file.csv'

header = ['Company Name', 'Link', 'Products', 'Solutions', 'Services', 'Platforms', 'Clients', 'Page_Text', 'overviewpage_Link', 'linkedinpage_Link','overview_Text','Overview_linkedIn','Industry_linlkedIn','Specialities_linkedIn','Company_Overview_Summary']
header_err = ['Company Name', 'Link', 'Error']

fh_in = open(in_file, 'r')
df = pd.read_excel(in_file)
fh_in.close()

input_list = df[['Company Name', 'Link']].values.tolist()

In [7]:
df

Unnamed: 0,Company Name,Link
0,"Simplify Healthcare, Inc.",https://simplifyhealthcare.com
1,Medecision,https://www.medecision.com/
2,Ingenious Med Inc.,https://ingeniousmed.com/


 - Write data to be written into output file for every URL

In [8]:
def write_file():    
    temp = []
    temp = [comp_name, link, ' | '.join(products_clean), ' | '.join(solutions_clean), ' | '.join(services_clean), ' | '.join(platforms_clean), ' | '.join(clients_clean), ' | '.join(page_text_clean), ' | '.join(overviewpage_link_clean), ' | '.join(linkedinpage_link_clean),  ' | '.join(overview_text_clean),  ' | '.join(overviewlk1),  ' | '.join(industrylk1),  ' | '.join(specialitieslk1),  ' | '.join(Company_overview_summary)] 
    out.append(temp)
    return ()

 - Write errors encountered during runtime for any input URL into error file

In [9]:
def write_error_file():
    temp_err = []
    temp_err = [comp_name, link, e] 
    out_err.append(temp_err)
    return ()

 - Define the Keywords for output file columns, Stop Words and Stop Phrases

In [10]:
products_key = ['product', 'software', 'consult']

solutions_key = ['solution', 'care', 'value', 'industry', 'offer', 'practice', 'capabili', 'practice', 'virtual', 
                 'monitor', 'real' 'clinical', 'surveillance', 'what', 'how', 'help', 'systems']

services_key = ['service']

platforms_key = ['platform', 'oprx']

clients_key = ['client', 'partner', 'serve']

overviewlinks_key = ['about', 'company']

linkedin_key = 'linkedin.com/company'

# Add stop words in lower case
stop_words = ['read report', 'careers', 'industry report', 'read the case study', 'read the report', 'join our team', 
              'overview', 'partners', 'customers', 'success stories', 'blogs', 'menu', 'previous', 'next', 
              'events & webinars', 'education & training', 'white paper', 'research paper', 'resources', 'terms of use', 
              'news', 'podcasts', 'request a demo', 'get a demo', 'our manifesto', 'schedule a demo', 'request demo',
              'cookie policy', 'blog', 'events', 'webinars', 'downloads', 'about us', 'about', 'advisors', 'contact', 
              'privacy', 'schedule a free demo', 'login', 'read now', 'press releases', 'flip through our magazine', 
              'brochures', 'case studies', 'case study', 'ebooks', 'infographics', 'magazines', 'videos', 'white papers', 
              'discover all our resources', 'learn more about who we are', 'executive team', 'board of directors', 
              'regulatory and compliance', 'see more demos', 'watch the video', 'listen to the podcast', 'news & events', 
              'brand', 'developer network', 'view webinar', 'read story', 'view event', 'log in', 'log out',
              'start the conversation now', 'code of ethics', 'privacy policy', 'terms and conditions', 'skip to content', 
              'home', 'request your demo today', 'read more', 'faq & help center', 'sign in', 'sign out', 'learn more', 
              'get in touch', 'view testimonials', 'terms of service', 'contact us', 'new registration click here', 
              'sites', 'sumo', 'faqs', 'faq', 'terms & conditions', 'site map', 'sitemap', 'customer stories', 
              'support', 'back to top', 'help', 'connect', 'news & blog', 'log in / sign up', 'scroll to top', 'follow', 
              'legal & privacy', 'close menu', 'terms of use and privacy policy', 'email', 'call', 'terms', 'company', 
              'jobs', 'manifesto', 'awards', 'news / events', 'join us', 'meet the team', 'rss', 'get started', 'tour', 
              'read article', 'read more reviews', 'accessibility', 'i understand', 'poland', 'switzerland', 
              'asia pacific', 'czechia', 'denmark', 'germany', 'norway', 'middle east & africa', 'ireland', 'italy', 
              'netherlands', 'france', 'brazil', 'united states', 'austria', 'turkey', 'sweden', 'canada', 'india', 
              'belgium', 'south africa', 'benelux', 'russia', 'united kingdom', 'romania', 'spain', 'slovakia', 
              'australia', 'argentina', 'chile', 'colombia', 'ecuador', 'mexico', 'peru', 'panama', 'portugal', 
              'new zealand', 'online courses', 'to the top', 'to top', 'book an appointment', 'our story', 'view demo', 
              'tutorials', 'demo', 'watch now', 'shop online', 'live demo', 'view demo', 'schedule demo', 'open positions',
              'livechat', 'brochure', 'book a demo', 'call us', 'find us', 'deutsch', 'french', 'japanese', 'portuguese', 
              'english', 'chinese', 'nederlands', 'request a quote', 'accept', 'decline','requestdemotoday', 'history',
              'read the study', 'read the post', 'my account', 'customer portal', 'share your experience', 'back', 
              'shop', 'back to dashboard', 'demos', 'log-in', 'uk & europe', 'asia-pacific', 'register', 'reach us', 
              'term of use', 'help center', 'chat', 'more', 'costa rica', 'croatia', 'czech republic', 'egypt', 'finland', 
              'hong kong', 'hungary', 'indonesia', 'israel', 'japan', 'korea', 'lithuania', 'luxembourg', 'malaysia',
              'morocco', 'nigeria', 'norway', 'philippines', 'qatar', 'saudi arabia', 'singapore', 'thailand', 'taiwan',
              'tunisia', 'united arab emirates', 'vietnam', 'singapore', 'skype', 'whatsapp', 'glossary', 'search'] 

stop_strings = ['download', 'webinar', 'case stud', 'explore', 'whitepaper', 'reviews', 'disclosure', 'register now',
                'press', 'click', 'login', 'blog', 'about', 'scroll', 'council', 'news', 'sign up', 'contact us', 'china',  
                'questions', 'feedback', 'free', 'read story', 'conversation', 'our story', 'terms of use', 'privacy', 
                'cookie', 'view all', 'navigation', 'legal', "let's talk", "let's connect", 'watch video', 'awards', 'faq',
                'leadership','get started', 'more info', 'more details', 'contact support', 'twitter', 'facebook', 
                'youtube', 'linkedin', 'wechat', 'pinterest', 'instagram', 'subscribe', 'try now', 'get a quote', 'http',  
                'javascript', 'my personal information', 'hiring', 'skip to','infographic', 'technical support', 'email us',
                ' more', 'follow us', 'terms of service', 'conditions', 'user agreement', 'not yet registered', 'show me',
                'forgot password', 'careers', 'jobs', 'live chat', 'tour today', 'sign in', 'sign out', 'help guide', 
                'question', 'testimonials', 'for a demo', 'schedule a call', 'google', 'instant demo', 'english', 
                'how it works', 'reference', 'vacanc', 'apply','demo today', 'open position', 'video', 'book demo',
                'item', 'software demo', 'engaging', 'tweet', 'online review', 'learn more', 'e-mail', 'contact form', 
                'buy now', 'open menu', 'full team', 'article', 'social media', 'polic', 'client portal', 'speaker', 
                'customer support', 'code of conduct', 'a demo', 'out more', 'north america', 'podcast', 'help center', 
                'create account', 'chat with us', 'request pricing', 'personal demo', 'acknowledge', 'github', 'android',
                'ios', 'window', 'messenger', 'continue', 'disclaimer', 'meeting', 'go to', 'your demo', 'talk to', 
                'send message', 'we are', 'a message', 'help desk', 'sign-in', 'sign-out', 'contact sales', 'to top',
                '@', '\.com', 'www\.', '__', '{', '}', '\..', '\[#', '\.js']

# Note: In reg-ex for using special characters like '.' or '[' or '#' as search string, always use escape character '\'
#       before the actual search string

Main processing section of code:
- For each of the URL in the input file, below is performed:
    - Download the html page source using browser via Selenium 
    - Parse the page source using BeautifulSoup
    - Eliminate non printable characters and multiple spaces
    - Loop thru the key words, stop words, stop phrases and write into respective columns 
    - Look for duplicate entries and clean up
    - Write an entry into output list / error output list
    - Close the browser
- Load the output list and error output list into dataframes and create respective output files

In [11]:
out = []
out_err = []
i = 0

for input in input_list:
    
    comp_name = input[0].strip()    # Capture company name
    link = input[1].strip()         # Capture company url
    
    if "https://https://" in link:
        link = link.replace("https://https://", "https://")
    
    products, solutions, services, platforms, clients, page_text, overviewpage_link, linkedinpage_link, overview_text = init_lists() # Initialize lists
        
    try: 
        browser, page_soup = download_browser(link)                             # Download browser and parse page source 
        
        # Sleep for 3 seconds for the page to respond
        time.sleep(3)                                                           # Sleep for 3 seconds for page to respond
        
        # Capture all tags which have href variable present in them
        u_lists = page_soup.find_all(href = re.compile('\S'))
        
        
        #u_lists = page_soup.find_all(class_ = re.compile(r'(drop)|(nav)|(menu)|(item)|(sub)'), href = re.compile('\S'))
        
        for u_list in u_lists:
            
            # Eliminate all non printable characters in text
            txt = re.sub('[^{}]'.format(string.printable), '', u_list.text.strip())
            
            
            # Eliminate multiple spaces in text
            txt_list = txt.split()                              
            txt = (' ').join(txt_list)
                     
            if len(txt) > 2 and txt.lower() not in stop_words:                  # Eliminate stop words & unwanted text          
                skip_text = False
                
                for stop_string in stop_strings:                                # Eliminate all stop strings
                    if re.search(stop_string.lower(), txt.lower()):
                        skip_text = True
                        break                        
                
                # Skip if phone numbers are present and include those strings with less than 10 words
                if skip_text == False and len(re.sub('[^0-9]', '', txt)) <= 9 and len(txt_list) < 10:
                    href = u_list['href'].strip()
                    
                    # Match on respective key words and write into respective columns 
                    for product in products_key:
                        if re.search(product.lower(), href.lower()):
                            products.append(txt)

                    for solution in solutions_key:
                        if re.search(solution.lower(), href.lower()):
                            solutions.append(txt)

                    for service in services_key:
                        if re.search(service.lower(), href.lower()):
                            services.append(txt)

                    for platform in platforms_key:
                        if re.search(platform.lower(), href.lower()):
                            platforms.append(txt)

                    for client in clients_key:
                        if re.search(client.lower(), href.lower()):
                            clients.append(txt)

                    page_text.append(txt)                                       # Capture all extracted & filtered text
                    
                    links = []
                    for pagelink in page_soup.findAll('a'):
                        links.append(pagelink.get('href'))
                    all_links = []
                    linkedinpage_link=[]
                    links_inter =[]
                    for all_link in links:
                        c='/'
                        lst = []
                        if all_link != None :
                            if re.search(linkedin_key, all_link):
                                if all_link[-1]=="/":
                                    linkedinpage_link.append(all_link+"about/")
                                if all_link[-1]!="/":
                                    linkedinpage_link.append(all_link+"/about/")
                                #linkedinpage_link.append(all_link)
                            for pos,char in enumerate(all_link):
                                if(char == c):
                                    lst.append(pos)
                            p=3
                            try:
                                while  len(links_inter) == 0 and p <= 6:
                                    nth = lst[p]+1
                                    links_inter1 = all_link[0:nth]




                                    for key in overviewlinks_key:
                                        if re.search(key, links_inter1) and re.search(link, links_inter1):
                                            links_inter.append(links_inter1)
                                    p = p+1
                            except Exception:
                                pass
                    links_inter1 = []
                    links_inter1 = [*set(links_inter)]
                    #[links_inter1.append(link_inter1) for link_inter1 in links_inter if link_inter1 not in links_inter1]
                    
                    for lin_inter in links_inter1:
                        try:
                            browser2, page_soup2 = download_browser(lin_inter)
                            u_lists2 = page_soup2.find_all("p")
                            
                            for txt2 in u_lists2:
                                txt2 = re.sub('[^{}]'.format(string.printable), '', txt2.text.strip())
                                overview_text.append(txt2)
                        except Exception:
                            pass
        # Remove duplicates for respective column entries
        products_clean = []
        #[products_clean.append(product) for product in products if product not in products_clean]
        products_clean = [*set(products)]

        solutions_clean = []
        #[solutions_clean.append(solution) for solution in solutions if solution not in solutions_clean]
        solutions_clean = [*set(solutions)]

        services_clean = []
        #[services_clean.append(service) for service in services if service not in services_clean]
        services_clean = [*set(services)]

        platforms_clean = []
        #[platforms_clean.append(platform) for platform in platforms if platform not in platforms_clean]
        platforms_clean = [*set(platforms)]

        clients_clean = []
        #[clients_clean.append(client) for client in clients if client not in clients_clean]
        clients_clean = [*set(clients)]

        page_text_clean = []
        #[page_text_clean.append(text) for text in page_text if text not in page_text_clean]
        page_text_clean = [*set(page_text)]
        
        overviewpage_link_clean = []
        #[overviewpage_link_clean.append(link_inter) for link_inter in links_inter if link_inter not in overviewpage_link_clean]
        overviewpage_link_clean = [*set(links_inter)]
        
        linkedinpage_link_clean = []
        #[linkedinpage_link_clean.append(linkedIn_inter) for linkedIn_inter in linkedinpage_link if linkedIn_inter not in linkedinpage_link_clean]
        linkedinpage_link_clean = [*set(linkedinpage_link)]
        browser.quit() 
        overviewlk = []
        industrylk = []
        specialitieslk = []
        
        overviewlk1 = []
        industrylk1 = []
        specialitieslk1 = []
        
        try:
            for linkedinpage_link_clean1 in linkedinpage_link_clean:
                overviewlk, industrylk, specialitieslk = linkedinscrap(linkedinpage_link_clean1)
                overviewlk1.append(" ".join(str(e) for e in overviewlk))
                industrylk1.append(" ".join(str(e) for e in industrylk))
                specialitieslk1.append(" ".join(str(e) for e in specialitieslk))
        except:
            pass
                      
        overview_text_clean = []
        [overview_text_clean.append(text) for text in overview_text if text not in overview_text_clean]
        
        
        Company_overview_summary = []
        try:
            if len(overview_text_clean)>0:
                
                Company_overview_summary = [query({"inputs": str(overview_text_clean),})[0]['summary_text']]
        except:
            pass
        
        

        write_file()                                                            # Write output record
    
    except Exception as e:                                                      # Write error record for run time errors
        print('url# {}: {} Error: {}'.format(i, link, e))
        write_error_file()
    finally: 
        browser.quit()                                                          # Quit the browser for every URL  
        
        i = i + 1                                                               # Counter tracking # of records processed 
        if i % 25 == 0:                                                         # Print after every 25 records
           print('# of URLs processed:', i)


  browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)


In [12]:
Company_overview_summary

["It looks like nothing was found at this location. Maybe try a search? Maybe try to find something else? No, it's not that easy. It's just that you can't find anything when you try to look for it. You have to go to a different location and do a search."]

 - Load the output list into a DataFrame and export to output CSV file

In [13]:
df_out = pd.DataFrame(out, columns = header)
df_out.to_csv(out_file, index = False, encoding='utf-8')

df_out

Unnamed: 0,Company Name,Link,Products,Solutions,Services,Platforms,Clients,Page_Text,overviewpage_Link,linkedinpage_Link,overview_Text,Overview_linkedIn,Industry_linlkedIn,Specialities_linkedIn,Company_Overview_Summary
0,"Simplify Healthcare, Inc.",https://simplifyhealthcare.com,,Provider1 | Medicaid Benefit Plan Management |...,Service1 | Service1 Digital Payer Platform,,,Provider1 | Medicaid Benefit Plan Management |...,https://simplifyhealthcare.com/about-us/,https://www.linkedin.com/company/simplifyhealt...,"Home About Us Our Story | Founded in 2008, w...",Simplify Healthcare is one of the leading digi...,IT Services and IT Consulting,"Healthcare IT, Payer Solutions, Benefit System...",Healthcare technology solutions providers offe...
1,Medecision,https://www.medecision.com/,Aveus Consulting Services,Digital Care Management | Care Coordination | ...,Aveus Consulting Services,,,Digital Care Management | Care Coordination | ...,https://www.medecision.com/about-us/,https://www.linkedin.com/company/medecision/ab...,Medecision is a digital care management compan...,Medecision® is a digital care management compa...,Software Development,"Population Health Management, Integrated Healt...",Medecision is a digital care management compan...
2,Ingenious Med Inc.,https://ingeniousmed.com/,,National health systems | Data intelligence | ...,,,,Compliance Hotline website | National health s...,https://ingeniousmed.com/about-us/,https://www.linkedin.com/company/ingenious-med...,It looks like nothing was found at this locati...,Ingenious Med offers usable solutions that sim...,Hospitals and Health Care,"technology, healthcare, sales, marketing, exec...",It looks like nothing was found at this locati...


 - Load the output error list into a DataFrame and export to output error CSV file

In [14]:
df_out_err = pd.DataFrame(out_err, columns = header_err)
df_out_err.to_csv(out_file_err, index = False, encoding='utf-8')

df_out_err

Unnamed: 0,Company Name,Link,Error
