# Ethical Web Scraping - Using Google API
### By Guang Yi Chua, Marketing Analyst, Construct Digital

This code was written quickly in conjunction with my blog post to demonstrate the power of using web scraping to get useful and relevant information about your customers, which will allow you to segment them more efficiently, thus allowing higher ROI.

Remember, always make sure you identify yourself and respect the robots.txt file of every website.

Happy scraping!

In [1]:
# Last ran on June 7th, 2018 with Python Version 2.7.15
# Import the necessary packages for CustomSearch to work
# These packages can be installed through 'conda install' or 'pip install'

from googleapiclient.discovery import build
import pprint
import pandas as pd

In [None]:
# This is just sample code for how to identify yourself if no API is available
# The following are most of the packages you would use if this were the case

import requests
import urllib2 
from bs4 import BeautifulSoup

identity = {'user-agent' : 'Safari/11.1.1 (Macintosh; Intel Mac OS X 10_13_5); \
             Your Name Here/Your Company/email@yourcompany.com'}

html = requests.get(url, headers=identity)

In order to use Google's API, you need to have: 

1. A Google Cloud account
2. A Google Cloud SDK (along with Python) installed
3. An application created in the Console
4. Set up a Custom Search Engine

More detailed instructions can be found at: https://stackoverflow.com/questions/37083058/programmatically-searching-google-in-python-using-custom-search


#### People Search

In [2]:
# These can be found on your Cloud Console and Custom Search Engine dashboard, respectively
my_api_key = "Your API Key here"
my_cse_id = "Your Customer Search Engine Here"

def google_search(search_term, api_key, cse_id, **kwargs):
    service = build("customsearch", "v1", developerKey=api_key)
    res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
    return res['items']

# Search for names here. In lieu of a list, you can try reading in a list of
# names with an Excel or CSV file with Pandas
search_parameters = ["Guang Yi Chua", "Edwin Tam"]
results = []

# The free version of Google API limits you to 100 queries per day
# If you require more, it is $5 per additional 1,000 queries
for names in search_parameters :
    results.append(google_search(names, my_api_key, my_cse_id, num=10))

In [3]:
# The JSON that the search gives is added on to a dictionary. 
# The first entry is the name that you searched
# The second number gives the individual hits from Google
results[0][0].keys()

[u'kind',
 u'title',
 u'displayLink',
 u'htmlTitle',
 u'formattedUrl',
 u'htmlFormattedUrl',
 u'pagemap',
 u'snippet',
 u'htmlSnippet',
 u'link']

In [4]:
# Results are returned in JSON format 

for result in results:
    pprint.pprint(result)

[{u'displayLink': u'www.linkedin.com',
  u'formattedUrl': u'https://www.linkedin.com/in/guangyic',
  u'htmlFormattedUrl': u'https://www.linkedin.com/in/<b>guangyi</b>c',
  u'htmlSnippet': u'Se <b>Guang Yi</b> Chuas profil p\xe5 LinkedIn \u2013 verdens st\xf8rste faglige netv\xe6rk. Guang <br>\nYis erfaring inkluderer General Assembly, Music and Ballet Company og United&nbsp;...',
  u'htmlTitle': u'<b>Guang Yi Chua</b> | Faglig profil',
  u'kind': u'customsearch#result',
  u'link': u'https://www.linkedin.com/in/guangyic',
  u'pagemap': {u'cse_image': [{u'src': u'https://media.licdn.com/mpr/mpr/shrink_100_100/p/8/005/092/188/23e56c2.jpg'}],
               u'cse_thumbnail': [{u'height': u'100',
                                   u'src': u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQMYNdJLSf_-hvXd9gx8x__rL5kxU9w15jYFVmMScuIizM65MKfe1_g',
                                   u'width': u'100'}],
               u'hcard': [{u'fn': u'Guang Yi Chua',
                           u'photo':

In [5]:
# Prepare an array to pass the relevant links to a Pandas dataframe
for_df = []

# Filter only LinkedIn results
for name in range(len(results)) :
    for hits in range(len(results[name])) :
        if "linkedin.com" in results[name][hits]["link"] :
            for_df.append(results[name][hits])
            
df = pd.DataFrame(for_df)

In [6]:
# Only displayLink, link, htmlSnippet, htmlTitle, snippet are useful for export, so let's isolate
df_export = df.loc[:, ["displayLink", "link", "htmlSnippet", "htmlTitle", "snippet"]]
df_export

Unnamed: 0,displayLink,link,htmlSnippet,htmlTitle,snippet
0,www.linkedin.com,https://www.linkedin.com/in/guangyic,Se <b>Guang Yi</b> Chuas profil på LinkedIn – ...,<b>Guang Yi Chua</b> | Faglig profil,Se Guang Yi Chuas profil på LinkedIn – verdens...
1,www.linkedin.com,https://www.linkedin.com/in/edwin-tam-6b653a2,View <b>Edwin Tam&#39;s</b> professional profi...,<b>Edwin Tam</b> | LinkedIn,View Edwin Tam's professional profile on Linke...
2,sg.linkedin.com,https://sg.linkedin.com/in/edwintam,View <b>Edwin Tam&#39;s</b> profile on LinkedI...,<b>Edwin Tam</b> - Digital Strategy Director -...,"View Edwin Tam's profile on LinkedIn, the worl..."
3,ca.linkedin.com,https://ca.linkedin.com/in/edwin-tam-ba03436,View <b>Edwin Tam&#39;s</b> profile on LinkedI...,"<b>Edwin Tam</b> - Assistant Dean, Student Aff...","View Edwin Tam's profile on LinkedIn, the worl..."


#### Companies

Search is not limited to just people. Finding out company information for B2B is just as powerful and useful.

In [7]:
company_search = ["Construct Digital", "Google"]
results2 = []

# Again, isolate only the LinkedIn results
for companies in company_search :
    search_string = companies + " LinkedIn"
    results2.append(google_search(search_string, my_api_key, my_cse_id, num=10))

In [8]:
for_df2 = []

for name in range(len(results2)) :
    for hits in range(len(results2[name])) :
        if "linkedin.com" in results2[name][hits]["link"] :
            #print "Company!"
            if "company" in results2[name][hits]["link"] :
                for_df2.append(results2[name][hits])
                #print "Company"
            
df2 = pd.DataFrame(for_df2)

In [9]:
df2_export = df2.loc[:, ["displayLink", "link", "htmlSnippet", "htmlTitle", "snippet"]]
df2_export

Unnamed: 0,displayLink,link,htmlSnippet,htmlTitle,snippet
0,www.linkedin.com,https://www.linkedin.com/company/construct-dig...,Learn about working at <b>Construct Digital</b...,<b>Construct Digital</b> | <b>LinkedIn</b>,Learn about working at Construct Digital. Join...
1,www.linkedin.com,https://www.linkedin.com/company/google,Learn about working at <b>Google</b>. Join <b>...,<b>Google</b> | <b>LinkedIn</b>,Learn about working at Google. Join LinkedIn t...
2,sg.linkedin.com,https://sg.linkedin.com/company/google,Learn about working at <b>Google</b>. Join <b>...,<b>Google</b> | <b>LinkedIn</b>,Learn about working at Google. Join LinkedIn t...
3,www.linkedin.com,https://www.linkedin.com/company/google/jobs,Learn about working at <b>Google</b>. Join <b>...,<b>Google</b>: Jobs | <b>LinkedIn</b>,Learn about working at Google. Join LinkedIn t...
4,au.linkedin.com,https://au.linkedin.com/company/google,Learn about working at <b>Google</b>. Join <b>...,<b>Google</b> | <b>LinkedIn</b>,Learn about working at Google. Join LinkedIn t...
5,ca.linkedin.com,https://ca.linkedin.com/company/google,Learn about working at <b>Google</b>. Join <b>...,<b>Google</b> | <b>LinkedIn</b>,Learn about working at Google. Join LinkedIn t...
6,il.linkedin.com,https://il.linkedin.com/company/google,Learn about working at <b>Google</b>. Join <b>...,<b>Google</b> | <b>LinkedIn</b>,Learn about working at Google. Join LinkedIn t...


When finished, export in your favorite format.