# Crawling and Scraping for Creating an Influencer Database

You just landed a new big, long term account. You need to develop a proper understanding of the industry, how it is structured, and who the leaders are so you can help your client develop strong relationships as well as establish strong thought leadership in the industry.

One way to find those people is through industry publications that feature leaders who contribute content on strategies and industry best practices. 

In this tutorial we will use crawling and scraping to create the nucleus of such a database. 

To be familiar, the industry is the online marketing and and advertising industry, the publication is the SEMrush blog, and the crawler is the open-source [advertools](https://github.com/eliasdabbas/advertools) crawler. 

At the time of this writing, this blog has 393 authors, each with a profile page. You can manually copy and paste all their profiles and links, but your time is much more valuable than that. 

# Preparation (pages and data elements to extract)

The full list of bloggers with a link to each blogger's profile page can be found on a few pages with this template: 

`https://www.semrush.com/blog/authors/all/?page={n}`
Where `n` is a number ranging from one to fourteen.

We first start by generating a list of those pages. The crawler will be instructed to start here.

In [80]:
author_pages = [f'https://www.semrush.com/blog/authors/all/?page={n}' for n in range(1, 15)]
author_pages

['https://www.semrush.com/blog/authors/all/?page=1',
 'https://www.semrush.com/blog/authors/all/?page=2',
 'https://www.semrush.com/blog/authors/all/?page=3',
 'https://www.semrush.com/blog/authors/all/?page=4',
 'https://www.semrush.com/blog/authors/all/?page=5',
 'https://www.semrush.com/blog/authors/all/?page=6',
 'https://www.semrush.com/blog/authors/all/?page=7',
 'https://www.semrush.com/blog/authors/all/?page=8',
 'https://www.semrush.com/blog/authors/all/?page=9',
 'https://www.semrush.com/blog/authors/all/?page=10',
 'https://www.semrush.com/blog/authors/all/?page=11',
 'https://www.semrush.com/blog/authors/all/?page=12',
 'https://www.semrush.com/blog/authors/all/?page=13',
 'https://www.semrush.com/blog/authors/all/?page=14']

From these pages, the crawler should follow the links and from each profile page extract certain elements that we are interested in, using CSS selectors. If you are not familiar with CSS (or XPath) selectors, they are basically a way for you to specify parts of the page in a language that is more explicit and specific than "list items" for example, but also in a way that corresponds to how we think and view pages. 

You most likely don't want all the links from a page. You typically want something like "all the links in the top part of the page that have social media icons".

I use a nice browser tool called [SelectorGadget](https://selectorgadget.com/) to help me find the names of the selectors. Once you activate it, any click you make on a part of the page gets highlighted in green, together with all similar elements. If you want to be more specific, you can click on other elements to deselect them. 

In the example below, I first clicked on the LinkedIn icon, which also selected several other links on the page. Then once I clicked on the Home icon, it deselected all other elements (that's why it is now in red), and I'm given the selector that corresponds to this specific element. 
In the bottom of the page you can see `.b-social-links__item_ico_linkedin`. You can also click on the XPath button to get the equivalent pattern if you want. So this is how we specify to the crawler which elements we want.  
Meet A.J., our all-time #1 champion!

![](css_selector_screenshot.png)

I did the same for other elements, and they are named below in this {key: value} mapping (Python dictionary). The keys can be named whatever you want, and they will become the column names in the crawl output file. The values are what the crawler will extract. 
Note that the selectors end with `::text` or `::attr(href)`.   
If you don't specify that, you will still get the links extracted correctly, but you will get the whole link object `<a href="https://example.com>Link Text</a>`.  In this case we specified whether we want the href or the text attribute.

In [3]:
selectors = {
    'twitter':   '.b-social-links__item_ico_tw::attr(href)',
    'linkedin':  '.b-social-links__item_ico_linkedin::attr(href)',
    'facebook':  '.b-social-links__item_ico_fb::attr(href)',
    'instagram': '.b-social-links__item_ico_instagram::attr(href)',
    'website':   '.b-social-links__item_ico_web-site::attr(href)',
    'job_title': '.b-profile-top__info-occupation::text',
    'summary': '.b-profile-top__description::text , .b-profile-top__description a::text',
    'alltime_rank': 'a[class="js-profile-ga-event"]::text',
    'rank_name': '.b-profile-top__status::text'
}

So now we have the start pages ready, and we have the elements that we want to extract.   
To crawl, we use the command below:

In [4]:
import advertools as adv
    
adv.crawl(url_list=author_pages,
          output_file='semrush_crawl_2020-06-17.jl',
          follow_links=True, 
          custom_settings={'DEPTH_LIMIT': 1},
          css_selectors=selectors) 

Let me explain: 
* `import advertools as adv`: activate the advertools package and use the alias `adv` to refer to it as a shorthand. 
* `url_list=author_pages`: This is where the crawler will start crawling. `author_pages` is the name we gave to the list of fourteen URLs that contain links to the bloggers' profiles. 
* `output_file`: This is where we want the crawl data to be saved. It is always good to provide a descriptive name, together with the date. ".jl" is for "jsonlines", which is a flexible way of storing the data, where each URL's data will be saved in an independent line in the file. We will import it as a DataFrame, which can then be saved to CSV format for easier sharing.
*`follow_links=True`: If set to False, then the crawler would only crawl the specified pages, which is also known as "list mode". In this case we want the crawler to follow links. What this means is that for every page crawled, all links will be followed. Now we don't want to crawl the whole website, so we use the next setting to limit our crawl. 
* `'DEPTH_LIMIT': 1`: Yes, do follow links you find on the initial pages, and crawl the pages you find, but only one level after the initial fourteen.
* `selectors`: Refers to the dictionary we created to specify the data we want extracted.

There are many different options for crawling, and you can [check the documentation](https://advertools.readthedocs.io/en/master/advertools.spider.html) if you are interested in more details.  
This takes a few minutes, and now we can open the file using the pandas function `read_json`, and by specifying `lines=True` (because it's jsonlines).

In [5]:
import pandas as pd

semrush = pd.read_json('semrush_crawl_2020-06-17.jl',
                       lines=True)

Now we have defined the variable `semrush` to refer to the crawl DataFrame. Let's first take a look at the columns it contains.  
As you can see below, there are eighty three columns. The majority are fixed (like "title", "h1", "meta_desc", etc.), and some are dynamic. For example OpenGraph data might exist on a page and might not, so some columns only appear if they are in the page.  
The columns that have social network names and the names we specified above would only appear if we explicitly specify them as we did in this example.

In [6]:
n = 1
for col1, col2, col3 in zip(semrush.columns[:28],
                            semrush.columns[28:56],
                            semrush.columns[56:].tolist() + ['', '']):
    print(f'{n:<2}: {col1:<30}', f'{n+28}: {col2:<40}', f'{n+28+28}: {col3}')
    n += 1

1 : url                            29: links_fragment                           57: request_headers_user-agent
2 : url_redirected_to              30: links_nofollow                           58: request_headers_accept-encoding
3 : title                          31: img_src                                  59: request_headers_cookie
4 : meta_desc                      32: img_alt                                  60: alt_hreflang
5 : canonical                      33: ip_address                               61: og:title
6 : alt_href                       34: crawl_time                               62: og:type
7 : og:image                       35: resp_headers_date                        63: og:description
8 : h1                             36: resp_headers_content-type                64: og:site_name
9 : h2                             37: resp_headers_cf-ray                      65: og:url
10: h3                             38: resp_headers_cache-control               66: request_heade

Since this is not an SEO audit, we are only interested in the extracted data related to the authors, we will now create a subset of the data, with the following code.  
It basically says that we want the subset of `semrush` where the cells of the column `attime_rank` are not empty, and where the columns are `h1`, `url`, or any of the keys that we specified for extraction.

In [95]:
authors = semrush[semrush['alltime_rank'].ne('')][['h1', 'url'] + list(selectors.keys())].copy()
authors.apply(lambda s: s.str[:20]).sample(10).style.hide_index()

h1,url,twitter,linkedin,facebook,instagram,website,job_title,summary,alltime_rank,rank_name
Evan Facinger,https://www.semrush.,https://twitter.com/,https://www.linkedin,,https://www.instagra,https://foremostmedi,Director Sales and,Evan is a digital ma,#139 @@ #154,Pro
Moss Clement,https://www.semrush.,https://twitter.com/,https://www.linkedin,https://www.facebook,https://www.instagra,https://writersperho,Content Manager at,Moss Clement is a fr,#201 @@ #17,Pro
Dave Rohrer,https://www.semrush.,https://twitter.com/,,,https://www.instagra,https://www.northsid,Founder at NorthSid,As an in-house and a,#322 @@ #244,Expert
Oleg Yemchuk,https://www.semrush.,,https://www.linkedin,,,,Marketing Manager a,Oleg Yemchuk is the,#220 @@ #265,Expert
Brandon Weaver,https://www.semrush.,,,,,,Content Marketing M,"Born in Idaho, grew",#271 @@ #2515,Helper
Cooper Hollmaier,https://www.semrush.,https://twitter.com/,https://www.linkedin,,,https://www.visiture,Technical SEO Manag,Cooper Hollmaier is,#216 @@ #86,Pro
Vahe Arabian,https://www.semrush.,https://twitter.com/,https://www.linkedin,,,http://www.stateofdi,Founder & Editor in,Vahe Arabian is the,#84 @@ #898,Pro
Alexandra Tachalova,https://www.semrush.,https://twitter.com/,https://www.linkedin,,,http://alextachalova,Founder at Digital,Alexandra Tachalova,#81 @@ #148,Pro
Alexander Porter,https://www.semrush.,https://twitter.com/,https://www.linkedin,https://www.facebook,https://www.instagra,https://paperclipdig,Head of Copy and SE,Alexander is Head of,#147 @@ #80,Pro
Nadya Khoja,https://www.semrush.,https://twitter.com/,https://www.linkedin,https://www.facebook,,https://drinkwithnad,Head of Marketing a,Upon realizing that,#200,Expert


We are almost done!  
Some cleaning of the data is needed. 

You might have noticed the additional characters in the `alltime_rank` column, as well as the fact that it contains two values; one for the rank in posts, and another for the rank in comments. The following code splits the two values, and removes the noisy characters.

In [8]:
ranks = authors['alltime_rank'].str.replace('\n|#|\s|№', '').str.split('@@')
ranks

13      [90, 2503]
15      [209, 126]
16           [208]
17            [89]
18           [206]
          ...     
408     [156, 645]
409          [152]
410      [162, 23]
412     [154, 393]
413    [151, 1749]
Name: alltime_rank, Length: 388, dtype: object

Now we can create two separate columns, one for each rank, and make sure they are integers so we can sort by those columns. In some cases where the blogger doesn't have a rank for comments, we give them a rank of zero.

In [9]:
authors['alltime_rank_posts'] = [int(rank[0]) for rank in ranks]
authors['alltime_rank_comments'] = [0 if len(rank) == 1 else int(rank[1]) for rank in ranks]

One final step. 
The columns `job_title` and `rank_name` contain some whitespace at the beginning and end, so we remove it. We also remove the delimeter appearing in the `summary` columns, which is two @ characters `@@`. This is because many summaries contain links in them, and they are extracted as three or four elements, so we remove them. Finally we rename `h1` to `name`, and `url` to `semrush_profile`, sort by `alltime_rank_posts`, and remove the column `alltime_rank`.
And we are done!

In [73]:
authors['job_title'] = authors['job_title'].str.strip()
authors['rank_name'] = authors['rank_name'].str.strip()
authors['summary'] = authors['summary'].str.replace('@@', '')
authors = authors.rename(columns={'h1': 'name',
                                  'url': 'semrush_profile'}).sort_values('alltime_rank_posts').reset_index(drop=True)
authors.drop('alltime_rank', axis=1).iloc[:10, [0, 2, 3, 4, 5, 6, 7]].apply(lambda s: s.str[:31] )

Unnamed: 0,name,twitter,linkedin,facebook,instagram,website,job_title
0,A.J. Ghergich,https://twitter.com/SEO/,https://www.linkedin.com/in/ajg,,,https://brado.net,Founder at Brado
1,Bill Widmer,https://twitter.com/TheBillWidm,https://www.linkedin.com/in/Bil,https://www.facebook.com/TheBil,,https://billwidmer.com,Content Marketing Expert
2,James Brockbank,https://twitter.com/brockbankja,https://www.linkedin.com/in/jam,,,https://digitaloft.co.uk,Managing Director at Digitaloft
3,Judith Lewis,https://twitter.com/judithlewis,,,https://www.instagram.com/decab,http://www.decabbit.com,Founder at Decabbit Consultancy
4,Julia McCoy,https://twitter.com/JuliaEMcCoy,https://www.linkedin.com/in/jul,https://www.facebook.com/JuliaE,https://www.instagram.com/femen,https://www.expresswriters.com,"CEO, Author at Express Writers,"
5,Tim Capper,https://twitter.com/GuideTwit/,https://www.linkedin.com/in/tca,,,,
6,Deepak Shukla,https://twitter.com/deepakpshuk,https://www.linkedin.com/in/dee,https://www.facebook.com/deepak,https://www.instagram.com/deepa,http://www.pearllemon.com/,SEO Director at Pearl Lemon at
7,Guy Sheetrit,,,,,,CEO at Over The Top SEO
8,Uzair Kharawala,https://twitter.com/sfdigital/,https://www.linkedin.com/in/uza,https://www.facebook.com/sfdigi,https://www.instagram.com/sfdig,https://www.sfdigital.co.uk,Partner at SF Digital Studios
9,Georgi Todorov,https://twitter.com/GeorgiTodor,https://www.linkedin.com/in/geo,https://www.facebook.com/Georgi,,http://digitalnovas.com/,Digital Marketer


Let's quickly check if the work seems to be correct. Let's see how many rows and columns we have in `authors`:

In [51]:
authors.shape

(388, 13)

388 rows and 13 columns. Weren't they 393?  
That's true. It seems there are five bloggers who have special profile pages that have the same data but in a different design (different CSS selectors). These are the "Columnists" of the blog. I have manually extracted their data and added them to the table, which you can see in the final `semrush_blog_authors.csv` file. 

>  **A very optimistic person once found a horseshoe on the floor and immediately thought, "Oh, now I only need three more horseshoes and a horse!"**

This list is only one horseshoe. The horse is the process of actually talking to those people, building real relationships, and finding a meaningful way to contribute to the network. 

Now that you have a list of all the Twitter accounts, you might want to create a list to keep track of the people you find interesting. You might only be interested in a subset, so you can filter by the title or summary for profiles containing "content", "SEO", "paid", or whatever you are interested in. The blog is published in several languages, so you could do the same for another language. With all the LinkedIn profiles you might consider creating or joining a specialist group, and inviting people to it. 

Good luck! 


In [11]:
d = dict.fromkeys(authors.columns)
d

{'name': None,
 'semrush_profile': None,
 'twitter': None,
 'linkedin': None,
 'facebook': None,
 'instagram': None,
 'website': None,
 'job_title': None,
 'summary': None,
 'alltime_rank': None,
 'rank_name': None,
 'alltime_rank_posts': None,
 'alltime_rank_comments': None}

In [45]:
d['name']= ['Jason Barnard', 'Kevin Indig', 'Jason Brown', 'Ross Tavendale', 'Marina Brocca']
d['twitter'] = ['https://twitter.com/jasonmbarnard/',
                'https://twitter.com/kevin_indig/',
                'https://twitter.com/keyserholiday/',
                'https://twitter.com/rtavs/',
                'https://twitter.com/marinabrocca/']

d['linkedin'] = ['https://www.linkedin.com/in/jasonmbarnard/',
                 'https://www.linkedin.com/in/kevinindig/',
                 'https://www.linkedin.com/in/jason-brown-keyserholiday/',
                 'https://www.linkedin.com/in/rosstavendale/',
                 'https://www.linkedin.com/in/marinabrocca/']

d['website'] = ['https://jasonbarnard.com/',
                'https://www.kevin-indig.com/',
                'http://reviewfraud.org/',
                'https://typeamedia.net/',
                'https://marinabrocca.com/']

d['job_title'] = ['The Brand SERP Guy at Kalicube.pro',
                  'VP SEO and Content at G2',
                  'Spam Hunter at Sterling Sky',
                  'Managing Director at Type A Media', '']

d['summary'] = ["""Jason Barnard is an author, speaker and consultant on all things digital marketing. His specialist subject is Brand SERPs (what appears when someone goodlmes your name). He teaches Brand SERP optimisation to students at Kalicube.pro. He also hosts a marketing podcast, where the smartest people in marketing talk to Jason about subjects they know inside out. The conversations are always interesting, always intelligent and always fun!

Over 2 decades of experience in digital marketing: he started promoting his first website in the year Google was incorporated and built it up to become one of the top 10,000 most visited sites in the world (60 million visits in 2007).

The Brand SERP Guy

Why “The Brand SERP Guy”? Because Jason has been studying, tracking and analysing Brand SERPs (what appears when someone Googles your name) since 2013...

Conclusion: Brand SERPs are your new business card, a reflection of your brand’s digital ecosystem and an honest critique of your online marketing strategy. That could well be enough to pique the interest of any marketer and any brand... in any industry :)

News: Jason has released a series of online courses that teach brand owners and marketers to optimise their brand SERPs : https://kalicube.pro/courses/

You might want to check out Jason's digital marketing podcast. The conversations are always intelligent interesting and fun. Guests include Rand Fishkin, Joost de Valk, Jono Alderson, Bill Slawski, John Mueller...""",
                'Data > Information > Knowledge > Wisdom',
                """After more than a decade working in the field, I am currently the Spam Hunt for Sterling Sky. Although I have focused on local SEO for nationwide multi-location franchises since 2015, my resume includes combating and reporting fake online reviews, which includes several local and national news appearances. My comments can be found regularly in the GMB forum and I'm in the process of becoming a Top Contributor. You can find me on Twitter talking SEO and music; say hi sometime.""",
                """Ross is the Managing Director at Type A Media, an independent search agency that work with FTSE250 companies and mid-sized brands to help them find the optimal way to talk to more people online.
When not obsessing over his clients rankings, he hosts the Canonical Chronicle, a weekly web show watched by 100k people every month.
If you want to ask him a direct question you can find him @rtavs on Twitter.""",
                'Marina: Consultora especializada en marketing legal y protección de datos, acreditada en área jurídica. Especialista en RGPD. Autora del blog marinabrocca.com -Ponente y formadora.']

d['facebook'] = ['https://twitter.com/jasonmbarnard/',
                 'https://www.facebook.com/kevin.indig/', 
                 'https://www.facebook.com/keyserholiday/',
                 'https://www.facebook.com/ross.tavendale/', '']

d['rank_name'] = ['Columnist', 'Columnist', 'Columnist', 'Columnist', 'Columnist']

d['semrush_profile'] = ['https://www.semrush.com/user/jasonbarnard/',
                        'https://www.semrush.com/user/kevin-indig/',
                        'https://www.semrush.com/user/keyserholiday/',
                        'https://www.semrush.com/user/159015569/',
                        'https://www.semrush.com/user/145633457/']

In [78]:
# authors.append(pd.DataFrame(d)).drop('alltime_rank', axis=1).to_csv('semrush_blog_authors.csv', index=False)