# 11. Data Wrangling With Screaming Frog

------------------------------------------------

## Learning Outcomes

- To learn how to automate the command line Screaming Frog commands with Python
- To learn how to wrangle 5 .csv files from Screaming Frog with Pandas
- To learn how to push the data into a BigQuery table
- To learn how to connect your BigQuery table to Google Data Studio

------------------------------------------------------

In the last tutorial, you learned how to [easily automate Screaming Frog on the command line for either Mac or Windows.](https://sempioneer.com/python-for-seo/screaming-frog-automation/).

In this section we'll be focusing on automating the previous terminal commands with Python.

Then we'll wrangle the .csv data into Pandas, push it into [BigQuery](https://cloud.google.com/bigquery) and finally view it in [Google Data Studio.](https://datastudio.google.com/u/0/navigation/reporting)

---------------------------------------------------------------

## Module Imports

References:
    
- https://docs.python.org/3/library/subprocess.html
- https://www.jstorimer.com/blogs/workingwithcode/7766119-when-to-use-stderr-instead-of-stdout
- https://pandas.pydata.org/
- https://www.vervesearch.com/blog/screaming-frog-google-compute-cloud-automatically-crawl-an-entire-industry-fast/

------------------------------------------------------------

<strong> Scripts To Refactor:</strong>
- https://www.vervesearch.com/screaming-frog-files/scream.py
- https://www.vervesearch.com/screaming-frog-files/auto-ssh.py
- https://raw.githubusercontent.com/skywind3000/terminal/master/terminal.py
- https://www.vervesearch.com/blog/compare-screaming-frog-crawl-files/
- https://github.com/skywind3000/terminal/

-----------------------------------------------------------------------------------------------

In [1]:
!pip install pandas



In [230]:
import os
import subprocess
import pandas as pd
import re
from datetime import datetime
from sys import platform

# Google Libraries:
from google.oauth2 import service_account
from google.cloud import bigquery

ImportError: cannot import name 'collections_abc' from 'six.moves' (unknown location)

----------------------------------------------------------------------------------------

## How To Run The Command Line In Python

In this section we'll be using linux commands:

In [3]:
process = subprocess.run("ls", shell=True, check=True, capture_output=True)
print(process)

CompletedProcess(args='ls', returncode=0, stdout=b'data-wrangling-screaming-frog.ipynb\n', stderr=b'')


In [4]:
print(f"This is the return code of the subprocess: {process.returncode}")

This is the return code of the subprocess: 0


- Typically a <strong> returncode 0 means that the command run successfully. </strong>
- Also notice how the output of the command is pushed into stdout (standard output), stderr (standard errror).

----------------------------------------------------------------------

## How To Run The Screaming Frog Command Line Scripts With Python

Let's extract our username, website, output location and Screaming Frog application and put them into variables:

In [5]:
!pwd

/Users/jamesaphoenix/Desktop/Imran_And_James/Python_For_SEO/11_data_wrangling_screaming_frog


In [25]:
username = 'jamesaphoneix'
website = 'https://phoenixandpartners.co.uk/'
output_location = '/users/jamesaphoenix/desktop'
screaming_frog_app = '/Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher'

In [7]:
print("Username:",username, "'\nWebsite:",website, '\nOutputLocation:', output_location,
     "\nScreamingFrogLocation:",screaming_frog_app)

Username: jamesaphoneix '
Website: https://phoenixandpartners.co.uk/ 
OutputLocation: --output-folder /users/jamesaphoneix/desktop 
ScreamingFrogLocation: /Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher


------------------------------------------

Now let's create a couple of Screaming Frog string commands that we'll push into subprocess commands:

In [292]:
screaming_frog_open = "/Applications/Screaming Frog SEO Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher"
screaming_frog_crawl=f'{screaming_frog_app} --headless --save-crawl --output-folder {output_location} --timestamped-output --crawl phoenixandpartners.co.uk'

In [294]:
screaming_frog_crawl

'/Applications/Screaming\\ Frog\\ SEO\\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher --headless --save-crawl --output-folder /users/jamesaphoenix/desktop --timestamped-output --crawl phoenixandpartners.co.uk'

---------------------------------------------------------------

Also notice, how we've used an f string for the screaming_frog_crawl variable, which means:

- jamesaphoenix will be passed into this text string instead of {username}.
- https://phoenixandpartners.co.uk/ will be passed into this text string instead of {website}.

In [323]:
print(screaming_frog_crawl)

/Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher --headless --save-crawl --output-folder /users/jamesaphoenix/desktop --timestamped-output --crawl phoenixandpartners.co.uk


------------------------------------------------------------------------------------------------

Let's now run them one by one:

In [295]:
open_sf = subprocess.run(screaming_frog_open)

This command will hopefully open scremaing frog, also the subprocess will keep running until we close the window.

So close screaming frog.

In [296]:
print(open_sf)

CompletedProcess(args='/Applications/Screaming Frog SEO Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher', returncode=0)


We can see that the return code was 0!

------------------------------------------------------------------------------

In [297]:
screaming_frog=subprocess.run(screaming_frog_crawl, 
               shell=True, 
               capture_output=True)

In [298]:
screaming_frog_crawl

'/Applications/Screaming\\ Frog\\ SEO\\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher --headless --save-crawl --output-folder /users/jamesaphoenix/desktop --timestamped-output --crawl phoenixandpartners.co.uk'

![](https://sempioneer.com/wp-content/uploads/2020/06/screaming-frog-1.png)

------------------------------------

### How To Find The Outputted Folder Name

As well as saving the crawl, we can parse the standard output pipe (stdout) and obtain the name of the timestamped folder:

In [43]:
dir(screaming_frog)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'args',
 'check_returncode',
 'returncode',
 'stderr',
 'stdout']

We can decode the stdout which will convert all of the console messages into a string:

In [126]:
text = screaming_frog.stdout.decode('utf-8')
print(text[0:10])

2020-06-25


Now let's search the date timestamp:

- after this: <strong> Output directory: </strong>
- before this: <strong> \n </strong>

![](https://sempioneer.com/wp-content/uploads/2020/06/output-directory.png)

In [391]:
timestamp = re.findall('(?<=Output directory:)(.*?)(?=\n)', 
                       str(text))

correct_folder = timestamp[0].strip()
print(f"This is the timestamped output folder: {correct_folder}")

This is the timestamped output folder: /users/jamesaphoenix/desktop/2020.06.25.15.01.49


------------------------------------------------------------------------

We can also check what folders are in our current working directory with:

~~~

os.listdir()

~~~

In [139]:
os.chdir('/Users/jamesaphoenix/Desktop') # This changes the directory into the desktop
os.listdir()

['Music',
 'Screaming Frog - Data Manipulation.ipynb',
 'ngrok',
 '2020.06.25.15.01.49',
 '.DS_Store',
 '.localized',
 'config',
 'Coding_Marketing_Projects',
 'Google_cloud-sdk',
 'screaming-frog-remotedesktop-image.vmdk',
 'Screenshot 2020-06-25 at 15.07.17.png',
 'Extracting Schema At Bulk.ipynb',
 'Scripts_and_keys',
 'YouTube SEO.jpg',
 'Data_Science_Resources',
 'Screenshot 2020-06-25 at 15.07.17 (2).png',
 'Marketing',
 'Sort Through These',
 'Atom.app',
 'Math Textbooks',
 '.ipynb_checkpoints',
 'Client_Projects',
 'Imran_And_James',
 'General_Assembly',
 'layered_architecture.png',
 'Postman.app',
 'message.png']

![](https://sempioneer.com/wp-content/uploads/2020/06/output-of-directory.png)

-----------------------------------------------------------------------------

Another way to capture the relevant folder would be to:
    
1. Get todays date.
2. Only return folders that include todays date.

In [153]:
now = datetime.now()
todays_date = now.strftime("%Y.%m.%d")
print(todays_date)

2020.06.25


In [157]:
screaming_frog_folders = [file for file in os.listdir() if todays_date in file]
print(screaming_frog_folders)

['2020.06.25.15.01.49']


------------------------------------------

## Enhancing Our Screaming Frog CLI Automation With Classes

Running the subprocess and string command is a better improvement then having to load up terminal and manually enter in the commands.

But let's take it a step further. I've created several classes such as 🐸 <strong> ScreamingFrogAnalyser, CSVParser and BigQueryAutomation</strong> 🐸

These will allow you to:

- Run a certain number of website crawls.
- All of the folders/files will be outputted and collected.
- All of the CSV data will be merged across multiple domains and the domain name added as an extra column.

----



You can find all of these scripts inside of the src folder of this module. 

If you would like to dive deeper into the code you can view it:
- Here
- Here
- Here

------------------------------------------------------

### How To Setup Executecrawl.py

However if you'd just like to make the most of it, I've created a function wrapper for it called execute_crawl.py

In this python file, you can find a function called run:
    
You will need to first gather all of the relevant information and populate it below so that the script can run correctly:

- <strong> GOOGLE_CLOUD_PROJECT_ID: </strong> This is your Google Cloud Project ID 

- <strong> OUTPUTFOLDER: </strong> This is your desired output folder for the screaming frog crawls. For example my outputfolder is: <strong> /Users/jamesaphoenix/Desktop </strong>
-  <strong> SERVICE_ACCOUNT_KEY_LOCATION: </strong>  You will need a service account key for us to automatically create and upload to BigQuery tables.
-  <strong> website_urls: </strong>  You need to provide a python list of website URL's that you'd like to crawl such as:

~~~
['https://phoenixandpartners.co.uk/', 'https://sempioneer.com/']

~~~

------------------------------------------------------------------------------------------------------------------------

## BigQuery Setup

- API Creation
- Table Creation
- Service Account Key Creation

TBC

------------------------------------------

## Pushing The Data To BigQuery



TBC

------------------------------------------

## Connecting The BigQuery Table to Google Data Studio



------------------------------------------------------------------------------------------