# Find differences in primary page elements for mobile audits

There are several important checks you should compare when getting ready for the mobile first index., that have been explicitly touched on again and again. [Here's Gary tweeting about it](https://twitter.com/methode/status/904579213616918528). 

This script will help you find important differences in:
- Text
- Structured data 

It is worth reading the blog post that accompanies this workbook. You can find that here.

**What will you need:**
- A crawl which covers both your site and mobile version of your site, for each desktop page you'll need a mobile equivalent (or the comparison obviously won't happen.)
- For each of those pages you'll need an extraction from that crawl of the full HTML for each page.

In the example below I've used Screaming Frog.

**How to use this:**

Any time you see **Instructions**, you'll need to take an action in the cell below. Any time you see **Notes**, you just need to run the cells, if you're looking to modify the workbook then the notes may help.

In [2]:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib
import re
import copy
import json
from deepdiff import DeepDiff
import pandas as pd
import pprint
import numpy as np
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import time
from tqdm import tqdm, tqdm_pandas
from tqdm import tnrange, tqdm_notebook

In [2]:
pd.set_option('mode.chained_assignment', None)

**Notes:** First we load in the dataframe of pages to be compared. Because I'm using a Screaming frog upload in this example, we skip the first row.

You could quite happily write some extra python to crawl all the pages here, but why build something where there already exists a great crawler.

In [3]:
# For all you Windows users out there you'll need to double escape your slashes i.e. users\\myuser\\documents\\file

df = pd.read_csv("Your_file_goes_here", 
                 skiprows=1)

**Instructions:** Here we need to set the correct column names for each crucial piece of information.
- url_column_name: The column which contains the URL crawled
- html_source_column_name: The column which contains the full extracted HTML of the page.
- alternate_columns: A list of all the columns which contain alt tags.
- status_code_column_name: The column which contains the status code.
- mobile_identifier: A snippet to identify the correct alternate tag. Typically ://m. will work for 99% of cases.

In [5]:
url_column_name = "Address"
html_source_column_name = "Extractor 1 1"
alternate_columns = ["Extractor 2 1", "Extractor 2 2"]
status_code_column_name = "Status Code"
mobile_identifier = "://m."

**Notes:** Set-up the functions that we're going to use.

In [29]:
def get_visible_text(string):
    '''
    This function is run on an HTML page and returns a list of elements and any visible text.
    
    Returns: List
    '''
    if pd.notnull(string):
        soup = BeautifulSoup(string, 'lxml')
        texts = soup.findAll(text=True)
        visible_text = filter(filter_visible_text, texts)

        list_text = [
            text
            for text in visible_text 
            if len(text) > 50
        ]
    else:
        return []
    
    return list_text


def filter_visible_text(element):
    '''
    Take a beautiful soup string element and return the element if it is visible and not
    whitespace.
    
    Returns: BS4 NavigableString 
    '''

    if re.match("^\s*$", element):
        return False
        
    return tag_visible(element)


def tag_visible(element):
    '''
    This function takes a BS4 navigable string and returns any visible element. 
    
    Returns: BS4 navigable string.
    '''
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def pivot_crawl_md_columns(df, how_match, url_column_name, html_source_column_name):
    '''
    This function takes a dataframe with mobile and desktop URLs and then pivots it
    so that mobile and desktop URLs are in columns. It can either work on path, or on 
    alternate urls.
    
    Returns: A dataframe.
    '''

    df_mobile = df[df["m_or_d"] == "mobile"][[url_column_name, html_source_column_name, "path"]]
        
    if how_match == "path":
        df_desktop = df[df["m_or_d"] == "desktop"][[url_column_name, html_source_column_name, "path"]]
        df_pivot = pd.merge(df_desktop, df_mobile, how="left", on="path")
        
        df_pivot.rename(columns={'Address_x':'Desktop URL', 'Address_y':'Mobile URL', 'path':'Shared Path'}, inplace=True)
    else:
        df_desktop = df[df["m_or_d"] == "desktop"][[url_column_name, html_source_column_name, "path", "alternate_url"]]
        df["alternate_url"] = df.apply(lambda x: get_alternate_url(x, alternate_columns, mobile_identifier), axis=1)
        df_mobile.rename(columns={'Address':'alternate_url'}, inplace=True)

        df_pivot = pd.merge(df_desktop, df_mobile, how="left", on="alternate_url")
        
        df_pivot.drop(["path_y"], axis=1, inplace=True)
        df_pivot.rename(columns={'Address':'Desktop URL', 'alternate_url':'Mobile URL', 'path_x':'Desktop Path'}, inplace=True)
    return df_pivot


def get_alternate_url(row, column_list, mobile_identifier):
    '''
    Takes a row from a dataframe and a list of columns, it will return
    the first column that contains the mobile_identifier.
    
    Returns: A string
    '''
    
    alt_tag_destination = ""
    for column in column_list:
        alt_tag = row[column]
        if pd.notnull(alt_tag):
            if mobile_identifier in alt_tag:
                alt_tag_destination = re.search("href=[\"']([^\"']*)[\"']", alt_tag).group(1)
                return alt_tag_destination
    
    return alt_tag_destination


def get_list_diff(row, column1, column2):
    '''
    Takes a row from a dataframe and two columns which contain lists and then diffs them. 
    
    Returns: A list
    '''
    diff = set(row[column1]).symmetric_difference(row[column2])
        
    missing_from_mobile = [elem for elem in diff if elem in row[column1]]
    
    if len(missing_from_mobile) == 0:
        missing_from_mobile = ["nothing"]
    
    print("a: {}".format(missing_from_mobile))
#     return ["dom"]
    return missing_from_mobile


def delete_keys_from_dict(dict_del, lst_keys):
    '''
    This function takes a nested python data structure of dictionaries and lists
    and removes all keys in the provided list.
    
    Returns: A dictionary.
    '''

    if isinstance(dict_del, dict):
        for k in lst_keys:
            try:
                del dict_del[k]
            except KeyError:
                pass
        for k,v in dict_del.items():
            if isinstance(v, dict):
                delete_keys_from_dict(v, lst_keys)
            elif isinstance(v, list):
                delete_keys_from_dict(v, lst_keys)
    elif isinstance(dict_del, list):
        for v in dict_del:
            delete_keys_from_dict(v, lst_keys)

    return dict_del


def set_all_dict_values_to_empty(dict_del):
    '''
    Sets all values in a dictionary to empty.
    
    Returns: A dictionary.
    '''

    if isinstance(dict_del, dict):
        for k,v in dict_del.items():
            if isinstance(v, dict):
                set_all_dict_values_to_empty(v)
            elif isinstance(v, list):
                set_all_dict_values_to_empty(v)
            else:
                if k == 'value':
                    dict_del[k] = ''
    elif isinstance(dict_del, list):
        for v in dict_del:
            set_all_dict_values_to_empty(v)

    return dict_del


def clean_empty(d):
    '''
    This function takes a dictionary and removes alls keys where the value
    is an empty list.
    
    Returns: A dictionary.
    '''
    if not isinstance(d, (dict, list)):
        return d
    if isinstance(d, list):
        return [v for v in (clean_empty(v) for v in d) if v]
    return {k: v for k, v in ((k, clean_empty(v)) for k, v in d.items()) if v}

            
def fetch_structured_data(string, cookie_header):
    if pd.isnull(string):
        return []
    
    headers = {'cookie':cookie_value}
    form_payload = {
        'url':string
    }
    r = requests_retry_session().post('https://search.google.com/structured-data/testing-tool/u/0/validate', 
                      headers=headers, 
                      data=form_payload, 
                      files={'set_multipart':'true'})
    
    # Be respectful of Structured Data Testing Tool
    time.sleep(4)
    
    if r.status_code != 200:
        print("Request has failed, more information below.")
        print(r.status_code)
        print(r.text)
        return ""

    # Extract the JSON payload
    data = json.loads(r.text[5:])

    # Check request succeeded.
    entities = data['tripleGroups']

    delete_keys = [
        'end', 
        'errors', 
        'begin', 
        'errorID', 
        'numErrors',
        'numWarnings',
        'numErrors',
        'richCardPreviewState',
        'richCardVerticalHints',
        'ownerSet',
        'errorsByOwner',
        'types',
        'warningsByOwner',
        'numNodesWithError',
        'numNodesWithWarning'
    ]
    
    processed_sd = [clean_empty(delete_keys_from_dict(entity, delete_keys)) for entity in entities if 'nodes' in entity]
    
    return processed_sd


def key_mapping(key):
    '''
    This takes a string and returns another string. Used to map Google SD dicts to microdata for ease of reading.
    '''
    if key == 'pred':
        return 'itemprop'
    elif key == 'typeGroup':
        return 'itemtype'
    elif key == 'idProperty':
        return '@id'
    else:
        return key


def change_keys(obj, convert):
    '''
    Recursively goes through the dictionary obj and replaces keys with the convert function.
    Taken from: 
    https://stackoverflow.com/questions/11700705/python-recursively-replace-character-in-keys-of-nested-dictionary
    '''
    if isinstance(obj, (str, int, float)):
        return obj
    if isinstance(obj, dict):
        new = obj.__class__()
        for k, v in obj.items():
            new[convert(k)] = change_keys(v, convert)
    elif isinstance(obj, (list, set, tuple)):
        new = obj.__class__(change_keys(v, convert) for v in obj)
    else:
        return obj
    return new


def get_json_list_diff(row, column1, column2, value_change=True):
    '''
    Takes a row from a dataframe and two columns which contain lists and then diffs them. 
    
    Returns: A list
    '''
    structured_data_1 = row[column1]
    structured_data_2 = row[column2]
    
    # https://github.com/pandas-dev/pandas/issues/14217
    diff_sd = ['nothing']
    for dict1 in structured_data_1:
        is_identical = 0
        for dict2 in structured_data_2:
            diffed_dict = DeepDiff(dict1, dict2, ignore_order=True)

            if bool(diffed_dict) is False:
                is_identical = 1
                break

        if is_identical == 0:
            diff_sd.append(dict1)  

    return diff_sd


def requests_retry_session(
    retries=3,
    backoff_factor=5,
    status_forcelist=(500, 502, 504),
    session=None,
):
    session = session or requests.Session()
    retry = Retry(
        total=retries,
        read=retries,
        connect=retries,
        backoff_factor=backoff_factor,
        status_forcelist=status_forcelist,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    return session

**Notes:** Before we can do any analysis we need to get our data in a usable form. 

We need to change the format of our data so we have aligned each mobile page with it's desktop counterpart. We're also going to filter out any pages which aren't 200s.

In [7]:
df = df[df[status_code_column_name] == 200]
df_excluded = df[df[status_code_column_name] != 200]
df['path'] = df[url_column_name].apply(lambda x: urllib.parse.urlparse(x).path)
df['m_or_d'] = df[url_column_name].apply(lambda x: "mobile" if mobile_identifier in x else "desktop")

**Notes:** Here we pivot the mobile and desktop columns so they're in the same row. 

Then we get any text from elements which would appear by default on the page. We ignore non-standard elements like `<script>` and `<title>`.

Finally once we've extracted all the differences in text we diff the two.

In [8]:
df_pivot = pivot_crawl_md_columns(df, "path", url_column_name, html_source_column_name)

In [9]:
df_pivot["desktop_text_list"] = df_pivot[html_source_column_name+"_x"].apply(lambda x: get_visible_text(x))
df_pivot["mobile_text_list"] = df_pivot[html_source_column_name+"_y"].apply(lambda x: get_visible_text(x))

In [10]:
# Can't use apply because of https://github.com/pandas-dev/pandas/issues/14217
missing_from_mobile = []
for index, row in df_pivot.iterrows():
    diff = set(row["desktop_text_list"]).symmetric_difference(row["mobile_text_list"])
    
    missing_from_mobile.append([elem for elem in diff if elem in row["desktop_text_list"]])

df_pivot["text_diff"] = missing_from_mobile

**Instructions:** Next we need to get the structured data. We're going to use Google's Structured data testing tool because it does a huge amount of grunt work and makes some sensible design decisions.

In order to use the Structured data testing tool API, you have to get an "API" key, which in this case is the cookie it uses for authentication. Head to the [tool]("https://search.google.com/structured-data/testing-tool/u/0/") and make a successful request with Chrome dev tools open.

Select the request called _validate_ and copy the "Cookie" request header and set the value in the cell below:

In [15]:
# The cookie value should start something like: 'CONSENT=YES+GB.en+V7; SID=WEsdf9808sdflklLMPUBEPMNlk23l23a1BmNBkMpssuxM6YFHF50nxlXa1mgFw.; HSID=Asrnsdf897j;...'
cookie_value = 'CONSENT=YES+GB.en+V7; SID=cQsdfa2fSZ....'

**Notes:** Now we go out and get the structured data. Desktop and then mobile. Because this can take rather a long time we initialise tqdm which now supports pandas, a progress meter for loops or other long running operations.

Important note, this structured data diff, is only capable of telling you that the objects are different, it currently won't show you what is different in them.

If you want to diff with values as well then you'll need to delete the function: set_all_dict_values_to_empty on both cells below.

In [16]:
tqdm_notebook().pandas()

In [17]:
df_pivot["desktop_sd_list"] = df_pivot["Desktop URL"].progress_apply(lambda x: set_all_dict_values_to_empty(change_keys(fetch_structured_data(x, cookie_value),key_mapping)))

In [18]:
df_pivot["mobile_sd_list"] = df_pivot["Mobile URL"].progress_apply(lambda x: set_all_dict_values_to_empty(change_keys(fetch_structured_data(x, cookie_value),key_mapping)))

**Notes:** Then we diff the structured data we have pulled.

In [20]:
df_pivot['sd_diff'] = df_pivot.apply(lambda x: get_json_list_diff(x, "desktop_sd_list", "mobile_sd_list", value_change=False), axis=1)




In [21]:
df_pivot["sd_diff"] = df_pivot.apply(lambda x: get_json_list_diff(x, "desktop_sd_list", "mobile_sd_list", value_change=False), axis=1)

**Notes:** For ease of output, we now want our data stacked, where each line is one missing line of text or structured data array. We have to sack off any HTML output here or 

In [22]:
df_pivot["num_diff_sd_objs"] = df_pivot['sd_diff'].apply(lambda x: len(x))
df_pivot["num_diff_text_strings"] = df_pivot['text_diff'].apply(lambda x: len(x))

In [23]:
df_pivot_text = df_pivot[['Desktop URL', 'Mobile URL', 'desktop_text_list', 'mobile_text_list', 'text_diff','num_diff_text_strings']]
df_pivot_sd = df_pivot[['Desktop URL', 'Mobile URL', 'desktop_sd_list', 'mobile_sd_list', 'sd_diff','num_diff_sd_objs']]

In [24]:
stacked_text_diff = (df_pivot_text.text_diff.apply(pd.Series).stack()
              .reset_index(level=1, drop=True)
              .to_frame('text_diff'))
df_pivot_text = df_pivot_text.drop(['text_diff'], axis=1).join(stacked_text_diff, how="left").reset_index().drop(["index"], axis=1)
df_pivot_text['category'] = 'text'
df_pivot_text['length'] = df_pivot_text['text_diff'].apply(lambda x: len(str(x)))

In [25]:
stacked_sd_diff = (df_pivot_sd.sd_diff.apply(pd.Series).stack()
              .reset_index(level=1, drop=True)
              .to_frame('sd_diff'))
df_pivot_sd = df_pivot_sd.drop(['sd_diff'], axis=1).join(stacked_sd_diff, how="left").reset_index().drop(["index"], axis=1)
df_pivot_sd['category'] = 'structured_data'
df_pivot_sd['length'] = df_pivot_sd['sd_diff'].apply(lambda x: len(str(x)))

**Notes:** Finally we join the stacked dataframes with a label.

In [26]:
output = pd.concat([df_pivot_sd,df_pivot_text])
output.drop(["desktop_sd_list", "mobile_sd_list", "desktop_text_list", "mobile_text_list"], axis=1, inplace=True)

In [28]:
output.to_csv("structured_data_schema_output.csv", sep=',', quoting=1)