Let's grab our data and take a look at it. (I did some basic cleaning and removed irrelevant rows in Google Sheets prior to importing, but that isn't strictly necessary for any of this).

In [2]:
import pandas as pd
import numpy as np

# Import the SEMRush export and skip the extraneous rows it has at the top 
df = pd.read_csv('skill-ranks.csv', skiprows=7)
# rename the columns and drop the useless one
df.columns = ['keyword', 'rank', 'dtype', 'url', 'difference', 'tags', 'cpc', 'volume']
df = df[['keyword', 'rank', 'url', 'tags', 'cpc', 'volume']]

# double-checking it worked right by looking at the output
df.head(2)

Unnamed: 0,keyword,rank,url,tags,cpc,volume
0,learn data science,1,https://www.dataquest.io/blog/learn-data-science/,career-builder|q3top10|tier1,7.37,880.0
1,python api tutorial,1,https://www.dataquest.io/blog/python-api-tutor...,api|python|skill-python-apis-scraping|tier3,1.97,390.0


The way SEMRush handles tags in exports is kind of annoying, but we can live with it. We've got a bunch of different skill paths, and we could use str.contains to separate each of them into a separate, smaller dataframe, like so:

In [3]:
#commented out because we don't need to run this, it's just some info on an alternate method.

# python_stats = df[df['tags'].str.contains('skill-python-stats')]

But, it may be easier in the future if we keep this all in one database, so instead, let's set a list of conditions and corresponding values, and then create a new column that'll list _only_ the skillpath tag.

In [4]:
conditions = [
    (df['tags'].str.contains('skill-python-stats') == True),
    (df['tags'].str.contains('skill-r-stats') == True),
    (df['tags'].str.contains('skill-python-ml-intermediate') == True),
    (df['tags'].str.contains('skill-python-ml-intro') == True),
    (df['tags'].str.contains('skills-sql') == True),
    (df['tags'].str.contains('skill-r-basics') == True),
    (df['tags'].str.contains('skill-python-basics') == True),
    (df['tags'].str.contains('skill-python-da-dv') == True),
    (df['tags'].str.contains('skill-python-apis-scraping') == True),
    (df['tags'].str.contains('skill-r-apis-scraping') == True),
    (df['tags'].str.contains('skill-r-dv') == True),
    (df['tags'].str.contains('career-builder') == True),
    ]

values = ['skill-python-stats', 'skill-r-stats', 'skill-python-ml-intermediate', 'skill-python-ml-intro',
         'skills-sql', 'skill-r-basics', 'skill-python-basics', 'skill-python-da-dv', 'skill-python-apis-scraping',
          'skill-r-apis-scraping', 'skill-r-dv', 'career-builder'
         ]

df['skillpath'] = np.select(conditions, values)

df.to_csv('SEMRush-data-exported.csv')

## Dataquest users - Stop Here and Work with the CSV that was exported as per the [directions here.](https://www.notion.so/dataquest/Skill-Path-Keyword-Worksheet-Documentation-65280a7f34764a7f8dc10da0835bad00)

Or, move on to the next section if you want to update conversion rates

## GA RegEx converter script

This script will take the dataset we've built, and create a version with an easy to copy-paste regex string for each skill path that you can then post into [this GA report](https://analytics.google.com/analytics/web/?authuser=0#/report/content-landing-pages/a41411988w90270749p93874080/_u.date00=20210101&_u.date01=20210312/) (see [documentation here](https://www.notion.so/dataquest/Skill-Path-Keyword-Worksheet-Documentation-65280a7f34764a7f8dc10da0835bad00)).

In [21]:
# I wrote this as a function, although in retrospect it doesn't need to be.
def df_maker(df):
    # get unique skillpath names
    paths = df.skillpath.unique()
    
    # for each skill path:
    for row in paths:
        skillpath_df = df[df['skillpath'] == row]         # create a unique df for that skillpath alone
        skillpath_name = skillpath_df['skillpath'].unique()  # assign the path's name to skillpath_name
        urls = skillpath_df['url'].tolist()              # convert the url column from pandas series to list
        url_string = str(urls)                           # convert the list into a string  
        url_string = url_string.replace('https://', '')  # remove elements to format the string for
        url_string = url_string.replace('[', '')         # GA regexp requirements
        url_string = url_string.replace(']', '')
        url_string = url_string.replace('\'', '')
        url_string = url_string.replace('nan', '')
        url_string = url_string.replace(', ', '|')
        
        # remove the final | if the string ends with one
        # (which happens for skillpaths that end with NaN because we don't have a page ranking)
        # ! this doesn't work if it ends with multiple nans, need to rethink later !
        if url_string[-1] == '|':                        
            output = url_string[:-1]
        else:
            output = url_string
        
        # print the output for copy-paste purposes
        print(skillpath_name)
        print(output)
        print('\n')

df_maker(df)

['career-builder']
www.dataquest.io/blog/learn-data-science/|www.dataquest.io/blog/data-science-certificate/|www.dataquest.io/blog/data-analyst-skills/|www.dataquest.io/blog/how-to-become-a-data-scientist/|www.dataquest.io/blog/data-analyst-skills/|www.dataquest.io/path/data-engineer/|www.dataquest.io/path/data-scientist/|www.dataquest.io/path/data-analyst/


['skill-python-apis-scraping']
www.dataquest.io/blog/python-api-tutorial/|www.dataquest.io/blog/web-scraping-tutorial-python/|www.dataquest.io/blog/python-api-tutorial/|www.dataquest.io/blog/web-scraping-tutorial-python/|www.dataquest.io/blog/python-api-tutorial/|www.dataquest.io/blog/web-scraping-tutorial-python/|www.dataquest.io/course/apis-and-scraping/|www.dataquest.io/blog/web-scraping-tutorial-python/


['skill-r-apis-scraping']
www.dataquest.io/blog/r-api-tutorial/|www.dataquest.io/blog/web-scraping-in-r-rvest/|www.dataquest.io/blog/web-scraping-in-r-rvest/|www.dataquest.io/blog/r-api-tutorial/|www.dataquest.io/blog/r-api-t