Let's grab our data and take a look at it. (I did some basic cleaning and removed irrelevant rows in Google Sheets prior to importing, but that isn't strictly necessary for any of this).

In [5]:
import pandas as pd
import numpy as np

# Import the SEMRush export and skip the extraneous rows it has at the top 
df = pd.read_csv('skill-ranks.csv', skiprows=7)
# rename the columns and drop the useless one
df.columns = ['keyword', 'rank', 'dtype', 'url', 'difference', 'tags', 'cpc', 'volume']
df = df[['keyword', 'rank', 'url', 'tags', 'cpc', 'volume']]

# double-checking it worked right by looking at the output
df.head(2)

Unnamed: 0,keyword,rank,url,tags,cpc,volume
0,learn data science,1,https://www.dataquest.io/blog/learn-data-science/,career-builder|q3top10|tier1,7.37,880.0
1,python api tutorial,1,https://www.dataquest.io/blog/python-api-tutor...,api|python|skill-python-apis-scraping|tier3,1.97,390.0


The way SEMRush handles tags in exports is kind of annoying, but we can live with it. We've got a bunch of different skill paths, and we could use str.contains to separate each of them into a separate, smaller dataframe, like so:

In [8]:
#commented out because we don't need to run this, it's just some info on an alternate method.

# python_stats = df[df['tags'].str.contains('skill-python-stats')]

But, it may be easier in the future if we keep this all in one database, so instead, let's set a list of conditions and corresponding values, and then create a new column that'll list _only_ the skillpath tag.

In [7]:
conditions = [
    (df['tags'].str.contains('skill-python-stats') == True),
    (df['tags'].str.contains('skill-r-stats') == True),
    (df['tags'].str.contains('skill-python-ml-intermediate') == True),
    (df['tags'].str.contains('skill-python-ml-intro') == True),
    (df['tags'].str.contains('skills-sql') == True),
    (df['tags'].str.contains('skill-r-basics') == True),
    (df['tags'].str.contains('skill-python-basics') == True),
    (df['tags'].str.contains('skill-python-da-dv') == True),
    (df['tags'].str.contains('skill-python-apis-scraping') == True),
    (df['tags'].str.contains('skill-r-apis-scraping') == True),
    (df['tags'].str.contains('skill-r-dv') == True),
    (df['tags'].str.contains('career-builder') == True),
    ]

values = ['skill-python-stats', 'skill-r-stats', 'skill-python-ml-intermediate', 'skill-python-ml-intro',
         'skills-sql', 'skill-r-basics', 'skill-python-basics', 'skill-python-da-dv', 'skill-python-apis-scraping',
          'skill-r-apis-scraping', 'skill-r-dv', 'career-builder'
         ]

df['skillpath'] = np.select(conditions, values)

df.to_csv('SEMRush-data-exported.csv')

## Dataquest users - Stop Here and Work with the CSV that was exported as per the [directions here.](https://www.notion.so/dataquest/Skill-Path-Keyword-Worksheet-Documentation-65280a7f34764a7f8dc10da0835bad00)

### What follows is just for my own practice



Beautiful! Now, let's set conversion percentages for the content in each of these skillpaths, and add those as a new column we can use as a multiplier. To make it easy for anyone to work with, we'll add easy to change values for all the conversion rates.

Note: these conversion numbers are made-up nonsense for now. We can substitute in the real numbers later!

In [4]:
conditions = [
    (df['tags'].str.contains('skill-python-stats') == True),
    (df['tags'].str.contains('skill-r-stats') == True),
    (df['tags'].str.contains('skill-python-ml-intermediate') == True),
    (df['tags'].str.contains('skill-python-ml-intro') == True),
    (df['tags'].str.contains('skills-sql') == True),
    (df['tags'].str.contains('skill-r-basics') == True),
    (df['tags'].str.contains('skill-python-basics') == True),
    (df['tags'].str.contains('skill-python-da-dv') == True),
    (df['tags'].str.contains('skill-python-apis-scraping') == True),
    (df['tags'].str.contains('skill-r-apis-scraping') == True),
    (df['tags'].str.contains('skill-r-dv') == True),
    (df['tags'].str.contains('career-builder') == True),
    ]

python_stats_conv = 0.01
r_stats_conv = 0.02
python_ml_intermediate_conv = 0.01
python_ml_intro_conv = 0.02
sql_conv = 0.006
r_basics_conv = 0.01
python_basics_conv = 0.01
python_da_dv_conv = 0.02
python_apis_scraping_conv = 0.01
r_apis_scraping_conv = 0.02
r_dv_conv = 0.005
career_builder_conv = 0.01

values = [python_stats_conv, r_stats_conv, python_ml_intermediate_conv, python_ml_intro_conv,
         sql_conv, r_basics_conv, python_basics_conv, python_da_dv_conv, python_apis_scraping_conv,
          r_apis_scraping_conv, r_dv_conv, career_builder_conv
         ]

df['signup_conv'] = np.select(conditions, values)
df.sample(2)

Unnamed: 0,keyword,rank_mar_10,url,tags,cpc,volume,skillpath,signup_conv
104,r programming basics,19,https://www.dataquest.io/course/introduction-t...,skill-r-basics,2.68,170.0,skill-r-basics,0.01
140,learn data analysis in r,1,https://www.dataquest.io/path/data-analyst-r/,skill-r-basics,,,skill-r-basics,0.01


Now we're set up to do some more advanced stuff, but for now, let's take a quick look at the potential conversions (not including long-tail and variations) for the SQL skill path, assuming we captured all traffic (which we obviously cannot, just as a proof of concept).

Below is a hacky (but it works!) way to get at this:

In [5]:
df[(df['skillpath'] == 'skills-sql')]['volume'].sum()*df[(df['skillpath'] == 'skills-sql')]['signup_conv'].mean()

1046.7000000000003

That'll do for now. Next steps:
* Create a column for current estimated traffic based on current rank and keyword volume
* Create a column for if we ranked 2nd for all keywords