# Keyword Analysis for SEO

Search Engine Optimization (SEO) keywords are the keywords and phrases in your web content that make it possible for people to find your site via search engines. A website that is well optimized for search engines "speaks the same language" as its potential visitor base with keywords for SEO that help connect searchers to your site. Keywords are one of the main elements of SEO.
Obtaining and using the right Keyword is one of the challenges when it comes to SEO,the below keyword Optimization analysis is based on the Keywords obtained from competitors' websites, the model ascertain useful Keywords that could help a website in Search Engine Optimization. By leveraging these keywords, businesses can use it for optimizing their website.

The model uses Keywords obtained for SEO applications such as SEMRUSH or Rank Tracker.
the below model runs on multiple Keywords obtained from competitors websites using **SEMRUSH's Keyword Gap Tool.**

### Libraries

In [1]:
import os
import glob
import pandas as pd
import numpy as np
import statistics
import spacy
import scattertext
import en_core_web_sm
from nltk.corpus import stopwords

### Loading the Keywords data

In [2]:
# in case multiple Keywords files:

#files_to_combine = [i for i in glob.glob('*.{}'.format('csv'))]
#data = pd.concat((pd.read_csv(file, header = 0) for file in files_to_combine))
#data.to_csv("combined_Keyword_results.csv", index=False, encoding='utf-8-sig')

In [3]:
data=pd.read_csv('combined_Keyword_results.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
# New Column to identify competitors used to to obtained current set of Keywords

competition = []
for i in range(len(data.index)):
    k=0
    for k in range(1,6):      # Change columns according to the col.index of competitors names
        if data.iloc[i,k] !=0:
            competition.append(data.columns[k])
            break
data['Companies']  = competition 

data.head()

Unnamed: 0,Keyword,healthpalace.ca,phytopurproducts.com,rockymountainsoap.com,massageessentials.ca,well.ca,Search Volume,Keyword Difficulty,CPC,Competition,Results,healthpalace.ca (pages),phytopurproducts.com (pages),rockymountainsoap.com (pages),massageessentials.ca (pages),well.ca (pages),Companies
0,lung cleanse,83,0,0,0,0,5400,86.06,0.6,1.0,3000000,https://www.healthpalace.ca/lungs-respiratory-...,,,,,healthpalace.ca
1,nutiva coconut oil,84,0,0,0,0,3600,85.72,0.53,1.0,597000,https://www.healthpalace.ca/nutiva-organic-ref...,,,,,healthpalace.ca
2,douglas labs,73,0,0,0,0,2900,64.47,1.28,0.53,22400000,https://www.healthpalace.ca/douglas-laboratori...,,,,,healthpalace.ca
3,how to lower blood sugar naturally,75,0,0,0,0,2400,79.87,1.09,1.0,50400000,https://www.healthpalace.ca/blog/15-natural-wa...,,,,,healthpalace.ca
4,zembrin,95,0,0,0,0,2400,72.64,0.21,1.0,40500,https://www.healthpalace.ca/innovite-eliteneur...,,,,,healthpalace.ca


### Classifying Keywords based on impact (Search Volume)

In [5]:
# Identifying Threshold for Keywords that are highly searched vs Keywords that are less searched
# Taking the overall avearge searches as threshold
mean_vol=statistics.mean(data['Search Volume'])
mean_vol

# Classifying Keywords into High and Low based on search volume threshold identified above
data['high_vol'] = data['Search Volume'].apply(lambda x: 'High' if x > mean_vol else 'Low')
df= data[['Keyword','high_vol','Companies']]
df['Keyword'] = df['Keyword'].astype(str)
df['Companies'] = df['Companies'].astype(str)
df.head()


Unnamed: 0,Keyword,high_vol,Companies
0,lung cleanse,High,healthpalace.ca
1,nutiva coconut oil,High,healthpalace.ca
2,douglas labs,High,healthpalace.ca
3,how to lower blood sugar naturally,High,healthpalace.ca
4,zembrin,High,healthpalace.ca


### Text Processing and building corpus

In [6]:
#Stop
stopWords = set(stopwords.words('english'))
nlp = en_core_web_sm.load() 
nlp.Defaults.stop_words |= stopWords

In [7]:
# Buidling function using Scatertext to evaluate term frequency and calulate f-score for each classification

def term_freq(df):
    corpus = (scattertext.CorpusFromPandas(df,
                                           category_col='high_vol', 
                                           text_col='Keyword',
                                           nlp=nlp)
              .build()
              .remove_terms(nlp.Defaults.stop_words, ignore_absences=True)
              )
    df = corpus.get_term_freq_df()
    df['High_Rating_Score'] = corpus.get_scaled_f_scores('High')
    df['Low_Rating_Score'] = corpus.get_scaled_f_scores('Low')
    df['High_Rating_Score'] = round(df['High_Rating_Score'], 2)
    df['Low_Rating_Score'] = round(df['Low_Rating_Score'], 2)
    
    df_high = df.sort_values(by='High freq', 
                             ascending = False).reset_index()
    df_low = df.sort_values(by='Low freq', 
                            ascending=False).reset_index()
    return df_high, df_low,corpus



In [8]:
# Using function to create corpus for text to be processsed and categorized based on f-score and freqency

df_high, df_low,corpus = term_freq(df)
df_high.head(10)

Unnamed: 0,term,High freq,Low freq,High_Rating_Score,Low_Rating_Score
0,ordinary,968,1132,0.98,0.02
1,the ordinary,808,856,0.98,0.02
2,soap,756,6128,0.83,0.17
3,shampoo,656,2472,0.9,0.1
4,cream,572,2128,0.9,0.1
5,oil,544,3772,0.84,0.16
6,hair,496,1828,0.9,0.1
7,natural,472,3140,0.85,0.15
8,reviews,468,2340,0.87,0.13
9,baby,464,1704,0.9,0.1


### Visualization 

In [9]:
html = scattertext.produce_scattertext_explorer(corpus,category='Low',
                   category_name='Low Searched Keywords',
                   not_category_name='High Searched Keywords',
                   width_in_pixels=1000,
                   metadata=corpus.get_df()['Companies'])
html_file_name = "Keywords Analysis-Plot.html"
open(html_file_name, 'wb').write(html.encode('utf-8'))

8442629

Reference:

 - https://github.com/JasonKessler/scattertext
 - https://www.semrush.com/

---