# Phrase scoring

## Assignment

For each scrape find 10 most prominent phrases.
Consider phrases up to 4 words.
### Dataset

[scrapes.csv](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/610644d8-ce87-408f-8c25-84268b3ca151/scrapes.csv)

### Dataset description

This dataset contains data from 14 different scrapes. Each scrape has multiple crawled pages. Each crawled page has multiple words.

- `scrape` - collection of crawled pages which were returned in the Google SERP
- `scrape_keyword` - keyword used in given Google Search
- `words` - vector of words extracted from given crawled page
- `scores` - vector of integer word prominence scores. Each word should have a corresponding score.

## Solution

Let's load the data after from the working directory

In [1]:
import pandas as pd
scrapes=pd.read_csv('scrapes.csv')
scrapes.head()

Unnamed: 0,scrape_id,scrape_keyword,crawled_page_id,crawled_page_url,words,scores
0,244568,seo course 2020,3242198,https://www.quicksprout.com/best-seo-courses-a...,"[""our"", ""content"", ""is"", ""reader"", ""supported""...","[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ..."
1,244568,seo course 2020,3242200,https://digitaldefynd.com/best-seo-courses-tra...,"[""skip"", ""to"", ""content"", ""trending"", ""10"", ""b...","[3, 3, 3, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,244568,seo course 2020,3242201,https://www.trumplearning.com/best-seo-course-...,"[""toggle"", ""navigation"", ""contact"", ""us"", ""hom...","[6, 6, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,244568,seo course 2020,3242199,https://ippei.com/best-seo-course/,"[""currently"", ""set"", ""to"", ""index"", ""currently...","[0, 0, 0, 0, 0, 0, 0, 0, 10, 10, 10, 7, 7, 7, ..."
4,244568,seo course 2020,3242195,https://www.searchenginejournal.com/best-free-...,"[""seo"", ""all"", ""seo"", ""ask"", ""an"", ""seo"", ""beg...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


Constants used in the program below:

In [2]:
window=4 #4-words phrases
top=10 #top 10 phrases

Dropping rows with no words and reseting the index to avoid gaps:

In [3]:
scrapes=scrapes[scrapes.words!='[]']
scrapes=scrapes.reset_index(drop=True)

Removing unnecessary symbols so that we can convert data to another data types

In [4]:
scrapes['words']=scrapes['words'].str.strip('["]')
scrapes['scores']=scrapes['scores'].str.strip('[]')
scrapes.head()

Unnamed: 0,scrape_id,scrape_keyword,crawled_page_id,crawled_page_url,words,scores
0,244568,seo course 2020,3242198,https://www.quicksprout.com/best-seo-courses-a...,"our"", ""content"", ""is"", ""reader"", ""supported"", ...","4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4..."
1,244568,seo course 2020,3242200,https://digitaldefynd.com/best-seo-courses-tra...,"skip"", ""to"", ""content"", ""trending"", ""10"", ""bes...","3, 3, 3, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1..."
2,244568,seo course 2020,3242201,https://www.trumplearning.com/best-seo-course-...,"toggle"", ""navigation"", ""contact"", ""us"", ""home""...","6, 6, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0..."
3,244568,seo course 2020,3242199,https://ippei.com/best-seo-course/,"currently"", ""set"", ""to"", ""index"", ""currently"",...","0, 0, 0, 0, 0, 0, 0, 0, 10, 10, 10, 7, 7, 7, 9..."
4,244568,seo course 2020,3242195,https://www.searchenginejournal.com/best-free-...,"seo"", ""all"", ""seo"", ""ask"", ""an"", ""seo"", ""begin...","0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0..."


Functions to convert words to tuples and scores to lists

In [5]:
def Convert_words(string): 
    li = tuple(string.split('", "')) 
    return li 
def Convert_scores(string): 
    li = list(string.split(', ')) 
    return li 

Let's create 2 separate disctionaries for all words and scores

In [6]:
words={}
scores={}
for i in range(0,len(scrapes.index)-1):
    words[i]=Convert_words(scrapes['words'][i])
    scores[i]=list(map(int,Convert_scores(scrapes['scores'][i])))

Creating a new DataFrame with phrases:

In [7]:
column_names = ["Phrase", "Score","Scrape","Page","Keyword","URL"] #column names for the new dataframe
df = pd.DataFrame(columns = column_names) #empty dataframe
for i in range(len(scrapes.index)-1): #looping through initial 'scrapes' dataframe
    scores_series=pd.Series(scores[i]) #Creating a series of scores for each row from the 'scrapes' dataframe
    words_series=pd.Series(words[i]) #Creating a series of words for each row from the 'scrapes' dataframe
    scores_series=scores_series.rolling(window).sum().sort_values(ascending=False).head(top) #Updating a series of scores with the sums of the rolling 4 words window, sorting it descending and taking 10 top values 
    phrases={} #new dictionary or phrases scores
    for a,b in scores_series.iteritems(): 
        phrases[a]=words[i][a-(window-1):a+1] #populating a dictionary of phrases and scores' indexes for a specific row from 'scrapes' dataframe
    phrases_series=pd.Series(phrases) #we need to make Series for phrases in order to populate it to the dataframe
    scrapes_results = { 'Phrase': phrases_series, 'Score': scores_series,'Scrape':scrapes['scrape_id'][i],'Page':str(scrapes['crawled_page_id'][i]),'Keyword':scrapes['scrape_keyword'][i],'URL':scrapes['crawled_page_url'][i] } 
    result = pd.DataFrame(scrapes_results) #Dataframe with all need data for the i row
    df=df.append(result) #appending the data frame with each iteration
df.head()

Unnamed: 0,Phrase,Score,Scrape,Page,Keyword,URL
134,"(seo, courses, and, guides)",80.0,244568,3242198,seo course 2020,https://www.quicksprout.com/best-seo-courses-a...
137,"(guides, on, the, internet)",80.0,244568,3242198,seo course 2020,https://www.quicksprout.com/best-seo-courses-a...
133,"(best, seo, courses, and)",80.0,244568,3242198,seo course 2020,https://www.quicksprout.com/best-seo-courses-a...
135,"(courses, and, guides, on)",80.0,244568,3242198,seo course 2020,https://www.quicksprout.com/best-seo-courses-a...
132,"(the, best, seo, courses)",80.0,244568,3242198,seo course 2020,https://www.quicksprout.com/best-seo-courses-a...


Now when we have phrases and their scores let's combine identical phrases within a scrape for different pages:

In [8]:
df=df.groupby(['Phrase','Scrape','Keyword','Score']).agg({'URL':', '.join,'Page':', '.join})
df=df.reset_index()

## Result - top 10 phrases for each scrape

In [19]:
df=df.sort_values(by=['Scrape','Score'],ascending=False).groupby(by=["Scrape"]).head(top)
df=df.reset_index(drop=True)
pd.set_option('display.max_rows',None)
df

Unnamed: 0,Phrase,Scrape,Keyword,Score,URL,Page
0,"(and, certifications, free, and)",244568,seo course 2020,80.0,https://www.mobidea.com/academy/seo-training-c...,3242192
1,"(and, guides, on, the)",244568,seo course 2020,80.0,https://www.quicksprout.com/best-seo-courses-a...,3242198
2,"(best, seo, courses, and)",244568,seo course 2020,80.0,https://www.quicksprout.com/best-seo-courses-a...,3242198
3,"(best, seo, training, courses)",244568,seo course 2020,80.0,https://www.mobidea.com/academy/seo-training-c...,3242192
4,"(certifications, free, and, paid)",244568,seo course 2020,80.0,https://www.mobidea.com/academy/seo-training-c...,3242192
5,"(courses, and, certifications, free)",244568,seo course 2020,80.0,https://www.mobidea.com/academy/seo-training-c...,3242192
6,"(courses, and, guides, on)",244568,seo course 2020,80.0,https://www.quicksprout.com/best-seo-courses-a...,3242198
7,"(guides, on, the, internet)",244568,seo course 2020,80.0,https://www.quicksprout.com/best-seo-courses-a...,3242198
8,"(hour, guide, to, seo)",244568,seo course 2020,80.0,https://moz.com/learn/seo,3242193
9,"(one, hour, guide, to)",244568,seo course 2020,80.0,https://moz.com/learn/seo,3242193


Generating a csv file with the results:

In [10]:
df.to_csv('top_10_phrases.csv')