# Simplified Page Rank 

This project is a simplified version of the Google Page Rank algorithm. The simplified version consists of three steps

## Rank pages based on the number of referrals to the page

The user builds a graph consisting of all URLs and the set of incoming links to each node (URL) in the graph. The pages are then ranked based on the number of incoming links. 

## Create an index of all words to pages

Traverse all web pages in the graph and build an index of keywords to web pages

## Combine rank and index to deliver a search

When a search keyword is provided, first search the index to get the list of web pages. Next order the pages based on their rank and provide this result to the user.

### Get NLTK stopwords

In [37]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tj225qr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Get URLs from a web page

In [18]:
import urllib

In [27]:
# make sure to install bs4 (beautiful soup)
# You can use the below command to install
# pip install bs4
from bs4 import BeautifulSoup
import urllib


def getURLsList(url):

    #resp = urllib3.request.urlopen(url)
    resp = urllib.request.urlopen(url)
    charset = resp.headers.get_content_charset()
    soup = BeautifulSoup(resp, from_encoding=charset)

    pages = []
    for link in soup.find_all('a', href=True):
        page = link['href']
        
        if page[:4] != "http":
            page = url + "/" + page
            
        pages.append(page)
        
    return pages

### Get keywords from a web page

In [31]:
import urllib3
import re
from nltk.corpus import stopwords

def getKeyWords(url):

    resp = urllib.request.urlopen(url)
    charset = resp.headers.get_content_charset()
    #soup = BeautifulSoup(resp, 'from_encoding=charset')
    soup = BeautifulSoup(resp, 'lxml')
    
    [s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
    visible_text = soup.getText()
    
    regex = r'(\w*) '
    words = filter(lambda w: w != '',re.findall(regex,visible_text))
    
    filtered_words = [w for w in words if w not in stopwords.words('english')]
    
    print(filtered_words)

In [32]:
getURLsList("http://www.cnn.com")



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


['http://www.cnn.com//',
 'http://www.cnn.com//',
 'http://www.cnn.com//us',
 'http://www.cnn.com//world',
 'http://www.cnn.com//politics',
 'http://money.cnn.com',
 'http://www.cnn.com//opinions',
 'http://www.cnn.com//health',
 'http://www.cnn.com//entertainment',
 'http://money.cnn.com/technology',
 'http://www.cnn.com//style',
 'http://www.cnn.com//travel',
 'http://bleacherreport.com',
 'http://www.cnn.com//videos',
 'http://www.cnn.com//vr',
 'http://cnn.it/go2',
 'http://www.cnn.com//',
 'http://www.cnn.com//us',
 'http://www.cnn.com//specials/us/crime-and-justice',
 'http://www.cnn.com//specials/us/energy-and-environment',
 'http://www.cnn.com//specials/us/extreme-weather',
 'http://www.cnn.com//specials/space-science',
 'http://www.cnn.com//world',
 'http://www.cnn.com//africa',
 'http://www.cnn.com//americas',
 'http://www.cnn.com//asia',
 'http://www.cnn.com//australia',
 'http://www.cnn.com//europe',
 'http://www.cnn.com//middle-east',
 'http://www.cnn.com//uk',
 'http://ww

### Get a list of all web pages and their corresponding keywords

In [36]:
getKeyWords("http://www.cnn.com")

['Breaking', 'WorldPoliticsMoneyOpinionHealthEntertainmentTechStyleTravelSportsVideoVRLive', 'TV', 'Search', 'InternationalArabicEspañolSet', 'edition', 'InternationalArabicEspañolSet', 'edition', 'Crime', 'JusticeEnergy', 'EnvironmentExtreme', 'WeatherSpace', 'ScienceWorldAfricaAmericasAsiaAustraliaEuropeMiddle', 'EastUK45CongressSupreme', 'Court2018Key', 'RacesPrimary', 'ResultsMarketsTechMediaPersonal', 'FinanceLuxuryOpinionPolitical', 'EdsSocial', 'CommentaryFoodFitnessWellnessParentingVital', 'SignsStarsScreenBingeCultureMediaBusinessCultureGadgetsFutureStartupsArtsDesignFashionArchitectureLuxuryAutosVideoDestinationsFood', 'DrinkPlayStayVideosPro', 'FootballCollege', 'FootballBasketballBaseballSoccerOlympicsVideoLive', 'TV', 'Digital', 'StudiosCNN', 'FilmsHLNTV', 'ScheduleTV', 'Shows', 'ZCNNVRShopCNN', 'LifestyleCNN', 'StoreHow', 'To', 'Watch', 'PhotosLongformInvestigationsCNN', 'profilesCNN', 'LeadershipCNN', 'NewslettersWork', 'InternationalEspañolArabicSet', 'edition', 'Intern

In [35]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tj225qr/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True