#### Quick python script that creates a realistic website CSV dataset

The goal for this script is to create a realistic graph with real websites. We scrape a set of websites, and for each website, we find all the hyperlinks. These are the node connections. Then we build the CSV dataset from there. This graph dataset will then be fed into our algorithm in `pagerank.ipynb`.

The resulting CSV dataset is very sparse, unlike the densely connected graph that `generator.py` creates artificially.

Note that this section is done in Python as Julia does not have good web scraping capabilities.

In [1]:
# What you want to search
query = "how to fix a flat tire"
# How many results do you want
num_nodes = 200

In [2]:
from googlesearch import search
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import time

Scrape all the websites that will be used for our nodes. We can do a quick Google search.

Keep in mind that sometimes Google blocks this script because we are querying too rapidly, and it detects that we are a bot. So there is a cooldown time period.

An interesting benchmark for our PageRank algorithm would be how close our results are to a real Google search.

In [3]:
results = []
links_path = "links.txt"

with open(links_path, 'w') as file:
    for i in search(query, sleep_interval=5, num_results=num_nodes):
        file.write(i + '\n')
        results.append(i)

In [4]:
print(results)
keys = [urlparse(f).netloc for f in results]
print(keys)

['https://www.wikihow.com/Fix-a-Flat-Tire', 'https://www.quora.com/How-can-you-fix-a-flat-tire-on-your-car-without-taking-it-off-and-having-it-towed-somewhere', 'https://techtirerepairs.com/flat-tire-how-to-safely-fix/', 'https://www.quora.com/What-can-I-do-to-fix-a-flat-tire-in-the-middle-of-nowhere', 'https://www.mach1services.com/fix-a-flat-tire-at-home/', 'https://germaniainsurance.com/blogs/post/germania-insurance-blog/2021/01/29/how-to-fix-a-flat-tire-what-to-do-if-you-have-a-flat-and-no-spare', 'https://www.amfam.com/resources/articles/on-the-road/11-steps-to-fix-a-flat-tire', 'https://www.progressive.com/lifelanes/on-the-road/how-to-fix-a-flat-tire/', 'https://www.sullivantire.com/blog/tires/proper-tire-repair', 'https://www.bicycling.com/repair/a20013517/bike-repair-how-to-fix-a-flat-tire/', 'https://www.tires-easy.com/blog/how-to-fix-a-flat-tire/', 'https://blog.napacanada.com/en/how-to-fix-a-flat-tire/', 'https://www.bridgestonetire.com/learn/maintenance/how-to-change-a-flat

#### Helper function that gets all the hyperlinks from a web page

The popularity of a website is determined by how much other websites reference it (how many connections that node has). This function scrapes the website for any URLs it makes reference to. 

One cool thing we do is we ignore all the hyperlinks that reference its own website. For example, the webpage

`https://www.reliancedigital.in/solutionbox/how-to-diagnose-laptop-problems-and-fix-them/`

Has the following hyperlinks inside its own text:
```
https://www.reliancedigital.in/solutionbox/category/product-reviews/
https://www.reliancedigital.in/solutionbox/category/product-reviews/mobiles-tablets-reviews/
https://www.reliancedigital.in/solutionbox/category/product-reviews/computers-laptops-product-review/
https://www.reliancedigital.in/solutionbox/category/product-reviews/tv-audio-product-reviews/

...

https://www.reliancedigital.in/solutionbox/category/buying-guides/home-appliances-buying-guides/
https://www.reliancedigital.in/solutionbox/category/buying-guides/health-personalcare/
https://www.reliancedigital.in/solutionbox/category/buying-guides/batteries-juice-packs/
https://www.reliancedigital.in/solutionbox/category/buying-guides/gaming-buying-guides/ 
```

and many, many more that come from the same domain, `reliancedigital.in`. We only want to count this domain name once as a result, because it we counting it multiple times, it will blow up its own popularity in the graph because it keeps referencing itself. This will skew our PageRank algorithm findings as it'll think this website is really popular because it keeps getting referenced, but in reality its just referencing itself (almost like cheating). 

So we filter all the hyperlinks that come from the same domain. That way we keep the hyperlinks that really come from other sources, and that adds variety to the graph and is a more representative showing of popularity.

In [5]:
def get_all_hyperlinks(url):
    reqs = requests.get(url)
    soup = BeautifulSoup(reqs.text, 'html.parser')
    
    urls = []
    for link in soup.find_all('a'):
        hyperlink = link.get('href')
        domain_name = urlparse(hyperlink).netloc
        
        # Sometimes the query results mess up
        # So keep on running instead of throwing an Exception
        try:
            # Filter out all the hyperlinks that reference themselves
            if domain_name not in url:
                urls.append(urlparse(hyperlink).netloc)
        except TypeError:
            continue # Onto the next link

    return urls

#### Helper function that makes a temp CSV file to store data so far

We store the data in a file as a intermediary instead of in a variable.

In [26]:
def make_nodes(path):
    with open("links.txt", 'r') as fp:
        lines = fp.readlines()

    with open(path, 'w') as fp:     
        for i in range(141, len(lines)):
            try:
                print(i)
                fp.write(keys[i].rstrip())

                hlinks = get_all_hyperlinks(lines[i].rstrip())

                for l in hlinks:
                    fp.write(", " + l)
                
                fp.write("\n")

                time.sleep(0.1)
            
            ### This is usually bad practice but if we encounter an connection error or a blockage from the website
            # we want to continue as if nothing happened in order to build our dataset
            except Exception:
                continue


In [27]:
make_nodes("temp.csv")

141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201


#### Create a dictionary from the temp CSV file. 

Each key is a node, and the entries are an array to all the websites that node has hyperlinks to. We then remove duplicate hyperlinks because only one is needed.

In [28]:
def csv_to_dict(path):
    node_dict = {}

    with open(path, 'r') as file:
        for line in file:
            # Split the line by comma and strip whitespace
            entries = [entry.strip() for entry in line.split(',')]

            # Use the first entry as the key and the rest as values
            if entries:  # Check if the line is not empty
                node_dict[entries[0]] = entries[1:]

    # Remove duplicates
    for k in node_dict:
        connections = node_dict[k]

        # Easy way to remove duplicates
        node_dict[k] = list(set(connections))

    return node_dict

node_dict = csv_to_dict("temp.csv")
print(node_dict)

{'www.wikihow.com': ['www.facebook.com', 'fr.wikihow.com', 'www.pinterest.com', 'www.carsdirect.com', 'www.wikihow.it', 'www.youtube.com', 'knowhow.napaonline.com', 'www.gonift.com', 'twitter.com', 'ar.wikihow.com', 'autorepair.about.com', 'ru.wikihow.com', 'www.instagram.com', 'www.tiktok.com'], 'www.quora.com': [], 'techtirerepairs.com': [], 'www.mach1services.com': [], 'germaniainsurance.com': ['classic.germaniaconnect.com', 'www.facebook.com', 'www.autoguide.com', 'germania-ciam.okta.com', 'www.instagram.com', 'policyholders.germaniaconnect.com', 'twitter.com', 'www.thedrive.com', 'www.brandtackle.com', 'roadsumo.com', 'germaniacreditunion.com', 'www.linkedin.com'], 'www.amfam.com': ['www.facebook.com', 'play.google.com', 'www.ghsa.org', 'instagram.com', 'www.pinterest.com', 'newsroom.amfam.com', 'injuryfacts.nsc.org', 'www.twitter.com', 'www.youtube.com', 'apps.apple.com', 'www.digicert.com', 'b2b.amfam.com', 'www.iii.org', 'www.ncsl.org', 'www.linkedin.com', 'chat-ui.amfam.com'],

#### Main function that creates the CSV graph

Parses the dictionary and gets the indices of all the connections, writing them into a file.

In [29]:
def create_csv_dataset(path):
    with open(path, 'w') as file:
        for n in node_dict:
            connections = node_dict[n]

            indexed_connections = []

            for c in connections:
                if c in keys:
                    # Recall the Julia is 1-indexed not 0-indexed like Python
                    indexed_connections.append(str(keys.index(c) + 1))

            file.write(", ".join(indexed_connections))
            file.write("\n")

create_csv_dataset("test.csv")