#### Quick script that creates a realistic website dataset

The goal for this script is to create a realistic graph with real websites. We scrape a set of websites for a specific search query, and paste them in `links.txt`, and for each website, we find all the hyperlinks. These are the node connections. Then we build the CSV dataset from there. This graph dataset will then be fed into our algorithm in `pagerank.ipynb`.

The resulting graph is very sparse, unlike the densely connected graph that `generator.ipynb` creates artificially.

By creating a realistic graph, albeit smaller than one used by Google, we hope to capture many of the real life obstacles PageRank faces. These include sinks, important sources, popular social media websites that may be irrelevant, links to ads, and so much more. This wouldn't be possible otherwise with a randomly created graph.

In [1]:
# What you want to search
query = "how to fix a flat tire"
# How many results do you want
num_nodes = 200

200

In [None]:
using HTTP
using Gumbo
using Cascadia

Scrape all the websites that will be used for our nodes. We can do a quick Google search.

Keep in mind that sometimes Google blocks this script because we are querying too rapidly, and it detects that we are a bot. So there is a cooldown time period.

An interesting benchmark for our PageRank algorithm would be how close our results are to a real Google search.

In [3]:
results = String[]

# Open the file and read line by line
open("links.txt", "r") do file
    for line in eachline(file)
        # Strip the line to remove any leading/trailing whitespace
        clean_line = strip(line)

        # Append the cleaned line to the links array
        push!(results, clean_line)
    end
end

LoadError: SystemError: opening file "links.txt": No such file or directory

In [4]:
print(results)
keys = [extract_domain(f) for f in results]
print(keys)

String[]Union{}[]

#### Helper function that gets all the hyperlinks from a web page

The popularity of a website is determined by how much other websites reference it (how many connections that node has). This function scrapes the website for any URLs it makes reference to. 

One cool thing we do is we ignore all the hyperlinks that reference its own website. For example, the webpage

`https://www.reliancedigital.in/solutionbox/how-to-diagnose-laptop-problems-and-fix-them/`

Has the following hyperlinks inside its own text:
```
https://www.reliancedigital.in/solutionbox/category/product-reviews/
https://www.reliancedigital.in/solutionbox/category/product-reviews/mobiles-tablets-reviews/
https://www.reliancedigital.in/solutionbox/category/product-reviews/computers-laptops-product-review/
https://www.reliancedigital.in/solutionbox/category/product-reviews/tv-audio-product-reviews/

...

https://www.reliancedigital.in/solutionbox/category/buying-guides/home-appliances-buying-guides/
https://www.reliancedigital.in/solutionbox/category/buying-guides/health-personalcare/
https://www.reliancedigital.in/solutionbox/category/buying-guides/batteries-juice-packs/
https://www.reliancedigital.in/solutionbox/category/buying-guides/gaming-buying-guides/ 
```

and many, many more that come from the same domain, `reliancedigital.in`. We only want to count this domain name once as a result, because it we counting it multiple times, it will blow up its own popularity in the graph because it keeps referencing itself. This will skew our PageRank algorithm findings as it'll think this website is really popular because it keeps getting referenced, but in reality its just referencing itself (almost like cheating). 

So we filter all the hyperlinks that come from the same domain. That way we keep the hyperlinks that really come from other sources, and that adds variety to the graph and is a more representative showing of popularity.

In [5]:
function get_all_hyperlinks(url::String)
    # Perform the HTTP request and parse the HTML
    response = HTTP.get(url)
    soup = parsehtml(String(response.body))

    # Extract all hyperlinks
    urls = String[]
    for link in eachmatch(Selector("a"), soup.root)
        hyperlink = get_attribute(link, "href")

        # Continue to the next link if hyperlink is nothing
        if isnothing(hyperlink)
            continue
        end

        # Extract the domain name from the hyperlink
        domain_name = HTTP.URIs.URI(hyperlink).host

        # Filter out self-referencing hyperlinks
        if domain_name !== nothing && domain_name ∉ url
            push!(urls, domain_name)
        end
    end

    return urls
end

#### Helper function that makes a temp CSV file to store data so far

We store the data in a file as a intermediary instead of in a variable.

In [26]:
function make_nodes(path)
    # Read lines from the links.txt file
    lines = readlines("links.txt")

    # Open the output file for writing
    open(path, "w") do fp
        for i in 1:length(lines)
            try
                println(i)
                # Write the key to the file, assuming 'keys' array is available
                write(fp, chomp(keys[i]))

                # Get all hyperlinks from the line
                hlinks = get_all_hyperlinks(strip(lines[i]))

                # Write each hyperlink to the file
                for l in hlinks
                    write(fp, ", " * l)
                end

                # Write a newline to separate entries
                write(fp, "\n")

                # Sleep for a short time
                sleep(0.1)

            ### This is usually bad practice but if we encounter an connection error or a blockage from the website
            # we want to continue as if nothing happened in order to build our dataset
            catch e
                continue
            end
        end
    end
end

In [27]:
make_nodes("temp.csv")

141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201


#### Create a dictionary from the temp CSV file. 

Each key is a node, and the entries are an array to all the websites that node has hyperlinks to. We then remove duplicate hyperlinks because only one is needed.

In [28]:
function csv_to_dict(path)
    node_dict = Dict{String, Vector{String}}()

    # Open the file and read line by line
    open(path, "r") do file
        for line in eachline(file)
            # Split the line by comma and strip whitespace
            entries = split(strip(line), ',')

            # Check if the line is not empty and then process
            if !isempty(entries)
                key = entries[1]
                values = entries[2:end]

                # Add to the dictionary
                node_dict[key] = get(node_dict, key, String[]) |> x -> union(x, values)
            end
        end
    end

    return node_dict
end

node_dict = csv_to_dict("temp.csv")
println(node_dict)

{'www.wikihow.com': ['www.facebook.com', 'fr.wikihow.com', 'www.pinterest.com', 'www.carsdirect.com', 'www.wikihow.it', 'www.youtube.com', 'knowhow.napaonline.com', 'www.gonift.com', 'twitter.com', 'ar.wikihow.com', 'autorepair.about.com', 'ru.wikihow.com', 'www.instagram.com', 'www.tiktok.com'], 'www.quora.com': [], 'techtirerepairs.com': [], 'www.mach1services.com': [], 'germaniainsurance.com': ['classic.germaniaconnect.com', 'www.facebook.com', 'www.autoguide.com', 'germania-ciam.okta.com', 'www.instagram.com', 'policyholders.germaniaconnect.com', 'twitter.com', 'www.thedrive.com', 'www.brandtackle.com', 'roadsumo.com', 'germaniacreditunion.com', 'www.linkedin.com'], 'www.amfam.com': ['www.facebook.com', 'play.google.com', 'www.ghsa.org', 'instagram.com', 'www.pinterest.com', 'newsroom.amfam.com', 'injuryfacts.nsc.org', 'www.twitter.com', 'www.youtube.com', 'apps.apple.com', 'www.digicert.com', 'b2b.amfam.com', 'www.iii.org', 'www.ncsl.org', 'www.linkedin.com', 'chat-ui.amfam.com'],

#### Main function that creates the CSV graph

Parses the dictionary and gets the indices of all the connections, writing them into a file.

In [29]:
function create_csv_dataset(path)
    # Open the file for writing
    open(path, "w") do file
        for (n, connections) in node_dict
            indexed_connections = String[]

            for c in connections
                if c in keys
                    push!(indexed_connections, string(findfirst(==(c), keys)))
                end
            end

            # Write to file
            write(file, join(indexed_connections, ", "))
            write(file, "\n")
        end
    end
end

create_csv_dataset("test.csv")