<P> <img src="https://i.ibb.co/gyNf19D/nhslogo.png" alt="nhslogo" border="0" width="100" align="right"><font size="6"><b> CS4132 Data Analytics</b> </font>

In [1]:
import numpy as np
import requests
import pandas as pd
import re
from bs4 import BeautifulSoup
s = requests.Session()
s.headers.update({'User-Agent': 'Chrome/114.0.0.0'})

# Research Task

# by Dominic and Aik Lok

## Introduction

<div class="alert alert-block alert-warning">
a)	A brief introduction to web scraping and its related ethical and legal concerns, including some interesting legal case studies.
</div>

### What is webscraping?
Webscraping, as the name implies, is automatically extracting data from websites through the use of software or code. The data can be anything from the html, text, images or tables on the website. The extracted data can be saved locally as a json, csv or excel file. 

Web pages are built using text-based mark-up languages (HTML and XHTML), and contain alot of useful information for data analytics. The problem is, web pages are built in for end-users, so humans looking at data in a website see a well-organised table. Contrarily, machines cannot read off the screen. We can use specialised tools such as Scrapy and Beautiful Soup to read mark-up languages and store its data.

Some uses of webscraping include:
- Data collection for projects such as this one
- Obtaining information on other competitors in the market
- Looking through social media for specific comments or pieces of information, as well as statistics related to it for sentiment analysis or other uses
- Prices for things such as stock or property/real estate

### Why is it a concern?
Ethics:
- Some information is meant to be private and obtaining them without users' consent is not nice, and potentially cause legal issues if the scraped data is personal data (refer to clause 13 of the PDPA)

- Scraping takes a lot of resources from the website's servers being scraped and can put a strain on it, which can affect users of the site's experience. This can be seen as a form of Ddos attack. (refer to singaporelegaladvice)

- Some websites contains content that was created/collated by individuals and organisations, so taking it without crediting them is akin to plagiarism. It can also cause legal issues in the form of copyright infringement if the scraped data has been copyrighted and is used without consent. (refer to singaporelegaladvice once again, as well as copyright act 2021)

Legality:
- Excessive scraping is a form of Ddos attack, which is illegal (refer to clause 7c of the CMA 1993)

- Some websites have terms and conditions for use, which may state that it does not allow webscraping (refer to singaporelegaladvice once again) 

### Examples/Case Studies:
- In 2017, PropertyGuru sued 99co for reposting their property listings on 99co's website, citing that the photos used and shown are copyrighted and stating that this was also written in their terms and uses. It was later found that the photos were not legally owned by PropertyGuru since they were not altered majorly by them. Even though official sites of both parties claim victory, the general consensus was that 99co won. Even though the data was not web scraped but manually copied, they would still be sued if they used webscrapers instead, since this is a concern about copyright. (refer to sources 7 to 10)

- In 2018/2019 hiQ was sent a cease-and-desist letter from LinkedIn for scraping publicly available LinkedIn profiles for data, citing that hiQ broke the CFAA and DMCA. Originally, hiQ was determined to be innocent, but after recent review, hiQ was ruled to have breached LinkedIn's Users Agreement and instead both companies reached a proposed settlement. This raised the issue of whether publicly information can or should be scraped, and whether scraping of such data should be limited by the owners of the website. (refer to sources 11 to 13)

## Overview and Setup

<div class="alert alert-block alert-warning">
b)	A brief overview of the assigned library, and detailed installation instructions.
</div>

### What is BeautifulSoup?
It is an open-source module used for web scraping. The installation is usually done through bash in linux (or a linux subsystem). It allows you to scrape and parse html or xml on websites and store it many different formats (depending on what was scraped). Some of its functions include:
- parse html from a website
- allowing you to grab and store the tables on a specific site
- search and filter specific elements or tags in the html
- navigate through the html tree
- modify the html tree (when it is stored locally in a .html file) 

### Installation Guide
For the installation, you will need to have python and Linux subsystem, both of which I already had installed. Since we are doing this in vscode, we will be giving instructions on how to install scrapy in vscode:
1. create a new terminal in vscode (using ctrl + shift + p and typing in create terminal) and type in the "bash" command in order to change it to a linux terminal
2. navigate (cd ./file/path) to where you your .vscode file is (or where you installed vscode)
3. enter the command "sudo pip install beautifulsoup4", which will require you to enter the password for the linux subsystem administrator account. (sudo may not be necessary, but is what I used in case it requires permissions)
4. if you do not have html5lib installed, run "sudo pip install html5lib" in the terminal

If you are doing this in Jupyter notebook, you would do this instead:
1. open anaconda prompt
2. enter "conda install -c anaconda beautifulsoup4"
3. restart your jupyter notebook
4. if you do not have html5lib installed, run "pip install html5lib" in any notebook cell

That is the entire installation process. To start using it, just import it using "from bs4 import BeautifulSoup" and "import html5lib" at the start of your code.

## Scraping bookstoscrape

<div class="alert alert-block alert-warning">
c)	Clear and executable code snippets illustrating the basics of the library and its usage for web scraping using http://books.toscrape.com/ as an example.
</div>

First, we need to import all of the required libraries

In [2]:
import numpy as np
import requests
import pandas as pd
from bs4 import BeautifulSoup

Next, we use BeautifulSoup to scrape all the avaliable html links of each book.

The html of each page contains all the books html links, but with one catch. The html links are spaced out over 50 pages, where the formatting of the each page's link is "http://books.toscrape.com/catalogue/page-X.html" where X is the page number. Since we know there are a total of 50 pages and the page index starts from 1, we can use a for loop to iterate through all 50 pages, and at each page, finding all hyperlink tags ("\<a\>" tag) and storing their attributes (the "href" part of the html tag) concatenated with the url format in the "references" list. This references list now contains every url that can be reached by clicking on all 50 webpages.

In [3]:
url = "https://books.toscrape.com/catalogue/"
base_url = 'https://books.toscrape.com/catalogue/page-{}.html'
references = []
for i in range(1, 51):  # For pages 1 to 50
    response = requests.get(base_url.format(i))
    soup = BeautifulSoup(response.content, 'html.parser')
    links = soup.find_all('a') #store all hyperlink tags
    for link in links: #get all the href attribute
        href = link.get('href')
        newRef = url + href #append to url format
        references.append(newRef)
references

['https://books.toscrape.com/catalogue/../index.html',
 'https://books.toscrape.com/catalogue/../index.html',
 'https://books.toscrape.com/catalogue/category/books_1/index.html',
 'https://books.toscrape.com/catalogue/category/books/travel_2/index.html',
 'https://books.toscrape.com/catalogue/category/books/mystery_3/index.html',
 'https://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html',
 'https://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html',
 'https://books.toscrape.com/catalogue/category/books/classics_6/index.html',
 'https://books.toscrape.com/catalogue/category/books/philosophy_7/index.html',
 'https://books.toscrape.com/catalogue/category/books/romance_8/index.html',
 'https://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',
 'https://books.toscrape.com/catalogue/category/books/fiction_10/index.html',
 'https://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
 'https://books.tos

We then notice that most of these links don't lead to the book information. Firstly, we notice that the url format for books is in the form of https://books.toscrape.com/catalogue/book_name_some-numbers/index.html (for example https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html) and that some links have category in their name. Hence, we first remove any links with "category" in them. Next, we remove anything that isnt in the format of the book url. Lastly, we remove all duplicate url links to finally get a list of 1000 links, which corrsepond to the 1000 books on the website.

In [4]:
references = np.array(references)
mask = ["category" not in link for link in references]
references = references[mask]
mask2 = [False if link == 'https://books.toscrape.com/catalogue/../index.html' else True for link in references]
references = references[mask2]
mask3 = ["page" not in link for link in references]
references = references[mask3]
references = np.unique(references).tolist()
references 

['https://books.toscrape.com/catalogue/10-day-green-smoothie-cleanse-lose-up-to-15-pounds-in-10-days_581/index.html',
 'https://books.toscrape.com/catalogue/10-happier-how-i-tamed-the-voice-in-my-head-reduced-stress-without-losing-my-edge-and-found-self-help-that-actually-works_582/index.html',
 'https://books.toscrape.com/catalogue/1000-places-to-see-before-you-die_1/index.html',
 'https://books.toscrape.com/catalogue/112263_583/index.html',
 'https://books.toscrape.com/catalogue/13-hours-the-inside-account-of-what-really-happened-in-benghazi_645/index.html',
 'https://books.toscrape.com/catalogue/1491-new-revelations-of-the-americas-before-columbus_650/index.html',
 'https://books.toscrape.com/catalogue/1st-to-die-womens-murder-club-1_2/index.html',
 'https://books.toscrape.com/catalogue/23-degrees-south-a-tropical-tale-of-changing-whether_556/index.html',
 'https://books.toscrape.com/catalogue/32-yolks_510/index.html',
 'https://books.toscrape.com/catalogue/8-keys-to-mental-health-t

Now that we have cleaned our links, we can use the Pandas library to read the tabular data within it, and concatenate all of them into a single dataframe. We store the first book in order to have a dataframe to append the rest of the books to. Then, we parse the html of the rest of the book and append it to the original dataframe.

In [6]:
bookLink = references[0]
url = requests.get(bookLink)
dataframe = pd.read_html(url.text) #stores all tables in a list, 
dataframe = dataframe[0] #since there is only 1 table we get the 0th one
dataframe = dataframe.T
dataframe.columns = dataframe.iloc[0] #the top row is used as column names
dataframe = dataframe.iloc[1:,:] #bottom row is the data we want

for i in range(1,len(references)): #repeat for all other books
    link = references[i]
    url = requests.get(link)
    df = pd.read_html(url.text)
    df = df[0]
    df = df.T
    df.columns = df.iloc[0]
    df = df.iloc[1:,:]
    dataframe = pd.concat([dataframe, df], ignore_index=True)
dataframe.head()

Unnamed: 0,UPC,Product Type,Price (excl. tax),Price (incl. tax),Tax,Availability,Number of reviews
0,96aa539bfd4c07e2,Books,Â£49.71,Â£49.71,Â£0.00,In stock (10 available),0
1,34669b2e9d407d3a,Books,Â£24.57,Â£24.57,Â£0.00,In stock (10 available),0
2,228ba5e7577e1d49,Books,Â£26.08,Â£26.08,Â£0.00,In stock (1 available),0
3,a9d7b75461084a26,Books,Â£48.48,Â£48.48,Â£0.00,In stock (11 available),0
4,4be9d1910f8a4e80,Books,Â£27.06,Â£27.06,Â£0.00,In stock (13 available),0


Finally, we can store the data in an excel file for further reference.

In [None]:
dataframe.to_excel("Books.xlsx")

## Case Study

<div class="alert alert-block alert-warning">
d)	Select a suitable case study website of your choice to showcase additional web scraping features. Include relevant code examples and explanations.
</div>

For this section, we will be using lxml parser instead of html-parser since it is considered to be "faster" due to being more efficient.

In [8]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


For this section, we will be scraping a website called ctftime.org. It is a suitable site to scrape since it has many page and filter options as well as tables on the most webpages. From this website, we would like to answer the question: "How many (archived) ctf events have there been in each year?"

For some background context, ctftime.org is a website for ctf players. It allows you to:
- post writeups for finished ctf competitions
- look at upcoming and past ctf competitions 
- create a team for ctf competitions which can:
    - earn rating by competiting in and winning ctf competitions
    - climb the leaderboard for teams
    - recruit members (through link in description to other social media sites)
    - be more recognised 

Now, we would also like to manually parse our html (as a proof of concept) instead of using the pandas function. Hence, we will make a new function called parseTable which will help to parse any extracted tables from each site and output it as a (python) list instead.

In [3]:
def parseTable(htmltable):    
    out = []
    tr = htmltable.find_all('tr') #find all table rows
    headerrow = [x.get_text(strip = True) for x in tr[0].find_all('th')] #find the header row
    if headerrow: # check if headerrow is empty
        out.append(headerrow)
        tr = tr[1:] #remove header row from list of table rows
    for row in tr: #iterate through every other row
        out.append([x.get_text(strip = True) for x in row.find_all('td')]) #append all data the datas into one list, then append list into table list
    return out

We would also like to make another change. Unlike the previous example, we would like to use requests.Session() instead so that we will not open another session everytime we rerun our code, allowing it to run faster.

Before we attempt to scrape ctftime, we realise that we will be thrown a 403 error no matter which part of ctftime we try to scrape. After reading up, we realised that we had to change our User-Agent so that we will not be blocked. Since the robots.txt of ctftime does not mention anything about allowed user agents, we had to find allowed user agents by opening up the network tab of the inspect element and finding the User-Agent section of any resource. From this, we found that 'Chrome/114.0.0.0' was among the allowed User-Agents.

In [4]:
s = requests.Session()
s.headers.update({'User-Agent': 'Chrome/114.0.0.0'})

As mentioned before, we would like to scrape ctftime.org, or more specifically, all the archived events over the past. Visiting the webpage, we can gather some information:

Firstly, after playing around with the filtering options, we find out that the url format of each page of ctftime is in the format of "https://ctftime.org/event/list/?year=YEARNUMBERS&online=NUMBER&format=NUMBER&restrictions=NUMBER". From this, we can gather that the query string includes what year it is and numbers to denote other filters on the website when searching up for archived events. However, when attempting to use this format with archived events in 2023, it returns upcoming events. We then realised that the query string for archived 2023 events included archive=BOOLEAN, making the final format: "https://ctftime.org/event/list/?year=YEARNUMBER&online=NUMBER&format=NUMBER&restrictions=NUMBER&archive=BOOLEAN". 

Example: https://ctftime.org/event/list/?year=2023&online=-1&format=0&restrictions=-1&archive=true 
(Some minor information: -1 refers to no filter for online/offline and restrictions, but format uses 0 to denote no filters)

Finally, we have our final url format of: 'https://ctftime.org/event/list/?year=YEARNUMBER&online=-1&format=0&restrictions=-1&archive=true'

Secondly, ctftime has archived ctfs competitions dating back to 2011. 

Lastly, if we attempt to store the table on the website in a numpy array, we find out that there are 5 columns for the first row but 6 for every subsequent one. This is because for there is an extra empty string row between the weight and the notes section.

Finally, we can start coding. 
Just like the first example, we store the first table in 2011 in order to have a dataframe to append the rest of the tables to. Then, we parse the html of the rest of the years and append them to original dataframe. While doing so, we make sure to get rid of the empty string column of each table to make sure there are no concatenation issues. We also insert an extra year columns for easier reference.

In [5]:
response = s.get("https://ctftime.org/event/list/?year=2011&online=-1&format=0&restrictions=-1&archive=true")
soup = BeautifulSoup(response.content, 'lxml') #using lxml instead
tables = soup.find("table") #there is only one table on each webpage
list_table = parseTable(tables)
columnnames, list_table = list_table[0], np.array(list_table[1:])[:, [0, 1, 2, 3, 4, 6]] #find column names and get rid of empty string columns, which is at index 5
out = pd.DataFrame(list_table, columns = columnnames) #this will be the final dataframe
out.insert(2, "Year", '2011') #adding a year column just for easier reference

for year in range(2012, 2023 + 1): #repeat for years 2012-2023
    url = "https://ctftime.org/event/list/?year=" + str(year) + "&online=-1&format=0&restrictions=-1&archive=true"
    response = s.get(url)
    soup = BeautifulSoup(response.content, 'lxml')
    tables = soup.find("table")
    list_table = parseTable(tables)
    columnnames, list_table = list_table[0], np.array(list_table[1:])[:, [0, 1, 2, 3, 4, 6]]
    append = pd.DataFrame(list_table, columns = columnnames)
    append.insert(2, "Year", str(year))
    out = pd.concat((out, append))
out = out.reset_index()
out = out.drop(columns = "index")
out.head()

Unnamed: 0,Name,Date,Year,Format,Location,Weight,Notes
0,PHD CTF Quals 2011,"10 Dec., 10:00 UTC — 11 Dec. 2011, 10:00 UTC",2011,Jeopardy,On-line,40.0,
1,UCSB iCTF 2011,"02 Dec., 04:00 UTC — 03 Dec. 2011, 01:00 UTC",2011,Attack-Defense,On-line,80.0,
2,RuCTF 2011,"19 Nov., 15:00 UTC — 20 Nov. 2011, 00:00 UTC",2011,Attack-Defense,On-line,70.0,
3,CSAW CTF Final Round 2011,"10 Nov., 08:00 UTC — 11 Nov. 2011, 08:00 UTC",2011,Attack-Defense,,20.0,
4,Power of XX 2011,"03 Nov., 01:00 UTC — 03 Nov. 2011, 11:00 UTC",2011,Jeopardy,"On-site, Seoul, Korea",0.0,missing the scoreboard


Finally, we can store the data in an excel file for potential future use.

In [166]:
out.to_excel("CTFTime.xlsx")

Now that we have all the data required, let's recall the problem statement, that being "How many (archived) ctf events have there been in each year?". Before we answer this question, we must check for duplicate events.

In [10]:
out.duplicated()[out.duplicated() == True]

712    True
dtype: bool

From the above, we see that there is one exact copy. Checking the website, we can verify that it was indeed recorded twice on the website. Therefore, we will be removing the duplicate and finally finding the number of (recorded) ctf events in each year.

In [16]:
out = out.drop_duplicates()
out.Year.value_counts()

2022    269
2021    240
2020    229
2019    197
2023    176
2018    154
2017    141
2016    109
2015     81
2014     59
2013     55
2012     35
2011     19
Name: Year, dtype: int64

## Conclusion

<div class="alert alert-block alert-warning"> 
e)	Finally, a short conclusion of what was covered in the tutorial, and pros and cons of the library assigned.
</div>

From this, we have covered what is webscraping, concerns regarding webscraping and case studies of cases concerning ethics and legality of webscraping. We also covered in general what beautifulsoup is and what it can do, as well as how to install it. Lastly, we have shown examples of how to use beautifulsoup to crawl through a website and navigate through the trees in the html in order to get to other html on the site, and obtain tables on the website.

#### Pros of this library:
- there is a newer documentation made by other users that looks cleaner and neater, as well as being more easily digestible
-  it is simpler than its counterpart, scrapy, in the sense that there is no need to look through several files and everything (at least for this assignment) can be done through code
- easier to install than its counterpart
- sufficient documentation to achieve what is needed (at least in this assignment)
- more beginner-friendly

#### Cons of this library:
- the official documentation for the library is very old (since it was initially released in 2004) and messy
- unlike its counterpart, it is very simple, but this also means that has fewer functionalities and require other libraries in order to achieve the same result as scrapy (in some cases)
- slower than its counterpart, especially for larger projects
- cannot extract large amounts of data from a website without getting IP banned/blocked

## References

<div class="alert alert-block alert-warning"> 
A reference section citing any additional references / links you may have used for the project.
</div>

1. Clause 13 of the [PDPA 2012](https://sso.agc.gov.sg/Act/PDPA2012?ProvIds=P14-P21-#P14-P21-) 
1. Issues caused by webscraping section of the [article on webscraping](https://singaporelegaladvice.com/law-articles/legal-scrape-crawl-websites-data-singapore/) by singaporelegaladvice
1. Clause 7c of the [CMA 1993](https://sso.agc.gov.sg/Act/CMA1993)
1. Copyright infringement subsection of the [article on webscraping](https://singaporelegaladvice.com/law-articles/legal-scrape-crawl-websites-data-singapore/) by singaporelegaladvice
1. Singapore [Copyright Act 2021](https://sso.agc.gov.sg/Acts-Supp/22-2021/Published/?ProvIds=P12-#pr7-)
1. Breach of website terms of use subsection of the [article on webscraping](https://singaporelegaladvice.com/law-articles/legal-scrape-crawl-websites-data-singapore/) by singaporelegaladvice
1. [99co's article](https://www.99.co/singapore/insider/victory-for-the-internet/) claiming they won the case
1. [PropertyGuru's article](https://www.propertyguru.com.sg/property-management-news/2018/3/170013/propertyguru-group-wins-legal-case-against-99-co) claiming they won the case
1. [99co "won" the case](https://mothership.sg/2018/03/propertyguru-copyright-infringement-case-against-99co/), overview of the case by Mothership
1. "Case study: is there copyright infringement when a website’s content is copied?" subsection of [article on webscraping](https://singaporelegaladvice.com/law-articles/legal-scrape-crawl-websites-data-singapore/) by singaporelegaladvice, general overview about the court case between 99co and PropertyGuru
1. General [overview](https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn#cite_note-1) of the hiQ vs LinkedIn court case
1. hiQ's [initial victory](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data) over the case
1. Reruling of the case between hiQ and LinkedIn and reaching a [proposed settlement](https://www.natlawreview.com/article/hiq-and-linkedin-reach-proposed-settlement-landmark-scraping-case)

https://wordpress.com/support/markdown-quick-reference/ (you may refer to this link on markup for Jupyter when formatting your proposal)

## Appendix

<div class="alert alert-block alert-warning"> 
An Appendix in the report containing a list of prompts used in your interactions with ChatGPT.
</div>

### Student 1 Prompts

1. what is web scraping
2. ethical and legal concerns of web scraping
3. real life cases of illegal webscraping
4. what is beautifulsoup
5. how to navigate html tree using beautifulsoup
6. use cases for using beautifulsoup for webscraping
7. examples of sites to webscrape for cs project without being ip blocked

### Student 2 Prompts

Used prompts

1. what is web scraping?
2. why would web scraping be unethical? can you provide some case studies?
3. how do i know if a website permits web scraping?
4. how to i scrape websites with beautiful soup 4
5. how do i follow links inside a html file
6. trying to do href = url + href gives me the error  can only concatenate str (not "Tag") to str

Other related but unused prompts in chat transcript

1. what are some examples of using scrapy for web scraping?
2. why does 'scrapy crawl quotes' yield Unknown command: crawl
3. how to convert html to pandas dataframe
4. html5lib not found, please install it 
5. i have .html file on my computer stored locally. How do i use pandas to retrieve data in the form of a dataframe
6. im still gettign this error ImportError: html5lib not found, please install it
7. i got this warning from trying to run it afterwords /Users/karen/anaconda3/lib/python3.10/site-packages/bs4/__init__.py:435: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.

### Sharing link of entire chat transcript

1. https://chat.openai.com/share/378c8870-dd37-4784-874c-9e8f3c301e59
2. https://chat.openai.com/share/5297d8f7-b6aa-4889-bb86-c7f33201bf49

<hr>
© NUS High School of Math & Science