# most-popular-hacker-news🗞

> This will be a web-scraping project that will scrape **https://news.ycombinator.com/** for the most popular articles based on the amount of up-votes received.

## DISCLAIMER📢:
> **This project is for `educational purposes ONLY` & does not encourage malpractice of web scraping whatsoever. `ALL` scraping done in this project will `NOT` exceed the limitations listed in `https://news.ycombinator.com/robots.txt` (shown below).**

![img](https://i.imgur.com/ohRZTDr.png)

> **NOTE: `ALWAYS` check https://{website of interest}.com/robots.txt `BEFORE` you attempt to scrape anything from the site. Whatever is listed as `disallowed` is not allowed to be scraped and it is against the law to do so.**

### Disallowed Pages & Crawl Delay Limit Explained🤔:
> In the screen-shot of `https://news.ycombinator.com/robots.txt` (shown above) it lists the pages we cannot scrape in the form of `relative links`. An example of a relative link would be `/robots.txt`. This relative link is added to end of `https://news.ycombinator.com` to create an `absolute link` to the `robots.txt` page of the site. To sum it up, the following pages are not allowed to be scraped:

1. https://news.ycombinator.com/x?
2. https://news.ycombinator.com/r?
3. https://news.ycombinator.com/vote?
4. https://news.ycombinator.com/reply?
5. https://news.ycombinator.com/submitted?
6. https://news.ycombinator.com/submitlink?
7. https://news.ycombinator.com/threads?

> `Crawl-Delay: 30` simply means that once we access the site we have to wait 30 seconds before we can access it again. This is something set in place to avoid the inconvenience of the website crashing.

### Objective📋

> Scrape hacker news for the most popular news articles of the day based on votes & sort the list of articles by most up-voted to least up-voted. 

### Expansion & Functionality🧩

#### Keyword Feature⚙
> Provide keyword functionality that will scrape based on whether or not any keyword in a list of given `keywords exist in the titles`. This will be something to keep in mind while building out our scraper so that we leave the door open for scalability and expansion.

#### Pagination Feature⚙
> Provide page number functionality that will allow a user to enter the `number of pages` they want to be included. This is also something that will be kept in mind throughout the building process. It would be a nice feature to add,

#### Minimum Number of Votes Feature⚙
> Provide functionality that allows a user to input a `minimum number of votes` that the articles must have. E.g., an input of 200 votes would likely filter out less popular articles completely & shorten the overall list of articles to chose from. I think this is a great feature to have. 

### Scraping Method🤖

> I favor the asynchronous method of scraping due to it being extremely fast. (We'll cover what this means later on).

### Libraries We Will Be Using📚

1. `asyncio` - this library allows us to use asynchronous functions in python
2. `aiohttp` - this is what you can call an asynchronous version of the requests library
3. `lxml` - Beautiful Soup is a nice library but it is not xpath compatible & **xpath is the best path** when web scraping. It allows for high level precision & it is very robust. The `lxml` is an xml parsing library that also has an html parser.

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

<IPython.core.display.Javascript object>

In [3]:
# Execute this to save new versions of the notebook
jovian.commit(project="most-popular-hacker-news")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[31m[jovian] Error: Failed to read the Jupyter notebook. Please re-run this cell to try again. If the issue persists, provide the "filename" argument to "jovian.commit" e.g. "jovian.commit(filename='my-notebook.ipynb')"[0m


## Plan of Attack

Now that we know what our objective is & what tools we will be using, we can begin to state exactly what relevant data we want to scrape.

### Data Points of Interest

#### `Article Data Points`
1. **The news article title**
2. **The news article link**
3. **The number of up-votes the news article has received**
4. **The author of the article**
5. **The time article was published**

#### `Navigation Data Points`
1. **The `more` (next page) link** 

In [4]:
jovian.commit()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[31m[jovian] Error: Failed to read the Jupyter notebook. Please re-run this cell to try again. If the issue persists, provide the "filename" argument to "jovian.commit" e.g. "jovian.commit(filename='my-notebook.ipynb')"[0m


## Surveying The Field 🔍
> Before we start to write code, it is essential to inspect the website we're scraping to determine where exactly our desired data points reside within the html. You can do this in a few quick & easy steps.

1. **Hover over the target data point with your mouse. e.g., the `article title`** 
2. **Right click** 
3. **Select inspect from the drop down menu.**

![img](https://i.imgur.com/wfJIzLQ.png)

### Using Developer Tools ⚒
> Once you click inspect a small window will open up either on the right side or the bottom of the page showing the `html code of the element`. In our case it will show the html for the title of the article that we were hovering over with our mouse.



![img](https://i.imgur.com/tbyVhOb.png)

> This gives us the ability to see where exactly we will be able to find the data we are looking for. The title of the article is located within an `anchor tag` with the `class titlelink`. The link to the article is also located within the same anchor tag's `href attribute`. That's two desired data points within one tag. Perfect!

#### NOTE: 
> You can also see `a.titlelink` above the element on the left side of the screen. In CSS styling the `.` is used to express a class name. This signifies the highlighted element as being an `anchor tag` with the `class titlelink` as mentioned before.

## XPATH is the BEST Path
> There are two ways to locate the desired data in an html document.
1. **CSS Selectors**
2. **Xpath Selectors**

> In my humble opinion, Xpath is the better choice because it allows us to be precise & it is a necessary skill to have when it comes to web scraping. CSS Selectors are also good to learn as well & are less complex than Xpath. The difference in complexity between the two comes at a cost. You can use whichever method that you prefer. In this project we will be using Xpath.

#### Here are some great resources to learn both methods

**Xpath Selectors:**

1. A great video that will make you really sound with xpath created by `Automation Zone` can be found here: https://www.youtube.com/watch?v=NhG__BL8zFo. 
2. An awesome cheat sheet provide by `Dev Hints` can be found here: https://devhints.io/xpath.

**CSS Selectors:**

1. A video created by `Web Dev Simplified` can be found here: https://www.youtube.com/watch?v=l1mER1bV0N0
2. `W3Schools` as a great reference that can be found here: https://www.w3schools.com/cssref/css_selectors.asp

## Locating Article Data Points

> If you press `CTR-F (CMD-F on Mac)` within the developer tools window, a blank text box will appear under the html code. Inside this text box we can type in an xpath (or css selector) & make sure that it is not only correct but also determine how many elements with the same xpath exist on the page.

![img](https://i.imgur.com/Q7QxxDB.png)

> We can see that the xpath `//a[@class="titlelink"]` gives us all the `anchor tags` with the `class titlelink` on the page. On the far right of the text field we see that it says `1 of 30`. That means there are 30 elements with this xpath of this page. We can click on the `up & down arrows` to look at each one. Keep in mind that this is where two of our article data points reside.

1. **The title of the article**
2. **The link to the article**

### Article Title
> The title of the article is going to be the text of within the anchor tag. We can extract the text of the anchor tag by adding `/text()` to the end of the xpath like so `//a[@class="titlelink"]/text()`.

![img](https://i.imgur.com/9wingSl.png)

**Take note of the fact that the count on the far right is still showing 30 elements. With the above xpath we will be extracting text for all 30. This will be recognized by Python as a list of strings.**

### Article Link
> The link to the article is located within the anchor tag's href attribute. To extract the href attribute we can simply add `/@href` to the end of our xpath like so `//a[@class="titlelink"]/@href`.

![img](https://i.imgur.com/9wingSl.png)

**Take note of the fact that the count on the far is still showing 30 elements. With the above xpath we will be extracting the link for all 30. This will also be recognized by Python as a list of strings.**

### Number of Up-Votes
> Using the same steps mentioned in the `Using Developer Tools` we can see that `the number of votes` is located in within a `span` element with the `class score`. We extract the text from this element the same way we extracted the text from the title of the article like so `//span[@class="score"]/text()`.

![img](https://i.imgur.com/3u6CzcR.png)

> There are two problems with this data point. Can you pinpoint them? These two problems will provide an answer to the question you may or may not have in your head. **When are we finally going to code?!** This is a prime example of why you must take the time to plan out your project before you rush to start writing code.

#### Houston, we have TWO problems! 
> 1. **There are `30 articles` on the page BUT `only 29 of them have been up-voted`! The 8th article in the screen-shot above has been posted only an hour ago and has 0 up-votes (points).**

> 2. **The up-votes are represented by `points` causing the number of votes to be a `string containing the word "points" at the end`.**

**We will have to handle these two problems within our code later on & we'll be able to do that easily. The important thing is that we know what to expect when it comes to grabbing the up-votes data.** Let's going ahead and state what we will do to handle this.

### Handling Up-Vote Data 

#### Edge Case #1: The article has no up-votes
> In the event that the article has no up-votes we will provide a `default value` that will be set to `0`. 

#### Edge Case #2: The article has up-votes & " points" needs to be removed
> In the event that the article has up-votes we will `remove " points"` from the string in order to be able to `convert the number of up-votes(points) into an integer`. This will allow us to `sort the articles by number of votes`.


### Article Author
> The author is located within an `anchor tag` with the `class hnuser`. We will extract the text of this element to get the name of the author with the following xpath expression: `'//a[@class="hnuser"]/text()'`

![img](https://i.imgur.com/LLzTN5E.png)

> There are `30 articles` and `only 29 of them have the author data point`! We will have to handle this in our code. Hopefully you are starting to see the value in planning out your project before you start coding.

### Handling Article Author Data 
**Edge Case #1: The article doesn't have an author

> In the event that an article doesn't have the author data point we will provide a `default value` set to `'N\A'`

### Article Publication Time
> The published time data point is located in a `span element` with the `class age`. In this case the actual date & time information is within the `title attribute`. There is also text that simply states how many hours ago the article was published. We want the date & time information. We can extract this data point with the following xpath expression: `'//span[@class="age"]/@title'` 

![img](https://i.imgur.com/0xbhuIp.png)

> We can see that this data point is present in all 30 of the articles on the page.

## Locating Navigation Data Points
> At the very bottom of the page there is a `more` link that takes you to the next page of articles. Let's inspect this element & see how we can go about `navigating the site` with our scraper.

![img](https://i.imgur.com/dtVLcuW.png)

> We can see that the more link is located in an `anchor tag` with the `class morelink`. We can extract the `href attribue` from the `anchor tag` to get the link using the following xpath expression: `'//a[@class="morelink"]/@href'`

![img](https://i.imgur.com/dtVLcuW.png)

> There is only one element that matches this xpath expression and rightfully so since there is only one more link on the page. If you look very closely at the actual `href attribute` value you will see that is a `relative link` that reads: **"news?p=2"**. The `p=2` means that the link leads us to the `page number` that is equal to `2`. The fact that it is a relative link means that we can simply add it to the main url to form the absolute link in the following manner: `"https://news.ycombinator.com/news?p=2"`. What do you think will happen if we simply change the 2 to a 1?

![img](https://i.imgur.com/igxerG2.png)

> Changing the 2 to a 1 at brings us to the first page! This will be very useful in our pagination feature later on. We will be able to simply change the number at the end of the link to go to different pages. For now we will state how to handle this data point below.

### Handling Navigation🧭

#### Edge Case #1
> In the event that the next page is unreachable will not attempt to scrape it. The way we will handle this will be explained in the implementation phase later on.

In [32]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "alotofuptime/most-popular-hacker-news" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/alotofuptime/most-popular-hacker-news[0m


'https://jovian.ai/alotofuptime/most-popular-hacker-news'

## It's Time To Code🧑‍💻!

> Now that we have a detailed plan we can start to import the necessary libraries to execute it.

In [6]:
!pip install asyncio --upgrade --quiet

In [7]:
!pip install aiohttp --upgrade --quiet

In [9]:
!pip install lxml --upgrade --quiet

In [10]:
# import libraries
import asyncio
import aiohttp
from lxml import etree

In [11]:
# store hackernews main page in a variable url
url =  "https://news.ycombinator.com"

# make request to hackernews using a ClientSession context manager
async def get_html(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            if response.status == 200:
                html = await response.text()
                return html

# pass html response to html parser using the etree class from lxml
parser = etree.HTMLParser()
tree = etree.fromstring(await get_html(url), parser)

# test out our article titles xpath
article_titles = tree.xpath('//a[@class="titlelink"]/text()')

article_titles[:6]

['Show HN: Test your shape rotation skills',
 'Google Tag Manager, the new anti-adblock weapon',
 'Be anonymous',
 'The fastest GIF does not exist',
 'x86 Is an Octal Machine (1992)',
 'About adding a static route to my DOCSIS modem']

In [12]:
# check the nummber of titles. it should be 30
len(article_titles)

30

> So far so good. This is just the beginning. Later we will implement each part of the process into functions to make our code cleaner. Before we do that we will do a few sanity checks to test our assumptions/logic and explore our data points. Let's check the rest of our data points.

In [13]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "alotofuptime/most-popular-hacker-news" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/alotofuptime/most-popular-hacker-news[0m


'https://jovian.ai/alotofuptime/most-popular-hacker-news'

In [14]:
# test out our news article links xpath
article_links = tree.xpath('//a[@class="titlelink"]/@href')

article_links[:6]

['https://0xf00ff00f.github.io/rotator/',
 'https://chromium.woolyss.com/f/HTML-Google-Tag-Manager-the-new-anti-adblock-weapon.html',
 'https://kg.dev/thoughts/be-anonymous',
 'https://www.biphelps.com/blog/The-Fastest-GIF-Does-Not-Exist',
 'https://gist.github.com/seanjensengrey/f971c20d05d4d0efc0781f2f3c0353da',
 'https://blog.danman.eu/about-adding-a-static-route-to-my-docsis-modem/']

In [15]:
# verify the length of the article links list is 30
len(article_links)

30

> Great! Our first two data points are working as expected. We have realistic expectations according to the detailed plan that we laid out in advance. This helps us to focus more on coding and spend less time fixing unexpected bugs. Let's move along to the next data point.

In [16]:
# test out our votes xpath
num_of_votes = tree.xpath('//span[@class="score"]/text()')

num_of_votes[:6]

['390 points',
 '12 points',
 '315 points',
 '512 points',
 '83 points',
 '37 points']

In [17]:
# verify the length (expected to be 29)
len(num_of_votes)

29

> Imagine if you didn't plan before hand. This would be unexpected behavior and would cause you to have to do further digging in the middle of coding. Our planning has paid off! We will handle this in the near future when we start to structure our program into functions. Next!

In [18]:
# test out our authors xpath
article_authors = tree.xpath('//a[@class="hnuser"]/text()')

article_authors[:6]

['0xf00ff00f', 'thyrox', 'kashnote', 'todsacerdoti', 'a1a106ed5', 'azalemeth']

In [19]:
# verify the lenght of authors list (expected to be 29)
len(article_authors)

29

> Again, due to our extensive planning we already expect this result and already have determined how we will handle it. Let's continue.

In [20]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "alotofuptime/most-popular-hacker-news" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/alotofuptime/most-popular-hacker-news[0m


'https://jovian.ai/alotofuptime/most-popular-hacker-news'

In [21]:
# test out our published times xpath
published_times = tree.xpath('//span[@class="age"]/@title')

published_times[:6]

['2022-02-20T19:19:09',
 '2022-02-21T01:41:03',
 '2022-02-20T18:54:36',
 '2022-02-20T14:11:14',
 '2022-02-20T20:41:48',
 '2022-02-20T23:15:07']

In [22]:
# verify the length of published times(expected to be 30)
len(published_times)

30

> Nice! All five of our main data points our going according to plan as for as the xpaths are concerned. Last but certainly not least, let's checkout the pagination data point.

In [23]:
# test out the more link xpath(we'll call it next_page)
next_page = tree.xpath('//a[@class="morelink"]/@href')[0]

next_page

'news?p=2'

#### Pagination Functionality⚙
> Since all we have to do is change the number from 2 to the desired page number, we will make things easier on ourselves by using `string formatting` to `generate the next page url`. This will be the first feature that we create! 

- Name: `page_generator`
- Type: `generator`
- Task: `yield next page url one at a time within a given range starting from 1`

##### Input
**main_url**: `str` representing the `main page of the website`

**page_count**: `int` representing the `num of pages to generate`

##### Output
**next_page**: `str` representing the `next_page in the specified page_count starting from 1`

(e.g, if url is "https://news.ycombinator.com/" & page_count is 5 the return value will be "https://news.ycombinator.com/news?p=1"

In [24]:
def page_generator(url, page_count):
    for n in range(1, page_count + 1):
        yield f"{url}/news?p={n}"

> For the sake of readability, let's take our page_generator, pass in the arguments, url (defined earlier) & 5 (page_count). Then let's store it in a variable call news_feed. This makes our code more readable. With generators you have to call the built in `next()` function provided by Python to get `the next yielded result`. Since we have to call next(), it would be easier on the eyes to say `next(news_feed)` as opposed to next(next_page) or next(page_generator(url, 5)).

In [29]:
news_feed = page_generator(url, 5)

In [30]:
next(news_feed)

'https://news.ycombinator.com/news?p=1'

>What do you think will happen if we call `next(news_feed)` again?

In [12]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	[32mnew file:   most-popular-hacker-news.ipynb[m

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   most-popular-hacker-news.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.ipynb_checkpoints/[m
	[31m.jovianrc[m



In [13]:
!git add most-popular-hacker-news.ipynb

In [14]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	[32mnew file:   most-popular-hacker-news.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.ipynb_checkpoints/[m
	[31m.jovianrc[m



In [15]:
!git commit -m "added page generator feauter📃"


*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: unable to auto-detect email address (got 'jovyan@jupyter-alotofuptime--api-2dgit-2de223aaa-2d6ae1987c8bfc-5f53-2.(none)')


In [17]:
!git config --global user.email "jmarcano617@gmail.com"

In [None]:
!gi