Analysis of Hacker News

The objective of this project was to create my own dataset of stories and articles from Hacker News, an to find insightful correlations between the popular discourse in the tech community and real world events.

Setup instructions

Some steps are required to get started with this project.

External dependencies

This project has external dependencies that need to be set up in order to successfully run the code. All services are free to use (as of April 2024).

Zilliz - vector DB

This project setup heavily depends on integration with an external data source, a cloud vector database called Zilliz. Successfully running many of the scripts requires setting up Zilliz or reconfiguring for a different data storage method.

Voyage - embeddings

Voyage AI was, at the time of my research, the industry leader in large text embeddings. I chose to integrate their API as part of my data preprocessing workflow. Setup instructions here.

Llama 2 - text synthesis

To clean and synthesize text coming from third party websites, I found Llama 2-7B, a locally running LLM, to work best. Get started by installing Ollama (instructions), and sequentially following this guide for setting up Llama to run on local port.

Internal dependencies

It is recommended to set up a virtual environment.

To install project dependencies:

pip install -r requirements.txt

Running the code

Get the source data

Pulling data from Hacker News itself is a simple script that could be executed without setting up external dependencies. Get started by allowing the bash script to run on your machine.

In the source folder, run the following terminal command to grant execution rights:

chmod +x top_hn/fetch_best.sh

You can now run the bash script to pull articles from Hacker News:

./top_hn/fetch_best.sh

This will create a json file in your ./results folder, containing the 200 most trending articles as of right now. The list is a ID representation of each individual article.

Process the articles

Processing the articles requires having set up or reconfigured the external dependencies.

The source script for processing the articles is in ./process.py.

If in the source folder, it can be triggered by running:

python process.py

View findings

My preliminary work with this data is deployed on Streamlit.

The source code can be found in ./streamlit.py, or deployed on https://hackernews-demo.streamlit.app.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.ipynb_checkpoints		.ipynb_checkpoints
demo		demo
exports		exports
jupyter		jupyter
top_hn		top_hn
.gitignore		.gitignore
README.md		README.md
comments.py		comments.py
df.py		df.py
drop.py		drop.py
embeddings.py		embeddings.py
events.py		events.py
export_db.py		export_db.py
graph.py		graph.py
hackernews.py		hackernews.py
keywords.py		keywords.py
milvus.py		milvus.py
process.py		process.py
requirements.txt		requirements.txt
scrape.py		scrape.py
streamlit.py		streamlit.py
trying.py		trying.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis of Hacker News

Setup instructions

External dependencies

Internal dependencies

Running the code

Get the source data

Process the articles

View findings

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Analysis of Hacker News

Setup instructions

External dependencies

Internal dependencies

Running the code

Get the source data

Process the articles

View findings

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages