> A project to analyze hacker news stories using nbdbt

## Analysis

In [1]:
#| echo: false
%reload_ext nbdbt.dbt_cellmagic

In [2]:
#| echo: false
%dbtconfig -p ../hn_whos_hiring -n notebooks/analysis.ipynb

### Raw HN Source

> This is the raw data for all Hacker News posts

It is sourced from the Google's Big Query Public Datases
and accessible as 
```
bigquery-public-data.hacker_news.full
```

It seems to be updated on a regular basis *(see timestamp of latest post)*.

In [3]:
%%dbt -a raw_sources analyses/raw_hn_source.sql
select *
from {{ source('public_datasets', 'full_stories') }}
order by timestamp desc


In [4]:
df = raw_sources.ref(10); df.head()

Unnamed: 0,title,url,text,dead,by,score,time,timestamp,type,id,parent,descendants,ranking,deleted
0,,,POSIX sh parameter expansion cheat sheet:<p><a...,,jwilk,,1656580359,2022-06-30 09:12:39+00:00,comment,31930208,31928736.0,,,
1,,,That also ate a lot of small healthy business ...,,Existenceblinks,,1656580348,2022-06-30 09:12:28+00:00,comment,31930207,31929941.0,,,
2,,,Does anyone actually like using JIRA? Or Confl...,,gaff33,,1656580322,2022-06-30 09:12:02+00:00,comment,31930206,31929941.0,,,
3,,,Not at all is the issue. IIRC svn checkout is ...,,masklinn,,1656580321,2022-06-30 09:12:01+00:00,comment,31930205,31929148.0,,,
4,,,"That&#x27;s because, if the chip uses 20% less...",,tintedfireglass,,1656580316,2022-06-30 09:11:56+00:00,comment,31930204,31925613.0,,,


The column names have some descriptions, but might need to be standardized.

In [5]:
df.columns.values

array(['title', 'url', 'text', 'dead', 'by', 'score', 'time', 'timestamp',
       'type', 'id', 'parent', 'descendants', 'ranking', 'deleted'],
      dtype=object)

### Exploratory Data Analysis 
> based on a sample size of 10 latest posts

* Check if time and timestamp contain the same info and one column can be eliminated

In [6]:
from datetime import datetime
import pandas as pd

In [7]:
df['newtimestamp'] = pd.to_datetime(df['time'].astype(float), unit='s',origin='unix', utc=True)

In [8]:
df[df['newtimestamp'] == df['timestamp']].any(axis=None)

True

Looks like they are one and the same.

### EDA Planned Questions

* Check if any id, by, time, timestamp, type are null 

* Check for unique values of `dead`

* Check for unique values of `type`  

## Standardization
> Standardize column names and types so downstream transformations don't have to deal with that

In [9]:
%%dbt -a hn_posts -n notebooks/analysis.ipynb models/hn_posts.sql
select 
  id as post_id, 
  title,
  url,
  text,
  `by` as author_id,
  time as pub_data
   
from {{ source('public_datasets', 'full_stories') }}
order by timestamp desc
