> A project to analyze hacker news stories using nbdbt

## Analysis

In [1]:
#| echo: false
%reload_ext nbdbt.dbt_cellmagic

In [2]:
#| echo: false
%dbtconfig -p ../hn_whos_hiring -n notebooks/analysis.ipynb

### Raw HN Source

> This is the raw data for all Hacker News posts

It is sourced from the Google's Big Query Public Datases
and accessible as 
```
bigquery-public-data.hacker_news.full
```

It seems to be updated on a regular basis *(see timestamp of latest post)*.

In [3]:
%%dbt -a raw_sources analyses/raw_hn_source.sql
select *
from {{ source('public_datasets', 'full_stories') }}
order by timestamp desc


In [4]:
df = raw_sources.ref(10); df.head()

Unnamed: 0,title,url,text,dead,by,score,time,timestamp,type,id,parent,descendants,ranking,deleted
0,,,"Not sure where you got that from, but that&#x2...",,seanhunter,,1658567561,2022-07-23 09:12:41+00:00,comment,32201677,32199341,,,
1,,,It&#x27;s a level of &quot;professionalism&quo...,,foverzar,,1658567556,2022-07-23 09:12:36+00:00,comment,32201676,32201587,,,
2,,,"<a href=""https:&#x2F;&#x2F;github.com&#x2F;cyp...",,DistrictFun7572,,1658567547,2022-07-23 09:12:27+00:00,comment,32201675,32199828,,,
3,,,"Interesting, why isn&#x27;t the moisture on wa...",,badpun,,1658567537,2022-07-23 09:12:17+00:00,comment,32201674,32201620,,,
4,,,Why are comments that are critical to the inve...,,lizardactivist,,1658567512,2022-07-23 09:11:52+00:00,comment,32201673,32200371,,,


The column names have some descriptions, but might need to be standardized.

In [5]:
df.columns.values

array(['title', 'url', 'text', 'dead', 'by', 'score', 'time', 'timestamp',
       'type', 'id', 'parent', 'descendants', 'ranking', 'deleted'],
      dtype=object)

### Exploratory Data Analysis 
> based on a sample size of 10 latest posts

* Check if time and timestamp contain the same info and one column can be eliminated

In [6]:
from datetime import datetime
import pandas as pd

In [7]:
df['newtimestamp'] = pd.to_datetime(df['time'].astype(float), unit='s',origin='unix', utc=True)

In [8]:
df[df['newtimestamp'] == df['timestamp']].any(axis=None)

True

Looks like they are one and the same.

### EDA Questions

* Check total record count

In [9]:
%%dbt -a tot_rec analyses/count_hn_source.sql
select count(*) as rec_count 
from {{ source('public_datasets', 'full_stories') }}


In [10]:
tot_rec_df = tot_rec.ref()

In [11]:
total_records = tot_rec_df.iloc[0].rec_count

In [12]:
total_records

32201676

* Check if any id, by, time, timestamp, type are null 

In [13]:
%%dbt -a null_field_counts analyses/null_field_counts.sql
with hn_posts
as (
select
     `by` as author,
     * except(`by`)
from {{ source('public_datasets', 'full_stories') }}
)
select 
   'id' as field, 
    count(*) as null_count,
from hn_posts
where id is null
union all
select 
   'author' as field, 
    count(*) as null_count,
from hn_posts
where author is null
union all
select 
   'time' as field, 
    count(*) as null_count,
from hn_posts
where time is null
union all
select 
   'timestamp' as field, 
    count(*) as null_count,
from hn_posts
where timestamp is null
union all
select 
   'type' as field, 
    count(*) as null_count,
from hn_posts
where type is null
union all
select 
   'dead' as field, 
    count(*) as null_count,
from hn_posts
where dead is null



In [14]:
null_df = null_field_counts.ref()

In [15]:
null_df['pct'] = null_df['null_count']/total_records

In [16]:
null_df

Unnamed: 0,field,null_count,pct
0,id,0,0.0
1,dead,30822943,0.957184
2,type,0,0.0
3,time,26818,0.000833
4,timestamp,26818,0.000833
5,author,955051,0.029658


| So `type` and `ids` all have values, 
| but there are entries with no `time` or `timestamp` (very small, less than 0.1 percent)
| and there are entries with no `author` (around 3 percent)
| and 96 percent have null values for the `dead` field.

* Check for unique values of `dead`

In [17]:
%%dbt -a dead_type_counts analyses/dead_types_counts.sql
select dead as dead_type, count(*) as dead_count
from {{ source('public_datasets', 'full_stories') }}
group by dead 
order by dead_count desc

In [18]:
dead_types_df = dead_type_counts.ref()
dead_types_df['pct'] = dead_types_df['dead_count']/total_records

In [19]:
dead_types_df

Unnamed: 0,dead_type,dead_count,pct
0,,30822943,0.957184
1,True,1378733,0.042816


* Sample dead values

In [20]:
%%dbt -a dead_rows analyses/dead_rows.sql
select * 
from {{ source('public_datasets','full_stories') }}
where dead is not null
limit 10

In [21]:
dead_rows_df = dead_rows.ref()

In [22]:
dead_rows_df.head()

Unnamed: 0,title,url,text,dead,by,score,time,timestamp,type,id,parent,descendants,ranking,deleted
0,,,"The square roots rule is very handy, and comes...",True,pfh,,1453809675,2016-01-26 12:01:15+00:00,comment,10972974,10972482,,,
1,,,All these things he listed (losing touch with ...,True,ithought,,1487297280,2017-02-17 02:08:00+00:00,comment,13665377,13665032,,,
2,,,Stop BSing us ALL previous climate disaster pr...,True,andred14,,1634252135,2021-10-14 22:55:35+00:00,comment,28871313,28865033,,,
3,,,We need a lot of customization in the output a...,True,Lower456,,1520529475,2018-03-08 17:17:55+00:00,comment,16545848,16545574,,,
4,,,The failing @nytimes.,True,monochromatic,,1520529511,2018-03-08 17:18:31+00:00,comment,16545849,16545685,,,


In [23]:
%%dbt -a not_dead_rows analyses/not_dead_rows.sql
select * 
from {{ source('public_datasets','full_stories') }}
where dead is null
limit 10

In [24]:
not_dead_rows_df = not_dead_rows.ref()

In [25]:
not_dead_rows_df.head()

Unnamed: 0,title,url,text,dead,by,score,time,timestamp,type,id,parent,descendants,ranking,deleted
0,,,Let&#x27;s say the string contains 100 0s and ...,,dlubarov,,1376497604,2013-08-14 16:26:44+00:00,comment,6212429,6211216,,,
1,,,Eric Schmidt would feel right at home in priso...,,logn,,1376497595,2013-08-14 16:26:35+00:00,comment,6212428,6210198,,,
2,,,What is the point of submitting a story behind...,,Quequau,,1420273930,2015-01-03 08:32:10+00:00,comment,8830251,8830214,,,
3,,,But that can be the case for small hatchbacks ...,,freehunter,,1376497523,2013-08-14 16:25:23+00:00,comment,6212421,6212022,,,
4,,,"Just to be clear, the BSD license did not exis...",,throwaway2048,,1376497501,2013-08-14 16:25:01+00:00,comment,6212420,6212325,,,


* Check for unique values of `type`  

In [26]:
%%dbt -a type_counts analyses/types_counts.sql
select type as type, count(*) as type_count
from {{ source('public_datasets', 'full_stories') }}
group by type 
order by type_count desc

In [27]:
types_df = type_counts.ref()
types_df['pct'] = dead_types_df['dead_count']/total_records

In [28]:
types_df

Unnamed: 0,type,type_count,pct
0,comment,27599990,0.957184
1,story,4570471,0.042816
2,job,15567,
3,pollopt,13668,
4,poll,1980,


## Standardization
> Standardize column names and types so downstream transformations don't have to deal with that

In [30]:
project_dir = '../hn_whos_hiring'
profiles_dir = '~/.dbt'

In [32]:
from fal import FalDbt

In [33]:
faldbt = FalDbt(project_dir,profiles_dir)

In [35]:
source = faldbt.sources[0]

In [37]:
source.schema

Unnamed: 0,table_catalog,table_schema,table_name,column_name,ordinal_position,is_nullable,data_type,is_generated,generation_expression,is_stored,is_hidden,is_updatable,is_system_defined,is_partitioning_column,clustering_ordinal_position,collation_name
0,bigquery-public-data,hacker_news,full,title,1,YES,STRING,NEVER,,,NO,,NO,NO,,
1,bigquery-public-data,hacker_news,full,url,2,YES,STRING,NEVER,,,NO,,NO,NO,,
2,bigquery-public-data,hacker_news,full,text,3,YES,STRING,NEVER,,,NO,,NO,NO,,
3,bigquery-public-data,hacker_news,full,dead,4,YES,BOOL,NEVER,,,NO,,NO,NO,,
4,bigquery-public-data,hacker_news,full,by,5,YES,STRING,NEVER,,,NO,,NO,NO,,
5,bigquery-public-data,hacker_news,full,score,6,YES,INT64,NEVER,,,NO,,NO,NO,,
6,bigquery-public-data,hacker_news,full,time,7,YES,INT64,NEVER,,,NO,,NO,NO,,
7,bigquery-public-data,hacker_news,full,timestamp,8,YES,TIMESTAMP,NEVER,,,NO,,NO,NO,,
8,bigquery-public-data,hacker_news,full,type,9,YES,STRING,NEVER,,,NO,,NO,NO,,
9,bigquery-public-data,hacker_news,full,id,10,YES,INT64,NEVER,,,NO,,NO,NO,,


In [38]:
%%dbt -a hn_posts models/hn_posts.sql
with stories as (
  select
    * except (`by`),
    `by` as submitter_id,
  from {{ source('public_datasets', 'full_stories') }}
),
latest_stories as (
  select 
     id as post_id, 
     title,
     url,
     submitter_id,
     text as content,
     timestamp as submit_timestamp, -- no need for time since timestamp == time
     ifnull(dead,false) as dead,  
     score as post_score,
     cast(parent as int64) as parent_id,
     type as post_type,
     ranking,
     deleted,
     descendants
   from stories
   order by submit_timestamp desc
)
select *
from latest_stories


### Notes on columns 
* post_id = unique identifier
* title = title of the post (can be null if comment?)
* url = link to story
* submitter_id - user id of submitter
* content - body of post
* submit_timestamp - date/time submitted
* dead - ? not sure, but only 4.28 percent are dead, rest are none or false
* score - rating?
* parent_id - link to parent if response to article/comment?
* descendants - ? count of descendants?
* ranking - ?
* deleted - ?
* post_type = 'story','comment', 'job','pollopt', 'poll'

### More questions
* meaning of dead
* what is ranking
* what is deleted
* is parent_id the link for a graph of article/comments/responses to comment?
* what is descendants

### Related to who's hiring

* how to filter who's hiring posts
* create a pipeline for text analysis

In [41]:
model.schema

Unnamed: 0,table_catalog,table_schema,table_name,column_name,ordinal_position,is_nullable,data_type,is_generated,generation_expression,is_stored,is_hidden,is_updatable,is_system_defined,is_partitioning_column,clustering_ordinal_position,collation_name
0,hn-whos-hiring,00dev,hn_posts,post_id,1,YES,INT64,NEVER,,,NO,,NO,NO,,
1,hn-whos-hiring,00dev,hn_posts,title,2,YES,STRING,NEVER,,,NO,,NO,NO,,
2,hn-whos-hiring,00dev,hn_posts,url,3,YES,STRING,NEVER,,,NO,,NO,NO,,
3,hn-whos-hiring,00dev,hn_posts,submitter_id,4,YES,STRING,NEVER,,,NO,,NO,NO,,
4,hn-whos-hiring,00dev,hn_posts,content,5,YES,STRING,NEVER,,,NO,,NO,NO,,
5,hn-whos-hiring,00dev,hn_posts,submit_timestamp,6,YES,TIMESTAMP,NEVER,,,NO,,NO,NO,,
6,hn-whos-hiring,00dev,hn_posts,dead,7,YES,BOOL,NEVER,,,NO,,NO,NO,,
7,hn-whos-hiring,00dev,hn_posts,post_score,8,YES,INT64,NEVER,,,NO,,NO,NO,,
8,hn-whos-hiring,00dev,hn_posts,parent_id,9,YES,INT64,NEVER,,,NO,,NO,NO,,
9,hn-whos-hiring,00dev,hn_posts,post_type,10,YES,STRING,NEVER,,,NO,,NO,NO,,


In [40]:
model = faldbt.list_models()[0]

In [39]:
# %cd ../hn_whos_hiring
# !dbt run -s models/hn_posts.sql
# %cd ../notebooks

/home/butch2/play/experiments/hn_whos_hiring/notebooks
14:14:39  Running with dbt=1.1.1
14:14:39  Found 1 model, 0 tests, 0 snapshots, 7 analyses, 191 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
14:14:39  
14:14:41  Concurrency: 1 threads (target='dev')
14:14:41  
14:14:41  1 of 1 START view model 00dev.hn_posts ......................................... [RUN]
14:14:43  1 of 1 OK created view model 00dev.hn_posts .................................... [[32mOK[0m in 1.72s]
14:14:43  
14:14:43  Finished running 1 view model in 3.55s.
14:14:43  
14:14:43  [32mCompleted successfully[0m
14:14:43  
14:14:43  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
/home/butch2/play/experiments/hn_whos_hiring/notebooks
