> A project to analyze hacker news stories using nbdbt

## Analysis

In [1]:
#| echo: false
%reload_ext nbdbt.dbt_cellmagic

In [2]:
#| echo: false
%dbtconfig -p ../hn_whos_hiring -n notebooks/analysis.ipynb

### Raw HN Source

> This is the raw data for all Hacker News posts

It is sourced from the Google's Big Query Public Datases
and accessible as 
```
bigquery-public-data.hacker_news.full
```

It seems to be updated on a regular basis *(see timestamp of latest post)*.

In [3]:
%%dbt -a raw_sources analyses/raw_hn_source.sql
select *
from {{ source('public_datasets', 'full_stories') }}
order by timestamp desc


In [4]:
df = raw_sources.ref(10); df.head()

Unnamed: 0,title,url,text,dead,by,score,time,timestamp,type,id,parent,descendants,ranking,deleted
0,,,POSIX sh parameter expansion cheat sheet:<p><a...,,jwilk,,1656580359,2022-06-30 09:12:39+00:00,comment,31930208,31928736.0,,,
1,,,That also ate a lot of small healthy business ...,,Existenceblinks,,1656580348,2022-06-30 09:12:28+00:00,comment,31930207,31929941.0,,,
2,,,Does anyone actually like using JIRA? Or Confl...,,gaff33,,1656580322,2022-06-30 09:12:02+00:00,comment,31930206,31929941.0,,,
3,,,Not at all is the issue. IIRC svn checkout is ...,,masklinn,,1656580321,2022-06-30 09:12:01+00:00,comment,31930205,31929148.0,,,
4,,,"That&#x27;s because, if the chip uses 20% less...",,tintedfireglass,,1656580316,2022-06-30 09:11:56+00:00,comment,31930204,31925613.0,,,


The column names have some descriptions, but might need to be standardized.

In [5]:
df.columns.values

array(['title', 'url', 'text', 'dead', 'by', 'score', 'time', 'timestamp',
       'type', 'id', 'parent', 'descendants', 'ranking', 'deleted'],
      dtype=object)

### Exploratory Data Analysis 
> based on a sample size of 10 latest posts

* Check if time and timestamp contain the same info and one column can be eliminated

In [6]:
from datetime import datetime
import pandas as pd

In [7]:
df['newtimestamp'] = pd.to_datetime(df['time'].astype(float), unit='s',origin='unix', utc=True)

In [8]:
df[df['newtimestamp'] == df['timestamp']].any(axis=None)

True

Looks like they are one and the same.

### EDA Questions

* Check total record count

In [9]:
%%dbt -a tot_rec analyses/count_hn_source.sql
select count(*) as rec_count 
from {{ source('public_datasets', 'full_stories') }}


In [10]:
tot_rec_df = tot_rec.ref()

In [11]:
total_records = tot_rec_df.iloc[0].rec_count

* Check if any id, by, time, timestamp, type are null 

In [12]:
%%dbt -a null_field_counts analyses/null_field_counts.sql
with hn_posts
as (
select
     `by` as author,
     * except(`by`)
from {{ source('public_datasets', 'full_stories') }}
)
select 
   'id' as field, 
    count(*) as null_count,
from hn_posts
where id is null
union all
select 
   'author' as field, 
    count(*) as null_count,
from hn_posts
where author is null
union all
select 
   'time' as field, 
    count(*) as null_count,
from hn_posts
where time is null
union all
select 
   'timestamp' as field, 
    count(*) as null_count,
from hn_posts
where timestamp is null
union all
select 
   'type' as field, 
    count(*) as null_count,
from hn_posts
where type is null
union all
select 
   'dead' as field, 
    count(*) as null_count,
from hn_posts
where dead is null



In [13]:
null_df = null_field_counts.ref()

In [14]:
null_df['pct'] = null_df['null_count']/total_records

In [15]:
null_df

Unnamed: 0,field,null_count,pct
0,id,0,0.0
1,type,0,0.0
2,dead,30560332,0.957098
3,time,26818,0.00084
4,timestamp,26818,0.00084
5,author,947682,0.02968


| So `type` and `ids` all have values, 
| but there are entries with no `time` or `timestamp` (very small, less than 0.1 percent)
| and there are entries with no `author` (around 3 percent)
| and 96 percent have null values for the `dead` field.

* Check for unique values of `dead`

In [16]:
%%dbt -a dead_type_counts analyses/dead_types_counts.sql
select dead as dead_type, count(*) as dead_count
from {{ source('public_datasets', 'full_stories') }}
group by dead 
order by dead_count desc

In [17]:
dead_types_df = dead_type_counts.ref()
dead_types_df['pct'] = dead_types_df['dead_count']/total_records

In [18]:
dead_types_df

Unnamed: 0,dead_type,dead_count,pct
0,,30560332,0.957098
1,True,1369875,0.042902


* Check for unique values of `type`  

In [19]:
%%dbt -a type_counts analyses/types_counts.sql
select type as type, count(*) as type_count
from {{ source('public_datasets', 'full_stories') }}
group by type 
order by type_count desc

In [20]:
types_df = type_counts.ref()
types_df['pct'] = dead_types_df['dead_count']/total_records

In [21]:
types_df

Unnamed: 0,type,type_count,pct
0,comment,27351834,0.957098
1,story,4547262,0.042902
2,job,15502,
3,pollopt,13633,
4,poll,1976,


## Standardization
> Standardize column names and types so downstream transformations don't have to deal with that

In [22]:
%%dbt -a hn_posts -n notebooks/analysis.ipynb models/hn_posts.sql
select 
  id as post_id, 
  title,
  url,
  text,
  `by` as author,
  time as a
   
from {{ source('public_datasets', 'full_stories') }}
order by timestamp desc
