# record-count-estimate

A short, ad hoc script to estimate the total number of posts from a small sample using the CLT

In [1]:
# Estimate the number of reddit posts in all files
# from a small sample

import numpy as np
import csv

## Preparation

This loads the file with the post counts for each year. Then, it creates an array for the estimate

In [2]:
# This file has the post counts from one file for each year
# It was generated with grep

reddit_file = 'Reddit_Posts/zipped/reddit-post-counts.txt'

# Need to split on newline to get rid of it

with open(reddit_file) as f:
    contents = f.read().split('\n')

In [3]:
contents

['RC_2006-12:61018',
 'RC_2007-11:372983',
 'RC_2008-10:789874',
 'RC_2009-09:2032276',
 'RC_2010-08:4247982',
 'RC_2011-07:10557466',
 'RC_2012-06:21897913',
 'RC_2013-05:33126225',
 'RC_2014-04:42440735',
 'RC_2015-04:55005780']

In [4]:
# Make an array with just the actual counts
# You can change num_years to see how that affects the sample estimate

num_years = 10

counts = [line.split(':')[1] for line in contents[0:num_years]]
count_array = np.array(counts).astype(float)
count_array

array([   61018.,   372983.,   789874.,  2032276.,  4247982., 10557466.,
       21897913., 33126225., 42440735., 55005780.])

## Estimation

Estimate the expected value (mean) and its corresponding standard error, then compute the upper bound of a 95% confidence interval and report the results

In [5]:
# Estimate the number of posts in all files using the CLT

# Number of files = months * years
# alpha sets the confidence level to 95%

months = 12
sample_size = len(count_array) * months
alpha = 1.96

# Estimate the expected value (mean), std_error,
# and the upper CI

exp_value = np.mean(count_array) * sample_size
std_error = np.std(count_array) * np.sqrt(sample_size)
upper_ci = exp_value + (alpha * std_error)

print(f"Expected number of posts for all reddit files {exp_value:0.3}")
print(f"Expected standard error {std_error:0.3}")
print(f"The upper CI is {upper_ci:0.3}")

Expected number of posts for all reddit files 2.05e+09
Expected standard error 2.09e+08
The upper CI is 2.46e+09


## Results

So, with 10 years of posts from 2006 - 2015, the expected number of posts is 2.05B with a standard error of 209M. The upper confidence interval is 2.46B. So, at most the program will need to process 2.46B records.