# Data Sampling

According to the [UserGuide2024.pdf](../references/UserGuide2024.pdf) the text file contains 3,638,436 rows. Very large and for processing time and GitHub file limits will need to sample and work with a smaller set.

Need to:
1. Check whether there is any date ordering in the files i.e. if the first n rows are sampled will that only cover early 2024
2. Take a sample of the file

In [1]:
# import standard packages
import numpy as np
import json
import pandas as pd
import plotly.express as px

In [2]:
path = "../data/raw/Nat2024PublicUS.c20250512.r20250708.txt"
sample_size = 100000

## Date ordering

In [3]:
# data stored in fixed width, loading schema defining where to split data
with open("../references/schema.json") as f:
    schema = json.load(f)
    f.close()

# definition lists position according to documentation but needs
# to be corrected to half-intervals for pandas
for k, v in schema.items():
    v = list(v.split(","))
    v[0] = int(v[0]) - 1
    v[1] = int(v[1])
    schema[k] = tuple(v)

In [4]:
# read in the data
df = pd.read_fwf(
    path,
    colspecs = list(schema.values()),
    nrows = sample_size,
    header = None,
    names = list(schema.keys()),
    usecols = ["DOB_MM"] # for speed only read the month column to see distribution
)

In [5]:
# check
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   DOB_MM  100000 non-null  int64
dtypes: int64(1)
memory usage: 781.4 KB


In [6]:
# what is the distribution by month?
df.groupby("DOB_MM").size()

DOB_MM
1     8642
2     7836
3     7992
4     7903
5     8223
6     7835
7     8852
8     9044
9     8522
10    8608
11    7829
12    8714
dtype: int64

There looks to be a relatively even distribution of months in the first 100,000 rows. Let's proceed with this sample.

## Extract sample

In [7]:
with open(path) as input_file:
    head = [next(input_file) for _ in range(sample_size)]
    input_file.close()

In [None]:
# by convention sampled data should be stored in /data/interm as semi-processed
output_path = path.replace("raw", "interim")
output_path = output_path.replace(".txt", "_sample.txt")

with open(output_path, "a") as f:
    f.writelines(head)
    f.close()