# Gathering Stats about Latvian Periodicals

We will explore zip files that store plaintext files of Latvian periodicals. We will gather some statistics about the periodicals and their content.

## Years published

Our first task will be to extract years from the filenames of the periodicals. We will use regular expressions to extract the years from the filenames.

The file names are stored in zip files. We will use the `zipfile` module to extract the filenames from the zip files.

In [36]:
# first get Python version
import sys
print(f"Python version: {sys.version}")
from pathlib import Path
from datetime import datetime
print(f"Current date and time: {datetime.now()}")
# computer CPU type
import platform
print(f"Computer processor: {platform.processor()}")
# print current working directory
print(f"Current working directory: {Path.cwd()}")
import zipfile
import re
import pandas as pd
print(f"Pandas version: {pd.__version__}")
import tqdm


Python version: 3.12.2 (tags/v3.12.2:6abddd9, Feb  6 2024, 21:26:36) [MSC v.1937 64 bit (AMD64)]
Current date and time: 2024-06-11 11:53:29.629200
Computer processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
Current working directory: c:\Users\vsaules\Github\lnb_transports\notebooks
Pandas version: 2.2.1


In [8]:
src_folder = Path("I:/zips")
assert src_folder.exists(), f"Source folder {src_folder} does not exist"
print(f"Source folder: {src_folder}")
# get list of files that contain word articles in them
files = list(src_folder.glob("*articles*.zip"))
print(f"Number of files: {len(files)}")


Source folder: I:\zips
Number of files: 118


In [33]:
# now we will write a function that will extract file names from a single zip file
def get_file_names(zip_file: Path, skip=1) -> list:
    with zipfile.ZipFile(zip_file, "r") as z:
        # get rid of folder names - we know first one is generally folder
        file_paths = z.namelist()[skip:]
        # now we only want file paths without folders
        file_names = [Path(file_path).name for file_path in file_paths]
        return file_names
z_file = files[0]
print(f"File: {z_file}")
file_names = get_file_names(z_file)
print(f"Number of files in zip: {len(file_names)}")

File: I:\zips\adelaides_latviesu_zinotajs_articles.zip
Number of files in zip: 725


In [22]:
# first 3 filenames
print(file_names[:3])

['xalz1962n114_001_plaintext_s01.txt', 'xalz1962n114_001_plaintext_s02.txt', 'xalz1962n114_002_plaintext_s03.txt']


In [26]:
# now let's write a function that will take a single file name and extract following two pieces of information:
# title - which is any text before year
# year - which is first 4 digit number in the file name
# we will use regular expression to extract year from the file name
def get_title_and_year(file_name: str) -> tuple:
    # regular expression to extract year
    year = re.search(r"\d{4}", file_name).group()
    # if length of year is not 4, return None
    if len(year) != 4:
        return {}
    # title is any text before year
    title = file_name.split(year)[0].strip()
    return {"title": title, "year": int(year)}
# let's test this function on first file name
d = get_title_and_year(file_names[0])
print(f"Title: {d['title']}, Year: {d['year']}")


Title: xalz, Year: 1962


In [28]:
# now let us write a function that given filenames will return a list of dictionaries
# each dictionary will have two keys: title and year
def get_titles_and_years(file_names: list) -> list:
    return [get_title_and_year(file_name) for file_name in file_names]
# let's test this function
titles_and_years = get_titles_and_years(file_names)
# first 3 titles and years
print(titles_and_years[:3])
# last 3 titles and years
print(titles_and_years[-3:])

[{'title': 'xalz', 'year': 1962}, {'title': 'xalz', 'year': 1962}, {'title': 'xalz', 'year': 1962}]
[{'title': 'xalz', 'year': 2003}, {'title': 'xalz', 'year': 2003}, {'title': 'xalz', 'year': 2003}]


In [32]:
# now let us write a function that will take list of dictionaries of  years and titles
# additional arguments will be start_year with default value of 1920 and end_year with default value of 1940
# this function will convert this list into pandas DataFrame and extract following information:
# min_year, max_year, total count of publications, count of publications between start_year and end_year (inclusive)
def get_summary(titles_and_years: list, start_year=1920, end_year=1940) -> dict:
    df = pd.DataFrame(titles_and_years)
    min_year = df["year"].min()
    max_year = df["year"].max()
    total_count = df.shape[0]
    count_between = df.query(f"{start_year} <= year <= {end_year}").shape[0]
    return {"min_year": min_year, "max_year": max_year, "total_count": total_count, f"count_{start_year}_{end_year}": count_between}
# let's test this function
summary = get_summary(titles_and_years)
print(summary)

{'min_year': 1962, 'max_year': 2003, 'total_count': 725, 'count_1920_1940': 0}


In [40]:
# let's create a function that given a zip file will return summary of all files in that zip
# in addition the dictionary will contain key publication that will contain zip file stem
def get_zip_summary(zip_file: Path, rstrip="_articles") -> dict:
    file_names = get_file_names(zip_file)
    titles_and_years = get_titles_and_years(file_names)
    summary = get_summary(titles_and_years)
    summary["publication"] = zip_file.stem
    # let's rstrip from publication name
    summary["publication"] = summary["publication"].rstrip(rstrip) 
    # let us also add full zip file path without PosixPath using forward slashes
    summary["zip_file"] = zip_file.as_posix()
    return summary
# let's test this function
zip_summary = get_zip_summary(z_file)
print(zip_summary)

{'min_year': 1962, 'max_year': 2003, 'total_count': 725, 'count_1920_1940': 0, 'publication': 'adelaides_latviesu_zinotaj', 'zip_file': 'I:/zips/adelaides_latviesu_zinotajs_articles.zip'}
