# Zuck Metadata Adder
*Eddie Chapman*

Adds metadata headings to text files for the Zuckerberg Files project. It needs a folder full of text files (w/o metadata) and a CSV file of metadata matching the text files (.txt filename = CSV `record_id` field). Metadata fields are formatted as `##metadata field##`. A `##content##` header is left at the end of the the metadata header to denote that the content is begining. Once metadata is added, the `Zuck PDF Cover Creator`, `Zuck PDF Body Creator` and `Zuck PDF Merger` can be used to turn the text files into PDFs.

### Set-up

Format the metadata CSV with the following headings:

`record_id` | `participants` | `record_type` | `record_format` | `date` | `source` | `title` | `url` | `description`

It's OK to have more records in the CSV than will be formatted in .txt form.

### Libraries

- `csv` to manipulate the metadata CSV
- `os` to set local working directory
- `chardet` to provide encoding detection support

In [19]:
import csv
import os
from chardet.universaldetector import UniversalDetector

Set your current working directory based on where your files/csv are located.   

Remember to add an extra backslash ('`C:\Users\Username\\...`' becomes '`C:\\Users\\Username\\...`')

In [20]:
# Set directory to the inside of folder full of text files that need editing
os.chdir('C:\\Users\\chapman4\\Downloads\\zuck-text-ready-test\\text\\text-new')

### Create Metadata

This reads the CSV file. It goes row by row, and each row's values are sent to a dictionary using the `fieldnames` list. That row's dictionary is added to the `metadata` list of rows and the loop goes on to the next record. 

In [21]:
def create_metadata():
    metadata = []
    with open('zuck-metadata.csv', encoding="UTF-8") as csvfile:
        fieldnames = ['record_id', 'participants', 'record_type', 'record_format',\
                      'date', 'source', 'title', 'url', 'description']
        reader = csv.DictReader(csvfile, fieldnames)
        for row in reader:
            record = {}
            for name in fieldnames:
                record[name] = row[name]
            metadata.append(record)
        return metadata       

### List filenames
This figures out which files we're going to be modifying and which rows of the metadata CSV contain the desired information (based on text filename = csv `record_id`)

In [22]:
# Creating list of filenames in folder specified when setting directory above
def list_filenames():
    filenames = [name.split(".")[0] for name in os.listdir(".") if name.endswith(".txt")]
    return filenames

### Detect Encoding
IDK, a lot of the original text files got created on different opperating systems and wound up in competeting formats. This might help by deterimining the encoding for each file and opening it up correctly. Who knows! 

In [23]:
def detect_encoding(filename):
    # Detector object for encoding detection
    detector = UniversalDetector()

    # Determining encoding information, adding as a key-value pair to record's metadata dict entry
    with open(filename + '.txt', 'rb') as infile:
        detector.reset()
        for line in infile:
            detector.feed(line)
            if detector.done:   # detection process ends automatically when confidence is high enough
                break
        detector.close()
        return detector.result['encoding']

### Grab Text
This up and pulls the whole text from any .txt file specified in `list_filenames()` above.

In [24]:
def grab_text(filename, encoding):
    with open(filename + '.txt', 'r+', encoding = encoding) as infile:
        text = infile.read()
        return text

### Write Metadata
That text we just grabbed is put to use here. Also the metadata from the CSV corresponding to the title of the file we're modified. It's all in here.

In [25]:
def write_metadata(filename, row, text): 
    with open(filename + '.txt', 'w', encoding = "utf8") as outfile:
        # All this mess could probably be simplified with str.format()
        outfile.write('##title##\n\n')    
        outfile.write(row['title'] + '\n\n')
        outfile.write('##date##\n\n')
        outfile.write(row['date'] + '\n\n')
        outfile.write('##id##\n\n')
        outfile.write(row['record_id'] + '\n\n') 
        outfile.write('##description##\n\n')
        outfile.write(row['description'] + '\n\n')
        outfile.write('##source##\n\n')
        outfile.write(row['source'] + '\n\n')
        outfile.write('##type##\n\n')
        outfile.write(row['record_type'] + '\n\n')
        outfile.write('##participants##\n\n')
        outfile.write(row['participants'] + '\n\n')
        outfile.write('##format##\n\n')
        outfile.write(row['record_format'] + '\n\n')
        outfile.write('##url##\n\n')
        outfile.write(row['url'] + '\n\n\n')
        outfile.write('##content##\n\n')
        outfile.write(text)
        outfile.close()

## Kick-off

In [26]:
def main():
    filenames = list_filenames()
    
    for row in create_metadata():
        if row['record_id'] in filenames:
            encoding = detect_encoding(row['record_id'])
            text = grab_text(row['record_id'], encoding)
            write_metadata(row['record_id'], row, text)

In [27]:
main()