# ZF Full Text Search

*Eddie Chapman*

Tokenizes text from multiple text files and stores the results in a CSV file.

Part of The Zuckerberg Files project. 

https://alir3z4.github.io/stop-words/

### Input: text files
- Multiple, located in the same folder
- Filenames: the file's corresponding Zuck Files record ID
- Contain metadata at the top of the file and transcript contents below
- Metadata fields, main content, and individual speakers are designated with double hashtags (`##title##`)
- We want to skip the metadata and tokenize everything after the `##content##` tag, except for speaker names

### Output: CSV
- Each row corresponds to a single tokenized text file
- `record_id` field lists the ID which is also the .txt filename
- `full_text` field contains the text file's contents, tokenized
- tokenized means:
    + Starting *below* the `##contents##` field in the text file
    + Skipping all info in hashtags, such as speaker names
    + Skipping all bracketted text
    + Removing stop words
    + Removing punctuation (except for symbols located within words such as apostrophes)
    + All remaining words joined into a string, seperated by spaces.



### Requirements

In [21]:
import csv
import os
import nltk
import re
from chardet.universaldetector import UniversalDetector
import codecs

### Where are your text files?

Modify the address below. Remember, Windows uses single slashes to seperate the directory. Python needs you to double them.

In [22]:
os.chdir('C:\\Users\\chapman4\\Downloads\\zuck-text-ready-test\\text\\text-new')

### Let's make a list of all of the text files.

In [23]:
def list_filenames():
    filenames = [name.split(".")[0] for name in os.listdir(".") if name.endswith(".txt")]
    return filenames

### What's the output CSV going to be called?

In [24]:
FILE_FULL_TEXT = 'zuck_full_text.csv'

### Here's a list of stop words to ignore

In [25]:
def list_stopwords():
    with open('stopwords.txt', 'r') as infile:
        stopwords = [line.strip() for line in infile]
        return stopwords
    

### How are those text files encoded?

This tries to acknowledge the multiple encodings of the text files. I'm not sure if it really helps. 

In [26]:
# Detector object for encoding detection
def detect_encoding(filename):
    detector = UniversalDetector()
    with open(filename + '.txt', 'rb') as infile:
        for line in infile:
            detector.feed(line)
            if detector.done:   # detection process ends automatically when confidence is high enough
                break
        detector.close()
        return detector.result['encoding']

In [27]:
def grab_text(filenames, stopwords): 
    rows = []
    for filename in filenames:
        encoding = detect_encoding(filename)
        with codecs.open(filename + '.txt', 'r', encoding) as infile: 
            row = {}
            text = infile.read()
            tokens_nonnormal = re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+", text)
            tokens_normal = [t.lower() for t in tokens_nonnormal]
            tokens_minus_stop = [t for t in tokens_normal if t not in stopwords]
            tokens = ' '.join(tokens_minus_stop)
            row['record_id'] = filename
            row['full_text'] = tokens
            rows.append(row)
    return rows

In [28]:
# Write to csv file
def write_csv(filename, rows):
    with open(filename, 'w', encoding='utf-8') as csv_file:
        col_names = rows[0].keys()
        writer = csv.DictWriter(csv_file, col_names)
        writer.writeheader()
        writer.writerows(rows) 

In [29]:
def main():
    filenames = list_filenames()
    stopwords = list_stopwords()
    rows = grab_text(filenames, stopwords)
    write_csv(FILE_FULL_TEXT, rows)

In [30]:
main()