# Parse Factiva metadata
This notebook will parse RTF files downloaded from Factiva in batches of up to 100 articles at a time, and create a single CSV file with all of the metadata and full-text from those articles. 

### Download RTF files from Factiva
1. To begin, search for your content on Factiva. 
2. In the Factiva search results, select the checkbox next to "Headlines" to choose the first 100 articles from your results. (You can also select individual articles of interest, up to 100)

![select all of the search results](fact_headlines.png)

3. Choose the RTF button, and then "Article Format" to save those 100 articles into a single RTF file. (This works best if you create a new folder for all of the RTF files you are going to download. That folder should not contain any other content.)

![article format option from RTF](fact_art_format.png)

4. Select the "Next 100" link to view the next 100 search results in Factiva.
5. Select "Clear" to unselect the previous list of 100, and repeat steps 2-3 to save the current pages of results for any articles you want to capture. 

### Convert RTF files to .txt
Once you've downloaded all of the articles you want, go to the Shell (Terminal on Mac), and from the directory that contains your RTF downloads, run the following command to create a .txt file copy of each .rtf file:

```
textutil -convert txt *.rtf
```

You can also choose the path to your rtf downloads folder from your working directory. For example:

```
textutil -convert txt Desktop/Factiva/*.rtf
```

### Parse .txt files

Now you should have a single folder with both .rtf and .txt file copies of the same. The next section includes Python code that will parse all of those .txt files into a single .csv file that includes the complete metadata and full-text from t articles. Each row of the .csv will represent a single article.

In [None]:
## Step 1: import Python libraries required to run code below
import re
import csv
import glob 

In [None]:
#sep is a variable we will use to separate each article. 
#In the .rtf downloads there is a hexadecimal character 0x0C between each article (which skips to the next page)) 
# we can represent that character here with the escape character \f
sep = '\f'

#choose which fields you would like to work with (see the full list below)
#fieldnames = ['SE', 'HD', 'WC', 'PD', 'SN', 'SC', 'LA', 'CY', 'LP', 'CO', 'TD', 'NS', 'RE', 'PUB', 'AN']
fieldnames = ['SE', 'HD', 'WC', 'PD', 'SN', 'SC', 'LA', 'CY', 'NS', 'CO', 'RE', 'PUB', 'AN', 'LP', 'TD' ]

#### Factiva field codes:

- AN = Accession Number
- ART = Captions, description of graphics
- BY = Author
- CLM = Column
- CO = Dow Jones Ticker Symbol/Company Code Name
- CR = Credit
- CT = Contact 
- CX = Correction
- DLN = Dateline
- ED = Edition
- HD = Headline
- IN = Industry Code:Descriptor
- LA = Language
- LP = Lead Paragraph
- NS = Subject Code:Descriptor
- PG = Page
- PUB = Publisher Name
- RE = Region Code:Descriptor
- RF = Reference
- RST = Source Restrictor Code
- SC = Source Code
- SE = Section
- SN = Source Name
- TD = Text following lead paragraphs
- VOL = Volume
- WC = Word Count

In [None]:
#this function processes the files and takes the path to a folder full of .txt files you want to process as its argument
def factiva_to_csv(path):
    #create a new csv
    with open('factiva_metadata.csv', 'w', newline='', encoding='utf-8') as csvfile:
        
        # write the fieldnames defined above as the headers of your csv 
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        
        #cycle through every text file in the path given in the function's argument
        files_all = glob.iglob(path + "*.txt")
        for k, filename in enumerate(files_all):

            #remove the path, whitespace, and '.txt' from filename to later use when printing output
            file_id = filename[:-4].strip(path)
            
            #open the files
            with open(filename, 'r', encoding='utf-8') as in_file:
                # text var for string of all docs
                text = in_file.read()
                
                #remove the "Search Summary" from the end of each document
                search_sum = '\nSearch Summary\n'
                drop_search_sum = re.split(search_sum, text)
                text = drop_search_sum[0]
                
                # split string by separator into single articles
                docs = re.split(sep, text)

                # loop through every doc to collect metadata and full text
                for i, doc in enumerate(docs):

                    # remove white space from beginning and end of each article
                    doc = doc.strip()

                    # skip any empty docs
                    if doc=="":
                        continue

                    #create an empty dictionary that will later contain metadata keys (fieldnames) and the content for each metadata field
                    metadata_dict = {}
                    
                    #this regular expression looks for the 2 or 3 character field codes in the document
                    regex = '(\s\s\s[A-Z]{2,3})'
                    
                    #split up each document based on the 2-3 char metadata field codes
                    splits = re.split(regex, doc)

                    #create variables to hold metadata and content
                    key, value = '', ''
                    
                    #cycle through each metadata element
                    for k, split in enumerate(splits):
                       
                        #check for the SE field, which doesn't follow the same syntax as other fields
                        if re.match('^SE', split):
                            key = 'SE'
                            #print("SE")
                            value = split.strip('SE\n')
                        
                        #if we match a 2-3 char code, assign it to key
                        elif re.match(regex, split):
                            key = split.strip()
                            #print('key=', key)
                        
                        #if we don't match a 2-3 char code assign it to value
                        else:
                            value = split.strip()
                            #print('value=',value)
                        
                        #only add keys and values to our dictionary if they match the fieldnames we chose above
        
                        if key in fieldnames:
                            metadata_dict[key] = value
                    
                    #write each row to the csv containing all of the values that match existing fieldnames/keys
                    writer.writerow(metadata_dict)
                    
                #output to let us know the .txt files that are being processed
                print("Writing", file_id)                 

In [None]:
#run the function by pointing to the path to a folder full of .txt files you want to process
factiva_to_csv('rtf/') 

### Note: CSV files and full text fields
Note that the full-text from Factiva is divided up into two fields: 
* LP = Lead Paragraph
* TD = Text following lead paragraphs

*Important:* At times character encoding anomalies will cause a full-text field to break up into new rows when importing CSV files into Microsoft Excel. You can import the CSV 
