## Parse ProQuest Metadata
This notebook includes a python function to parse newspaper articles downloaded from ProQuest Global Newsstream into one CSV file with metadata and full text (when full text is available).

#### Download PQ files to use as input
The script below takes as input .txt file downloads available via ProQuest Global Newsstream. These are available from ProQuest in batches of up to 100 articles per file. To save those files from your ProQuest search results:
1. Select each article you want to save (or select all results on page) using result checkboxes.
2. From the *...* button dropdown, select "TXT - Text Only"

    ![Screenshot of Saving results from Proquest](imgs/pq_save_feb20.png "Save results from PQ")
3. Accept all defaults and continue to save .txt file of bundled article downloads.
4. Save the downloaded .txt file (or files) to a folder in the same directory as this notebook. In the example below, that folder is called "txt_input" but you can use any path name that you will then call in the final cell.

In [4]:
## Step 1: import libraries required to run Python code
import os
import re
import sys
import csv
import glob

### Collect metadata
We need to tell the script which fields to collect from the .txt files. You can inspect the text files yourself to look for field names at the beginning of new lines such as `Title: ` or `Publication year: ` and then add them to the list variable called `fieldnames` below. Here is a list of field names available in the ProQuest text downloads as of July 2019:

`'Title', 'Publication title', 'Publication year', 'Document URL', 'Full text', 'Links', 'Section', 'Publication subject', 'ISSN', 'Copyright', 'Abstract', 'Publication info', 'Last updated', 'Place of publication', 'Location', 'Author', 'Publisher', 'Identifier / keyword', 'Source type', 'ProQuest document ID', 'Country of publication', 'Language of publication', 'Publication date', 'Subject', 'Database', 'Document type'`

In [5]:
fieldnames = ['Titre','Année de publication','Texte intégral','Publication', 'Lieu de publication', 'Auteur', 'Éditeur', 'Date de publication', 'Sujet', 'Société / organisation']

### Running the script
The function below:
1. Accepts a path to a directory full of .txt files you want to process as its argument (e.g., `parsePQ("txt_input/")`.
2. Creates a csv file called `pq_metadata.csv`
3. Cycles through every text file in the path you identified in step 1.
4. Splits each document into individual articles using the `sep` separator.
5. Splits each article into lines.
6. For each line, matches `fieldnames` with existing metadata tags in each document.
7. Saves the fieldname and following content (values) to a dictionary, `metadata_dict`.
7. Writes the fieldnames (keys) and content (values) of the `metadata_dict` for each article to its own row in the CSV.

In [6]:
sep = "____________________________________________________________"

## function that takes a directory of .txt files from ProQuest as input
def parsePQ(path, output_path, file_output='csv'):
    '''This function parses text file downloads from ProQuest into metadata and full-text.

    It takes as input a path to .txt files that have been downloaded from ProQuest Global Newsstream.
    It returns a CSV file of selected metadata along with the full text for each article as a separate row.
    Optional: set the second parameter to 'txt' to also output individual articles as .txt files in /output/

    Parameters
    ----------

    path : str
        path to .txt files that will be parsed (e.g., 'txts/' or 'pdfs/')
    file_output : str (default = 'csv')
        change to file_output='txt' to return individual articles as text files AND a CSV file for all;
        default behavior only outputs the CSV file
    '''
    # open a csv file to write metadata to
    with open(output_path, 'w', newline='', encoding='utf-8-sig') as csvfile:
        # add fieldnames as header
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        #cycle through every text file in the directory given as an argument
        files_all = [os.path.join(path, f) for f in os.listdir(path) if f.endswith(".txt")]
        
        print(list(files_all))

        for filename in files_all:

            #remove the path, whitespace, and '.txt' from filename to later use when printing output
            file_id = filename[:-4].strip(path)

            with open(filename, 'r', encoding='utf-8-sig') as in_file:
                # text var for string of all docs
                text = in_file.read()

                # split string by separator into single articles
                docs = re.split(sep, text)

                # remove first and last items from docs list: first item is empty string; last is copyright info
                docs = docs[1:-1]

                # loop through every doc to collect metadata and full text
                for i, doc in enumerate(docs):
                    if file_output == 'txt':
                        new_file = 'output/' + file_id + str(i) + '.txt'
                        txt_file = open(new_file,'w')
                    # remove white space from beginning and end of each article
                    doc = doc.strip()

                    # skip any empty docs
                    if doc=="":
                        continue

                    if file_output == 'txt':
                        txt_file.write(doc)
                        txt_file.close()

                    # split doc on every new line
                    metadata_lines = doc.split('\n\n')

                    #remove first "line" from article which is the article title without any field title
                    metadata_lines = metadata_lines[1:]

                    #declare a new dictionary
                    metadata_dict = {}

                    #for each element add the fieldname/key and following value to a dictionary
                    for line in metadata_lines:

                        #ignore lines that do not have a field beginning "Xxxxxx:" (e.g. "Publication title: ")
                        if not re.match(r'^[^:]+: ', line):
                            continue
                        #looks for beginning of new line following structure of "Publication year: " splitting on the colon space
                        (key,value) = line.split(sep=': ', maxsplit=1)

                        #only add to dictionary if the key is in fieldnames
                        if key in fieldnames:
                            metadata_dict[key] = value

                    #write the dictionary values to new row in csv
                    writer.writerow(metadata_dict)
            print("Writing", file_id)

### Running the script
Running the cell above loads the function, but doesn't do anything yet.
To *execute* the function, run the cell below, replacing `txt_input/` with the name of your folder full of .txt files.

Add a second parameter, file_output='txt', if you would like to return individual articles as text files as well as the CSV of the full parsed data. In this case make sure there is an ```/output/``` directory available to store the text files.

```parsePQ("txt_input/", file_output='txt')```

In [14]:
parti = "GPC"

In [15]:
#run the script
parsePQ(f"/home/beetho/Downloads/test_scrap_results/Canada Policy/{parti}_old", f"/home/beetho/Downloads/test_scrap_results/{parti}.csv")   # C:/Users/Perron/Desktop/Chris/Download/Canada/ # "C:\Users\firmi\Downloads\All_proquest_data\USnews_update\articles"

['/home/beetho/Downloads/test_scrap_results/Canada Policy/GPC_old/2.txt', '/home/beetho/Downloads/test_scrap_results/Canada Policy/GPC_old/1.txt']
Writing 2
Writing 1


### Et voila!
You should now see pq_metadata.csv in the same directory as this notebook.

In [16]:
import pandas as pd

In [17]:
bq = pd.read_csv(f"/home/beetho/Downloads/test_scrap_results/BQ.csv")

In [18]:
len(bq)

41

In [19]:
pcc = pd.read_csv(f"/home/beetho/Downloads/test_scrap_results/PCC.csv")
ndp = pd.read_csv(f"/home/beetho/Downloads/test_scrap_results/NDP.csv")
lpc = pd.read_csv(f"/home/beetho/Downloads/test_scrap_results/LPC.csv")
gpc = pd.read_csv(f"/home/beetho/Downloads/test_scrap_results/GPC.csv")



In [20]:
bq["Parti"] = "BQ"
pcc["Parti"] = "PCC"
lpc["Parti"] = "LPC"
ndp["Parti"] = "NDP"
gpc["Parti"] = "GPC"

In [22]:
all_parties = pd.concat([bq, pcc, lpc, ndp, gpc], ignore_index=True)

In [24]:
all_parties.drop_duplicates(inplace=True)

In [25]:
all_parties.columns

Index(['Titre', 'Année de publication', 'Texte intégral', 'Publication',
       'Lieu de publication', 'Auteur', 'Éditeur', 'Date de publication',
       'Sujet', 'Société / organisation', 'Parti'],
      dtype='object')

In [37]:
all_parties.drop_duplicates(subset=["Titre","Texte intégral","Date de publication","Parti"], inplace=True)

In [38]:
# Reset index after dropping duplicates
all_parties.reset_index(drop=True, inplace=True)

In [39]:
len(all_parties)

4462

In [40]:
all_parties.to_csv(f"/home/beetho/Downloads/test_scrap_results/all_parties.csv", index=False, encoding='utf-8-sig')