**For INFM 603 students:** This notebook illustrates string processing operations in python.

**Summary:** We'll first load a file of State Department cables (i.e., telegram messages) from the 1970's that the National Archives and Records Administration makes available.  We'll then use python to find some interesting things in that collections.


**Task:** Build a travel record for Secretary of State Henry Kissinger for the messages in at least 20 of the files (each of which contains the messages for a month).  To do this, first **find the messages sent by Kissinger**. For each such message, **print the date and the address of the sending organization to a file** named Kissinger_Travel.txt, in **increasing date order **(i.e., oldest messages first).  Examining this file, you should be able to see where Kissinger as visiting each time he sent a message (although the abbreviations used in the sender's addresses might be a bit cryptic).

The messages in the CFPF.PU files are (intentionally) missing the msgtext field, but the messages in the CFPF.TEL files do have that field.  So I suggest focusing on the files that start with CFPF.TEL.


A useful heuristic would be that **the first line that starts with "FM " is the from address.**  Note, however, that there may sometimes be errors in the message format (for example, I suppose there might be a space before FM in some cases).  Anyhow, it should be easy to check to see if every message has exactly one line that starts with "FM " and if so that's very likely the from address and if not then inspecting the message will likely indicate what went wrong in that case.

A few people have asked (directly or indirectly) **how to "find the messages sent by Kissinger"**.  There are two ways people have tried:

1. When the message was sent on Kissinger's behalf (Kissinger didn't actually type these -- someone else did) then the message (in the msgtext field) will say, near the end "Kissinger".  In my own code I just looked for any message that had Kissinger in the last 10 tokens, and that seemed to work fine (I printed out many such cases, and they all looked like Kissinger used as a signature, not as someone mentioned in the message).

2. In the PU files there is a <from> tag that sometimes indicates that Kissinger was the author of the message, but in the TEL files that trick does not seem to work (because the <from> tag always seems to indicate the sending organization in the TEL files).  If you do want to use the metadata rather than the text of the messages, just be sure you look at enough examples to understand how those metadata fields were populated.

3. According to Wikipedia, Kissinger became secretary of state on September 22, 1973.  You might expect to find few (or perhaps even no) messages signed by him from before that date (he was previously the National Security Advisor, so he may have sent some messages before that date).  This should not be an issue for automatic processing, but it may be something you will want to know if you are choosing which messages to examine.

First we need to download the collection.  We'll do this by using the Unix wget command to download a zip file, and then the Unix unzip command.  To use a Unix command in a Colab notebook, just precede it with a !

In [None]:
!wget https://users.umiacs.umd.edu/~oard/cables.zip
print('starting unzip')
!unzip -u -q cables.zip
print('unzip complete, files stored in cables/')

--2022-09-13 22:06:12--  https://users.umiacs.umd.edu/~oard/cables.zip
Resolving users.umiacs.umd.edu (users.umiacs.umd.edu)... 128.8.120.33
Connecting to users.umiacs.umd.edu (users.umiacs.umd.edu)|128.8.120.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 509664391 (486M) [application/zip]
Saving to: ‘cables.zip.2’


2022-09-13 22:06:30 (27.2 MB/s) - ‘cables.zip.2’ saved [509664391/509664391]

starting unzip
unzip complete, files stored in cables/


Now let's open one file from the collection, which contains multiple messages (enclosed in a `<sasdoc></sasdoc>` is a document. Inside is the `<msgtext>` and `<subject>` tags). Look at some of the subject lines to get a sense for what's in the collection.  

In [None]:
import xml.etree.ElementTree as ET
import random


We'll make a list of lists.  The outer list will have on entry per message; the inner list will have one entry per token. 

Here, we separate each message into lines; storing the lines into a list, then storing the list into another list for the individual message, and then storing the list into the final list of all messages. We also check the metadata for dates and keep track of both the tokens and the dates.

##### ***This takes a few minutes, please be patient!***

In [None]:
import os

class DataHolder:
    def __init__(self,date,token):
        self.date = date
        self.token = token

data = []
i=0
files = 0
count = 0
for file in os.listdir('cables'):
    if file.startswith('CFPF.TEL') and file.endswith('PU') and files < 20:
        tree=ET.parse('cables/' + file)
        root = tree.getroot()
        for doc in list(root.iter('sasdoc')): 
            messages = list(doc.iter('msgtext'))
            if (len(messages) == 0):
                continue
            message = messages[0]
            dates = list(doc.iter('date'))
            if (len(dates) == 0):
                continue
            date = dates[0]
            # if count < 5 and "fm secstate" in message.text.lower():
            #     count += 1
            #     print(date.text)
            #     print(message.text)
            tok = message.text.split("\n")
            token = [x.split(" ") for x in tok]
            finalToken = []
            for line in token:
                finalLine = []
                for word in line:
                    if len(word) > 0:
                        finalLine.append(word.casefold().strip(".,:*'())-"))
                if len(finalLine) > 0:
                    finalToken.append(finalLine)
            if len(finalToken) > 0:
                data.append(DataHolder(date.text,finalToken))
        files += 1
print("Finished!")
#for i in range(5):
 #   print(tokens[i])

Finished!


Now we can look for specific complete words in the full text of a message.  

In [None]:
for i in range(len(tokens)):
    for line in tokens[i]:
        lineSlice = line[-3:]
        for word in lineSlice:
            if word == "kissinger":
                print('War criminal detected: ', i)


Creating the entry. Find the sender (where Kissinger was), and convert the date into a datetime object. 

In [None]:
import datetime

class Entry:
    def __init__(self,sender,receiver,date):
        self.sender = sender
        self.receiver = receiver
        self.date = date

    def generateEntryFromDataPoint(point,ignoreDC = False):
        token = point.token
        if Entry.isFromWarCriminal(token):
            sender = Entry.findSender(token)
            if (sender)and ((not ignoreDC) or (not "secstate washdc" in sender)):
              sender = sender.replace("amembassy","American Embassy in").replace("amconsul","American Consul in").replace("uslo","United States Liason's Office in").replace("usint","United States Interest Section in").replace("usmission","United States Mission to").replace("usdel","United States Delegation to").replace("usun","the United Nations")
              receiver = Entry.findReceiver(token)
              date = Entry.toDateTime(point.date)
              return Entry(sender,receiver,date)
        return False

    def render(self):
        return "ADDRESS: " + self.sender + "\nDATE: " + str(self.date) + '\n\n'

    def findSender(token):
        for line in token:
            word = line[0]
            if word == "fm":
                return Entry.listToString(line[1:])
        return False

    def findReceiver(token):
        for line in token:
            word = line[0]
            if word == "to":
                return Entry.listToString(line[1:]) 
        return False

    def isFromWarCriminal(token):
        lastLines = token[-4:]
        for line in lastLines:
            lastWord = line[-1]
            if lastWord == "kissinger":
                return True
        return False

    def toDateTime(date):
        months = {
            "JAN": 1,
            "FEB": 2,
            "MAR": 3,
            "APR": 4,
            "MAY": 5,
            "JUN": 6,
            "JUL": 7,
            "AUG": 8,
            "SEP": 9,
            "OCT": 10,
            "NOV": 11,
            "DEC": 12
        }
        day, month, year = date.split(" ")
        day = int(day)
        month = months[month]
        year = int(year)
        thisDay = datetime.date(year,month,day)
        return thisDay
    
    def listToString(line):
        out = ""
        for word in line:
           out += word + " "
        return out

entries = []
for point in data:
    entry = Entry.generateEntryFromDataPoint(point)
    if entry:
        entries.append(entry)
entries.sort(key=lambda x: x.date)
print(entries)
for entry in entries:
  print(entry.render())

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 197

Print to file, and then to console.

In [None]:
with open('./Kissinger_Travel.txt', 'w') as f:
    for entry in entries:
        f.write(entry.render())
        print(entry.render())

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 1974-12-07


ADDRESS: secstate washdc 
DATE: 197