<b>Apollo 13 Transcript Data Cleaning</b>

The goal is to examine the transcript from the Apollo 13 mission to see if any interesting insights can be gleamed. The Apollo 13 was the famous mission to the moon that almost ended in disaster when an oxygen tank on board exploded.

The data will be imported and cleaned from a text file available <a href="https://www.hq.nasa.gov/alsj/a13/AS13_TEC.txt"> here </a>. The transcription from pdf was provided by Heiko Küffen.

In [7]:
import pandas as pd

Import data into a pandas dataframe

In [4]:
df = pd.read_table('apollo13transcript.txt', header = None)

Trim some of the information off of the top so only the transcript information is preserved

In [6]:
df = df.iloc[138:]
df.head()

Unnamed: 0,0
276,000:09:57 CC
277,Roger. Staging.
278,000:10:00 CDR
279,"And S-IV ignition, Houston."
280,000:10:04 CC


In [110]:
df.tail()

Unnamed: 0,0
21885,Roger.
21886,05 22 54 56 P-l
21887,Photo-1. Splashdown at this time. The three ch...
21888,05 22 55 12 R-l
21889,"... Recovery, I have a clock - -"


The last 3 rows have the date information in a different format for some reason. We'll need to correct these manually.

In [111]:
df.loc[21884] = "142:54:49 S-l"
df.loc[21886] = "142:54:56 P-l"
df.loc[21888] = "142:55:12 R-l"
df.tail()

Unnamed: 0,0
21885,Roger.
21886,142:54:56 P-l
21887,Photo-1. Splashdown at this time. The three ch...
21888,142:55:12 R-l
21889,"... Recovery, I have a clock - -"


Notice that even rows contain the timestamp and the speaker acronym, and the odd rows contain the message. By splitting the even and odd rows into seperate dataframes we can later later combine them so all the necessary information is on one row.

In [112]:
timeSpeaker = df.iloc[::2]
text = df.iloc[1::2]

The timeSpeaker dataframe can be further split up into two columns. One column containing the time and one containing the speaker acronym. We'll examine this using the .head() method below. Note we'll have to reset the index later.

In [113]:
timeSpeaker.head()

Unnamed: 0,0
138,000:00:02 CDR
140,000:00:03 CMP
142,000:00:05 CDR
144,000:00:12 CMP
146,000:00:14 CDR


We can split this into two columns by using the code below:

In [114]:
timeSpeaker = timeSpeaker[0].str.split(expand = True)
timeSpeaker.head()

Unnamed: 0,0,1
138,000:00:02,CDR
140,000:00:03,CMP
142,000:00:05,CDR
144,000:00:12,CMP
146,000:00:14,CDR


Things are starting to get a bit messy so let's seperate the data into 3 series that we can concatenate into a final dataframe. We'll also reset the index in all of them as the index is our series has gotten wonky due to the splitting and slicing.

In [115]:
time = timeSpeaker[0].reset_index(drop=True)
speaker = timeSpeaker[1].reset_index(drop=True)
text = text[0].reset_index(drop=True)

cleanedDf = pd.concat([time, speaker, text], axis = 1)
cleanedDf.columns = ['time', 'speaker', 'text']

cleanedDf.head()

Unnamed: 0,time,speaker,text
0,000:00:02,CDR,The clock is running.
1,000:00:03,CMP,"Okay. P11, Jim."
2,000:00:05,CDR,Yaw program.
3,000:00:12,CMP,Clear the tower.
4,000:00:14,CDR,Yaw complete. Roll program.


Write this data to a tabbed seperated value file.

In [116]:
cleanedDf.to_csv('cleanedData.txt', sep = '\t', index = False)