# Lesson 7: Dicts and misc. topics

In this lesson, we will try to use a dict as yet another data structure. We will be using it to store metadata for a transcript. The reason that this makes sense, in short, is that we can map a key to a certain value rather than having to remember the order of a list or a tuple. That is, if we want to know the age of the child, ideally we would want to do something like this:

```python
metadata['age of child']
```

instead of

```python
metadata[2]
```

## Where to start?

Since we want to extract data from transcripts, let's check out how a transcript looks, namely the first part with metadata.

```
@UTF8
@PID:	11312/c-00015632-1
@Begin
@Languages:	eng
@Participants:	CHI Adam Target_Child , MOT Mother , URS Ursula_Bellugi Investigator , RIC Richard_Cromer Investigator , COL Colin_Fraser Investigator
@ID:	eng|Brown|CHI|2;03.04|male|typical|MC|Target_Child|||
@ID:	eng|Brown|MOT||female|||Mother|||
@ID:	eng|Brown|URS|||||Investigator|||
@ID:	eng|Brown|RIC|||||Investigator|||
@ID:	eng|Brown|COL|||||Investigator|||
@Date:	08-OCT-1962
@Comment:	Birth of CHI is 4-JUL-1960
@Time Duration:	10:00-11:00
*CHI:	play checkers .
%mor:	n|play n|checker-PL .
%gra:	1|2|MOD 2|0|INCROOT 3|2|PUNCT
%xpho:	<1> pe
*CHI:	big drum .
%mor:	adj|big n|drum .
%gra:	1|2|MOD 2|0|INCROOT 3|2|PUNCT
...
```

So, by looking at this, we can make out a few things about the format of the metadata, e.g.:

1. A metadata line start with an `@`. -> This makes it distinguishable from other line types.
1. After the `@`, there is a word stating the metadata to be found in that particular line. -> We know what's coming depending on the beginning of the line.
1. After this word, we see the sequence `:\t`. -> We know where to split the first part from the actual content.

As usual, we start out by loading a bunch of transcripts into raw strings from a folder. Let's try to do it for Eve instead of Adam this time.

In [14]:
from os import chdir as cd  # function to change working directory
import glob  # introduces a function to retrieve lists of files

# point to the data folder and go there
pathin = '/home/kasper/Downloads/Brown/Eve'
cd(pathin)

# loop over all files and store it as a list for now
file_contents = []
for filename in sorted(glob.glob('*.cha')):
    with open(filename, encoding='utf-8') as f:
        raw = f.read()
        file_contents.append(raw)

Next, we loop over these file contents and extract the metadata lines. We can follow this course of action:

```python
metadata_for_transcripts = []  # a list to store metadata dicts

for raw in file_contents:
    # prepare dict for metadata in the current transcript
    metadata = {}
    
    # get all metadata lines (we know they start with @)
    metadata_lines = [line for line in raw.split('\n') if line.startswith('@')]
    
    # go over each line and extract whatever's in them based on the keyword
    # the metadata value is retrieved with line.split('\t')[1]
    for line in metadata_lines:
        
        # language line
        if line.startswith('@Languages'):
             ...
            
            
```

### Your turn!

In [19]:
metadata_for_transcripts = []  # a list to store metadata dicts

for raw in file_contents:
    # prepare dict for metadata in the current transcript
    metadata = {}
    
    # get all metadata lines (we know they start with @)
    metadata_lines = [line for line in raw.split('\n') if line.startswith('@')]
    
    # since we have more than one participant, we want to store participant data
    # every time we come across an @ID line; hence, this variable to store it in
    participants = []
    
    # go over each line and extract whatever's in them based on the keyword
    # the metadata value is retrieved with line.split('\t')[1]
    for line in metadata_lines:
        
        # language line
        if line.startswith('@Languages'):
            lang = line.split('\t')[1]
            metadata['language'] = lang  # add it to the dict
        
        # participants ID lines (to be stored in var participants)
        elif line.startswith('@ID'):
            participant_raw = line.split('\t')[1]
            info = participant_raw.split('|')  # "sub-info" is stored between |'s
            abbr = info[2]  # the abbreviation used for the participant in the transcript
            role = info[7]  # their role, e.g. Target_child or Investigator
            participant = (abbr, role)  # sum up the info in a tuple
            participants.append(participant)  # add it to the prepared list
            
            # we get age of child from the ID line with the target child
            # so, when we have that particular line, execute this block
            if role == 'Target_Child':
                # the age is given as YEARS;MONTHS.DAYS, e.g. 2;03.08
                # get this as a raw string first. Convert the values to int along the way
                age_raw = info[3]
                years_monthsdays = age_raw.split(';')  # [YEARS, MONTHS+DAYS]
                years = int(years_monthsdays[0])  # first item is years
                months_days = years_monthsdays[1].split('.')  # [MONTHS, DAYS]
                months = int(months_days[0])  # first item is months
                try:  # sometimes, days are not given; hence, we prepare for an exception
                    days = int(months_days[1])  # second item is days
                except:  # if something goes wrong in the parsing, just write 0
                    days = 0
                # calculate the total number of days (month = 30 days, for simplicity)
                days_total = years * 365 + months * 30 + days
                # then calculate years based on the number of days
                normalized_years = days_total / 365
                metadata['age of child'] = normalized_years  # add it to the dict
            
            # the corpus is also given in ID lines, so get that as well
            metadata['corpus'] = info[1]  # overwritten each time; but no problem
        
        # date line
        elif line.startswith('@Date'):
            date_raw = line.split('\t')[1]
            day, month, year = date_raw.split('-')  # nifty syntax (Heinold p. 90)
            # map the month string to the number of the month instead
            months = {'JAN': 1, 'FEB': 2, 'MAR': 3, 'APR': 4, 'MAY': 5, 'JUN': 6,
                      'JUL': 7, 'AUG': 8, 'SEP': 9, 'OCT': 10, 'NOV': 11, 'DEC': 12}
            month = months[month]
            # convert the things to numbers and store it as a tuple
            date = (int(day), month, int(year))
            metadata['date'] = date  # add it to the dict
            
        # duration line
        elif line.startswith('@Time Duration'):
            duration_raw = line.split('\t')[1]
            # calculate the start time as a float point number
            start = duration_raw.split('-')[0]
            start_hours = int(start.split(':')[0])
            start_mins = int(start.split(':')[1])
            start_time = start_hours + start_mins / 60
            # same for end time
            end = duration_raw.split('-')[1]
            end_hours = int(end.split(':')[0])
            end_mins = int(end.split(':')[1])
            end_time = end_hours + end_mins / 60
            # calculate the difference = duration
            duration = end_time - start_time
            metadata['duration'] = duration  # add it to the dict
            
    # remember to add the list of participants
    metadata['participants'] = participants
    
    # add the now finished dict to the list for metadata
    metadata_for_transcripts.append(metadata)

metadata_for_transcripts

[{'age of child': 1.4931506849315068,
  'corpus': 'Brown',
  'date': (17, 10, 1962),
  'duration': 0.5,
  'language': 'eng',
  'participants': [('CHI', 'Target_Child'),
   ('MOT', 'Mother'),
   ('COL', 'Investigator'),
   ('RIC', 'Investigator')]},
 {'age of child': 1.4931506849315068,
  'corpus': 'Brown',
  'date': (31, 10, 1962),
  'duration': 1.0,
  'language': 'eng',
  'participants': [('CHI', 'Target_Child'),
   ('MOT', 'Mother'),
   ('FAT', 'Father'),
   ('RIC', 'Investigator'),
   ('COL', 'Investigator')]},
 {'age of child': 1.5753424657534247,
  'corpus': 'Brown',
  'date': (12, 11, 1962),
  'duration': 1.0,
  'language': 'eng',
  'participants': [('CHI', 'Target_Child'),
   ('FAT', 'Father'),
   ('MOT', 'Mother'),
   ('RIC', 'Investigator'),
   ('COL', 'Investigator')]},
 {'age of child': 1.5753424657534247,
  'corpus': 'Brown',
  'date': (28, 11, 1962),
  'duration': 1.0,
  'language': 'eng',
  'participants': [('CHI', 'Target_Child'),
   ('MOT', 'Mother'),
   ('FAT', 'Father

In [20]:
import pandas as pd

dataframe = pd.DataFrame(metadata_for_transcripts)

dataframe

Unnamed: 0,language,age of child,corpus,date,duration,participants
0,eng,1.493151,Brown,"(17, 10, 1962)",0.5,"[(CHI, Target_Child), (MOT, Mother), (COL, Inv..."
1,eng,1.493151,Brown,"(31, 10, 1962)",1.0,"[(CHI, Target_Child), (MOT, Mother), (FAT, Fat..."
2,eng,1.575342,Brown,"(12, 11, 1962)",1.0,"[(CHI, Target_Child), (FAT, Father), (MOT, Mot..."
3,eng,1.575342,Brown,"(28, 11, 1962)",1.0,"[(CHI, Target_Child), (MOT, Mother), (FAT, Fat..."
4,eng,1.657534,Brown,"(12, 12, 1962)",0.5,"[(CHI, Target_Child), (MOT, Mother), (FAT, Fat..."
5,eng,1.739726,Brown,"(2, 1, 1963)",1.0,"[(CHI, Target_Child), (MOT, Mother), (FAT, Fat..."
6,eng,1.739726,Brown,"(16, 1, 1963)",1.0,"[(CHI, Target_Child), (MOT, Mother), (FAT, Fat..."
7,eng,1.739726,Brown,"(28, 1, 1963)",1.0,"[(CHI, Target_Child), (MOT, Mother), (FAT, Fat..."
8,eng,1.821918,Brown,"(13, 2, 1963)",1.0,"[(CHI, Target_Child), (MOT, Mother), (FAT, Fat..."
9,eng,1.821918,Brown,"(27, 2, 1963)",1.0,"[(CHI, Target_Child), (MOT, Mother), (FAT, Fat..."
