# Extracting MLU

MLU is one of the most important measurements of child language development, as it combines the amount of speech per utterance (which could otherwise be measured by number of tokens) with a measure of syntactic compexity (morphemes). Extracting MLU from the CHILDES transcripts does have its challenges. The main issue is what to count. Root morphemes are associated with a vertical pipe character "|" in the CHAT transcription convention, so counting the number of |'s per %mor line will get us the number of root morphemes. We then need a list of all the bound morphemes that we can expect in the transcription. Some of these seem to vary from transcript to transcript, so we need to be somewhat familiar with our data. Finally, we may be interested e.g. in some particular part of speech, say nouns or prepositions, and we will need to adapt our scripts to catch the items of interest for our research. The script below counts root morphemes plus a standard list of bound morphemes. Not all of these will be relevant for the Brown transcripts.

In [1]:
from os import chdir as cd

child = 'Adam'
pathin = '/Users/ethan/Desktop/Brown/'+child
pathout = '/Users/ethan/Desktop'

cd(pathin)
file = child + '01.cha'

removelist = ['\t', '\r']

with open(file,'r') as f:
    text = f.read()
    for item in removelist:
        text = text.replace(item, '')
    text = text.split('\n')
       
    # make a list with only the %mor lines from the transcript
    # first find the lines with the child's speech, then jump to the following line (s+1)
    # collect all these %mor lines in the variable utt
    utt = []
    for s, val in enumerate(text):
        if val.startswith('*CHI'):
            m = str(text[s+1])
            m = m[5:]
            utt.append(m)
    
    # set up empty lists to collect the morpheme counts.
    # in the current version of this script, only "morphs", the one collects all the morphemes,
    # regardless of type is used. But that could be modified, e.g. if we wanted to track the number
    # of possesives (mPOSS) or plural nouns (mPL) specifically.
    morphs = []
    mPL = []
    mPOSS = []
    mPAST = []
    mPROG = []
    mPERF = []
    mTHIRDSING = []
    mCOMP1 = []
    mCOMP2 = []
    mSUPER1 = []
    mSUPER2 = []
    mADVR1 = []
    mADVR2 = []
    mAGT = []
    
    for s in utt:
        base = s.count('|')
        PL = s.count('-PL')
        POSS = s.count('POSS')
        PAST = s.count('&PAST')
        PROG = s.count('-PROG')
        PERF = s.count('&PERF')
        THIRDSING = s.count('&3S')
        COMP1 = s.count('-CP')
        COMP2 = s.count('&CP')
        SUPER1 = s.count('-SP')
        SUPER2 = s.count('&SP')
        ADVR1 = s.count('-ADVR')
        ADVR2 = s.count('LY')
        AGT = s.count('-AGT')
        

        m = base + PL + POSS + PAST + PROG + PERF + THIRDSING + COMP1 + COMP2 + SUPER1 + SUPER2 + ADVR1 + ADVR2 + AGT
        morphs.append(m)
        mPL.append(PL)
        mPOSS.append(POSS)
        mPAST.append(PAST)
        mPROG.append(PROG)
        mPERF.append(PERF)
        mTHIRDSING.append(THIRDSING)
        mCOMP1.append(COMP1)
        mCOMP2.append(COMP2)
        mSUPER1.append(SUPER1)
        mSUPER2.append(SUPER2)
        mADVR1.append(ADVR1)
        mADVR2.append(ADVR2)
        mAGT.append(AGT)


    
av = sum(morphs)/len(morphs)
print(av)

2.166403785488959
