# Word frequency, "if" statements, and some other stuff

By now you can do lots of things with python. You can:
1. read text into python
1. do basic manipulations with that text (putting everything in lowercase, removing punctuation, etc.)
1. work with strings of text including:
    1. splitting text strings into lists of items, using different delimiters
    1. converting integer variables into text
    1. combining pieces text strings using "+"
1. find the number of items in a list 
1. find the number of unique items in a list
1. use "for" loops to repeat actions such as:
    1. doing the same thing to all the files in a folder
    1. doing the same thing with every item in a string, in order

That's a lot! If you are not sure you know how to do all of these things, take some time to go back and look at the previous notebooks, because we will continue building on these skills.

In this lesson, we will use start using acutal child language data, downloaded from CHILDES. We will use the "if" statement to look at word frequency. Along the way, we will pick up some other new tools.



## Download the Brown corpus

We will be using data from Roger Brown's pioneering study on Adam, Eve, and Sarah. You can download the data [here] [brown].

[brown]: http://childes.psy.cmu.edu/access/Eng-NA/Brown.html

## Look at the data

CHILDES data is stored in the ".cha" format, but this is just a text file, with a ".cha" suffix instead of a ".txt" suffix. Python should have no trouble reading the data in as a string.

We'll start by just looking at the first transcript of Adam's speech. Because it is quite long, we can just look at the first 1000 characters.



In [1]:
import os

datapath = '/Users/ethan/Desktop/Brown/Adam/'
os.chdir(datapath)
file = 'adam01.cha'


with open(file,'r') as f:
    text = f.read()
    print(text[:1000])


@UTF8
@PID:	11312/c-00015632-1
@Begin
@Languages:	eng
@Participants:	CHI Adam Target_Child , MOT Mother , URS Ursula_Bellugi Investigator , RIC Richard_Cromer Investigator , COL Colin_Fraser Investigator
@ID:	eng|Brown|CHI|2;3.04|male|typical|MC|Target_Child|||
@ID:	eng|Brown|MOT|||||Mother|||
@ID:	eng|Brown|URS|||||Investigator|||
@ID:	eng|Brown|RIC|||||Investigator|||
@ID:	eng|Brown|COL|||||Investigator|||
@Date:	08-OCT-1962
@Comment:	Birth of CHI is 4-JUL-1960
@Time Duration:	10:00-11:00
*CHI:	play checkers .
%mor:	n|play n|checker-PL .
%gra:	1|2|MOD 2|0|ROOT 3|2|PUNCT
%xpho:	<1> pe
*CHI:	big drum .
%mor:	adj|big n|drum .
%gra:	1|2|MOD 2|0|INCROOT 3|2|PUNCT
*MOT:	big drum ?
%mor:	adj|big n|drum ?
%gra:	1|2|MOD 2|0|INCROOT 3|2|PUNCT
*CHI:	big drum .
%mor:	adj|big n|drum .
%gra:	1|2|MOD 2|0|INCROOT 3|2|PUNCT
%spa:	$IMIT
*CHI:	big drum .
%mor:	adj|big n|drum .
%gra:	1|2|MOD 2|0|INCROOT 3|2|PUNCT
%spa:	$IMIT
*CHI:	big drum .
%mor:	adj|big n|drum .
%gra:	1|2|MOD 2|0|INCROOT 3|2|PUNCT
%sp

## Getting just the data we want

There is a lot of information here. Let's say that we are interested in finding out what Adam's most common words are, and how that develops with time. In this case, we don't need to know what his mother or anyone else says, and we don't need the grammatical or part-of-speech data either. We just want to know what Adam has said. We can use an "if" statement for this. Here are some quick examples of how "if" works:

In [2]:
# Test if strings are the same

a = 'It was the best of times'
b = 'It was the worst of times'
c = 'It was the best of times'
if c == a:
    print(b)
    
# "==" : is the same as

It was the worst of times


In [3]:
# Test if strings are different

a = 'It was the best of times'
b = 'It was the worst of times'
c = 'It was the best of times'

if c != b:
    print(a)
    
# "!=" : is not the same as

It was the best of times


In [4]:
# Compare numbers

a = 42
b = 43

if a > b:
    print('true')
else:
    print('false')

false


In [5]:
# Compare length of strings

a = 'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.'

b = 'Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.'

c = 'But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.'

if len(a)<len(b):
    print('a is shorter than b')
elif len(a)<len(c):
    print('a is shorter than c')
else:
    print('a the longest string')
    
# notice that the second condition is also true, but python stops after the first one is met.

a is shorter than b


In [6]:
# Look for specfic strings within a larger string, or look for specific items in a list

a = 'Jellicle cats come one come all'
b = a.split()
print(a + '\n')
print(b); print('')


if 'Jellicle' in a:
    print('Jellicle found in a'  + '\n')
    
if 'Jellicle' in b:
    print('Jellicle found in b'  + '\n')
    
if 'Pollicles' not in b: 
    print('No Pollicles found in b' + '\n')
else:
    print('Pollicles found in b' + '\n')

Jellicle cats come one come all

['Jellicle', 'cats', 'come', 'one', 'come', 'all']

Jellicle found in a

Jellicle found in b

No Pollicles found in b



## Splitting the text into lines

There are many more example of things we can do with "if", but these examples give you the basic ides. Now let's use "if" to select only Adam's data. First, let's split the text. Because each utterance is on a separate line, let's use the "new line" code '\n' to split our data

In [7]:
import os

datapath = '/Users/ethan/Desktop/Brown/Adam/'
os.chdir(datapath)
file = 'adam01.cha'

with open(file,'r') as f:
    text = f.read()
    text = text.split('\n')
    print(text[100:150])

['%mor:\tn|nickel .', '%gra:\t1|0|INCROOT 2|1|PUNCT', '*CHI:\tnickel .', '%mor:\tn|nickel .', '%gra:\t1|0|INCROOT 2|1|PUNCT', '*CHI:\tnickel .', '%mor:\tn|nickel .', '%gra:\t1|0|INCROOT 2|1|PUNCT', '*CHI:\tnickel .', '%mor:\tn|nickel .', '%gra:\t1|0|INCROOT 2|1|PUNCT', '*CHI:\tshadow .', '%mor:\tn|shadow .', '%gra:\t1|0|INCROOT 2|1|PUNCT', '*CHI:\tshadow .', '%mor:\tn|shadow .', '%gra:\t1|0|INCROOT 2|1|PUNCT', '*CHI:\tshadow .', '%mor:\tn|shadow .', '%gra:\t1|0|INCROOT 2|1|PUNCT', '*CHI:\tshadow .', '%mor:\tn|shadow .', '%gra:\t1|0|INCROOT 2|1|PUNCT', '*MOT:\tshadow ?', '%mor:\tn|shadow ?', '%gra:\t1|0|INCROOT 2|1|PUNCT', '*CHI:\tyeah (.) shadow (.) yeah .', '%mor:\tco|yeah n|shadow co|yeah .', '%gra:\t1|2|COM 2|0|ROOT 3|2|COM 4|2|PUNCT', '%com:\tShadow is a book', '%spa:\t$RES', '*CHI:\tshadow (.) funny .', '%mor:\tn|shadow adj|fun&dn-Y .', '%gra:\t1|0|INCROOT 2|1|POSTMOD 3|1|PUNCT', '%com:\tShadow is a book', '*CHI:\tput dirt up .', '%mor:\tv|put&ZERO n|dirt adv|up .', '%gra:\t1|0|RO

### Problems!
Now the text is split into lines, but we still have all the mother's speech, plus the lines starting with %mor, %gra, etc. Also, everything that everybody says has a "\t" in front of it. Just like "\n" means new line, "\t" means "tab". Some my also find "\r"'s, so let's get rid of those things. We may turn up other things we want to get rid of, so let's set up a list with all these nuisances, and do a for loop through them, until they are gone.

In [8]:
import os

datapath = '/Users/ethan/Desktop/Brown/Adam/'
os.chdir(datapath)
file = 'adam01.cha'

removelist = ['\t', '\r']

with open(file,'r') as f:
    text = f.read()
    for item in removelist:
        text = text.replace(item, '')
    text = text.split('\n')
    print(text[100:150])
    

['%mor:n|nickel .', '%gra:1|0|INCROOT 2|1|PUNCT', '*CHI:nickel .', '%mor:n|nickel .', '%gra:1|0|INCROOT 2|1|PUNCT', '*CHI:nickel .', '%mor:n|nickel .', '%gra:1|0|INCROOT 2|1|PUNCT', '*CHI:nickel .', '%mor:n|nickel .', '%gra:1|0|INCROOT 2|1|PUNCT', '*CHI:shadow .', '%mor:n|shadow .', '%gra:1|0|INCROOT 2|1|PUNCT', '*CHI:shadow .', '%mor:n|shadow .', '%gra:1|0|INCROOT 2|1|PUNCT', '*CHI:shadow .', '%mor:n|shadow .', '%gra:1|0|INCROOT 2|1|PUNCT', '*CHI:shadow .', '%mor:n|shadow .', '%gra:1|0|INCROOT 2|1|PUNCT', '*MOT:shadow ?', '%mor:n|shadow ?', '%gra:1|0|INCROOT 2|1|PUNCT', '*CHI:yeah (.) shadow (.) yeah .', '%mor:co|yeah n|shadow co|yeah .', '%gra:1|2|COM 2|0|ROOT 3|2|COM 4|2|PUNCT', '%com:Shadow is a book', '%spa:$RES', '*CHI:shadow (.) funny .', '%mor:n|shadow adj|fun&dn-Y .', '%gra:1|0|INCROOT 2|1|POSTMOD 3|1|PUNCT', '%com:Shadow is a book', '*CHI:put dirt up .', '%mor:v|put&ZERO n|dirt adv|up .', '%gra:1|0|ROOT 2|1|OBJ 3|1|JCT 4|1|PUNCT', '%com:Shadow is a book', '*CHI:put dirt up .'

### Using "if" to select data

Now we are ready to select lines that only have speech in them. Of course, we only want Adam's speech, but let's just see how we can use "if" together with "or" to get both Adam and his mother's speech

In [9]:
import os

datapath = '/Users/ethan/Desktop/Brown/Adam/'
os.chdir(datapath)
file = 'adam01.cha'

removelist = ['\t', '\r']

with open(file,'r') as f:
    text = f.read()
    for item in removelist:
        text = text.replace(item, '')
    text = text.split('\n')
    
    turns = []
    for line in text:
        if line.startswith('*CHI') or line.startswith('*MOT'):
            turns.append(line)
            
    turns = '\n'.join(turns)
    print(turns[:300])



*CHI:play checkers .
*CHI:big drum .
*MOT:big drum ?
*CHI:big drum .
*CHI:big drum .
*CHI:big drum .
*CHI:horse .
*MOT:horse .
*CHI:who (th)at ?
*MOT:who is that ?
*MOT:those are checkers .
*CHI:two check .
*MOT:two checkers (.) yes .
*CHI:rig horn .
*CHI:p(l)ay check(ers) .
*MOT:play checkers ?
*CH


## Only Adam

This is starting to look better. From now on, though, we will drop the

    or line.startswith('*MOT')

and just select Adam's speech.

In [10]:
import os

datapath = '/Users/ethan/Desktop/Brown/Adam/'
os.chdir(datapath)
file = 'adam01.cha'

removelist = ['\t', '\r']

with open(file,'r') as f:
    text = f.read()
    for item in removelist:
        text = text.replace(item, '')
    text = text.split('\n')
    
    turns = []
    for line in text:
        if line.startswith('*CHI'):
            turns.append(line)
            
    turns = '\n'.join(turns)
    print(turns[:300])

*CHI:play checkers .
*CHI:big drum .
*CHI:big drum .
*CHI:big drum .
*CHI:big drum .
*CHI:horse .
*CHI:who (th)at ?
*CHI:two check .
*CHI:rig horn .
*CHI:p(l)ay check(ers) .
*CHI:yep .
*CHI:big horn .
*CHI:alright look tv .
*CHI:part .
*CHI:part .
*CHI:part .
*CHI:part .
*CHI:part .
*CHI:ge(t) over 


### Cleaning up Adam's data

Right now, we're only interested in Adam's words, so let's remove punctuation. You already know how to do this, but notice in the code below that when we import modules, we can give them a new temporary name. People often do this if they will be writing the module name a lot of times. If I will be using "Punctuation" a lot, and don't want to write it out every time, I can write:

    from string import punctuation as pnc
    
and from then on, I only need to write "pnc" and not "punctuation". Also, notice that previously we have imported all of string, like this:

    import string
    punct = set(string.punctuation)

And now we are doing:

    from string import punctuation as pnc
    punct = set(pnc)
    
These will give exactly the same result. We could also say:

    from string import punctuation
    punct = set(punctuation)
    
In all cases, the code does the same thing. I put these details here for those of you who are interested. If you don't care, don't worry about it!

In [11]:
import os

datapath = '/Users/ethan/Desktop/Brown/Adam/'
os.chdir(datapath)
file = 'adam01.cha'

removelist = ['\t', '\r']

from string import punctuation as pnc
punct = set(pnc)

with open(file,'r') as f:
    text = f.read()
    for item in removelist:
        text = text.replace(item, '')
    text = text.split('\n')
    
    turns = []
    for line in text:
        if line.startswith('*CHI'):
            line = line.replace('*CHI:', '')
            line = ''.join(x for x in line if x not in punct) 
            line = line.strip()
            turns.append(line)

    turns = '\n'.join(turns)
    print(turns[:300])



play checkers
big drum
big drum
big drum
big drum
horse
who that
two check
rig horn
play checkers
yep
big horn
alright look tv
part
part
part
part
part
get over  Mommy
nickel
nickel
nickel
nickel
shadow
shadow
shadow
shadow
yeah  shadow  yeah
shadow  funny
put dirt up
put dirt up
put dirt up
sit der


### Getting the word frequency

Now we have a long string with nothing but the words Adam has said in this transcript (formatted with line breaks, to look nice).

You already know how to find how many types and tokens he uses. Now, let's count how many times he uses each word.

To do this, we will first need to split our nice string into a list again. Of course, we could just avoid putting it into a string to begin with, since "turns" was already a list, but I'm doing thist for illustrative purposes.

Then, we can use a method called "Counter" from the module "collections" to count word frequency.

Also, notice that below I have moved all my "import" statements to the top of the script. This is a stylistic convention; the code will work either way, but it's cleaner code if you have your imports at the top.

In [12]:
import os
from collections import Counter
from string import punctuation as pnc

datapath = '/Users/ethan/Desktop/Brown/Adam/'
os.chdir(datapath)
file = 'adam01.cha'

removelist = ['\t', '\r']

punct = set(pnc)

with open(file,'r') as f:
    text = f.read()
    for item in removelist:
        text = text.replace(item, '')
    text = text.split('\n')
    
    turns = []
    for line in text:
        if line.startswith('*CHI'):
            line = line.replace('*CHI:', '')
            line = ''.join(x for x in line if x not in punct) 
            line = line.strip()
            turns.append(line)
            
    allturns = ' '.join(turns)
    words = allturns.split()
    
    freq = Counter(words).most_common(50)
    print(freq)




[('that', 158), ('who', 111), ('dat', 95), ('my', 81), ('Mommy', 75), ('no', 74), ('okay', 66), ('Adam', 65), ('go', 59), ('get', 49), ('kitty', 48), ('Daddy', 45), ('paper', 40), ('write', 40), ('Bozo', 39), ('there', 31), ('I', 30), ('put', 30), ('yeah', 30), ('ball', 29), ('xxx', 27), ('it', 26), ('bulldozer', 26), ('see', 24), ('Shadow', 24), ('truck', 21), ('man', 21), ('read', 20), ('look', 20), ('record', 19), ('like', 18), ('two', 18), ('part', 17), ('tractor', 17), ('in', 17), ('pencil', 16), ('up', 16), ('meow', 16), ('you', 15), ('bunnyrabbit', 15), ('mine', 14), ('horn', 14), ('name', 14), ('hit', 14), ('sit', 14), ('hm', 14), ('find', 14), ('a', 13), ('doggie', 13), ('John', 13)]


### Cool, right?

Now we have a list of Adam's 50 most common words in this transcript, in descending order. It would be nice to have a complete list, though, wouldn't it? Actually, we can use 

    Counter.most_common()
    
for this too. If you think about, when we wrote:

    freq = Counter(words).most_common(50)
    
we asked for the 50 most common words. But why not just ask for all of them? We don't know how many words there are, but we know how to find that information:

    freq = Counter(words).most_common(len(words))
    


### Why stop with Adam's first transcript?

This is nice, but we'd like to know about how Adam's vocabulary develops. Let's add a "for" loop to cycle through *all* of his transcripts. In the Jane Austen demo, I used a list with the titles of all the novels to access all the items in a folder. Some of you used os.walk to achieve the same thing. Here I want to show you another useful module, "glob". We can use glob to give us every file in a folder that meets certain criteria. Here, I ask for every file that ends with ".cha"

In [13]:
import os
import glob
from collections import Counter
from string import punctuation as pnc

datapath = '/Users/ethan/Desktop/Brown/Adam/'
os.chdir(datapath)

removelist = ['\t', '\r']

punct = set(pnc)

for file in glob.glob('*.cha'):
    with open(file,'r') as f:
        text = f.read()

        for item in removelist:
            text = text.replace(item, '')
        text = text.split('\n')

        turns = []
        for line in text:
            if line.startswith('*CHI'):
                line = line.replace('*CHI:', '')
                line = ''.join(x for x in line if x not in punct) 
                line = line.strip()
                turns.append(line)

        allturns = ' '.join(turns)
        words = allturns.split()

        
        freq = Counter(words).most_common(len(words))
        print(file)
        print(freq)
        print('')



adam01.cha
[('that', 158), ('who', 111), ('dat', 95), ('my', 81), ('Mommy', 75), ('no', 74), ('okay', 66), ('Adam', 65), ('go', 59), ('get', 49), ('kitty', 48), ('Daddy', 45), ('paper', 40), ('write', 40), ('Bozo', 39), ('there', 31), ('I', 30), ('put', 30), ('yeah', 30), ('ball', 29), ('xxx', 27), ('it', 26), ('bulldozer', 26), ('see', 24), ('Shadow', 24), ('truck', 21), ('man', 21), ('read', 20), ('look', 20), ('record', 19), ('like', 18), ('two', 18), ('part', 17), ('tractor', 17), ('in', 17), ('pencil', 16), ('up', 16), ('meow', 16), ('you', 15), ('bunnyrabbit', 15), ('mine', 14), ('horn', 14), ('name', 14), ('hit', 14), ('sit', 14), ('hm', 14), ('find', 14), ('a', 13), ('doggie', 13), ('John', 13), ('here', 13), ('fall', 13), ('yep', 12), ('tatoo', 12), ('right', 12), ('Tuffy', 12), ('dere', 12), ('ride', 11), ('big', 11), ('pull', 11), ('light', 11), ('Judy', 11), ('fine', 11), ('Catherine', 10), ('Buzz', 10), ('open', 10), ('camping', 10), ('move', 10), ('shoe', 10), ('play', 10

### Increased use of "I"?

Scrolling through the data, one thing that I notice is that the word "I" seems to move up the ranks as a frequently uttered word as Adam gets older. Let's trace the change in frequency of the word "I". We could use the file names to track Adam's increasing age, but it would be nice to have his actual age. In this case, the ages are available on the website, but the data are also in the transcript, so let's extract it with an "if" statement

In [14]:
import os
import glob
from collections import Counter
from string import punctuation as pnc

datapath = '/Users/ethan/Desktop/Brown/Adam/'
os.chdir(datapath)
file = 'adam01.cha'

removelist = ['\t', '\r']


punct = set(pnc)

with open(file,'r') as f:
    text = f.read()
    for item in removelist:
        text = text.replace(item, '')
    text = text.split('\n')
    for item in text:
        if '|CHI|' in item:
            a = item
            
    print(a)
    a = a.split('|')
    print('')
    print(a)
    print('')
    age = a[3]
    print('age: ' + age)
    print('')


@ID:eng|Brown|CHI|2;3.04|male|typical|MC|Target_Child|||

['@ID:eng', 'Brown', 'CHI', '2;3.04', 'male', 'typical', 'MC', 'Target_Child', '', '', '']

age: 2;3.04



### "I" through the ages

Now that we can get Adam's age from the transcript, let's use a "for" loop to cycle through all the Adam transcripts and find the frequency of the word "I" and track that by age. However, simply using the raw count may be misleading, because as Adam grows older he is probably saying more (and we could easily use Python to check this, right?). To take that into acount, let's find the proportion of total tokens that are "I":

    for item in freq:
        if item[0] == 'I':
            icounter = item[1]
    prop_freq = str(icounter/tokens)
    print('Frequency of "I" at age ' + str(age) + ': ' + prop_freq)
    
Finally, since don't need the proportion to infinite decimal places, lets lop off everything after three decimal places. We can do that by only showing the first 6 characters in the string "prop_freq"

    prop_freq = str(icounter/tokens)
    print('Frequency of "I" at age ' + str(age) + ': ' + prop_freq[:5])


In [15]:
import os
import glob
from collections import Counter
from string import punctuation as pnc


datapath = '/Users/ethan/Desktop/Brown/Adam/'
os.chdir(datapath)

removelist = ['\t', '\r']

punct = set(pnc)

for file in glob.glob('*.cha'):
    with open(file,'r') as f:
        text = f.read()

        for item in removelist:
            text = text.replace(item, '')
        text = text.split('\n')
        
        for item in text:
            if '|CHI|' in item:
                a = item
            
        a = a.split('|')
        age = a[3]

        turns = []
        for line in text:
            if line.startswith('*CHI'):
                line = line.replace('*CHI:', '')
                line = ''.join(x for x in line if x not in punct) 
                line = line.strip()
                turns.append(line)

        allturns = ' '.join(turns)
        words = allturns.split()
        tokens = len(words)

        
        freq = Counter(words).most_common(len(words))

        for item in freq:
            if item[0] == 'I':
                icounter = item[1]
        prop_freq = str(icounter/tokens)
        print('Frequency of "I" at age ' + str(age) + ': ' + prop_freq[:5])

Frequency of "I" at age 2;3.04: 0.010
Frequency of "I" at age 2;3.18: 0.005
Frequency of "I" at age 2;4.03: 0.004
Frequency of "I" at age 2;4.15: 0.005
Frequency of "I" at age 2;4.30: 0.013
Frequency of "I" at age 2;5.12: 0.009
Frequency of "I" at age 2;6.03: 0.015
Frequency of "I" at age 2;6.17: 0.027
Frequency of "I" at age 2;7.01: 0.010
Frequency of "I" at age 2;7.14: 0.013
Frequency of "I" at age 2;8.01: 0.006
Frequency of "I" at age 2;8.16: 0.013
Frequency of "I" at age 2;9.04: 0.000
Frequency of "I" at age 2;9.18: 0.004
Frequency of "I" at age 2;10.02: 0.032
Frequency of "I" at age 2;10.16: 0.031
Frequency of "I" at age 2;10.30: 0.031
Frequency of "I" at age 2;11.13: 0.032
Frequency of "I" at age 2;11.28: 0.042
Frequency of "I" at age 3;0.11: 0.042
Frequency of "I" at age 3;0.25: 0.041
Frequency of "I" at age 3;1.09: 0.048
Frequency of "I" at age 3;1.26: 0.058
Frequency of "I" at age 3;2.09: 0.053
Frequency of "I" at age 3;2.21: 0.044
Frequency of "I" at age 3;3.04: 0.068
Frequen

### One more thing...

It would be nice to be able to save our hard-won data. We can do too.

Let's make a [.csv] [csv] file so we can put our data in a spreadsheet.


[csv]: https://en.wikipedia.org/wiki/Comma-separated_values

In [16]:
from os import chdir as cd
pathout = '/Users/ethan/Desktop/'
newfile = 'Adam_words.csv'

cd(pathout)

header = 'Frequency of Adam\'s words \n'
with open(newfile, 'a+') as newfile:
    newfile.write(header)
newfile.close()

This makes a new .csv file and puts on the deskop. The csv format uses commas to separate data columns, so by combining that with '\n' to make new rows, we can build a spreadsheet from scratch. First, delete the new csv file from your desktop. We'll make it again below.

In [17]:
import glob
from os import chdir as cd
from collections import Counter
from string import punctuation as pnc

pathout = '/Users/ethan/Desktop/Adam_words.csv'
datapath = '/Users/ethan/Desktop/Brown/Adam/'

header = 'age,freq of I,\n'

with open(pathout, 'a+') as nf:
    nf.write(header)

cd(datapath)

removelist = ['\t', '\r']

punct = set(pnc)

for file in glob.glob('*.cha'):
    with open(file,'r') as f:
        text = f.read()

        for item in removelist:
            text = text.replace(item, '')
        text = text.split('\n')
        
        for item in text:
            if '|CHI|' in item:
                a = item
            
        a = a.split('|')
        age = a[3]

        turns = []
        for line in text:
            if line.startswith('*CHI'):
                line = line.replace('*CHI:', '')
                line = ''.join(x for x in line if x not in punct) 
                line = line.strip()
                turns.append(line)

        allturns = ' '.join(turns)
        words = allturns.split()
        tokens = len(words)

        
        freq = Counter(words).most_common(len(words))

        for item in freq:
            if item[0] == 'I':
                icounter = item[1]
        prop_freq = str(icounter/tokens)
        
        newline = age + ',' + prop_freq[:5] + '\n'
    
        with open(pathout, 'a+') as nf:
            nf.write(newline)

print('All done!')

All done!


### Isn't that neato?

You should now have a .csv file on your desktop which can be imported by Excel or other spreadsheet programs.

# Quiz for everybody

We covered a lot in this notebook. Please make sure you understand what happended here! Ask questions if you don't!

1. Find some other words that you would like to trace in Adam's speech.
1. Make a spreadsheet with these word frequencies and his age, just as we did here for "I". Add columns for every word you choose to track.
1. Make a spreadsheet that tracks the frequency of the mother's use of a word of your choice.

## Bonus question, for those who can't get enough:

1. Write a script that loops through all three children and makes three different spreadsheets tracking frequency of "I" (or some other word) for all three children.

## Extra bonus question, for  those who truly have nothing else to do:

1. Write a script that makes a single spreadsheet that tracks the frequency of any word for all three children. Make one column per child.

### Hints to the extra bonus question
1. You will probably need to abandon using the child's age; just use visit number instead.
1. The children have different numbers of visits. You will have to find a way to deal with that in your script