Source: https://childes.talkbank.org/access/Eng-NA/Gathercole.html
Languages: eng
Aged 2;9 to 6;6 
Situation: At the lunch table at school

In [1]:
from nltk import word_tokenize, pos_tag
from nltk.corpus.reader import CHILDESCorpusReader
import pandas as pd


# We need this to make our dictionary like: key: comedian, value: string format
def combine_text(list_of_text):
        '''Takes a list of text and combines them into one large chunk of text.'''
        combined_text = ' '.join(list_of_text)
        return combined_text

class ChildesToPandaDataFrame:
    
    def __init__(self, corpus_root, format):
        
        self.corpusReader = CHILDESCorpusReader(corpus_root, format)
        
        self.data_dict = {}
        for filename in self.corpusReader.fileids():
            text_list = []
            [text_list.append(word) for word in self.corpusReader.words(filename)]
            self.data_dict[filename] = text_list
        
        data_combined = {key: [combine_text(value)] for (key, value) in self.data_dict.items()}
        
        pd.set_option('max_colwidth',150)
        self.data_df = pd.DataFrame.from_dict(data_combined).transpose()
        self.data_df.columns = ['transcript']
    

In [2]:
to = ChildesToPandaDataFrame('Tommerdahl/', '.*.xml')

In [3]:
to.data_df

Unnamed: 0,transcript
AJC1.xml,so you know it should be fine I could leave him in here and go away for an hour or two no no no no we want you to talk to him and that's fine you ...
AJC2.xml,drove to c if you do needta go out don't bother asking you can just okay thank_you I've had a_lot_of coffee too this morning I have a are you maki...
AVW1.xml,let's see what we can see a bit lefti lighter it's a bit what lefti lighter it's a bit lighter what does that mean it means that it's lighter insi...
AVW2.xml,when did you play with them was it yesterday let's put was it yesterday when you played with these toys cold milk in yours and hot milk in mind oh...
BDO1.xml,look a doctor chair mum is that a doctor chair yeah take my shoes off can you see this door whose door is this like ours who's just had a door lik...
BDO2.xml,is he in there again in his little house yeah I put him in there oh did you yeah I wonder if he stayed there from where you put him in there yeah ...
CDH1.xml,yeah there we are thank_you okay yes have fun thank_you very much see you in a bit okay this is a teddy bear would you like some water Charlie no ...
CMC1.xml,it really doesn't matter I can't stress that enough it's helpful at some point if you can ask him about something that he's done in the past right...
CMC2.xml,pardon yeah why it doesn't work battery's gone just pretend yeah just pretend hello hey you guys when I ever to say meow say it then it a white me...
ECB1.xml,oh look at this oh look there's a fire station there look they'd be able to cut you out of the car if your foot got stuck what who's that next doo...


In [4]:
# Let's pickle it for later use
to.data_df.to_pickle("df_to.pkl")

In [5]:
# Let's take a look at the transcript for 08.xml
to.data_df.transcript.loc['AJC1.xml']

"so you know it should be fine I could leave him in here and go away for an hour or two no no no no we want you to talk to him and that's fine you know it doesn't matter if he tips something out is it alright if I take his power rangers out as_well yes that's fine that's fine yeah okay then all ready guys mommy yeah okay well we're all set okay let's get these shoes off do you wanna take your shoes off in here no yeah no no you wanna keep your shoes on yeah or take them off take them off take ooh there's one I can take them off well done okay we'll leave them over here yeah ooh what do we have what do we got in the box the heli helicopter hm plane is being rescued it's being rescued yeah I've got the fire engine can you see where's where's the fire engine where's the fire station no can you see a fire station in here umm there ooh I think that's a hotel what about that one it's got a bit of fire can we put him over there okay mom I decided on a helicop a helicopter umm I can what's tha

# Clean the data
## To analysis our transcript more properly, we need to clean our data.
Common text data cleaning steps are:
 - Make text all lower case
 - Remove punctuation
 - Remove numerical values
 - Remove common non-sensical text (/n)
 - Tokenize text
 - Remove stop words
 - More...

More data cleaning steps after tokenization:
 - Stemming / lemmatization
 - Parts of speech tagging
 - Create bi-grams or tri-grams
 - Deal with typos
 - And more...
 
ref: [https://github.com/adashofdata/nlp-in-python-tutorial/blob/master/1-Data-Cleaning.ipynb]

In [6]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [8]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(to.data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
AJC1.xml,so you know it should be fine i could leave him in here and go away for an hour or two no no no no we want you to talk to him and thats fine you k...
AJC2.xml,drove to c if you do needta go out dont bother asking you can just okay thankyou ive had alotof coffee too this morning i have a are you making te...
AVW1.xml,lets see what we can see a bit lefti lighter its a bit what lefti lighter its a bit lighter what does that mean it means that its lighter inside a...
AVW2.xml,when did you play with them was it yesterday lets put was it yesterday when you played with these toys cold milk in yours and hot milk in mind oh ...
BDO1.xml,look a doctor chair mum is that a doctor chair yeah take my shoes off can you see this door whose door is this like ours whos just had a door like...
BDO2.xml,is he in there again in his little house yeah i put him in there oh did you yeah i wonder if he stayed there from where you put him in there yeah ...
CDH1.xml,yeah there we are thankyou okay yes have fun thankyou very much see you in a bit okay this is a teddy bear would you like some water charlie no yo...
CMC1.xml,it really doesnt matter i cant stress that enough its helpful at some point if you can ask him about something that hes done in the past right whi...
CMC2.xml,pardon yeah why it doesnt work batterys gone just pretend yeah just pretend hello hey you guys when i ever to say meow say it then it a white meow...
ECB1.xml,oh look at this oh look theres a fire station there look theyd be able to cut you out of the car if your foot got stuck what whos that next door w...


In [9]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [10]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
AJC1.xml,so you know it should be fine i could leave him in here and go away for an hour or two no no no no we want you to talk to him and thats fine you k...
AJC2.xml,drove to c if you do needta go out dont bother asking you can just okay thankyou ive had alotof coffee too this morning i have a are you making te...
AVW1.xml,lets see what we can see a bit lefti lighter its a bit what lefti lighter its a bit lighter what does that mean it means that its lighter inside a...
AVW2.xml,when did you play with them was it yesterday lets put was it yesterday when you played with these toys cold milk in yours and hot milk in mind oh ...
BDO1.xml,look a doctor chair mum is that a doctor chair yeah take my shoes off can you see this door whose door is this like ours whos just had a door like...
BDO2.xml,is he in there again in his little house yeah i put him in there oh did you yeah i wonder if he stayed there from where you put him in there yeah ...
CDH1.xml,yeah there we are thankyou okay yes have fun thankyou very much see you in a bit okay this is a teddy bear would you like some water charlie no yo...
CMC1.xml,it really doesnt matter i cant stress that enough its helpful at some point if you can ask him about something that hes done in the past right whi...
CMC2.xml,pardon yeah why it doesnt work batterys gone just pretend yeah just pretend hello hey you guys when i ever to say meow say it then it a white meow...
ECB1.xml,oh look at this oh look theres a fire station there look theyd be able to cut you out of the car if your foot got stuck what whos that next door w...


## Make Document-Term Matrix (DTM)

In [11]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,able,abracadabra,accident,ache,aches,acorn,acrobat,actually,adam,add,...,yum,yummilicious,yummy,zebra,zoo,zoom,zooming,zoos,zzz,æta
AJC1.xml,0,0,0,0,0,0,0,1,2,0,...,6,0,2,0,0,0,1,0,0,0
AJC2.xml,1,0,0,0,0,0,0,0,1,0,...,4,0,0,0,4,0,0,0,0,0
AVW1.xml,0,0,0,0,0,0,1,1,0,0,...,2,0,0,0,0,0,0,0,0,0
AVW2.xml,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
BDO1.xml,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
BDO2.xml,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
CDH1.xml,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
CMC1.xml,0,0,0,17,1,0,0,0,0,1,...,0,0,1,0,4,0,0,0,0,0
CMC2.xml,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,2,0,0,0,0,0
ECB1.xml,2,0,0,0,0,0,0,0,0,0,...,8,0,0,0,0,0,0,0,0,0


In [12]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm_to.pkl")

In [38]:
import pickle

# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean_to.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))