Source: https://childes.talkbank.org/access/Eng-NA/Gathercole.html
Languages: eng
Aged 2;9 to 6;6 
Situation: At the lunch table at school

In [22]:
from nltk import word_tokenize, pos_tag
from nltk.corpus.reader import CHILDESCorpusReader
import pandas as pd


# We need this to make our dictionary like: key: comedian, value: string format
def combine_text(list_of_text):
        '''Takes a list of text and combines them into one large chunk of text.'''
        combined_text = ' '.join(list_of_text)
        return combined_text

class ChildesToPandaDataFrame:
    
    def __init__(self, corpus_root, format):
        
        self.corpusReader = CHILDESCorpusReader(corpus_root, format)
        
        self.data_dict = {}
        for filename in self.corpusReader.fileids():
            text_list = []
            [text_list.append(word) for word in self.corpusReader.words(filename)]
            self.data_dict[filename] = text_list
        
        data_combined = {key: [combine_text(value)] for (key, value) in self.data_dict.items()}
        
        pd.set_option('max_colwidth',150)
        self.data_df = pd.DataFrame.from_dict(data_combined).transpose()
        self.data_df.columns = ['transcript']
    

In [23]:
ga = ChildesToPandaDataFrame('Gathercole/', '.*.xml')

In [24]:
ga.data_df

Unnamed: 0,transcript
01.xml,xxx Rach you wanted me to see who Jeffy is yeah and he's gonna sit right there right that's my Mommy can I sit and watch you eat today my my Daddy...
02.xml,I got a low chair www I got mine lower chair get me get me I'm a lower chair okay I'll do it to you give me lower chair okay will you give me lowe...
03.xml,this is yyy whoa be careful Sarah my shoe fell off here you need help my shoe needs tying I sit by you today right you're not kiddin these hard sh...
04.xml,hi how are you Nicole how are you Brian fine hi this chair is deep it's deep yeah I thought it was up here yeah you know th th they said you did u...
05.xml,macaroni macaroni Bryan macaroni to many know what I eat know what I eat you know what what I decided eat somebody xxx what I decided eat some th...
06.xml,give that to Matthew after Rach no how are you Luke fine is that warm in there where your arms are hunm cold no cold yeah God I'm cold next time ...
07.xml,hi Saasha hi how are you fine I wonder Nicole's takin off her stuff so she will be kinda late her clothes and stuff mhm Rachel oh gosh Rachel I kn...
08.xml,hi how are you fine oh they made mess real mess in here I don't like that mess no that they made with the beans you mean Lily yeah this is Sarah_L...
09.xml,great these are pancakes dy dyou do you know what what what Matthew they keep well Lleana and Bobbi only serve people things two times a day is th...
10.xml,this is good my favorite mm this is good my favorite mm you like spaghetti mhm so do so does Nate that's my does he Sarah uhhuh uhhuh my my brothe...


In [26]:
# Let's pickle it for later use
ga.data_df.to_pickle("df_ga.pkl")

In [28]:
# Let's take a look at the transcript for 08.xml
ga.data_df.transcript.loc['08.xml']

"hi how are you fine oh they made mess real mess in here I don't like that mess no that they made with the beans you mean Lily yeah this is Sarah_Lastname hi and Megan_Lastname hi I sit at this table today in my class right I sit in your class too we both sit in your class look at that mess oh isn't that a dumb mess with those beans isn't it a dumb mess mhm we get to sit over in my class right soohoh do I yeah how come you have that oh just for fun oh how come you have that just for fun wha how do you have that you're silly aren't you no no no crazy I just got pink undershirt yyy that's my favorite color Mommy see if you can see me when I put this in my mailbox did you pick it out no okay your Mom did ahhah I just got it I just got a I just um I just got it from K_mart I hadta scoot a little I hadta scoot a little that's what I'm saying right Megan you don't you don't have your lunch yet what you don't have your lunch yet jello that's hot that's peaches right mhm peaches and jello righ

# Clean the data
## To analysis our transcript more properly, we need to clean our data.
Common text data cleaning steps are:
 - Make text all lower case
 - Remove punctuation
 - Remove numerical values
 - Remove common non-sensical text (/n)
 - Tokenize text
 - Remove stop words
 - More...

More data cleaning steps after tokenization:
 - Stemming / lemmatization
 - Parts of speech tagging
 - Create bi-grams or tri-grams
 - Deal with typos
 - And more...
 
ref: [https://github.com/adashofdata/nlp-in-python-tutorial/blob/master/1-Data-Cleaning.ipynb]

In [29]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [30]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(ga.data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
01.xml,xxx rach you wanted me to see who jeffy is yeah and hes gonna sit right there right thats my mommy can i sit and watch you eat today my my daddy j...
02.xml,i got a low chair www i got mine lower chair get me get me im a lower chair okay ill do it to you give me lower chair okay will you give me lower ...
03.xml,this is yyy whoa be careful sarah my shoe fell off here you need help my shoe needs tying i sit by you today right youre not kiddin these hard sho...
04.xml,hi how are you nicole how are you brian fine hi this chair is deep its deep yeah i thought it was up here yeah you know th th they said you did um...
05.xml,macaroni macaroni bryan macaroni to many know what i eat know what i eat you know what what i decided eat somebody xxx what i decided eat some th...
06.xml,give that to matthew after rach no how are you luke fine is that warm in there where your arms are hunm cold no cold yeah god im cold next time i...
07.xml,hi saasha hi how are you fine i wonder nicoles takin off her stuff so she will be kinda late her clothes and stuff mhm rachel oh gosh rachel i kno...
08.xml,hi how are you fine oh they made mess real mess in here i dont like that mess no that they made with the beans you mean lily yeah this is sarahlas...
09.xml,great these are pancakes dy dyou do you know what what what matthew they keep well lleana and bobbi only serve people things two times a day is th...
10.xml,this is good my favorite mm this is good my favorite mm you like spaghetti mhm so do so does nate thats my does he sarah uhhuh uhhuh my my brother...


In [31]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [32]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
01.xml,xxx rach you wanted me to see who jeffy is yeah and hes gonna sit right there right thats my mommy can i sit and watch you eat today my my daddy j...
02.xml,i got a low chair www i got mine lower chair get me get me im a lower chair okay ill do it to you give me lower chair okay will you give me lower ...
03.xml,this is yyy whoa be careful sarah my shoe fell off here you need help my shoe needs tying i sit by you today right youre not kiddin these hard sho...
04.xml,hi how are you nicole how are you brian fine hi this chair is deep its deep yeah i thought it was up here yeah you know th th they said you did um...
05.xml,macaroni macaroni bryan macaroni to many know what i eat know what i eat you know what what i decided eat somebody xxx what i decided eat some th...
06.xml,give that to matthew after rach no how are you luke fine is that warm in there where your arms are hunm cold no cold yeah god im cold next time i...
07.xml,hi saasha hi how are you fine i wonder nicoles takin off her stuff so she will be kinda late her clothes and stuff mhm rachel oh gosh rachel i kno...
08.xml,hi how are you fine oh they made mess real mess in here i dont like that mess no that they made with the beans you mean lily yeah this is sarahlas...
09.xml,great these are pancakes dy dyou do you know what what what matthew they keep well lleana and bobbi only serve people things two times a day is th...
10.xml,this is good my favorite mm this is good my favorite mm you like spaghetti mhm so do so does nate thats my does he sarah uhhuh uhhuh my my brother...


## Make Document-Term Matrix (DTM)

In [39]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,accident,actually,add,afraid,agh,agreeable,ah,ahead,ahhah,air,...,youre,youve,yuck,yucky,yum,yummy,yyy,yyys,zero,zoo
01.xml,0,0,0,0,0,0,0,1,0,0,...,11,1,0,0,0,0,23,1,0,1
02.xml,0,0,0,0,0,0,1,0,0,0,...,19,0,1,2,0,3,13,0,0,0
03.xml,1,0,0,0,1,0,0,0,1,0,...,5,0,0,0,7,0,14,0,0,0
04.xml,0,1,2,0,3,0,0,1,3,2,...,16,0,3,1,0,3,28,0,0,0
05.xml,0,0,0,0,3,0,0,0,2,0,...,4,0,6,2,0,0,30,0,0,0
06.xml,0,0,0,0,1,0,0,0,3,0,...,12,0,0,0,0,0,28,0,0,0
07.xml,0,0,0,0,0,0,1,2,5,0,...,7,1,1,1,0,1,14,0,2,0
08.xml,0,0,0,0,0,0,1,0,3,0,...,13,2,0,0,3,0,21,0,3,0
09.xml,0,0,0,0,0,0,0,0,1,1,...,4,0,1,0,0,4,18,0,0,0
10.xml,0,0,0,0,0,0,2,0,1,0,...,5,0,2,4,6,2,17,0,0,0


In [40]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm_ga.pkl")

In [38]:
import pickle

# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean_ga.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))