# Analyzing syntax in corpora

In the reading by Reitter and Moore, the authors analyzed the Map Task corpus to see if shared syntactic structures between the two participants contributed to their success at the task (that is, accurately recreating the route on the map).  In this assignment, you will look at a somewhat simpler kind of syntactic structure, namely questions.  We will try to extract from this corpus cases where one person asked a question, and then look at what kind of response the other person made.  The corpus does not transcribe punctuation, so we can't rely on question marks!  :-)

This assignment is not broken down so finely into separate steps.  Instead, it is divided only into a few conceptual sections, and within each section you can structure your code as you think best, using one or more code cells.  Whatever you do, remember to display some reasonably-sized of your intermediate results along the way so that the reader can see what you're up to.

Also, instead of just having discussion questions at the end, the assignment has some questions with each section that you will answer in a markdown cell.  These questions are aimed at getting you to think about each step in the process and explain your thinking about it.  There will also be a few discussion questions at the end as well.

Doing things in this way will help you prepare for your final project.  In your project, you will want to "narrate" your exploration of your topic by interleaving code and text, explaining at each stage why you chose to do what you did.

Remember that when working on the assignment, you may need to glance at the actual data files from time to time, or print out some of your results, in order to check that your code is working right, or to get an idea on how to write your code.  It can also be a good idea to read over the whole assignment first to get a sense of the overall plan.

## Loading in the data

To begin with, we want to load in the Map Task corpus (which we already worked with in an exercise in class).  We are interested in cases where one speaker asked a question and the other answered it.  Therefore, we will begin by looking for sequences of two lines spoken by different speakers.  (Not all of these are questions, but we will try to handle that later.)  In addition, we want to extract the "result size" which is in one of the header lines in each file.  (This number represents the total number of lines spoken in that conversation.)  So your task is to do the following for all the files in the corpus:

* Extract the "result size" from the second line of the file, convert it to an integer, and store it in a dictionary where the keys are the file labels (like "q2ec4") and the values are these line counts.  So your dictionary should look something like `{"q2ec4": 112, "q3nc1": 86, ...}`
* Separate the speaker label of each line from the actual dialogue
* Extract all pairs of consecutive lines spoken by two different speakers (i.e., all cases where one person responded to the other, rather than a single person continuing to speak).
* Store all these pairs into some kind of data structure that records, for each line pair:
    * The file label (e.g., "q2ec4")
    * Which speaker spoke first in the pair of lines (either "g" or "f")
    * The first line of dialogue
    * The second line of dialogue

In [1]:
import os
import re
import pandas as pd

In [2]:
directory = 'maptask_data/Transcripts/'
len(os.listdir(directory))

129

In [3]:
rsize_list = []
table = []
for filename in os.listdir(directory):
    # print(filename, end=" ")
    if filename!='.ipynb_checkpoints':
        f = open(directory+filename).read()

        # get result size
        match = re.search(r'Result size: \d+;', f)
        rsize = int(f[match.span()[0]+len('Result size: '):match.span()[1]-1])
        rsize_list.append((filename[:-4],rsize))

        speaker=None
        prevline = f.splitlines()[3]
        # get data for file label, first speaker, first and second line of dialogue
        for line in f.splitlines()[4:]:
            if speaker != line[:1] and speaker in ['g','f']:
                row = (filename, speaker, prevline[2:].strip(), line[2:].strip())
                table.append(row)
            prevline = line
            speaker = line[:1]


rsize_dict = dict(rsize_list)

In [4]:
df = pd.DataFrame(table, columns=["file","speaker","first","second"])
df.tail(100)

Unnamed: 0,file,speaker,first,second
18595,q2nc6.txt,f,yeah,you've to go below the noose about two inches ...
18596,q2nc6.txt,g,you've to go below the noose about two inches ...,okay
18597,q2nc6.txt,f,and then across,and then across
18598,q2nc6.txt,g,do you have that,no i don't
18599,q2nc6.txt,f,no i don't,okay
...,...,...,...,...
18690,q3ec5.txt,g,just underneath the top corner of it,oh the top corner on the right
18691,q3ec5.txt,f,oh the top corner on the right,on the left
18692,q3ec5.txt,g,on the left,on the left
18693,q3ec5.txt,f,so it's the top corner,just so if you just stop you know anywhere


**🧐 Questions:**

1. What kind of data structure did you use to store the lines of dialogue?  Why?
2. We know that not every sequence of two lines spoken by two different people is a question-answer pair.  Is the converse true?  If a question is asked by one speaker and answer by the other, is it always the case that the question and answer will appear in two consecutive lines of dialogue?
3. What other complications might we want to think about at this stage of data processing?

1. I used a dataframe because it was the first suitable thing I thought of and I know how to make it. It has different columns to store the data, I can have many observations, and there are many existing operations to work on dataframes.
2. No, the converse probably isn't true either. A speaker might ask a question, then add more context or qualification in their next consecutive line. Or the answerer might say a line that doesn't answer the question, like "mhm" or this bit from q1ec4.txt lines 91-94 (towards the end)
'''
f	is that how it finish 	
g	take it to the lighthouse 	
g	uh-huh and that's it finished
'''
3. Maybe because of the issues from 2. we should have collapsed the lines from the same speaker into "speaker turns" like in the Discourse analysis notebook. There'd be plenty of non-question and non-answer content though, even worse than now.

## Finding questions

Now we will use the `spacy` library to see if we can identify questions within the data.

In English, questions are often (but not always!) characterized by what is known as "subject-auxiliary inversion".  This is where the auxiliary verb and the subject switch places from their order in a declarative sentence (and the auxiliary "do" appears if no other is present).  So for instance a declarative sentence might be like "Corpus linguistics has changed his life", where "corpus linguistics" is the subject and "has" is the auxiliary; the subject comes first, as we are familiar with.  But in question form this becomes "Has corpus linguistics changed his life?"  Now the auxiliary is first and the subject comes after.  If the sentence had no auxiliary, like "Corpus linguistics rocks your socks", then some form of the auxiliary "do" would appear, as in "Does corpus linguistics rock your socks?"

In both cases, the crucial fact that we will try to exploit is that there is an auxiliary *before* the subject of the same clause.

For each question we find, we will want to store the following information:

* The label of the file in which it occurred (e.g., "q1nc3")
* Whether the first speaker (i.e., the one asking the question) was the direction-giver ("g") or the follower ("f")
* The number of shared *lemmas* between the two lines *which are members of open word classes*.  We'll assume that the *open word classes are noun, verb, adjective, and adverb.
* Whether the second line of dialogue includes any words that mean something like "yes".
* Whether the second line of dialogue includes any words that mean something like "no".

You will have to decide for yourself exactly what words to count for "yes" and "no".  If you like, instead of just recording a true/false value for whether any such word occurred, you could count the number of yes-words and the number of no-words in the second line.

So you need to loop over all the pairs you gathered in the previous step, and for each one:

* Apply your spacy `nlp` function/object thingamajig to the first line of the pair.
* Look at each word in the resulting spacy "document" and see if it has a "dependency relation" (`.dep_`) indicating that it is the subject of a clause.  (You may have to look up what the name of this relation is in spacy's terminology; or you can try applying spacy to some sample sentences to figure out what it calls the subject.)  Note that there may be more than one subject in a line of dialogue (since there could be multiple clauses).
* For each subject. . .
    * Get its "head"
    * Look at all the "children" of this head, and see if any of them have a part of speech (`.pos_`) of "AUX"
    * If you find one, check its index (`.i`) and see if it is less than the index of the subject word.  If it is, then we'll assume this is a question, and we'll grab the information mentioned above and store it in our list of results.

Finally convert your list of results to a DataFrame.  It should have five columns corresponding to the five pieces of information listed above.

In [5]:
import spacy

In [6]:
nlp = spacy.load('en_core_web_md')

In [7]:
yeswords = ['yes','right','mmhmm','uh-huh','yeah','yep','yup','okay','mm','aye']
nowords = ['no','nah','nope']

open_word_classes = ['NOUN','VERB','ADJ','ADV']

In [8]:
table2 = []
for n,line in enumerate(df['first']):
    for word in nlp(line):
        if word.dep_!="nsubj":
            continue
        # print("LINE: ",line)
        # j+=1
        for child in list(word.head.children):
            if child.pos_=="AUX" and child.i<word.i:
                first_list = []
                second_list = []
                for notlemma in nlp(line):
                    first_list.append(notlemma.lemma_)
                for notlemma in nlp(df['second'][n]):
                    second_list.append(notlemma.lemma_)
                shared_lemmas = len(set(first_list) & set(second_list))

                # count yeses and nos in second line
                yeses = 0
                nos = 0
                for word_secondline in df['second'][n].split():
                    if str(word_secondline) in yeswords:
                        yeses += 1
                    if str(word_secondline) in nowords:
                        nos += 1
                
                table2.append((df['file'][n][:-4],df['speaker'][n],shared_lemmas,yeses,nos))

In [9]:
# this one was purely for testing purposes but it lays out the questions conveniently

i = 0
# j=0

# table3 = []
for n,line in enumerate(df['first']):
    if i <= 300:
        i+=1
        continue
    for word in nlp(line):
        if word.dep_!="nsubj":
            continue
        # print("LINE: ",line)
        # j+=1
        for child in list(word.head.children):
            if child.pos_=="AUX" and child.i<word.i:
                print("LINE: ",line)
                print("\t",df['second'][n])

                # create 2 lists of the lemmas in the first and second lines and count shared lemmas
                first_list = []
                second_list = []
                for notlemma in nlp(df['first'][n]):
                    first_list.append(notlemma.lemma_)
                for notlemma in nlp(df['second'][n]):
                    second_list.append(notlemma.lemma_)
                shared_lemmas = len(set(first_list) & set(second_list))

                # count yeses and nos in second line
                yeses = 0
                nos = 0
                for word_secondline in df['second'][n].split():
                    if str(word_secondline) in yeswords:
                        yeses += 1
                    if str(word_secondline) in nowords:
                        nos += 1
                
                # table3.append((df['file'][n],df['speaker'][n],shared_lemmas,yeses,nos))
    if i >= 500:
        break
    i+=1
# print(j)

LINE:  so where do you want me to go when i come down this mountain
	 ehm
LINE:  how far along to the right do you want me to go
	 about two inches
LINE:  have you got that
	 nope
LINE:  have you got a rift valley
	 yeah
LINE:  below the rift valley have you got any rocks
	 mmhmm
LINE:  um have you got any white water
	 nope
LINE:  have you got anything in the middle of the page
	 rapids
LINE:  have you got anything underneath the white water
	 a manned fort
LINE:  um have you got a stone creek underneath that as well
	 yeah mmhmm
LINE:  um have you got an outlaws' hideout
	 no
LINE:  have you got anything bel-- beneath the rocks
	 oh yeah


In [10]:
df2 = pd.DataFrame(table2,columns=['file','speaker','shared_lemmas','yeses','nos'])
print("nonzero number of yeses, nos, shared lemmas")
print(len(df2[df2['yeses']>0]),len(df2[df2['nos']>0]),len(df2[df2['shared_lemmas']>0]))

nonzero number of yeses, nos, shared lemmas
796 345 589


In [11]:
df2.head(10)

Unnamed: 0,file,speaker,shared_lemmas,yeses,nos
0,q7nc7,g,0,2,0
1,q7nc7,g,0,0,1
2,q7nc7,g,1,0,1
3,q7nc7,g,3,0,0
4,q7nc7,f,1,1,0
5,q7nc7,g,0,0,1
6,q7nc7,g,0,1,0
7,q7nc7,f,0,0,0
8,q7nc7,f,0,1,0
9,q7nc7,g,4,0,0


**🧐 Questions:**

1. Will every sentence including this kind of subject-auxiliary inversion be a question?  What kinds of spurious results might be included?
2. Will every question display this subject-auxiliary inversion?  What kinds of questions might be missed by our analysis?
3. Our procedure for finding subject-auxiliary inversion was to find auxiliaries which are "children" of the thing that is the head of the subject.  As best you can, explain in linguistic/syntactic terms what this means and why this is (or isn't) a reasonable way to operationalize the idea of subject-auxiliary inversion.
4. What "yes" and "no" words did you decide to include?  Why?
5. Why did we decide to look for shared lemmas rather than shared word forms?
6. Why did we decide to look only for shared open-class lemmas, instead of all shared lemmas?  What are the advantages and disadvantages of this approach?

1. I didn't see any real non-questions, though there are probably are spurious results. An interesting thing I saw was that sometimes the direction-giver was giving a command but phrasing it as a question (like "could you move sort of a up sort of a wee bit diagonally to your to your right just a wee bit diagonally" from q3ec8), like a polite request, when they're not really trying to get an answer about the direction-follower's ability to move however (even if that would be a normal answer), they're just telling them how to move. I did see something like "have have you got a start point" and "and have you you got a footbridge" get counted twice, probably because of the double auxiliary/subject, or the stutter (?) is messing with how spacy is working, or something of the sort. Those are questions though, they just got double-counted.
2. No, we'd be missing the questions are indicated by tone and context only, rather than word order. There aren't question marks in this corpus, after all. Hypothetical example: "you have carved stones?" There's also this exchange from q7nc2 lines 37-39:
g	east about two inches
f	east 	
g	mm
The way g answered seems like f's "east" could've been a question, seeking clarification about the direction to go. It's possible f was randomly repeating, but if it were a question we wouldn't have caught it.
We're also missing who/where/what/when/why questions, it looks like we're mostly capturing "have you" "can you" "do you" type of questions, auxiliary plus subject, and missing "what plant thing" (q7nc7 line 172)
3. The head is the most important word in a phrase, with all other words in the phrase depending on it. The children of a word are grammatically related/relevant to the word. For a noun, its children might be its adjectives and possessives (in "my red apple is wormy," "my" and "red" for the noun "apple"), while its head would be the verbs it does or is acted on by (in "my red apple is wormy," "is" is the head of "apple"). For a verb, its children might be the subject, auxiliary, adverbs, and any objects (in "I am quickly running," "I" "am" "quickly" for the verb "running"), while its head is itself. Then for a subject, its head would be the verb that it "does," for lack of a better word, while the head's children would be words grammatically relevant to that verb, including the auxiliary, if it exists. With non-questions the auxiliary is usually after the subject, and with questions this is usually flipped, so I think it's a reasonable way to operationalize subject-auxiliary inversion. It seems to work fairly well; everything I found was phrased as a question, though due to inherent limitations with subject-auxiliary inversion, we didn't get all questions.
4. I included these: yeswords = ['yes','right','mmhmm','uh-huh','yeah','yep','yup','okay','mm','aye']; nowords = ['no','nah','nope']. I just chose some off the top of my head, then I actually looked at the text files a bit and added some more, like "okay" and "aye" and "nope." I will note that "uh-huh" won't be captured since spacy splits it into "uh" "-" "huh" but like I didn't bother to take it out. The code can't capture multi-word yeses/nos, which is why I didn't add things like "I don't" or "I have."
5. Someone might conjugate a word differently and we'd still want to know, which is why we looked at lemmas. Example: "can you go there?" "okay I went there". "Go" and "went" have different word forms but the same lemma, the second person is basically using the same word meaning, just in the past tense.
6. Open-class classes generally have more important meaning, I would say. That's why innovative new words for new important meanings are added to those classes. Something from closed classes like articles and auxiliaries are used all the time and might not mean that much, like if two speakers both used the word "the." This is imperfect and we might want to count meanings from some closed classes, maybe prepositions, especially given the map navigation context (words like above, below, around, next (to), etc.). Here's an analogy: to build a sentence structure, the open classes are like colorful blocks you can switch in and out, while the closed classes are usually the syntactic glue that holds all the meaning together. The glue is always there and necessary no matter the blocks.

## Adding in the metadata

Read the `maptask_trial_data.csv` file into a DataFrame.  Remember that in the first part of the assignment, you created a dictionary that mapped each file label to the "result size".  Use this to add a new column to the "trial data" DataFrame that holds the result size (i.e., number of lines in that dialogue).

In [12]:
df3 = pd.read_csv("maptask_data/maptask_trial_data.csv")

In [13]:
df3['rsize'] = df3['Label'].map(rsize_dict)
df3

Unnamed: 0,Label,Deviation,EyeContact,Familiar,rsize
0,q1nc1,78,False,False,686
1,q1nc2,204,False,False,397
2,q1nc3,40,False,True,404
3,q1nc4,53,False,True,226
4,q1nc5,35,False,False,189
...,...,...,...,...,...
123,q8ec4,54,True,False,289
124,q8ec5,25,True,True,228
125,q8ec6,47,True,True,367
126,q8ec7,43,True,False,145


**🧐 Question:**

1. Why might the number of lines of dialogue be relevant?

Maybe more lines of dialogue implies the two participants took way longer to finish the task and that they had difficulty, and it would be interesting to look at their questions—how many, how often were there nos in response, things like that. Maybe the ratio of nos to yeses in comparison to other files.

## Combining and correlating

Now what you want to do is take your DataFrame of questions from above, and group it by the file label.  You want to group it in such a way that, for each file, you can compute the following measurements:

* the number of questions you found in that file
* the number of shared open-class lemmas per question
* the number of yes-answers
* the number of no-answers

(Or you can do number of yes-words and no-words, if that's how you decided to do it above.)  It is possible to do this with a single groupby, or you can do a separate groupby for each variable you want to compute.

For each of these variables, you then want to map your data with the "trial data" CSV, so that you have a single DataFrame that looks something like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Deviation</th>
      <th>EyeContact</th>
      <th>Familiar</th>
      <th>NLines</th>
      <th>NShared</th>
    </tr>
    <tr>
      <th>Label</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>q1nc6</th>
      <td>34</td>
      <td>False</td>
      <td>False</td>
      <td>166</td>
      <td>0.500000</td>
    </tr>
    <tr>
      <th>q2nc3</th>
      <td>44</td>
      <td>False</td>
      <td>True</td>
      <td>319</td>
      <td>0.222222</td>
    </tr>
    <tr>
      <th>q3nc3</th>
      <td>56</td>
      <td>False</td>
      <td>True</td>
      <td>155</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>q4ec1</th>
      <td>204</td>
      <td>True</td>
      <td>False</td>
      <td>72</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>q6ec5</th>
      <td>60</td>
      <td>True</td>
      <td>True</td>
      <td>194</td>
      <td>0.000000</td>
    </tr>
  </tbody>
</table>

This table just has one column for NShared, which is the number of shared lemmas.  You might make several tables like this, one for shared lemmas, one for number of yeses, etc.; or you could make one table that has one column for shared lemmas, another for number of yeses, and so on.  It's up to you.

Once you have that table, call `.corr()` on it.  It should give you a table that looks something like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Deviation</th>
      <th>Yeses</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Deviation</th>
      <td>1.000000</td>
      <td>-0.105862</td>
    </tr>
    <tr>
      <th>Yeses</th>
      <td>-0.105862</td>
      <td>1.000000</td>
    </tr>
  </tbody>
</table>

(Again, your table may have more columns depending on how you did things.)  This is telling you the correlations between each pair of columns.  Without going deep into stats, we'll just assume we can interpret this as follows: positive numbers mean a positive correlation (i.e., when one variable is high, the other is also high); negative numbers mean the opposite (when one variable is high, the other is low); numbers near zero mean little correlation (i.e., not much relationship between the variables).

You want to look at the correlations between your variables and the "trial data" variables of Deviation, Familiarity, Eye Contact, and "result size" (the one you added to the table yourself).

In [14]:
questions_grouped = df2.groupby(['file']).sum().apply(lambda x: x)
questions_grouped['number_of_questions'] = len(questions_grouped['speaker'].iloc[0])

nqs = []
for n,row in enumerate(questions_grouped['speaker']):
    nqs.append(len(questions_grouped['speaker'].iloc[n]))

questions_grouped['number_of_questions'] = nqs

In [15]:
questions_grouped

Unnamed: 0_level_0,speaker,shared_lemmas,yeses,nos,number_of_questions
file,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
q1ec2,fgggg,1,4,0,5
q1ec3,ggggfgggggggfggggg,17,12,5,18
q1ec4,ff,1,0,2,2
q1ec5,gggggggfgg,2,5,2,10
q1ec6,fg,2,3,0,2
...,...,...,...,...,...
q8nc4,gfgggfgfgggg,0,7,5,12
q8nc5,ggffggfgggfggfgggg,6,12,4,18
q8nc6,ggggggggg,0,4,5,9
q8nc7,ggggfggfgggfgfg,5,11,2,15


In [16]:
df3['shared_lemmas'] = df3['Label'].map(dict(questions_grouped['shared_lemmas']))
df3['yeses'] = df3['Label'].map(dict(questions_grouped['yeses']))
df3['nos'] = df3['Label'].map(dict(questions_grouped['nos']))
df3['number_of_questions'] = df3['Label'].map(dict(questions_grouped['number_of_questions']))
df3

Unnamed: 0,Label,Deviation,EyeContact,Familiar,rsize,shared_lemmas,yeses,nos,number_of_questions
0,q1nc1,78,False,False,686,77.0,23.0,7.0,64.0
1,q1nc2,204,False,False,397,8.0,7.0,2.0,18.0
2,q1nc3,40,False,True,404,15.0,14.0,8.0,32.0
3,q1nc4,53,False,True,226,4.0,8.0,3.0,11.0
4,q1nc5,35,False,False,189,13.0,3.0,5.0,16.0
...,...,...,...,...,...,...,...,...,...
123,q8ec4,54,True,False,289,8.0,8.0,3.0,18.0
124,q8ec5,25,True,True,228,3.0,4.0,1.0,3.0
125,q8ec6,47,True,True,367,16.0,12.0,8.0,29.0
126,q8ec7,43,True,False,145,3.0,7.0,3.0,12.0


In [17]:
df3[['Deviation','yeses']].corr()

Unnamed: 0,Deviation,yeses
Deviation,1.0,-0.128815
yeses,-0.128815,1.0


In [18]:
df3[['Deviation','nos']].corr()

Unnamed: 0,Deviation,nos
Deviation,1.0,-0.214476
nos,-0.214476,1.0


In [19]:
df3[['Deviation','shared_lemmas']].corr()

Unnamed: 0,Deviation,shared_lemmas
Deviation,1.0,-0.046694
shared_lemmas,-0.046694,1.0


In [20]:
df3[['Deviation','number_of_questions']].corr()

Unnamed: 0,Deviation,number_of_questions
Deviation,1.0,-0.112863
number_of_questions,-0.112863,1.0


In [21]:
df3[['Familiar','shared_lemmas']].corr()

Unnamed: 0,Familiar,shared_lemmas
Familiar,1.0,0.037082
shared_lemmas,0.037082,1.0


In [22]:
df3[['EyeContact','shared_lemmas']].corr()

Unnamed: 0,EyeContact,shared_lemmas
EyeContact,1.0,-0.060997
shared_lemmas,-0.060997,1.0


In [23]:
df3[['EyeContact','number_of_questions']].corr()

Unnamed: 0,EyeContact,number_of_questions
EyeContact,1.0,-0.095746
number_of_questions,-0.095746,1.0


In [24]:
df3[['rsize','shared_lemmas']].corr()

Unnamed: 0,rsize,shared_lemmas
rsize,1.0,0.633912
shared_lemmas,0.633912,1.0


In [25]:
df3[['rsize','number_of_questions']].corr()

Unnamed: 0,rsize,number_of_questions
rsize,1.0,0.714152
number_of_questions,0.714152,1.0


In [26]:
df3[['rsize','yeses']].corr()

Unnamed: 0,rsize,yeses
rsize,1.0,0.532162
yeses,0.532162,1.0


In [27]:
df3[['rsize','nos']].corr()

Unnamed: 0,rsize,nos
rsize,1.0,0.466298
nos,0.466298,1.0


Um so nothing too interesting here, correlations are generally weak besides result size vs shared lemmas, and it makes sense that there are more shared lemmas when there are more lines.
Given that the example dataframe above had NShared (number of shared lemmas) as what looks like a proportion, I'm going to assume that we're meant to be looking at proportions, to like normalize numbers of yeses/nos/shared lemmas to the result size or number of questions asked. I'll get on with that.

In [28]:
df3['questions_per_line'] = df3['number_of_questions'] / df3['rsize']
df3['shared_lemmas_per_question'] = df3['shared_lemmas'] / df3['number_of_questions']
df3['yes_no_ratio'] = df3['yeses'] / df3['nos']
df3['yeses_per_question'] = df3['yeses'] / df3['number_of_questions']
df3['nos_per_question'] = df3['nos'] / df3['number_of_questions']
df3

Unnamed: 0,Label,Deviation,EyeContact,Familiar,rsize,shared_lemmas,yeses,nos,number_of_questions,questions_per_line,shared_lemmas_per_question,yes_no_ratio,yeses_per_question,nos_per_question
0,q1nc1,78,False,False,686,77.0,23.0,7.0,64.0,0.093294,1.203125,3.285714,0.359375,0.109375
1,q1nc2,204,False,False,397,8.0,7.0,2.0,18.0,0.045340,0.444444,3.500000,0.388889,0.111111
2,q1nc3,40,False,True,404,15.0,14.0,8.0,32.0,0.079208,0.468750,1.750000,0.437500,0.250000
3,q1nc4,53,False,True,226,4.0,8.0,3.0,11.0,0.048673,0.363636,2.666667,0.727273,0.272727
4,q1nc5,35,False,False,189,13.0,3.0,5.0,16.0,0.084656,0.812500,0.600000,0.187500,0.312500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123,q8ec4,54,True,False,289,8.0,8.0,3.0,18.0,0.062284,0.444444,2.666667,0.444444,0.166667
124,q8ec5,25,True,True,228,3.0,4.0,1.0,3.0,0.013158,1.000000,4.000000,1.333333,0.333333
125,q8ec6,47,True,True,367,16.0,12.0,8.0,29.0,0.079019,0.551724,1.500000,0.413793,0.275862
126,q8ec7,43,True,False,145,3.0,7.0,3.0,12.0,0.082759,0.250000,2.333333,0.583333,0.250000


In [29]:
df3[['Deviation','yes_no_ratio']].corr()

Unnamed: 0,Deviation,yes_no_ratio
Deviation,1.0,-0.007552
yes_no_ratio,-0.007552,1.0


In [30]:
df3[['Deviation','yeses_per_question']].corr()

Unnamed: 0,Deviation,yeses_per_question
Deviation,1.0,-0.12276
yeses_per_question,-0.12276,1.0


In [31]:
df3[['Deviation','nos_per_question']].corr()

Unnamed: 0,Deviation,nos_per_question
Deviation,1.0,-0.173698
nos_per_question,-0.173698,1.0


In [32]:
df3[['Deviation','questions_per_line']].corr()

Unnamed: 0,Deviation,questions_per_line
Deviation,1.0,-0.0484
questions_per_line,-0.0484,1.0


In [33]:
df3[['Deviation','shared_lemmas_per_question']].corr()

Unnamed: 0,Deviation,shared_lemmas_per_question
Deviation,1.0,0.003777
shared_lemmas_per_question,0.003777,1.0


In [34]:
df3[['EyeContact','yes_no_ratio']].corr()

Unnamed: 0,EyeContact,yes_no_ratio
EyeContact,1.0,0.12109
yes_no_ratio,0.12109,1.0


In [35]:
df3[['EyeContact','yeses_per_question']].corr()

Unnamed: 0,EyeContact,yeses_per_question
EyeContact,1.0,0.040785
yeses_per_question,0.040785,1.0


In [36]:
df3[['EyeContact','nos_per_question']].corr()

Unnamed: 0,EyeContact,nos_per_question
EyeContact,1.0,-0.034449
nos_per_question,-0.034449,1.0


In [37]:
df3[['EyeContact','questions_per_line']].corr()

Unnamed: 0,EyeContact,questions_per_line
EyeContact,1.0,0.044624
questions_per_line,0.044624,1.0


In [38]:
df3[['EyeContact','shared_lemmas_per_question']].corr()

Unnamed: 0,EyeContact,shared_lemmas_per_question
EyeContact,1.0,0.054966
shared_lemmas_per_question,0.054966,1.0


In [39]:
df3[['Familiar','yes_no_ratio']].corr()

Unnamed: 0,Familiar,yes_no_ratio
Familiar,1.0,-0.053353
yes_no_ratio,-0.053353,1.0


In [40]:
df3[['Familiar','yeses_per_question']].corr()

Unnamed: 0,Familiar,yeses_per_question
Familiar,1.0,-0.187542
yeses_per_question,-0.187542,1.0


In [41]:
df3[['Familiar','nos_per_question']].corr()

Unnamed: 0,Familiar,nos_per_question
Familiar,1.0,0.123654
nos_per_question,0.123654,1.0


In [42]:
df3[['Familiar','questions_per_line']].corr()

Unnamed: 0,Familiar,questions_per_line
Familiar,1.0,-0.172421
questions_per_line,-0.172421,1.0


In [43]:
df3[['Familiar','shared_lemmas_per_question']].corr()

Unnamed: 0,Familiar,shared_lemmas_per_question
Familiar,1.0,-0.060937
shared_lemmas_per_question,-0.060937,1.0


In [44]:
df3[['yeses','nos']].corr()

Unnamed: 0,yeses,nos
yeses,1.0,0.49079
nos,0.49079,1.0


**🧐 Questions:**

1. Remember that the "Deviation" measures how different the follower's map path wound up being from the givers (high numbers mean more difference, i.e., the players were not as successful in replicating the original path).  Which of your variables has the most meaningful correlation with Deviation?  What is your interpretation of this result?
2. Which of your variables has the most meaningful correlation with the number of lines in the dialogue (aka "result size")?  What is your interpretation of this result?
3. You may have heard that correlation does not imply causation.  That is certainly relevant here.  Consider the most likely *causal* interpretation of a correlation between "number of questions asked" and Deviation; then consider the most likely *causal* interpretation of a correlation between "number of questions asked" and "eye contact".  What is the difference in how we're likely to infer causality in the two cases?
4. What do you find about the correlations with your "Yes" and "No" variables?  Are any of them meaningful?  How do you interpret these results?
5. We did not make use of the "speaker" variable indicating whether the person asking the question was the direction-giver or the follower.  How might this be relevant?  What would be your hypothesis about the importance of questions asked by the giver vs. the receiver?
6. Suppose someone asked you the following: "When you're having a conversation with someone and trying to communicate something rather specific to them, is it a good or a bad sign if you (the person giving them the information) ask a lot of questions?  Is it a good or a bad sign if they (the person receiving the information) ask a lot of questions?"  How would our results be relevant to these questions?  What kinds of things might be relevant to identifying "good" or "successful" communication that we did not address here?

1. Deviation and nos had the most meaningful correlation of -0.214476. I took this to mean that the more often someone said no to a question, the better the clarity of communication, which translates to less deviation in route.
2. Result size had the greatest correlation with number_of_questions at 0.714152, which makes sense; it also had a correlation of 0.633912 with shared_lemmas. The more lines of dialogue, the more likely there will be questions and shared lemmas between the question and answer. The correlation with yeses is 0.532162. Correlation with nos is 0.466298. We might infer that some pairs just didn't say no as much and it's less related to how long the conversation went on. However long the dialogue goes, they'll usually ask proportionately many questions, but not as proportionately many replies of no.
3. Correlation of deviation vs number of questions: -0.112863. Likely causal interpretation: didn't ask enough questions to clarify/understand/communicate -> greater deviation.
Correlation of number of questions vs eye contact: -0.095746. Likely causal interpretation: more eye contact means greater understanding -> people don't feel the need to ask as many questions.
In the first case we assumed the number of questions was a cause and in the second case we assumed it was the effect.
4. The correlation of yeses and nos was 0.49079. This is fairly high. Didn't get many noticeable correlations with other variables and proportions of yeses/nos. The strongest is maybe Familiar vs yeses_per_question at -0.187542. If unfamiliar with someone, you say yes in response to their questions more? Because you want to show agreement and are less comfortable disagreeing? Second strongest correlation is -0.173698 for Deviation and nos_per_question. Actually, Deviation and the plain number of nos has a stronger correlation of -0.214476. As I interpreted in question 1, the more often someone said no to a question, the better the clarity of communication, which translates to less deviation in route.
5. I would assume that the direction-giver is asking most of the questions, since I saw a lot of "do you have"s that make more sense for the direction-giver to be saying. I'd hypothesize that the follower's questions are more about clarifications of instructions received, while the giver's questions are more about existence of landmarks on the follower's map, "can you go this way," that type of thing.
6. That's a good point...If we could look at number of questions asked by just the direction-giver and compare it to deviations, we might be able to figure out whether it is a good or bad sign if the communicator is asking many questions. The same for the direction-giver. Good or successful communication might be identified through tone (uncertain tones might result in poorer communication?), or use of some fillers like "ehm" and "well" (unsureness? taking time to think things through? communicating to the other person that you're thinking or you're feeling a certain way about what they said?), or body-language mirroring (being "on the same wavelength" or something), or length of time to complete the task (fast=good?).