-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathGS_physics_in_fluid_speech.Rmd
542 lines (438 loc) · 69.7 KB
/
GS_physics_in_fluid_speech.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
---
title : "Gesture-Speech Physics in Fluent Speech and Rhythmic Upper Limb Movements"
shorttitle : "Gesture-Speech Physics in Fluent Speech"
author:
- name : "Wim Pouw"
affiliation : "1,2,3"
corresponding : yes # Define only one corresponding author
address : "Donders Institute for Brain, Cognition and Behaviour, Heyendaalseweg 135, 6525 AJ Nijmegen"
email : "w.pouw@psych.ru.nl"
- name : "Lisette de Jonge-Hoekstra"
affiliation : "1,5"
- name : "Steven J. Harrison"
affiliation : "1,6"
- name : "Alexandra Paxton"
affiliation : "1,7"
- name : "James A. Dixon"
affiliation : "1,7"
affiliation:
- id : "1"
institution : "Center for the Ecological Study of Perception and Action, University of Connecticut"
- id : "2"
institution : "Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen"
- id : "3"
institution : "Institute for Psycholinguistics, Max Planck Nijmegen"
- id : "4"
institution : "Faculty of Behavioral and Social Sciences, University of Groningen"
- id : "5"
institution : "Royal Dutch Kentalis, Sint-Michielsgestel"
- id : "6"
institution : "Department of Kinesiology, University of Connecticut"
- id : "7"
institution : "Department of Psychological Sciences, University of Connecticut"
authornote: |
All anonymised data and analysis code are available at the Open Science Framework (https://osf.io/tgbmw/). This manuscript has been written with Rmarkdown - for the code-embedded reproducible version of this manuscript please see the Rmarkdown (.Rmd) file available at the OSF page.
This research has been funded by The Netherlands Organisation of Scientific Research (NWO; Rubicon grant “Acting on Enacted Kinematics”, Grant Nr. 446-16-012; PI Wim Pouw).
Some sections of this paper have been submitted (in verbatim) as 2-page abstract to GESPIN2020.
Acknowledgement: We would like to thank Jenny Michlich for pointing us to relevant bioacoustic literature. We thank Susanne Fuchs for valuable comments on this work.
abstract: |
A common understanding is that hand gesture and speech coordination in humans is culturally and cognitively acquired, rather than having a biological basis. Recently, however, the biomechanical physical coupling of arm movements to speech vocalization has been studied in steady-state vocalization and mono-syllable utterances, where forces produced during gesturing are transferred onto the tensioned body, leading to changes in respiratory-related activity and thereby affecting vocalization F0 and intensity. In the current experiment (N = 37), we extend this previous line of work to show that gesture-speech physics impacts fluent speech, too. Compared with non-movement, participants who are producing fluent self-formulated speech, while rhythmically moving their limbs, demonstrate heightened F0 and amplitude envelope, and such effects are more pronounced for higher-impulse arm versus lower-impulse wrist movement. We replicate that acoustic peaks arise especially during moments of peak-impulse (i.e., the beat) of the movement, namely around deceleration phases of the movement. Finally, higher deceleration rates of higher-mass arm movements were related to higher peaks in acoustics. These results confirm a role for physical-impulses of gesture affecting the speech system. We discuss the implications of gesture-speech physics for understanding of the emergence of communicative gesture, both ontogenetically and phylogenetically.
keywords : "hand gesture, speech production, speech acoustics, biomechanics, entrainment"
wordcount : "X"
bibliography : ["mybib.bib", "r-references.bib"]
fig_caption : no
floatsintext : yes
figurelist : no
tablelist : no
footnotelist : no
linenumbers : yes
mask : no
draft : no
documentclass : "apa6"
classoption : "man, noextraspace"
output : papaja::apa6_docx
---
```{r setup, include = FALSE}
library("papaja") #papaja::apa6_pdf
knitr::opts_chunk$set(fig.cap = "")
knitr::opts_chunk$set(dpi=600)
```
```{r analysis-preferences_packages_functions_etc, warning = FALSE}
# Seed for random number generation
set.seed(42)
#knitr::opts_chunk$set(cache.extra = knitr::rand_seed)
#load libraries
library(dplyr) #data formatting
library(ggplot2) #plotting 3d density plots
library(ggbeeswarm) #plotting of density jitter distributions
library(gridExtra) #plotting mulitple pannels
library(nlme) #mixed regression
library(ggplot2) #plotting 3d density plots
library(gam) #generalizized additive models
library(mgcv) #plotting generalized additive models
library(itsadug) #plotting generalized additive models
library(scales) #for rescaling variables
```
```{r functions_themes, echo = FALSE, message = FALSE}
#save blue theme for plotting later on
bluetheme <- theme(
panel.background = element_rect(fill = "white", colour = "grey",
size = 2, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid',
colour = "grey"),
panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
colour = "grey"),
strip.background =element_rect(fill="white"))
```
```{r load_in_dataetc, echo = FALSE, message=FALSE, warning=FALSE}
#set directories, load main data, and order factors
parentfolder <- "D:/Research_Projects/Respiration, Gesture, Speech/MainStudy_fluidspeech/MS_markdown"
basefolder <- dirname(parentfolder)
meta <- read.csv(paste0(basefolder, "/DATA/META/META.csv"))
fd <- read.csv(paste0(basefolder, "/DATA/Main/MERGED_DATA/fd.csv"))
fd$condition <- factor(fd$condition, levels = c("PASSIVE", "WRIST", "ARM"))
cartoon_duration_s <- c(50, 33, 39, 75, 23, 38, 77, 21, 89, 120, 100, 48) #seconds for each video clip
#save some general info task
#####average time per trial
ch <- ave(fd$time_ms_rec, fd$unique_trial,FUN=function(x){max(x)-min(x)} )
ch <- ch[!duplicated(fd$unique_trial)]
trialtime <- mean(ch, na.rm=TRUE)/1000
sdtrialtime <- sd(ch, na.rm=TRUE)/1000
```
Communicative hand gestures are ubiquitous across human cultures. Gestures aid communication by seamlessly interweaving relevant pragmatic, iconic and symbolic expressions of the hands together with speech [@hollerMultimodalLanguageProcessing2019; @streeckDepictingGesture2008; @feyereisenCognitivePsychologySpeechRelated2017]. For such multi-articulatory utterances to do their communicative work, gesture and speech must be tightly temporally coordinated to form a sensible speech-gesture whole. In fact, gestures' salient moments are often timed with emphatic stress made in speech, no matter what the hands depict [@wagnerGestureSpeechInteraction2014; @shattuck-hufnagelDimensionalizingCospeechGestures2019]. For such gesture-speech coordination to get off the ground, the system must functionally constrain its degrees of freedom [@turveyCoordination1990]; in doing so it will have to utilize (or otherwise account for) intrinsic dynamics arising from the bio-physics of speaking and moving at the same time. Here we provide evidence that movement of the upper limbs constrain fluent self-generated speech acoustics through biomechanics.
## The gesture-speech prosody link
The tight coordination of prosodic aspects of speech with the kinematics of gesture has been long appreciated and is classically referred to as the beat-like quality of co-speech gesture [@mcneillHandMindWhat1992]. As obtained from video analysis, gesture apices are often found to align with *pitch accents*— accents which are acoustically predominately defined by positive excursions in the fundamental frequency (FO), lowering of the second formant, longer vowel duration, and increased intensity [@loehrTemporalStructuralPragmatic2012; @mendoza-dentonSemioticLayeringGesture2011; @mcclavePitchManualGestures1998]. Pitch accents can be perceptually differentiated by sudden lowering of F0 as well, but gestures do not seem to align with those events quite as much [@imProbabilisticRelationCospeech2020].
In more recent motion-tracking studies, gesture-speech prosody correlations have been obtained as well. For example, gestures’ peak velocity often co-occurs near peaks in F0, even when such gestures are depicting something [@dannerQuantitativeAnalysisMultimodal2018; @leonardTemporalRelationBeat2011; @krivokapicGesturalCoordinationProsodic2014; @pouwEntrainmentModulationGesture2019; @pouwQuantifyingGesturespeechSynchrony2019]. In pointing gestures, stressed syllables align neatly with the maximum extension of the pointing movement, such that the hand movement terminates at the first syllable utterance in strong-weak stressed “PApa” and terminates later during the second syllable utterance in the weak-strong “paPA” [@esteve-gibertProsodicStructureShapes2013; @rochet-capellanSpeechFocusPosition2008]. During finger-tapping and mono-syllabic utterances, when participants are instructed to alternate prominence in their utterances (“pa, PA, pa, PA”), the tapping action spontaneously aligns with the syllable pattern, such that larger movements are made during stressed syllables [@parrellSpatiotemporalCouplingSpeech2014]. Conversely, if participants are instructed to alternate stress in finger tapping (strong-weak-strong-weak force production), speech will follow, with larger oral-labial apertures for stressed vs. unstressed tapping movements.
Even when people do not intend to change the stress patterning of an uttered sentence, gesturing concurrently affects speech acoustics in a way that makes it seem intentionally stressed, inducing an increase in vocalization duration and a lowering of the second formant of co-occurrent speech [@krahmerEffectsVisualBeats2007]. Further, gesture and speech cycle rates seem to be attracted towards particular (polyrhythmic) stabilities: In-phase speech-tapping is preferred over anti-phase coordination, and 2:1 speech-to-tapping ratios are preferred over more complex integer ratios such as 5:2 [@stoltmannSyllablepointingGestureCoordination2017;@zelicArticulatoryConstraintsSpontaneous2015; @kelsoConvergingEvidenceSupport1984; @treffnerIntentionalAttentionalDynamics2002]. These previous results indicate that gesture and speech naturally couple their activity, like many living and non-living oscillatory systems [@pikovskySynchronizationUniversalConcept2001], requiring further study on the exact nature of this coupling.
## Gesture-speech physics
The current mainstream understanding of the gesture-prosody link is that is not “biologically mandated” [p.69 ; @mcclavePitchManualGestures1998; @shattuck-hufnagelProsodicCharacteristicsNonreferential2018], requiring neural-cognitive timing mechanisms [@ruiterProductionGestureSpeech2000; @deruiterAsymmetricRedundancyGesture2017] that appear only after about 16 months of age [@iversonHandMouthBrain2005; see also @esteve-gibertProsodyAuditoryVisual2018]. Recently, however, a potential physical coupling of arm movements with speech via myofascial tissue biomechanics was investigated, where it was found that hand gesturing physically impacts steady-state vocalizations and mono-syllabic consonant-vowel utterances [@pouwGesturespeechPhysicsBiomechanical2019; @pouwEnergyFlowsGesturespeechinpress; @pouwAcousticInformationUpper2020; @pouwAcousticSpecificationUpper2019]. Specifically, hand and arm movements can transfer a force (a physical impulse) onto the musculoskeletal system, thereby modulating respiration-related muscle activity, leading to changes in vocalization's intensity. If vocal fold adjustments do not accommodate for gesture-induced impulses, the fundamental frequency (F0) of vocalizations is affected as well. Higher-impulse arm movements or two-handed movements will induce more pronounced effects on F0 and intensity than lower-impulse wrist movements or one-handed movements. This is because the mass of the “object” in motion is higher for arm versus wrist movements, thereby changing the momentum of the effector (everything else—such as effector speed—being equal, as momentum effector = effctor mass x effector velocity). The change in momentum is the physical impulse, and physical impulse is highest when the change in velocity (i.e., acceleration) is highest (everything else—such as effector mass—being constant).
How physical impulses are absorbed by the respiratory system is likely complex and not a simple linear function [@levinTensegrityNewBiomechanics2006]. However, a complete understanding will involve an appreciation of the body as a pre-stressed system [@bernsteinCoordinationRegulationsMovements1967; @profetaBernsteinLevelsMovement2018], forming an interconnected tensioned network of compressive (e,g., bones) and tensile elements (e.g., fascia, muscles) through which forces may reverberate nonlinearly [@turveyMediumHapticPerception2014; @silvaSteadystateStressOne2007]. Specifically, the upper limb movements are controlled by stabilizing musculoskeletal actions of the scapula and shoulder joint, which directly implicate accessory expiratory muscles that also stabilize scapula and shoulder joint actions [e.g., the serratus anterior inferior; see @pouwEnergyFlowsGesturespeechinpress for an overview].
Peripheral actions also play a role, as performing an upper limb movement recruits a whole kinetic chain of muscle activity around the trunk (e.g., rector abdominus) to maintain posture [@hodgesFeedforwardContractionTransversus1997]. Indeed, when people are standing vs. sitting, for example, the effects of peak physical impulse of gestures onto vocalization acoustics are more pronounced [@pouwGesturespeechPhysicsBiomechanical2019]. We reasoned that this is because standing involves more forceful anticipatory postural counter adjustments [@cordoPropertiesPosturalAdjustments1982], which reach the respiratory system via accessory expiratory muscles also implicated in keeping postural integrity. Recently, more direct evidence has been found for the gesture-respiratory-speech link: Respiratory-related activity (measured with a respiratory belt) was enhanced during moments of peak-impetus of gesture as opposed to other phases in the gesture movement, and respiratory-related activity itself was predictive of the gesture-related intensity modulations of mono-syllable utterances [@pouwEnergyFlowsGesturespeechinpress].
The evidence reviewed so far has been based on experiments on continuous vocalizations or monosyllabic utterances. Such results cannot directly generalize to fluent, self-generated, full-sentenced speech. However, recent work suggest that gesture-speech physics does generalize to fluent speech. For example, @cravottaEffectsEncouragingUse2019 found that encouraging participants to gesture during cartoon-narration versus giving no instructions lead to 22Hz increase in observation of max F0 and to greater F0 ranges of speech and intensity. Furthermore, computational modelers have reported on interesting successes in synthesizing gesture kinematics based on speech acoustics alone [@ginosarLearningIndividualStyles2019; @kucherenkoAnalyzingInputOutput2019], indicating that information about body movements inhabits the speech signal. Such results do not necessitate a role for biomechanics but only suggests a strong connection between gesture and speech.
## Current experiment
The current experiment was conducted as a simple test of the constraints of upper limb movement on fluent speech acoustics. Participants were asked to retell a cartoon scene that they had just watched, while either not moving, vertically moving their wrist, or vertically moving their arm at a tempo of 80 beats per minute (1.33Hz). Participants were asked to give a stress or beat in the downward motion with a sudden stop at maximum extension (i.e., sudden deceleration). Participants were asked to not allow movements to affect their speaking performance in any way. Similar to previous experiments [e.g., @pouwEnergyFlowsGesturespeechinpress], we assessed the following to conclude that gesture-speech physics is present:
* 1) Does rhythmic co-speech movement change acoustic markers of prosody (i.e., F0 and amplitude envelope)?
* 2) At what moments of co-speech movement is change in acoustics observed?
* 3) Does degree of physical impulse (as measured by effector mass or changes in speed) predict acoustic variation?
# Method
## Participants & Design
A total of `r printnum(length(meta$PPN))` undergraduate students at the University of Connecticut were recruited as participants (*M* age = `r printnum(mean(meta$AGE))`, *SD* age = `r printnum(sd(meta$AGE))`, %cis-gender female = `r printnum( round(sum(meta$GENDER == "FEMALE")/length(meta$PPN)*100, 2) )`, %cis-gender male = `r printnum( round(sum(meta$GENDER == "MALE")/length(meta$PPN)*100, 2) )`, %right-handed = `r printnum( round(sum(meta$Handedness == "R")/length(meta$PPN)*100, 2) )`).
The current design was fully-within subject, with a three-level movement manipulation (passive vs. wrist-movement vs. arm-movement condition). Movement condition was randomly assigned per trial. Taken together, participants performed `r printnum(length(unique(fd$unique_trial)))` trials, each lasting about 40 seconds. The study design was approved by the IRB committee of the University of Connecticut (#H18-227).
## Material & Equipment
### Cartoon vignettes
Twelve cartoon vignettes were created from the “Canary Row” and “Snow Business” Tweety and Sylvester cartoons (M vignette duration = `r printnum(mean(cartoon_duration_s))`seconds (*SD* = `r printnum(sd(cartoon_duration_s))`). These cartoons are often used in gesture research [@mcneillHandMindWhat1992]. The videos can be accessed here: https://osf.io/rfj5x/.
### Audio and Motion Tracking
A MicroMic C520 cardioid condenser microphone headset (AKG, Inc.) was used to record audio at 44.1Khz. The microphone was plugged into a computer that handled the recording via a c++ script. Also plugged into this computer was a Polhemus Liberty motion tracking system (Polhemus, Inc.) which tracked position of the participant’s index finger of the dominant hand, sampling with one 6D sensor at 240 Hz. We applied a first-order Butterworth filter at 30 Hz for the vertical position (z) traces and its derivatives.
## Procedure
Upon arrival, participants were briefed that this 30-minute experiment entailed retelling cartoon scenes while standing and performing upper-limb movements. A motion sensor was attached to their tip of the index finger of the dominant hand, and a microphone headset was put on. Participants were asked to stand upright and were introduced to three movement conditions (see Figure 1). In the passive condition, participants did not move and kept their arm resting alongside the body. In the wrist-movement condition, participants were asked to continuously move the hand vertically at the wrist joint while keeping the elbow joint at 90 degrees. In the arm-movement Condition, participants moved their arm vertically at the elbow joint, without wrist movement. Similar to previous studies [e.g., @pouwAcousticInformationUpper2020], participants were asked to give emphasis in the downward motion of the movement with a sudden halt—in other words, a beat—at the maximum extension of their movement.
After introduction of the movements, participants were told that they were to move at a particular tempo, indicated by visual feedback system. The feedback system consisted of a horizontal bar that continually updated to report on the participant’s movement speed in the previous movement cycle. The participant was to keep the horizontal bar between the lower and higher boundaries (a 20% region, [72-88] BPM) of the 1.33-Hz target tempo (i.e., 80 BPM). Participants briefly practiced moving at the target rate before starting the experiment.
Critically, the participants were not exposed to an external rhythmic signal, like a visual metronome. Subsequently, participants were instructed that they would watch and then retell cartoon clips while making one of the instructed movements (or making no movements). Participants were asked to keep their speech as normal as possible while making the movements (or no movement). In the conditions requiring movement, participants were to keep their movement tempo within the target range.
Twelve cartoon vignettes were readied to be shown before each trial. The experiment ended when the participant saw and retold all 12 vignettes or when the total experiment time reached 30 minutes. To ensure that all movement conditions would be performed at least once within that time, we set the maximum time per trial at 1 minute. In other words, when participants were still retelling the same scene after 60 seconds, the experimenter would terminate the trial and move to the next trial. Mean retelling time was, however, well below 1 minute (*M* = `r printnum(round(trialtime))` seconds, *SD* = `r printnum(round(sdtrialtime, 2))`).
Figure 1. Graphical overview of movement conditions
```{r method_stance_pic, echo = FALSE,warning=FALSE, fig.align = 'center', fig.height= 4}
library(raster)
mypng <- stack(paste0(parentfolder, "/images/FigureStanceMethod.png"))
plotRGB(mypng,maxpixels=1e300)
```
*Note*. Movement conditions are shown. Each participant performed all conditions (i.e., within-subjects). To ensure that movement tempo remained relatively constant, participants were shown a moving green bar, which indicated whether they moved too fast or too slow relative to a 20% target region of 1.33Hz. Participants were instructed to have an emphasis in the downbeat with an abrupt stop (i.e., beat) at the maximum extension.
## Preprocessing
### Speech acoustics
The fundamental frequency was extracted with sex-appropriate preset ranges (male = 50-400Hz; female = 80-640Hz). We used a previously written R script [https://osf.io/m43qy/; @pouwMaterialsTutorialGespin20192019] utilizing the R package ‘wrassp’ [@winkelmannWrasspInterfaceASSP2018], which applies a K. Schaefer-Vincent algorithm. It should be noted that F0 tracking is always susceptible to noisy estimation. We have however checked for multiple participants of the data if there were gross mistrackings of F0 algorithm (e.g., sudden jumps to higher harmonics) and we did not find any. However, we did not hand-check the F0 track for all the data given the current sample size, therefore we must accept a certain range of noise that is common to F0 tracking.
We also extracted a smoothed (5-Hz Hann window) amplitude envelope using a previously written custom-written R script [https://osf.io/uvkj6/, which reimplements in R a procedure from @heAmplitudeEnvelopeKinematics2017a]. The amplitude envelope was calculated by applying a Hilbert transformation to the sound waveform, yielding a complex-valued analytic signal from which we take the complex modulus. After smoothing and downsampling to 240Hz, this gives a one-dimensional time series referred to as the amplitude envelope tracing the extrema's (i.e., envelope) of the sound waveform as shown in Figure 2.
### Data and Exclusions
We collected `r printnum( round( nrow(fd)*(1/240)/60,2) )` minutes of continuous data (passive condition = `r printnum( round( nrow(fd[fd$condition == "PASSIVE",])*(1/240)/60,2) )`, wrist-movement condition = `r printnum( round( nrow(fd[fd$condition == "WRIST",])*(1/240)/60,2) )`, arm-movement = `r printnum( round( nrow(fd[fd$condition == "ARM",])*(1/240)/60,2) )`). However, a C++ memory allocation error resulted in the loss of the precise timing information of the sampling of the motion tracker after a certain period (1 million milliseconds), thus for a subset of the experiment data. Full data was therefore obtained for the first 16 minutes and 40 seconds of each trial for all participants. We limit our analyses to this complete data set. This dataset consists of `r printnum( round( nrow(fd[is.na(fd$EXCLUDE),])*(1/240)/60,2) )` minutes of continuous speech and movement data (passive = `r printnum( round( nrow(fd[is.na(fd$EXCLUDE) & fd$condition == "PASSIVE",])*(1/240)/60,2) )`, wrist-movement condition = `r printnum( round( nrow(fd[is.na(fd$EXCLUDE) & fd$condition == "WRIST",])*(1/240)/60,2) )`, arm-movement condition = `r printnum( round( nrow(fd[is.na(fd$EXCLUDE) & fd$condition == "ARM",])*(1/240)/60,2) )`).
## Baseline
For gesture-speech analysis we also created a surrogate condition. We randomly paired the speech of the passive condition trials of participant x with motion-tracking data from the movement conditions for that participant x (without scrambling the order of the speech and motion time series extracted in these falsely paired trials). This surrogate randomly paired condition allowed us to exclude the possibility that any effects of movement were due to chance correlations inherent to the structure of speech and movement, rather than the correlations arising out of the coupling of speech and movement. We only use this surrogate control condition as a contrast when we are performing analysis on the temporal relation between speech and movement.
We computed the following measures to check whether our movement manipulation was successful and whether speech rates were comparable across conditions. Figure 2 shows a summary of the results for key manipulation check measures.
```{r exclude_data, echo=FALSE,warning = FALSE, message = FALSE, output = FALSE, results = 'hide'}
fd <- subset(fd, is.na(EXCLUDE)) #exclude data where we have motion tracking timing loss
```
## Manipulation Checks
### Movement Frequency
To ascertain whether participants moved their limbs within the target 1.33-Hz range, we performed a wavelet-based analysis [using R package 'WaveletComp'; @roschWaveletCompGuidedTour2014]. Wrist movements were performed at slightly faster rates (*M* = `r printnum( mean( fd$dom_hz_mov[is.na(fd$EXCLUDE) & fd$condition == "WRIST"], na.rm = TRUE))` Hz, *SD* = `r printnum( sd( fd$dom_hz_mov[is.na(fd$EXCLUDE) & fd$condition == "WRIST"], na.rm = TRUE))`) than arm movements (*M* = `r printnum( mean( fd$dom_hz_mov[is.na(fd$EXCLUDE) & fd$condition == "ARM"], na.rm = TRUE))` Hz, *SD* = `r printnum( sd( fd$dom_hz_mov[is.na(fd$EXCLUDE) & fd$condition == "ARM"], na.rm = TRUE))`), but in both cases the movements were distributed over the target range. This confirms that our movement manipulation was successful. For our surrogate control condition, the mean frequency of the artificially paired movement time series fell between both arm- and wrist-movement condition frequency distributions (*M* = `r printnum( mean( fd$dom_hz_mov[is.na(fd$EXCLUDE) & fd$condition == "PASSIVE"], na.rm = TRUE))` Hz, *SD* = `r printnum( sd( fd$dom_hz_mov[is.na(fd$EXCLUDE) & fd$condition == "PASSIVE"], na.rm = TRUE))`).
```{r speech info, echo=FALSE, message=FALSE, warning=FALSE, results='hide', output = FALSE}
#get vocalization cycles
time_p <- NA #initizalize a temporary variable
time_p <- ave(fd$ENV, fd$unique_vocalization, FUN= function(x) max(x, na.rm = TRUE))#extract highest amplitude observation observed during a vocalization
time_p <- ifelse(is.na(fd$unique_vocalization), NA, time_p) #if there is NA vocalization ENVELOPE max shoudl be ignroed
fd$time_peak <- ifelse(time_p!=fd$ENV, NA, fd$time_ms_rec) #insetead of the amplitude fill in the time of that max amplitude
fd$time_peak[!is.na(fd$time_peak)] <- ave(fd$time_peak[!is.na(fd$time_peak)],
fd$unique_trial[!is.na(fd$time_peak)],
FUN = function(x) c(0, diff(x)) ) #now for each trial get the difference of these timings
fd$time_peak <- ifelse(fd$time_peak == 0, NA, fd$time_peak) #only consider differences that are nonzero (this ignores the first difference observations)
fd$time_peak <- ifelse(is.infinite(fd$time_peak), NA, fd$time_peak) #ignore infinites that are produced for our missing timing data
fd$time_peak <- 1000/fd$time_peak #get occurent Hz by dividing it by 1 seconds (i.e., 1000 milliseconds)
#get average vocalization duration
time_p <- NA #re-initizalize a temporary variable
fd$time_voc <- ave(fd$time_ms_rec, fd$unique_vocalization, FUN= function(x) max(x, na.rm = TRUE)-min(x, na.rm = TRUE)) #get for each unique vocalization its begin and end time, substract and be left with time of the vocalization
fd$time_voc <- ifelse(is.na(fd$unique_vocalization), NA, fd$time_voc) #only keep vocalization time for when vocalization != NA
fd$time_voc <- ifelse(is.infinite(fd$time_voc), NA, fd$time_voc) #ignore infinites that are produced for our missing timing data
fd$time_voc <- 1000/fd$time_voc #make variable into Hz
fd$time_voc <- ifelse(is.infinite(fd$time_voc), NA, fd$time_voc) #ignore infinites that are produced for our missing timing data
rm(time_p) #remove temporary variable
```
### Speech Rate
To provide a description about speech rate we calculated two measures, namely vocalization duration and vocalization interval (see Fig. 2 for examples), which are measures derived from information in the F0 track, as well as the amplitude envelope for the interval calculation. The vocalization duration was defined as the ms length of an uninterrupted run of F0 observations. The vocalization interval was determined by identifying two consecutive runs of F0 observations (i.e., vocalization events) and determining the peak amplitude envelope of each of those vocalization events as to compare the relative timing between those peaks. This way we have a single time point for each vocalization event that we can compare with the next vocalization event's time point (i.e., the vocalization interval).
Figure 3 shows relatively uniform distributions for these specific speech measures. No clear 1:1 frequency couplings of movement and vocalization duration or vocalization interval nor any other clear signs of polyrhythmic coupling of movement and speech are observed [see e.g., @zelicArticulatoryConstraintsSpontaneous2015; @stoltmannSyllablepointingGestureCoordination2017]. Note though that there are other possible (acoustically defined) units of speech that might entrain to movements that we do not further pursue here [@linHowHitThat2020a]. We restrict ourselves for the current report to speech vocalization acoustics rather than speech-movement cycle dynamics, as the former is the confirmatory research topic of the current study.
To compare vocalization rates to movement, we computed the average vocalization duration and interval for each trial by tracking the time of uninterrupted runs of F0 observations and then converting the time in milliseconds to Hz. For the passive condition, the average vocalization duration was *M* = `r printnum( mean( fd$time_voc[fd$condition == "PASSIVE"], na.rm = TRUE))` Hz, *SD* = `r printnum( sd( fd$time_voc[fd$condition == "PASSIVE"], na.rm = TRUE))`, and the vocalization interval was *M* = `r printnum( mean( fd$time_peak[fd$condition == "PASSIVE"], na.rm = TRUE))` Hz, *SD* = `r printnum( sd( fd$time_peak[fd$condition == "PASSIVE"], na.rm = TRUE))`.. For the wrist-movement condition the vocalization duration was *M* = `r printnum( mean( fd$time_voc[fd$condition == "WRIST"], na.rm = TRUE))` Hz, *SD* = `r printnum( sd( fd$time_voc[fd$condition == "WRIST"], na.rm = TRUE))`, and the vocalization interval was the vocalization interval was *M* = `r printnum( mean( fd$time_peak[fd$condition == "WRIST"], na.rm = TRUE))` Hz, *SD* = `r printnum( sd( fd$time_peak[fd$condition == "WRIST"], na.rm = TRUE))`. For the arm-movement condition, the vocalization duration was *M* = `r printnum( mean( fd$time_voc[fd$condition == "ARM"], na.rm = TRUE))` Hz, *SD* = `r printnum( sd( fd$time_voc[fd$condition == "ARM"], na.rm = TRUE))` and the vocalization interval was *M* = `r printnum( mean( fd$time_peak[fd$condition == "ARM"], na.rm = TRUE))` Hz, *SD* = `r printnum( sd( fd$time_peak[fd$condition == "ARM"], na.rm = TRUE))`.
```{r example_time_series_code, echo=FALSE,warning = FALSE, message = FALSE, output = FALSE, results = 'hide'}
#make an example time series with acoustic and motion data
sample <- fd[fd$ppn == "31" & fd$trial == 4,] #pick a trial
sample$time <- sample$time_ms_rec-min(sample$time_ms_rec) #start the trial at 0 time
sample <- sample[sample$time > 15000 & sample$time < 22000,] #collect this bit of the data as corresponding to the waveform
sample$F0 <- ifelse(sample$F0 == 0, NA, sample$F0) #for plotting F0;s should be given NA instead of 0's when vocalization is absent
sample$time <- sample$time-min(sample$time) #start this sample of the trial at 0 time
#make plots, combine them with grid.arrange (which was later exported for some extra editing)
a <- ggplot(sample, aes(x= time)) + geom_line(aes(y = ENV), color = "purple", size= 1.3) + geom_line(aes(y = rescale(z_mov, c(0.8, 1.3)) )) + bluetheme + ylim( -0.3, 1.35)+ theme(axis.text.y = element_text(face = "bold", color="purple"))+ theme(axis.title.x=element_blank())
b <- ggplot(sample, aes(x= time)) + geom_line(aes(y = F0), color = "red", size = 0.8)+ geom_line(aes(y = rescale(z_mov, c(160, 220)))) + bluetheme + ylim(50, 250) + theme(axis.text.y = element_text(face = "bold", color="red"))+ theme(axis.title.x=element_blank())
c <- ggplot(sample, aes(x= time)) + geom_line(aes(y = dom_hz_mov), color = "cyan3", size = 1.3)+ geom_line(aes(y = rescale(z_mov, c(1.30, 1.33)))) + bluetheme + ylab("wavelet estimate frequency (Hz) ") + theme(axis.text.y = element_text(face = "bold", color="cyan3"))+ xlab("time in milliseconds")
#grid.arrange(a,b,c, nrow =3) # this is the figure that we further edited and is called next
```
Figure 2. Example movement-, amplitude envelope-, F0- time series, and time-dependent movement frequency estimates
```{r plot_example_time_series, fig.height= 5}
#load in the finally edited time series example
mypng <- stack(paste0(parentfolder, "/images/FigureTimeSeriesExample.png"))
plotRGB(mypng,maxpixels=1e500000)
```
*Note figure 2.* A sample of about 10 seconds is shown. With the participant’s permission the speech sample is available at https://osf.io/2qbc6/. The smoothed amplitude envelope in purple traces the waveform maxima. The F0 traces show the concomitant vocalizations in Hz, with an example of vocalization interval and vocalization duration (which was calculated for all vocalizations). The bottom panel shows the continuously estimated movement frequency in cyan, which hovers around 1.33 Hz. In all these panels, the co-occurring movement is plotted in arbitrary units (a.u.) to show the temporal relation of movement phases and the amplitude envelope, F0, and the movement frequency estimate. In our analysis, we refer to the maximum extension and deceleration phases as relevant moments for speech modulations. In this example, a particularly dramatic acoustic excursion occurs during a moment of deceleration of the arm movement, possibly an example of gesture-speech physics.
Figure 3. Summary of movement-frequency, vocalization duration and vocalization interval
```{r manipulation_checkplot, echo=FALSE, message=FALSE, warning=FALSE, results='hide', fig.width=5, figure.heigth = 8}
#get average vocalization cycle
dd <- fd
dd$condition2 <- ifelse(dd$condition == "PASSIVE", "PASSIVE (Random Pairing)",
as.character(dd$condition)) #rename the passive condition to False Pair for assessing vocal-motor coupling
dd$condition2 <- ordered(dd$condition2, levels = c("PASSIVE (Random Pairing)", "WRIST", "ARM")) #reorder the levels
dd$condition <- ordered(dd$condition, levels = c("PASSIVE", "WRIST", "ARM")) #reorder the levels
#plot relevant summary data for the vocalization cycles (time_peak), and vocalizaiton duration (time_voc)
a <- ggplot(dd, aes(x = dom_hz_mov)) + geom_density()+bluetheme + facet_grid(.~condition2)+ xlim(0.1,3)+ geom_vline(xintercept = 1.33, color = "red", size= 0.4) + xlab("movement frequency (Hz)")+ theme(strip.text.x = element_text(size = 7))
b <- ggplot(dd, aes(x = time_peak)) + geom_density()+bluetheme + facet_grid(.~condition)+ geom_vline(xintercept = 1.33, color = "red", size= 0.4) +xlim(0.1, 6) + xlab("vocalization interval (Hz)")+ theme(strip.text.x = element_text(size = 7))
c <- ggplot(dd[!duplicated(dd$unique_vocalization),], aes(x = time_voc)) + geom_density()+bluetheme + facet_grid(.~condition)+ geom_vline(xintercept = 1.33, color = "red", size= 0.4) +xlim(0.1, 6) + xlab("vocalization duration (Hz)") + theme(strip.text.x = element_text(size = 7))
#grid.arrange(a,b,c,nrow = 3)
#load in the image
mypng <- stack(paste0(parentfolder, "/images/Figure3.png"))
plotRGB(mypng,maxpixels=1e500000)
```
*Note Figure 3*. Density distributions of movement frequencies, vocalization interval, and vocalization duration are shown. There was no movement for the passive condition, but we display the randomly paired movement time series in the surrogate baseline pairing for which frequency information is shown. The red vertical line indicates the target movement frequency at 1.3 Hz.
# Results
## Overview of analyses
We report three main analyses to show that gesture-speech physics is present in fluent speech. Firstly, we assess overall effects of movement condition on vocalization acoustics (F0 and the amplitude envelope); these would support our hypothesis that upper limb movement—and, especially, high-impulse movement—constrains fluent speech acoustics. Secondly, we assess whether vocalization acoustic modulations are observed at particular phases of the movement cycle, which gesture-speech physics holds should occur at moments of peaks in deceleration. Thirdly, we assess whether a continuous estimate of upper limb physical impulse through deceleration rate predicts vocalization acoustic peaks, which would support the gesture-speech physics hypothesis that physical impulses are transferred onto the vocalization system.
The following generally applies to all analyses. For hypothesis testing, we performed mixed linear regression models [using R-package ‘nlme’; @pinheiroNlmeLinearNonlinear2019], and non-linear generalized additive modeling or GAM [using R-package 'gam'; @hastieGamGeneralizedAdditive2019] with random intercept for participants by default.
## Acousic correlates of movement condition
Figure 4 shows the average F0 and amplitude envelope (z-scaled for participants) per trial per condition. The passive condition had generally lower levels of F0 and amplitude envelope as compared to the arm- and wrist-movement conditions. Furthermore, the higher-impulse arm-movement condition generally had higher levels of F0 and amplitude envelope as compared to lower-impulse wrist-movement condition.
Table 1 shows the results of mixed linear regression analysis. For the amplitude envelope, the passive condition had a lower average amplitude envelope as compared to the the wrist-movement condition, as well as the arm-movement condition. After accounting for differences in F0 for sex (males had generally 73 Hz lower F0), wrist-movement condition had about 1.6 Hz increase in average as compared to the passive condition, but this was not statistically reliable. Further, the arm-movement condition increased in F0 by 3.5 Hz over the passive Condition.
Figure 4. Average F0 and amplitude envelope (ENV) per trial per condition
```{r plot_avF0_avENV, echo=FALSE, message = FALSE, warning = FALSE, fig.width=5}
#average acoustics plots
#average F0 per trial
fd$av_f0 <- ave(fd$F0z, fd$unique_trial, FUN = function(x) mean(x, na.rm = TRUE)) #get the average F0
#plot F0
a <- ggplot(fd[!duplicated(fd$av_f0),], aes(x = condition, y = av_f0)) + geom_violin(color = "red", alpha = 0.6) +
geom_boxplot(alpha = 0.2) + geom_beeswarm(priority='density',cex=1.0, size= 0.4) + ylab("average F0 per trial (z-scaled per participant)") + xlab("condition") + ggtitle("Vocalization F0") + bluetheme
#average Amplitude vocalization per trial
#note here that we take the amplitude during a vocalization (i.e., when F0 is observed) thereby ignoring voiceless consonents
fd$av_ENV[!is.na(fd$F0z)] <- ave(fd$ENVz[!is.na(fd$F0z)], fd$unique_trial[!is.na(fd$F0z)], FUN = function(x) mean(x, na.rm = TRUE))
#plot ENV
b <- ggplot(fd[!duplicated(fd$av_ENV),], aes(x = condition, y = av_ENV)) + geom_violin(color= "purple", alpha = 0.6) +
geom_boxplot(alpha = 0.2) + geom_beeswarm(priority='density',cex=1.0, size= 0.4) + ylab("average ENV per trial (z-scaled per participant)") + xlab("condition") + ggtitle("Vocalization ENV") + bluetheme
#grid.arrange(b,a, nrow=1)
#load in the image
mypng <- stack(paste0(parentfolder, "/images/Figure4.png"))
plotRGB(mypng,maxpixels=1e500000)
```
*Note Figure 4*. Violin and box plots are shown for average F0 (Hz) and amplitude envelope (*z*-scaled) per trial (jitters show observation).
```{r model_differences_absolute, output= FALSE, message = FALSE, warning = FALSE, results='hide'}
#compute statistics for unscaled F0 within participants
#effect condition F0
dd <- fd #convert to data to a temporary dataset for this analysis
dd$F0 <- ifelse(dd$F0 == 0, NA, dd$F0) #make NA when F0 is not observed (i.e., 0) thereby non-vocal. not affecting our analysis
fd$av_f0 <- ave(dd$F0, dd$unique_trial, FUN = function(x) mean(x, na.rm = TRUE)) #get the mean F0 per trial
lmdat <- fd[!duplicated(fd$unique_trial),] #only keep one unique trial, thereby also keeping one mean F0 observation per trial #run models
model1f <- lme(av_f0~gender+condition, data = lmdat, random = ~1|ppn, method = "ML", na.action = na.exclude)
#random slopes for condition did not converge
#model1fran <- lme(av_f0~gender+condition, data = lmdat, random = ~condition|ppn, method = "ML", na.action = na.exclude)
cmod1f <- coef(summary(model1f)) #collect summary information of the f model
#effect condition ENV
dd <- fd #convert to data to a temporary dataset for this analysis
fd$av_ENV[!is.na(dd$F0z)] <- ave(fd$ENVz[!is.na(dd$F0z)], fd$unique_trial[!is.na(dd$F0z)], FUN = function(x) mean(x, na.rm = TRUE))
lmdat <- dd[!duplicated(dd$av_ENV),]
model1e <- lme(av_ENV~condition, data = lmdat, random = ~1|ppn, method = "ML", na.action = na.exclude)
#random slopes for condition did not converge
#model1eran <- lme(av_ENV~gender+condition, data = lmdat, random = ~condition|ppn, method = "ML", na.action = na.exclude)
cmod1e <- coef(summary(model1e)) #collect summary information of the e model
```
\pagebreak
Table 1. Linear mixed effects for effects of condition on F0 and amplitude envelope (ENV)
```{r tablesmean1, warning = FALSE, echo = FALSE,fig.width=6}
#make tables
tm <- cbind(contrast = c("intercept", "Wrist vs. Passive", "Arm vs. Passive"," ",
"intercept", "Male vs. Female", "Wrist vs. Passive", "Arm vs. Passive"), b = c(round(cmod1e[,"Value"],3),"",round(cmod1f[,"Value"],3)),
SE = c( round(cmod1e[,"Std.Error"],3),"", round(cmod1f[,"Std.Error"],3)),
df = c(round(cmod1e[,"DF"],0)," ", round(cmod1f[,"DF"],0)),
p = c(round(cmod1e[,"p-value"],4),"", round(cmod1f[,"p-value"],4)))
rownames(tm) <- c("ENV (z-scaled)", "", ""," ",
"F0 (Hz)", "", "", "")
tm[,5] <- ifelse(tm[,5] == 0, "< .0001", tm[,5])
# at most 4 decimal places
knitr::kable(tm, digits = 3, align = "c", booktabs = T)
```
## Coupling of vocalization duration and movement
Having ascertained in the previous analysis that acoustics were modulated for movement versus no movement, we further need to confirm that such modulations occur at particular moments in the movement cycle. Figure 5 shows the main results for all data, where we model over time the acoustic patterning in vocalizations around the maximum extension of the movement cycle, for all movement cycles that occurred. If there are particular moments in the movement cycle where vocalization is affected—for example, the moment when the hand starts decelerating (estimated from the data as shown in Figure 5)—we would expect acoustic modulations (peaks) at such moments of the movement cycle.
Just before the moment of maximum extension, the observed amplitude envelope shows a clear peak, most dramatically for the arm-movement condition, but also for the wrist-movement condition. For speech in the randomly paired movement- and passive condition, this was not the case; this provides evidence that the results observed in the arm- and wrist-movement conditions are not due to mere chance. For F0, the pattern is somewhat less clear, but positive peaks still occur just before the maximum extension. These findings replicate our earlier work on steady-state vocalization and mono-syllabic utterances, showing that moments of peak deceleration also show peaks in acoustics [@pouwEnergyFlowsGesturespeechinpress; @pouwGesturespeechPhysicsBiomechanical2019].
To test whether trajectories are indeed non-linear and are reliably different from the passive condition, we performed Generalized Additive Modeling or GAM, a type of non-linear mixed effects procedure. GAM is a popular time-series analysis in phonetics and allows the automatic modeling of more (and less) complex non-linear patterns by combining a set of smooth basis functions. Furthermore, GAM allows for testing whether those non-linear trajectories are modulated depending on some grouping of the data [see e.g., @wielingAnalyzingDynamicPhonetic2018a]. We assessed with GAM the trajectory of acoustics around 800 milliseconds of the maximum extension of the movement. We chose 800 milliseconds (-400, 400) as this is about the time for a 1.33Hz cycle (1000/1.33Hz = 752 ms) with an added margin of error of about 50ms. The model results with random slopes and intercept for participant are shown in Table 2.
Firstly, for all models tests for non-linearity of the trajectories were statistically reliable (*p*'s < .0001), meaning that there were peaks or valleys in acoustics over the movement cycle rather than a flat linear trend (Figure 6). As shown in Table 2, our results replicate the general finding that the wrist movements condition led to reliably different non-linear peaks in acoustics as compared to the passive condition (*p* < .001). Moreover, this effect—relative to the passive condition—is even more extreme for the arm movement condition (*p* < .001). Figure 6 provides the fitted trajectories for the GAM models.
For readers interested in individual differences in trajectories, we have created interactive graphs for each participant’s average amplitude envelope trajectories (https://osf.io/a423h/) and F0 trajectories (https://osf.io/fdzwj/).
\pagebreak
Figure 5. Average observed vocalization acoustics relative to the moment of maximum extension
```{r movementplot_avF0_avENV, echo=FALSE, message=FALSE, warning=FALSE, fig.height=8}
dd <- fd[is.na(fd$EXCLUDE),] #make a temporary dataset for this analysis
dd$dist <- -(dd$filled_z_min-dd$time_ms_rec) #get the time relative to the maximum extension (i.e., z_min)
dd$ENVz <- ifelse(is.na(dd$F0z), NA, dd$ENVz) #only keep amplitude reading when there is a vocalization
#The next repeated procedure averages for each observation (F0z, ENVz etc.), what the average reading was for distance from maximum extension, for each condition and each participant seperately
dd$averageF0 <- ave(dd$F0z, dd$dist, dd$condition, dd$ppn, FUN = function(x) mean(x, na.rm = TRUE))
dd$averageENV <- ave(dd$ENVz, dd$dist, dd$condition, dd$ppn, FUN = function(x) mean(x, na.rm = TRUE))
dd$acc <- ave(dd$z_mov, dd$unique_trial, FUN = function(x) -1*scale(c(0,0,diff(diff(x)))))
dd$averagedecc <- scale(ave(dd$acc, dd$dist, dd$condition, dd$ppn, FUN = function(x) mean(x, na.rm = TRUE)) )
dd$averagv <- scale(ave(dd$v, dd$dist, dd$condition, dd$ppn, FUN = function(x) mean(x, na.rm = TRUE)))
dd$averagez <- scale(ave(dd$z_mov, dd$dist, dd$condition, dd$ppn, FUN = function(x) mean(x, na.rm = TRUE)))
dn <- dd[!duplicated(paste0(dd$condition, dd$dist, dd$ppn)),] #keep only unique averaged trajectories, by only keeping one row per paticipant, distance from z_min and ppn
dn1 <- dn #make another temporary dataset for plotting results
dn1$condition <- ifelse(dn1$condition == "PASSIVE", "PASSIVE (Random Pairing)", as.character(dn1$condition))
dn1$condition <- factor(dn1$condition, levels = c("PASSIVE (Random Pairing)", "WRIST", "ARM"))
#plot the average Envelope per distance
a <- ggplot(dn1, aes(x = dist, y = averageENV)) + geom_smooth(color = "purple") +
facet_grid(.~condition) + xlim(-400, 400) + theme_bw() + ggtitle("Vocalization Amplitude") + ylab("time from maximum extension")+
geom_vline(xintercept = 0, linetype = "dashed") + bluetheme + theme(axis.text.x = element_text(face="bold",
angle=45))+ geom_vline(xintercept =-100, color= "blue", linetype = "dashed")+ labs(x=NULL) +ylab("z-scaled ENV")
#plot the average F0 per distance
b <- ggplot(dn1, aes(x = dist, y = averageF0)) + geom_smooth(color = "red") +
facet_grid(.~condition) + xlim(-400, 400) + theme_bw() + ggtitle("Vocalization F0") + ylab("time from maximum extension")+
geom_vline(xintercept = 0, linetype = "dashed") + bluetheme+ theme(axis.text.x = element_text(face="bold",
angle=45))+ geom_vline(xintercept =-100, color= "blue", linetype = "dashed") + labs(x=NULL)+ylab("z-scaled F0")
#plot the average acceleration per distance
c <- ggplot(dn[dn$condition != "PASSIVE",], aes(x = dist)) +
geom_smooth(aes(y = averagedecc), color = "blue", size = 2) +
geom_smooth(aes(y = averagez), color = "black", size = 2) +
facet_grid(.~condition) + xlim(-400, 400) + theme_bw() + ggtitle("Vertical position and accelaration") + ylab("time from maximum extension")+ geom_vline(xintercept = 0, linetype = "dashed") + bluetheme + theme(axis.text.x = element_text(face="bold",
angle=45)) + geom_vline(xintercept =-100, color= "blue", linetype = "dashed")+
xlab("time relative to maximum extension")+ylab("movement (acceleration)")
#grid.arrange(a,b,c)
#load in the image
mypng <- stack(paste0(parentfolder, "/images/Figure5.png"))
plotRGB(mypng,maxpixels=1e500000)
#these plots were generated to provide information about invidividual variation and were put online on the OSF
extra <- ggplot(dn1, aes(x = dist, y = averageENV, color = as.factor(ppn))) + geom_smooth(size = 0.3, alpha= 0.3) +
facet_grid(.~condition) + xlim(-400, 400) + theme_bw() + ggtitle("Vocalization Amplitude") + ylab("time from maximum extension")+
geom_vline(xintercept = 0, linetype = "dashed") + bluetheme + theme(axis.text.x = element_text(face="bold",
angle=45))+ geom_vline(xintercept =-100, color= "blue", linetype = "dashed")+ labs(x=NULL) +ylab("z-scaled ENV")+ ggtitle("individual variation of amplitude envelope trajectories around maximum extension")
extra2 <- ggplot(dn1, aes(x = dist, y = averageF0, color = as.factor(ppn))) + geom_smooth(size = 0.3, alpha= 0.3) +
facet_grid(.~condition) + xlim(-400, 400) + theme_bw() + ggtitle("Vocalization Amplitude") + ylab("time from maximum extension")+
geom_vline(xintercept = 0, linetype = "dashed") + bluetheme + theme(axis.text.x = element_text(face="bold",
angle=45))+ geom_vline(xintercept =-100, color= "blue", linetype = "dashed")+ labs(x=NULL) +ylab("z-scaled ENV")+ ggtitle("individual variation of F0 trajectories around maximum extension")
```
*Note Figure 5*. For the upper two panels the average acoustic trajectory is shown around the moment of maximum extension (t = 0, dashed black line). In the lower panel, we have plotted the *z*-scaled average vertical displacement of the hand and the *z*-scaled acceleration trace. The blue dashed vertical line marks the moment where the deceleration phase starts, which aligns with peaks in acoustics.
Figure 6. Fitted trajectories GAM
```{r table_anddiff, echo=FALSE, message=FALSE, warning=FALSE, results = 'hide'}
#model differences with respect to temporal distance from z_min by using generalized additive modeling (GAM)
#Keep only data that are roughly within the 1.3Hz cycle, otherwise will give bad estimates as there are too little data points at higher intervals
CC <- dn[abs(dn$dist) < 400,] #make a temporary subdataset where the absolute temporal distance from max extension is 400 ms
#PERFORM GAM
#m1 <- bam(averageENV~ condition + s(dist, by=as.factor(condition)) + s(ppn,bs="re"),data=CC) #we used random slopes as it converged
m1r <- bam(averageENV~ condition + s(dist, by=as.factor(condition)) + s(ppn,condition, bs="re") + s(ppn, condition, bs = "re"),data=CC)
mod1 <- summary(m1r) #collect GAM data for average Envelope
#m2 <- bam(averageF0~ gender+condition + s(dist, by=as.factor(condition)) + s(ppn,bs="re"),data=CC) #we used random slopes as it converged
m2r <- bam(averageF0~ gender+condition + s(dist, by=as.factor(condition)) + s(ppn,condition, bs="re") + s(ppn, condition, bs = "re"),data=CC)
mod2 <- summary(m2r) #collect GAM data for average F0
#plot the fitted values per condition
#par(mfrow=c(1,2))
#plot_smooth(m1r,view="dist", plot_all="condition",ylab = 'ENV')
#plot_smooth(m2r,view="dist", plot_all="condition",ylab = 'F0')
#load in the image
mypng <- stack(paste0(parentfolder, "/images/Figure6.png"))
plotRGB(mypng,maxpixels=1e500000)
```
Table 2. Model results for GAM analysis
```{r tables_GAM, echo=FALSE, output = FALSE, message = FALSE}
tm2 <- cbind(contrast = c("intercept", "Wrist vs. Passive", "Arm vs. Passive"," ",
"intercept", "Male vs. Female", "Wrist vs. Passive", "Arm vs. Passive"),
b = c(round(mod1$p.coeff,3),"",round(mod2$p.coeff,3)),
SE = c( round(mod1$se[1:3], 3),"", round(mod1$se[1:4],3)),
df = c(round(mod1$p.t,3),"", round(mod2$p.t,3)),
p = c(round(mod1$p.pv,4),"", round(mod2$p.pv,4)))
rownames(tm2) <- c("ENV (z-scaled)", "", ""," ",
"F0 (Hz)", "", "", "")
tm2[,5] <- ifelse(tm2[,5] == 0, "< .0001", tm2[,5])
knitr::kable(tm2, digits = 3, align = "c", booktabs = T)
```
*Note*. Model results are shown for the amplitude envelope (ENV; *z*-scaled) and F0 (Hz). For F0, we accounted for sex differences when estimating independent effects of condition.
## Degree of physical impetus and acoustic peaks
We have confirmed that speech acoustics are modulated around moments of the deceleration phase, about 0-200 ms before the maximum extension. The effect of gesture-speech physics can be further examined by looking at how the forces produced by the upper limb movement predict acoustic peaks. Therefore, for all vocalizations that occurred between 200--0ms before the maximum extension, we assessed whether acoustic peak (i.e., maximum F0 or maximum amplitude envelope) was predicted by the maximum deceleration value (i.e., minimum acceleration observation) observed in that 200 ms window. In previous research, we found that higher deceleration was related to higher amplitude envelope observations but not F0 [@pouwEnergyFlowsGesturespeechinpress].
Figure 7 shows the general pattern of the results for the wrist- and arm-movement condition. Here we averaged per trial the maximum deceleration values of max F0 and max ENV for each vocalization event. Table 3 shows the model results of linear mixed effects model with random intercept and slopes for participants, in which we regressed the trial-averaged max observed deceleration against the co-occurring trial-averaged vocalization acoustic peaks for amplitude envelope and F0 (separately). Higher deceleration indeed predicted higher amplitude envelope. This was also the case for F0, but only for arm movements (as opposed to wrist movement), as indicated by a statistically reliable interaction between condition and max deceleration effect (*p*'s < .05). Together, these demonstrate the roles of both acceleration and effector mass in producing physical impulses.
Figure 7. Relation max deceleration and height acoustic peak
```{r impact analysis, echo=FALSE, warning = FALSE, message = FALSE}
#for this impact analysis we only do the movement conditions
dd <- subset(fd, condition != "PASSIVE")
dd$dist <- -(dd$filled_z_min-dd$time_ms_rec) #calculate the temporal distance from maximum extension again
dn <- subset(dd, dist < 0 & dist > -200) #subset the region around peak deceleration where higher acoustic modulations were found
dn$peak_env <- ave(dn$ENVz, dn$unique_vocalization, FUN = function(x) max(x, na.rm = TRUE) ) #get the peak amplitude
dn$peak_F0 <- ave(dn$F0z, dn$unique_vocalization, FUN = function(x) max(x, na.rm = TRUE) ) #get the peak in F0
dn$peak_acc <- ave(dn$acc, dn$unique_vocalization, FUN = function(x) max(abs(x), na.rm = TRUE) ) #get the peak in deceleration
dn$peak_acc <- ifelse(dn$peak_acc < 0.1, NA, dn$peak_acc) #consider only cases when there was some acceleration/deceleration
dn$condition <- factor(as.character(dn$condition), levels = c("WRIST", "ARM"))
#plot deceleration and acoustic peaks averaged per trial
dn$peak_env_av <- ave(dn$peak_env, dn$unique_trial, FUN = function(x) mean(x, na.rm = TRUE))
dn$peak_F0_av <- ave(dn$peak_F0, dn$unique_trial, FUN = function(x) mean(x, na.rm = TRUE))
dn$peak_acc_av <- ave(dn$peak_acc, dn$unique_trial, FUN = function(x) mean(x, na.rm = TRUE))
a <- ggplot(dn[!duplicated(dn$unique_trial),], aes(x= peak_acc_av, y = peak_env_av, color = as.factor(ppn))) + geom_point(size = 2) + facet_grid(.~condition) + bluetheme + xlab("max deceleration") + ylab("max ENV (z-scaled)") +theme(legend.position = "none") + geom_smooth(method = "lm", alpha= 0, color = "purple", size=1.5)
b <- ggplot(dn[!duplicated(dn$unique_trial),], aes(x= peak_acc_av, y = peak_F0_av,color = as.factor(ppn))) + geom_point(size = 2) + facet_grid(.~condition) + bluetheme + xlab("max deceleration") + ylab("max F0 (z-scaled)")+theme(legend.position = "none")+ geom_smooth(method = "lm", alpha= 0, color = "red", size=1.5)
#grid.arrange(b, a, nrow =2)
#load in the image
mypng <- stack(paste0(parentfolder, "/images/Figure7.png"))
plotRGB(mypng,maxpixels=1e500000)
```
*Note Figure 7*. The x-axis shows the average maximum deceleration per trial (absolutized negative acceleration value), where 0 indicates no deceleration (absolutized) and positive values indicate higher deceleration rates in cm/seconds squared. Each point contains trial averaged values. It can be seen that deceleration rates are more extreme for the arm versus the wrist condition. On the y-axis, we have the average maximum observed amplitude envelope (lower panel) and F0 (upper panel) for those moments of deceleration. Higher decelerations co-occur with higher peaks in acoustics for arm movements (but not or less so for wrist movements).
\newpage
Table 4. Linear mixed effects of deceleration and acoustic peaks
```{r model_impact, warning= FALSE, message = FALSE}
#effect condition and peak deceleration on vocalization peak
lmdat <- dn[!duplicated(dn$unique_trial),] #only keep 1 observation per unique vocalization
model1e <- lme(peak_env_av~peak_acc_av, data = lmdat, random = ~peak_acc_av|ppn, method = "ML", na.action = na.exclude)
mod1e <- coef(summary(model1e))
model1f <- lme(peak_F0_av~condition*peak_acc_av, data = lmdat, random = ~peak_acc_av|ppn, method = "ML", na.action = na.exclude)
mod1f <- coef(summary(model1f))
tm3 <- cbind(contrast = c("Intercept", "Max Deceleration"," ",
"intercept", "Arm vs. Wrist", "Max Deceleration", "Arm x Max Deceleration"),
b = c(round(mod1e[,"Value"],3),"",round(mod1f[,"Value"],3)),
SE = c( round(mod1e[,"Std.Error"],3),"", round(mod1f[,"Std.Error"],3)),
df = c(round(mod1e[,"DF"],0)," ", round(mod1f[,"DF"],0)),
p = c(round(mod1e[,"p-value"],4),"", round(mod1f[,"p-value"],4)))
rownames(tm3) <- c("ENV (z-scaled)", "", " ",
"F0 (z-scaled)", "", "", "")
tm3[,5] <- ifelse(tm3[,5] == 0, "< .0001", tm3[,5])
knitr::kable(tm3, digits = 3, align = "c", booktabs = T)
```
\pagebreak
# Discussion
In the current study, we demonstrated biomechanical effects of flexion-extension upper limb movements on speech by replicating in fluent speech effects obtained in steady-state vocalization and mono-syllabic utterances. We showed that rhythmically moving the wrist or arm affects vocalization acoustics by heightening F0 and amplitude envelope of speech vocalizations, as compared to a passive control and statistical control conditions. We finally show that higher deceleration rates observed within 200 milliseconds before the moment of the maximum extension of the arm movement materializes into more extreme acoustic peaks, demonstrating a role for acceleration and effector mass for gesture’s effect onto speech (i.e., an effect of physical impulse). Indeed, in all analyses, we observe that higher-mass arm versus wrist movements affect speech more clearly.
Thus, stabilities in speaking may arise out of gesture-speech biomechanics in fluent speech as well as more simplified speech sounds. This does not mean that speech prosody necessarily requires gesture for reaching prosodic targets. Indeed, other sensorimotor solutions are available for modulating F0 and intensity [e.g., vocal-fold tensioning, respiratory actions; @perrierMotorEquivalenceSpeech2015]. Furthermore, F0 is across the board less, if at all, affected in line with our previous work [@pouwEnergyFlowsGesturespeechinpress], and work on the variable and often negligible role of respiratory actions in F0 modulations [@petroneRelationsSubglottalPressure2017]. However, we think on the basis of present work we can argue that the biomechanical coupling of gesture and speech provides a 'smart' mechanism for 'timing' acoustic and movement expressions, and provides way in to the phylogenetic origin of gesture.
We should wonder still whether the current effects of upper limb movement can be produced due to the attentional guidance to move (in the sense of “I must stop my wrist here and move up”), rather than the physical impulses produced by moving. In the previous studies we provided additional evidence with a respiration belt that tensioning around the trunk is involved in gesture-induced effects on vocal acoustics (@pouwEnergyFlowsGesturespeechinpress), or that postural stability moderates said effects [@pouwGesturespeechPhysicsBiomechanical2019]. The additional evidential strength of these previous studies for gesture-speech physics lies in part in that a cognitive control account does a) not readily predict that trunk tensioning is involved in synchronizing upper limb movement and speech, and b) equally does not predict that standing or sitting matters for synchronizing two speech and gesture trajectories. It should be noted here, that one could in principle explain trunk tensioning and postural control effects with some new cognitive control account, but it does not seem parsimonious to do so in light of a gesture-speech physics alternative.
This reasoning from parsimony extends to the basic kinematic-acoustic analysis of the current study, too. We should therefore ask in the current context: Does a cognitive control account predict that arm motion versus wrist motion should lead to heightened acoustic effects? Does a cognitive control account predict that acoustic peaks arise around the deceleration phase rather at the maximum extension phase? Does a cognitive control account predict that the degree to which a limb in motion decelerates scales with the acoustic peak that ensues? It is wholly possible that a particular cognitive control account may still account for all these effects, or more likely a subset of these effects. But to do so one needs to invoke some new hypothesis how this cognitive control system produced these observables. This comes at the cost of parsimony, as we are invoking new unobservable mechanisms to explain these observables, especially if there is a more parsimonious explanation available that explains these effects.
To be clear, this does not mean that we can fully exclude cognitive control. Not in principle, but also more forcefully, not in degree. In fact, it is likely that complex interactions arise between on the one hand biophysical constraints arising out of moving your upper limb and vocalizing at the same time, and a speech system organizing meaningful speech in the context of those constraints. It is likely then, that in contrast to previous studies with non-meaningful speech, that there are interactions either of amplification or counteraction of gesture-speech physics that are likely bidirectional with lexical, syntactics, and prosodic speech organization. For example, one might speed up the occurrence of a physical impulse as then it will occur during a part of speech where there is a lexical stress. Or one might counteract an F0 effect of a physical impulse laryngeally, as its acoustic effect would lead to an inappropriate acoustic marker in the syntactic context of the sentence. These potential interactions between gesture-speech physics and meaningful speech organization must be studied in controlled experiments, but we do think they are likely to be there in the current context.
Future research may thus consist of controlled experiments on syntactic, lexical and prosodic interactions with biophysical constraints. However, on the other hand more research is needed on truly spontaneous speech. In the current study, participants are retelling a cartoon, which is a very different context to speak from as compared to say a conversation. In part because there is some cognitive load of having to retell something accurately from recent memory, which too can have contextualizing effects.
## Wider implications
Gesture-speech physics holds promise for revising our understanding of the emergence of communicative gesture in anatomically modern humans, both ontogenetically and phylogenetically.
It is well known that infants produce concurrent vocal-motor babblings. Furthermore, increased rhythmicity or frequency of motor babbling predicts speech-like maturation of vocalization [@ejiriRelationshipRhythmicBehavior1998; @ejiriCooccurencesPreverbalVocal2001]. Rather than a primarily neural development that instantiates gesture-speech synchrony [@iversonHandMouthBrain2005], we suggest that during such vocal-motor babblings gesture-speech physics is discovered; this could provide the basis for infants to develop novel stable sensorimotor solutions for communication, such as a synchronized pointing gesture with a vocalization. Such sensorimotor solutions are, of course, likely solicited and practiced through support of caretakers, yet without the biomorphological scaffolding, gesture-speech synchrony would not get off the ground ontogenetically.
Phylogenetic accounts have been central in discussions of the drivers of the depiction and referential function of gesture [@tomaselloOriginsHumanCommunication2008; @kendonReflectionsGesturefirstHypothesis2017; @frohlichMultimodalCommunicationLanguage2019]. However, the current work supports the view that peripheral body movements may have served as a control parameter of an under-evolved vocal system. Previous work has proposed that the vocal system may have been evolutionarily exapted from rhythmic abilities in the locomotor domain [@ravignaniRhythmSpeechAnimal2019; @larssonBipedalStepsDevelopment2019], and viewing upper limb movements as constraints on the vocal system’s evolution fits neatly in such views.
When our species became bipedal, the respiratory system was thereby liberated from upper-limb locomotary perturbations. We know that breathing (and vocalization) cycles often rigidly couple 1:1 with locomotion cycles in quadrupeds [@carrierEnergeticParadoxHuman1984], rigidly limiting what can be done (or communicated) in one breath. For example, vocalization acoustics of flying bats are synchronized with their wing beats [@lancasterRespiratoryMuscleActivity1995]. Bipedalism, however, did not only free respiration from locomotion; it freed the upper limbs, too, allowing these highly skilled articulators to modulate a possibly less skilled respiratory-vocal system. Gestures, then, may have played a role in the complexification of the respiratory system in our species, which has been attributed to have occurred to serve speech evolution [@maclarnonEvolutionHumanSpeech1999].
Thus, gesture-speech physics is not culturally specific, as animals can do it, too [e.g., bats; [@lancasterRespiratoryMuscleActivity1995]. It can further be related to other species, such as orangutangs who deepen their vocalizations by cupping their hands in front of their mouth [@hardusToolUseWild2009]. Other animals have been found to be sensitive to body-related information in sound in that body size and strength can be detected from vocalizations alone [@pisanskiVoiceModulationWindow2016; @ghazanfarVocaltractResonancesIndexical2007], and humans are able to do this with some accuracy as well [@pisanskiReturnOzVoice2014], even when they are blind from birth [@pisanskiCanBlindPersons2016]. In a recent experiment, we found that listeners are exquisitely sensitive to gesture-modulated acoustics: Listeners can synchronize their own upper limb movement by simply listening to a vocalizer producing a steady-state vocalization while rhythmically moving their wrist or arm [@pouwAcousticInformationUpper2020; @pouwAcousticSpecificationUpper2019]. Thus, bodily dynamics can imprint the (human) voice, and this can be informative for listeners.
To conclude, gesture-speech physics opens up the possibility that gesture may have evolved as a control parameter on vocal actions. This ecological revision [@kuglerInformationNaturalLaw1987; @turveyMediumHapticPerception2014] of gesture-speech coupling provides a solid phylogenetic basis for an evolution of multimodal behavior, whereby peripheral bodily tensioning naturally formed coalitions with sound-producing organs that were still very much under development.
\newpage
# References
```{r create_references}
r_refs(file = "r-references.bib")
r_refs(file = "mybib.bib")
```
\begingroup
\setlength{\parindent}{-0.5in}
\setlength{\leftskip}{0.5in}
<div id = "refs"></div>
\endgroup