# Week 4: Psychoacoustics

This week we will explore the field of *[psychoacoustics](https://en.wikipedia.org/wiki/Psychoacoustics)*, the study of how humans perceive and interpret sound, bridging the gap between physical sound properties and subjective auditory experiences. This will form the foundation for more detailed investigations of "vertical" and "horizontal" sound perception in the coming weeks.

## Introduction to Psychoacoustics

Psychoacoustics is a branch of science that examines the psychological and physiological responses associated with sound, including how we detect, differentiate, and interpret auditory stimuli. It seeks to answer questions such as: Why do some sounds seem louder than others, even if they have the same physical intensity? How do we distinguish between different musical instruments playing the same note? What makes certain sounds pleasant or unpleasant?

Psychoacoustics explores the relationship between the measurable, physical properties of sound (such as frequency, amplitude, and harmonic relationships) and the way these sounds are perceived by the human ear and brain (such as pitch, loudness, and timbre). Insights from psychoacoustics are applied in audio engineering, music production, hearing aid design, and the development of audio codecs (such as MP3), which exploit perceptual limitations to compress audio data efficiently.

The roots of psychoacoustics can be traced back to the 19th century, with foundational work by scientists such as [Hermann von Helmholtz](https://en.wikipedia.org/wiki/Hermann_von_Helmholtz) (1821–1894), who investigated the sensations of tone and the physical basis of music. Later, [S. S. Stevens](https://en.wikipedia.org/wiki/Stanley_Smith_Stevens)  (1906–1973) contributed to the development of psychophysical scaling, introducing methods to quantify the relationship between stimulus and perception (e.g., the Stevens' Power Law for loudness perception). Over time, psychoacoustics has evolved into a multidisciplinary field, integrating insights from physics, psychology, neuroscience, and engineering.

## Anatomy of the Ear

The human ear is a remarkably complex organ that enables us to detect sounds and maintain our sense of balance. It is divided into three main sections: the outer ear, middle ear, and inner ear. Each part plays a distinct and crucial role in the auditory process.

### The outer ear

The *[outer ear](https://en.wikipedia.org/wiki/Outer_ear)* consists of the visible part called the *pinna* and the *ear canal*. Its primary function is to collect sound waves from the environment and funnel them toward the *eardrum* (sometimes call the *tympanic membrane*). The unique shape of the pinna helps us localize the direction of sounds. When sound waves reach the eardrum, they cause it to vibrate.

![Outer Ear](https://upload.wikimedia.org/wikipedia/commons/4/40/Ear-anatomy-text-small-en.svg)  
*Image Source: [Wikipedia - Outer Ear](https://en.wikipedia.org/wiki/Outer_ear)*

The eardrum  in the human ear functions much like the membrane in a microphone. Both act as sensitive barriers that vibrate in response to incoming sound waves. When sound waves enter the ear canal, they strike the eardrum, causing it to vibrate. These vibrations are then transmitted through the ossicles to the inner ear, where they are converted into electrical signals for the brain to interpret. In a microphone, sound waves hit a thin, flexible membrane (diaphragm), causing it to vibrate. These vibrations are converted into electrical signals, which can then be amplified, recorded, or transmitted. As such, both the ear drum and the microphone membrane convert air pressure variations (sound waves) into mechanical vibrations. Both also serve as the first step in transforming acoustic energy into a form that can be further processed; biologically in the ear, electronically in the microphone.

### The middle ear

Beyond the eardrum lies the *[middle ear](https://en.wikipedia.org/wiki/Middle_ear)*, which contains three tiny bones known as the *ossicles*: the malleus, incus, and stapes. These bones act as a mechanical lever system, amplifying the vibrations from the eardrum and transmitting them to the oval window, a membrane-covered opening to the inner ear. This amplification is essential for efficiently transferring sound energy from air to the fluid-filled inner ear.

![Middle Ear](https://upload.wikimedia.org/wikipedia/commons/8/81/Blausen_0330_EarAnatomy_MiddleEar.png)  
*Image Source: [Wikipedia - Middle Ear](https://en.wikipedia.org/wiki/Middle_ear)*

This process is analogous to how a microphone connected to a preamplifier works. The vibrations from a microphone membrane are weak and are typically run through a *preamplifier* to boost the signal to a level suitable for further processing or recording. Similarly, the ossicles amplify the mechanical vibrations from the eardrum, ensuring that the signal is strong enough to be effectively transmitted into the inner ear for further processing by the auditory system.

### The inner ear

The *[inner ear](https://en.wikipedia.org/wiki/Inner_ear)* is where the mechanical vibrations are transformed into electrical signals that the brain can interpret as sound. The main structure responsible for this is the *[cochlea](https://en.wikipedia.org/wiki/Cochlea)*, a spiral-shaped, fluid-filled organ lined with thousands of tiny hair cells. As vibrations travel through the cochlear fluid, they cause the hair cells to move, generating nerve impulses that are sent to the brain via the auditory nerve.

![Inner Ear](https://upload.wikimedia.org/wikipedia/commons/1/14/Blausen_0329_EarAnatomy_InternalEar.png)  
*Image Source: [Wikipedia - Inner Ear](https://en.wikipedia.org/wiki/Inner_ear)*

The cochlea in the inner ear functions similarly to an analog-to-digital converter (ADC) in audio technology. Just as an ADC transforms continuous analog sound waves into discrete digital signals that can be processed by computers, the cochlea converts mechanical vibrations from sound into electrical nerve impulses that the brain can interpret. Inside the cochlea, thousands of hair cells respond to specific frequencies, effectively performing a biological form of frequency analysis and encoding the intensity and timing of sounds. This process is analogous to how an ADC samples and quantizes audio signals, enabling the transmission and interpretation of complex auditory information in a form the brain can understand.

The inner ear also contains the [vestibular system](https://en.wikipedia.org/wiki/Vestibular_system), which is crucial for maintaining balance and spatial orientation.


### Human auditory perception vs machine perception

Although they are different in many ways, it can be helpful to think about the human auditory system as analogous to an analog-to-digital converter (ADC). Both systems transform analog signals (sound waves) into a format that can be interpreted (heard or processed) by the brain or digital devices.

```mermaid
graph TD
    A[Sound Waves] --> B[Outer Ear]
    B --> B1[Pinna]
    B --> B2[Ear Canal]
    B --> C[Middle Ear]
    C --> C1[Eardrum (Tympanic Membrane)]
    C --> C2[Ossicles (Malleus, Incus, Stapes)]
    C --> D[Inner Ear]
    D --> D1[Cochlea]
    D1 --> D2[Hair Cells]
    D1 --> D3[Basilar Membrane]
    D --> E[Auditory Nerve]
    E --> F[Brain]
    F --> F1[Auditory Cortex (Sound Processing)] 
```
In comparison, capturing digital sound uses this signal chain: 

```mermaid
graph TD
    A[Sound Waves] --> B[Microphone]
    B --> C[Pre-Amplifier]
    C --> D[Sampling Process]
    D --> E[Quantization]
    E --> F[Digital Output]
```


## Loudness

### Equal Loudness Contours

As we discussed last week, there is not a direct relationship between [sound pressure level (SPL)](https://en.wikipedia.org/wiki/Sound_pressure#Sound_pressure_level) and [loudness](https://en.wikipedia.org/wiki/Loudness). SPL is a physical measurement of the amplitude of a sound wave, typically expressed in decibels (dB), while loudness is a subjective perception; how loud a sound seems to a human listener. Although they are related, the same SPL can be perceived as having a different loudness depending on the frequency of the sound and the listener's hearing sensitivity.

This relationship was first systematically studied by Harvey Fletcher and Wilden A. Munson in the 1930s. Their experiments led to the creation of the *Fletcher-Munson curves*, also known as *[equal-loudness contours](https://en.wikipedia.org/wiki/Equal-loudness_contour)*. These curves show that the human ear is not equally sensitive to all frequencies: we are most sensitive to frequencies between roughly 2,000 and 5,000 Hz, and less sensitive to very low or very high frequencies. As a result, a low-frequency sound must be played at a higher SPL than a mid-frequency sound to be perceived as equally loud.

![Equal-loudness contour](https://upload.wikimedia.org/wikipedia/commons/4/47/Lindos1.svg)
*Image Source: [Wikipedia - Equal-loudness contour](https://en.wikipedia.org/wiki/Equal-loudness_contour)*

There is evidence suggesting that our heightened sensitivity to frequencies between 2,000 and 5,000 Hz has evolutionary roots. This frequency range corresponds closely to the spectral content of human speech, particularly the consonant sounds that are crucial for understanding language. Being able to detect subtle differences in these frequencies would have provided a survival advantage by improving communication, social bonding, and the ability to detect important environmental sounds such as a baby's cry or warning calls. Over time, natural selection may have favored individuals whose hearing was optimized for these frequencies, shaping the contours of human auditory perception.

Equal-loudness contours are crucial in audio engineering, music production, and hearing science. They explain why music or speech can sound different at low versus high volumes, and why audio equipment often includes "loudness" compensation to adjust for these perceptual differences. Understanding these contours helps us design better audio systems and create more accurate sound reproductions that align with human hearing.

```{exercise}Sine tone loudness
Try to play a sine tone with different frequencies, for example, 100, 500, 1000, 3000, 5000 ,10,000 Hz. What do you notice in terms of how loud it sounds like? 
```

### Threshold of hearing


- **Threshold of Hearing**: Quietest perceivable sound. [Learn more](https://en.wikipedia.org/wiki/Absolute_threshold_of_hearing)

![Threshold of Hearing](https://upload.wikimedia.org/wikipedia/commons/d/da/Average_click-evoked_waveforms_and_Average_hearing_thresholds_for_younger_and_older_adults.jpg)
*Image Source: [Wikipedia - Hearing Thresholds](https://en.wikipedia.org/wiki/Absolute_threshold_of_hearing)*

### Hysteresis
Hysteresis refers to the phenomenon where the response of a system depends not only on its current state but also on its past states. In psychoacoustics, this can manifest in auditory perception, where the perception of a sound may be influenced by previously heard sounds or stimuli. This concept is crucial in understanding how the auditory system adapts and reacts over time to varying acoustic environments.

![Hysteresis](https://upload.wikimedia.org/wikipedia/en/e/e2/Hysteresis.png)
*Image Source: [Wikipedia - Hysteresis](https://en.wikipedia.org/wiki/Hysteresis)*

- **Threshold of Pain**: Sounds above 120 dB. [Learn more](https://en.wikipedia.org/wiki/Sound_pressure)

## Masking


- **Critical Bands**: Frequency ranges where masking occurs. [Learn more](https://en.wikipedia.org/wiki/Critical_band)
- **Temporal Masking**: One sound obscures another. [Learn more](https://en.wikipedia.org/wiki/Masking_(audio))

![Audio Masking Graph](https://upload.wikimedia.org/wikipedia/commons/e/eb/Audio_Mask_Graph.png)
*Image Source: [Wikipedia - Audio Masking](https://en.wikipedia.org/wiki/Masking_(audio))*



## Pitch

Pitch is the perceptual attribute that allows us to order sounds on a scale from low to high. While pitch is closely related to the physical property of frequency (the rate at which a sound wave vibrates), it is ultimately a subjective experience shaped by the human auditory system.

### Psychoacoustics of Pitch Perception

- **Frequency and Pitch**: Generally, higher frequencies are perceived as higher pitches and lower frequencies as lower pitches. However, the relationship is not strictly linear—our perception of pitch compresses at very high and very low frequencies.
- **Pitch Discrimination**: The human ear is remarkably sensitive to small changes in frequency, especially in the mid-frequency range (around 500–4000 Hz). The smallest detectable difference in pitch is called the *just noticeable difference* (JND), which can be as little as 0.3% of the frequency for trained listeners.
- **Missing Fundamental**: Even if the fundamental frequency of a complex tone is absent, the ear can infer its pitch from the pattern of overtones (harmonics). This phenomenon is known as the *missing fundamental* or *virtual pitch* effect.
- **Pitch of Complex Tones**: Most musical sounds are complex, containing multiple harmonics. The perceived pitch usually corresponds to the fundamental frequency, but the presence and strength of overtones can influence timbre and pitch clarity.
- **Pitch and Loudness**: Perceived pitch can shift slightly with changes in loudness, especially at very high or low frequencies.
- **Cultural and Musical Context**: Musical training and cultural background can affect pitch perception, such as the ability to recognize microtones or perceive pitch in non-Western musical scales.

### Frequency
- **Infrasound**: Below 20 Hz. [Learn more](https://en.wikipedia.org/wiki/Infrasound)
- **Audible Sound**: 20 Hz to 20,000 Hz. [Learn more](https://en.wikipedia.org/wiki/Hearing_range)
- **Ultrasound**: Above 20,000 Hz. [Learn more](https://en.wikipedia.org/wiki/Ultrasound)

### Instruments
- **Piano**: 32 Hz to 4186 Hz. [Learn more](https://en.wikipedia.org/wiki/Piano)
- **Saxophone**: Known for its rich tone. [Learn more](https://en.wikipedia.org/wiki/Saxophone)
- **Just Noticeable Differences (JND)**: The smallest detectable frequency change. [Learn more](https://en.wikipedia.org/wiki/Just-noticeable_difference)

#### Further Reading
- [Pitch perception (Wikipedia)](https://en.wikipedia.org/wiki/Pitch_(music))
- [Missing fundamental (Wikipedia)](https://en.wikipedia.org/wiki/Missing_fundamental)
- [Psychoacoustics (Wikipedia)](https://en.wikipedia.org/wiki/Psychoacoustics)


## Auditory illusions

    Masking effects: Simultaneous and temporal masking.
    Critical bands and frequency resolution.
    Auditory illusions: Shepard tones, missing fundamental, binaural beats.

Activities:

    Listening tests: Compare sine waves, complex tones, and illusions.
    Small experiments: Measure threshold of hearing or masking using software (e.g., Pure Data, Audacity).
    Discussion: How psychoacoustics informs music production and hearing aids.


## Spatial perception

- **Spatial Hearing**: Localizing sound sources. [Learn more](https://en.wikipedia.org/wiki/Sound_localization)


## Psychoacoustics in technology

### Audio Visualizations

- **Waveform**: A visual representation of amplitude over time. [Learn more](https://en.wikipedia.org/wiki/Waveform)
- **Spectrogram**: Displays frequency content over time. [Learn more](https://en.wikipedia.org/wiki/Spectrogram)
- **Log Mel Spectrogram**: Mimics human hearing by applying the Short-Time Fourier Transform (STFT), Mel band-pass filters, and a logarithmic transformation to represent audio on a decibel scale. Widely used in audio-related tasks for its perceptual relevance. [Learn more](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum)
- **MFCCs (Mel-Frequency Cepstral Coefficients)**: Extracts and compresses audio features by applying the Discrete Cosine Transform (DCT) to the Log Mel spectrum. Commonly used in speech processing, music classification, and music information retrieval. [Learn more](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum)
- **CQT (Constant-Q Transform)**: Uses a logarithmic frequency scale with exponentially spaced center frequencies and varying filter bandwidths. Ideal for musical note frequency extraction and analysis. [Learn more](https://en.wikipedia.org/wiki/Constant-Q_transform)


In [5]:
import numpy as np
import librosa
print(numpy.__version__)

import matplotlib.pyplot as plt
import librosa.display

# Generate a test audio signal (sine wave + harmonics)
sr = 22050  # sample rate
duration = 2.0  # seconds
t = np.linspace(0, duration, int(sr * duration), endpoint=False)
audio = 0.5 * np.sin(2 * np.pi * 220 * t) + 0.3 * np.sin(2 * np.pi * 440 * t) + 0.2 * np.sin(2 * np.pi * 880 * t)

fig, axs = plt.subplots(3, 2, figsize=(14, 12))
fig.suptitle('Audio Representations', fontsize=16)

# Waveform
librosa.display.waveshow(audio, sr=sr, ax=axs[0, 0])
axs[0, 0].set_title('Waveform')
axs[0, 0].set_xlabel('')
axs[0, 0].set_ylabel('Amplitude')

# Spectrogram
S = np.abs(librosa.stft(audio, n_fft=1024, hop_length=256))
librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max), sr=sr, hop_length=256, x_axis='time', y_axis='hz', ax=axs[0, 1])
axs[0, 1].set_title('Spectrogram (dB)')

# Mel Spectrogram
S_mel = librosa.feature.melspectrogram(y=audio, sr=sr, n_fft=1024, hop_length=256, n_mels=64)
librosa.display.specshow(librosa.power_to_db(S_mel, ref=np.max), sr=sr, hop_length=256, x_axis='time', y_axis='mel', ax=axs[1, 0])
axs[1, 0].set_title('Mel Spectrogram (dB)')

# MFCC
mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13, n_fft=1024, hop_length=256)
librosa.display.specshow(mfccs, x_axis='time', ax=axs[1, 1])
axs[1, 1].set_title('MFCC')

# CQT
C = np.abs(librosa.cqt(audio, sr=sr, hop_length=256, n_bins=60))
librosa.display.specshow(librosa.amplitude_to_db(C, ref=np.max), sr=sr, hop_length=256, x_axis='time', y_axis='cqt_note', ax=axs[2, 0])
axs[2, 0].set_title('CQT (dB)')

# Hide the last empty subplot
axs[2, 1].axis('off')

plt.tight_layout(rect=[0, 0, 1, 0.97])
plt.show()

NameError: name 'numpy' is not defined

### Symbolic Representations

#### MIDI
MIDI (Musical Instrument Digital Interface) is a standard protocol for communicating musical performance data between electronic instruments and computers. It encodes information such as note pitch, velocity, duration, and control changes. [Learn more](https://en.wikipedia.org/wiki/MIDI)

#### ABC Notation
ABC Notation is a text-based music notation system that uses ASCII characters to represent musical scores. It is widely used for folk and traditional music due to its simplicity and compatibility with text-based tools. [Learn more](https://en.wikipedia.org/wiki/ABC_notation)

#### REMI
REMI (REvamped MIDI-derived events) is an enhanced representation of MIDI data designed to better capture musical rhythm and structure. It introduces features like Note Duration events, Bar and Position tokens, and Tempo events, making it suitable for music generation tasks. [Learn more](https://arxiv.org/abs/2002.00212)

#### MusicXML
MusicXML is an XML-based format for representing Western music notation. It encodes detailed musical elements such as notes, rests, articulations, and dynamics, making it ideal for sharing and analyzing sheet music. [Learn more](https://en.wikipedia.org/wiki/MusicXML)

#### Piano Roll
The Piano Roll is a visual representation of music, where time is displayed on the horizontal axis and pitch on the vertical axis. Notes are represented as rectangles, with their length indicating duration. It is commonly used in digital audio workstations (DAWs) for music editing and analysis. [Learn more](https://en.wikipedia.org/wiki/Piano_roll)

#### Note Graph
A Note Graph is a graph-based representation of musical scores, where nodes represent notes and edges capture relationships such as sequence, onset, and sustain. This approach provides a structured way to analyze and model complex musical relationships. [Learn more](https://arxiv.org/abs/2006.05417)

- **[Ideas Roadshow: Believing Your Ears - Auditory Illusions](https://ideasroadshow.com/items/believing-your-ears-auditory-illusions)**  
    *Diana Deutsch in conversation with Howard Burton* (2015), Open Agenda Publishing.
