Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About audio sampling #3

Closed
17Skye17 opened this issue Dec 13, 2022 · 1 comment
Closed

About audio sampling #3

17Skye17 opened this issue Dec 13, 2022 · 1 comment

Comments

@17Skye17
Copy link

Hi, I find that in this _get_audio function, a spectrogram will be generated for each sampled frame by signal.spectrogram(samples, samplerate, nperseg=512,noverlap=353). For efficiency, would it be possible to generate one large spectrogram for a video and then perform sampling on the large spectrogram? Since this would save some preprocessing costs if there is no memory problem.

Thanks for releasing this nice work!

def _get_audio(self, idx, s, e):
		
		audio_mask = torch.zeros((1, self.opt.max_audio_frames), dtype=np.long)


		audio = torch.zeros((self.opt.max_audio_frames, 1024, 128), dtype=torch.double) 



		audio_folder = self.video_dict[idx].split('.')[:-1][0].replace('frames',self.opt.audio_pt)

		audio_folder_bk = audio_folder.replace('audio_raw','VGGSound_Audio_features_10s_aligned')
		self.save_path = audio_folder_bk
		# audio_folder = audio_folder.replace('playpen-iop','playpen-storage')
		
		total_num_wav = len(glob.glob(audio_folder+'/*.wav'))
		total_num_pt = len(glob.glob(audio_folder+'/*.pt'))

		# print('Read: '+ idx)

		total_fbank = []
		# if self.opt.max_audio_frames < total_num_wav: # for frame-wise fusion
		
		if self.my_len == 4816 or True:
			sample_indx = np.linspace(0, total_num_pt-1, num=self.opt.max_audio_frames, dtype=int)
			for tmp_idx in sample_indx:
				fbank = torch.load(audio_folder+'/'+ str("{:04d}".format(tmp_idx))+ '.pt', map_location=torch.device('cpu'))

				total_fbank.append(fbank)

		else:
			for tmp_idx in range(self.opt.max_audio_frames):	#total_num_wav self.opt.max_audio_frames
					### loader for VGGSound
					try:
							
						samples, samplerate = sf.read(audio_folder+'/'+ '0000.wav')

						if samples.shape[0] > 16000*(self.opt.yb_audio_length+0.1):
							sample_indx = np.linspace(0, samples.shape[0] -16000*(self.opt.yb_audio_length+0.1), num=self.opt.max_audio_frames, dtype=int)
							samples = samples[sample_indx[tmp_idx]:sample_indx[tmp_idx]+int(16000*self.opt.yb_audio_length)]

						else:
							# repeat in case audio is too short
							samples = np.tile(samples,int(self.opt.yb_audio_length))[:int(16000*self.opt.yb_audio_length)]

						samples[samples > 1.] = 1.
						samples[samples < -1.] = -1.

						frequencies, times, spectrogram = signal.spectrogram(samples, samplerate, nperseg=512,noverlap=353)
						spectrogram = np.log(spectrogram+ 1e-7)

						mean = np.mean(spectrogram)
						std = np.std(spectrogram)
						spectrogram = np.divide(spectrogram-mean,std+1e-9)

						total_fbank.append(torch.tensor(spectrogram).unsqueeze(0).float())

					except:
						print('Too short: '+ audio_folder+'/'+ str("{:04d}".format(tmp_idx))+ '.wav')
						# print("skip too short")
						continue
					
		
		

		# audio[:total_fbank.size(0)] = total_fbank
		# audio_mask[0, :total_fbank.size(0)] = 1
		# return audio, audio_mask
		total_fbank = torch.vstack(total_fbank)
		return total_fbank, audio_mask
@GenjiB
Copy link
Owner

GenjiB commented Dec 18, 2022

Actually, it spend more time if you compute a spectrogram for a long audio signal. Thus, my code would compute only a short segment (e.g., 5s 10s 30s whatever you want)

@GenjiB GenjiB closed this as completed Dec 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants