# Preprocessing 

In this script, I would like to do the following:
1. Loop through all subdirectories in the music folder (not on Github) and get features for each song
2. Features to look at:  
    a) Zero-crossings (possibly do rate?)  
    b) Spectral centroids  
    c) Spectral rolloff  
    d) Mel-frequency cestral coefficients (multiple columns)  
    e) Chroma frequencies (multiple columns)  
    f) Tempograms  
3. Compile CSV file that stores all this data for each song

Also, I found a really good website for more information of feature extraction: https://musicinformationretrieval.com

In [1]:
# Libraries

## Music
import librosa

## Data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## File handling
import os
import pathlib
import csv

## Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

Let's start off by preparing a csv file

In [14]:
header = 'chroma_stft spectral_centroid spectral_bandwidth rolloff zero_crossing_rate tempo'
for i in range(1, 21):
    header += f' mfcc{i}'
header += ' year'
header = header.split()

file = open("../data/new_raw_data.csv", "w", newline = "")
with file:
    writer = csv.writer(file)
    writer.writerow(header)

Now, we are going to systematically extract features out of every song and store the data of one song in one row.  We will then write the row to a csv file.

Note: the following code block will take a few hours

In [15]:
years = ['1970', '1980', '1990', '2000', '2010']

for year in years:
    subdir = f'../music/english/{year}s_audio'
    for filename in os.listdir(subdir):
        songname = f'../music/english/{year}s_audio/{filename}'
        x, sr = librosa.load(songname)
        
        # Extracting features. Note: We can possibly look at separating the harmonic and percussive parts of the song
        chroma_stft = librosa.feature.chroma_stft(x, sr=sr)
        spec_cent = librosa.feature.spectral_centroid(x, sr=sr)
        spec_bw = librosa.feature.spectral_bandwidth(x, sr=sr)
        rolloff = librosa.feature.spectral_rolloff(x, sr=sr)
        zcr = librosa.feature.zero_crossing_rate(x)
        mfcc = librosa.feature.mfcc(x, sr=sr)
        tempo = librosa.beat.tempo(x, sr = sr)
        
        
        to_add = f'{np.mean(chroma_stft)} {np.mean(spec_cent)} {np.mean(spec_bw)} {np.mean(rolloff)} {np.mean(zcr)} {tempo}'
        for i in mfcc:
            to_add += f' {np.mean(i)}'
        
        to_add += f' {year}'
        
        # Putting this into a file
        
        file = open("../data/new_raw_data.csv", 'a', newline = '')
        with file:
            writer = csv.writer(file)
            writer.writerow(to_add.split())































































































In [2]:
# Doing same thing for Indian music
header = 'chroma_stft spectral_centroid spectral_bandwidth rolloff zero_crossing_rate tempo'
for i in range(1, 21):
    header += f' mfcc{i}'
header += ' year'
header = header.split()

file = open("../data/indian_raw_data.csv", "w", newline = "")
with file:
    writer = csv.writer(file)
    writer.writerow(header)

In [4]:
genres = ['bollywood', 'malayalam']

for genre in genres:
    subdir = f'../music/indian/{genre}_audio'
    for filename in os.listdir(subdir):
        songname = f'../music/indian/{genre}_audio/{filename}'
        x, sr = librosa.load(songname)
        
        # Extracting features. Note: We can possibly look at separating the harmonic and percussive parts of the song
        chroma_stft = librosa.feature.chroma_stft(x, sr=sr)
        spec_cent = librosa.feature.spectral_centroid(x, sr=sr)
        spec_bw = librosa.feature.spectral_bandwidth(x, sr=sr)
        rolloff = librosa.feature.spectral_rolloff(x, sr=sr)
        zcr = librosa.feature.zero_crossing_rate(x)
        mfcc = librosa.feature.mfcc(x, sr=sr)
        tempo = librosa.beat.tempo(x, sr = sr)
        
        
        to_add = f'{np.mean(chroma_stft)} {np.mean(spec_cent)} {np.mean(spec_bw)} {np.mean(rolloff)} {np.mean(zcr)} {tempo}'
        for i in mfcc:
            to_add += f' {np.mean(i)}'
        
        to_add += f' {genre}'
        
        # Putting this into a file
        
        file = open("../data/indian_raw_data.csv", 'a', newline = '')
        with file:
            writer = csv.writer(file)
            writer.writerow(to_add.split())





































