Author: Emine Darı

## About The Challenge
- In this challenge, the goal is to classify the audio records of professors of Turkish Academy using Machine Learning methods. Each record is approximately 5 seconds long. The dataset containing the records to train and test the models can be found on this [link](https://www.kaggle.com/c/turkishacademyvoicechallenge/overview) to the in-class competition published on Kaggle.

## Dependencies

In [1]:
from scipy.io import wavfile
import numpy as np
import pandas as pd
import librosa
import csv
import os
from sklearn.linear_model import LogisticRegression

## Step 1
* In this step, we first create a function to initialize a csv file to keep extracted features to use when train and test sets are processed. Saving the extracted features into a file instead of storing them in variables will save us time, as after we decide on the features we can go on to try different models with these saved files.

In [10]:
def create_file(filename,header,train=True):
    file = open(filename, 'w', newline='')
    with file:
        writer = csv.writer(file)
        #append the label to header only for the training stage
        if train:
            header.append("label")
        writer.writerow(header)

- Now we create a function that will be used to extract features using the [Librosa](https://librosa.org/doc/latest/feature.html#feature-extraction) library. This function will save the features in a file by appending them to the end the file for each audio.


In [3]:
def extract_and_save_features(audio_name, audio_path, file_to_save, train=True,label=0):

        #Load the audio file
        y, sr = librosa.load(audio_path, mono=True)
        
        #Extract features 
        rms = librosa.feature.rms(y=y)
        chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr)
        spec_cent = librosa.feature.spectral_centroid(y=y, sr=sr)
        spec_bw = librosa.feature.spectral_bandwidth(y=y, sr=sr)
        rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)
        zcr = librosa.feature.zero_crossing_rate(y)
        mfcc = librosa.feature.mfcc(y=y, sr=sr)
        
        #Append features in a list
        extracted_features = f'{audio_name} {np.mean(chroma_stft)} {np.mean(rms)} {np.mean(spec_cent)} {np.mean(spec_bw)} {np.mean(rolloff)} {np.mean(zcr)}'    
        for freq in mfcc:
            extracted_features += f' {np.mean(freq)}'
            
        #Append label of the audio, if in train mode
        if train:
            extracted_features += f' {label}'    
        
        #Save the features in a file, in append mode
        file = open(file_to_save, 'a', newline='')
        with file:
            writer = csv.writer(file)
            writer.writerow(extracted_features.split())

- Last step of preparation is to write a function to normalize extracted features in the dataset

In [4]:
def normalize(dataset):
    data_normalized = ((dataset-dataset.min())/(dataset.max()-dataset.min()))
    return data_normalized

## Step 2
- As the next step we process the train set which is structured as:  _**train/train/class_id/**sample.wav_
- The dataset is assumed to be saved in the same directory with the notebook/code.
- **Warning:** _This step might take a **long time.**_

In [8]:
class_count = 2

#Create a list with names of the features to be extracted
features = ["filename","rms", "chroma_stft","spec_cent", "spec_bw", "rolloff", "zcr"]
for i in range(20):
        features.append("mfcc_" + str(i+1))

In [5]:
#Call the function to create the file to save the train set's features                 
create_file('train_processed.csv',features)

#Extract features for each audio in the train set    
    
for i in range(class_count):
    for file in os.listdir("train/train/" + str(i)):
        file_path = "train/train/" + str(i) + "/" + file
        extract_and_save_features(file, file_path,'train_processed.csv',True,i)
            

## Step 3
- Now we process the test set that is structed as : _**test/test/**sample.wav_

In [12]:
#Call the function to create the file to save the test set's features 
create_file('test_processed.csv',features,train=False)

#Extract features for each audio in the test set   
for file in os.listdir("test/test"):
        file_path = "test/test/" + file
        extract_and_save_features(file, file_path,'test_processed.csv',train=False)

## Step 4
- In the final step we pick a model and train it with the features we extracted. Then we predict the labels for the test set and write them into a file in submission format.

In [13]:
#read the processed train data
train_set = pd.read_csv("train_processed.csv")

#split x and y data, normalize x data
x_train = normalize(train_set.drop(["filename","label"],axis=1))
y_train = train_set.label.values

#choose a model and train it with the labels
model = LogisticRegression()
model.fit(x_train,y_train)

#read the processed test data and drop the filename column, then normalize the dataset
test_set = pd.read_csv("test_processed.csv")
x_test_norm = normalize( test_set.drop(["filename"],axis=1))

#predict the test data with the model    
predicted_data = model.predict(x_test_norm)


#FileName,Class 
predicted_files = []
for filename in os.listdir("test/test"):
    predicted_files.append(filename)
    
prediction = pd.DataFrame({"FileName" : predicted_files, "Class":predicted_data} )
prediction.to_csv("submission.csv", index = False, header=True)

print("Prediction complete. Please check submission.csv file")

Prediction complete. Please check submission.csv file
