In [None]:
"""
Domain
    Media

focus
    optimize the selection process

Business challenge/requirement
    Motion Studios is the largest Radio production house in Europe. Their total revenue 
    is $ 1B+. The company has launched a new reality show – "The Star RJ". The show is 
    about finding a new Radio Jockey who will be the star presenter on upcoming shows. 
    In the first round, participants have to upload their voice clip online and the clip will 
    be evaluated by experts for selection in the next round. There is a separate team in
    the first round for the evaluation of male and female voices. 
    Response to the show is unprecedented and the company is flooded with voice clips.
    You as an ML expert have to classify the voice as either male/female so that the first 
    level of filtration is quicker. 

Key issues
    Voice samples are across accents

Considerations
    The output from the pre-processed WAV files was saved into the CSV file

Data volume
    - Approx 3000 records – file voice-classification.csv 

Fields in Data
    • meanfreq: mean frequency (in kHz)
    • sd: standard deviation of the frequency
    • median: median frequency (in kHz)
    • Q25: first quantile (in kHz)
    • Q75: third quantile (in kHz)
    • IQR: interquantile range (in kHz)
    • skew: skewness (see note in specprop description)
    • kurt: kurtosis (see note in specprop description)
    • sp.ent: spectral entropy
    • sfm: spectral flatness
    • mode: mode frequency
    • centroid: frequency centroid (see specprop)
    • peakf: peak frequency (frequency with the highest energy)
    • meanfun: average of fundamental frequency measured across the acoustic 
    signal
    • minfun: minimum fundamental frequency measured across the acoustic signal
    • maxfun: maximum fundamental frequency measured across the acoustic signal
    • meandom: average of dominant frequency measured across the acoustic signal
    • mindom: minimum of dominant frequency measured across the acoustic signal
    • maxdom: maximum of dominant frequency measured across the acoustic 
    signal
    • dfrange: range of dominant frequency measured across the acoustic signal
    • modindx: modulation index. Calculated as the accumulated absolute difference 
    between adjacent measurements of fundamental frequencies divided by the 
    frequency range
    • label: male or female

Additional information
    - NA

Business benefits
    Since "The Star RJ" is a reality show, the time to select candidates is very short. The 
    whole success of the show and hence the profits depends upon quick and smooth 
    execution
"""

In [11]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
CSV_PATH = r'D:\CourseWork\data-science-python-certification-course\Assignments\09 Supervised Learning - II\Case Study I\resources\voice-classification.csv'
df = pd.read_csv(CSV_PATH)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 21 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   meanfreq  3168 non-null   float64
 1   sd        3168 non-null   float64
 2   median    3168 non-null   float64
 3   Q25       3168 non-null   float64
 4   Q75       3168 non-null   float64
 5   IQR       3168 non-null   float64
 6   skew      3168 non-null   float64
 7   kurt      3168 non-null   float64
 8   sp.ent    3168 non-null   float64
 9   sfm       3168 non-null   float64
 10  mode      3168 non-null   float64
 11  centroid  3168 non-null   float64
 12  meanfun   3168 non-null   float64
 13  minfun    3168 non-null   float64
 14  maxfun    3168 non-null   float64
 15  meandom   3168 non-null   float64
 16  mindom    3168 non-null   float64
 17  maxdom    3168 non-null   float64
 18  dfrange   3168 non-null   float64
 19  modindx   3168 non-null   float64
 20  label     3168 non-null   obje

In [5]:
# Encoding label
le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])
df.head(5)

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,1
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.066009,0.107937,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,1
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.077316,0.098706,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,1
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.151228,0.088965,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,1
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.13512,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,1


In [7]:
# We have already seen this question in one of the previous assignment (Assignment 7, Case Study 1)
# we know the relevant features.

x = df[['sd','Q25','IQR','sfm','mode','meanfun']]
y = df['label']

In [9]:
# Split the dataset into train-test with 20% of the data kept aside for testing.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=157)

In [12]:
# Fit a logistic regression model and measure the accuracy of the test set.
lrm = LogisticRegression()
lrm.fit(x_train, y_train)
y_pred = lrm.predict(x_test)
accuracy_score(y_pred, y_test)

0.88801261829653