# Machine Learning project: gender recognition from speech

The following acoustic properties of each voice are measured:
- **duration**: length of signal <span style="color:red">**NOT USED!!**</span>
- **meanfreq**: mean frequency (in kHz)
- **sd**: standard deviation of frequency
- **median**: median frequency (in kHz)
- **Q25**: first quantile (in kHz)
- **Q75**: third quantile (in kHz)
- **IQR**: interquantile range (in kHz)
- **skew**: skewness (see note in specprop description)
- **kurt**: kurtosis (see note in specprop description)
- **sp.ent**: spectral entropy
- **sfm**: spectral flatness
- **mode**: mode frequency
- **centroid**: frequency centroid (see specprop)
- **peakf**: peak frequency (frequency with highest energy) <span style="color:red">**NOT USED!!**</span>
- **meanfun**: average of fundamental frequency measured across acoustic signal
- **minfun**: minimum fundamental frequency measured across acoustic signal
- **maxfun**: maximum fundamental frequency measured across acoustic signal
- **meandom**: average of dominant frequency measured across acoustic signal
- **mindom**: minimum of dominant frequency measured across acoustic signal
- **maxdom**: maximum of dominant frequency measured across acoustic signal
- **dfrange**: range of dominant frequency measured across acoustic signal
- **modindx**: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range

In [35]:
# imports from pandas
import pandas as pd
pd.options.display.max_columns = None

# imports from sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [36]:
# load data and create data frame
data = pd.read_csv('dataSet.csv')
df = pd.DataFrame(data)

In [37]:
# transform column label into numbers
#     - female: 0
#     - male:   1
def encode_label(df):
    return df.astype('category').cat.codes
df['label'] = encode_label(df['label'])

In [42]:
# display data frame info
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 21 columns):
meanfreq    3168 non-null float64
sd          3168 non-null float64
median      3168 non-null float64
Q25         3168 non-null float64
Q75         3168 non-null float64
IQR         3168 non-null float64
skew        3168 non-null float64
kurt        3168 non-null float64
sp.ent      3168 non-null float64
sfm         3168 non-null float64
mode        3168 non-null float64
centroid    3168 non-null float64
meanfun     3168 non-null float64
minfun      3168 non-null float64
maxfun      3168 non-null float64
meandom     3168 non-null float64
mindom      3168 non-null float64
maxdom      3168 non-null float64
dfrange     3168 non-null float64
modindx     3168 non-null float64
label       3168 non-null int8
dtypes: float64(20), int8(1)
memory usage: 498.2 KB
None


In [43]:
# display data frame
df.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,mode,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,0.0,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,1
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,0.0,0.066009,0.107937,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,1
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,0.0,0.077316,0.098706,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,1
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,0.083878,0.151228,0.088965,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,1
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,0.104261,0.13512,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,1


In [19]:
# prepare training values:
#     - x: what we know
#     - y: what we want to know
x = df.drop('label', axis=1)
y = df['label']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

In [None]:
# create random fores classifier
rfc = RandomForestClassifier(n_estimators=100)

In [31]:
# train rfc
rfc.fit(x_train, y_train)

# print score
score = rfc.score(x_test, y_test)
print('{}%'.format(round(score*100, 2)))

97.51%


In [32]:
# make predictions
prediction = rfc.predict([[
    0.2022728,
    0.04060666,
    0.2129694,
    0.1821243,
    0.227241,
    0.04511674,
    3.040879,
    17.07277,
    0.8827420,
    0.2635666,
    0.1200658,
    0.2022728,
    0.1497998,
    0.04319295,
    0.2791139,
    0.3374789,
    0,
    1.593457,
    1.593457,
    0.11383929
]])

print("Female" if prediction[0]==0 else "Male")

Female
