# Machine Learning project: gender recognition from speech

The following acoustic properties of each voice are measured:
- **duration**: length of signal <span style="color:red">**NOT USED!!**</span>
- **meanfreq**: mean frequency (in kHz)
- **sd**: standard deviation of frequency
- **median**: median frequency (in kHz)
- **Q25**: first quantile (in kHz)
- **Q75**: third quantile (in kHz)
- **IQR**: interquantile range (in kHz)
- **skew**: skewness (see note in specprop description)
- **kurt**: kurtosis (see note in specprop description)
- **sp.ent**: spectral entropy
- **sfm**: spectral flatness
- **mode**: mode frequency
- **centroid**: frequency centroid (see specprop)
- **peakf**: peak frequency (frequency with highest energy) <span style="color:red">**NOT USED!!**</span>
- **meanfun**: average of fundamental frequency measured across acoustic signal
- **minfun**: minimum fundamental frequency measured across acoustic signal
- **maxfun**: maximum fundamental frequency measured across acoustic signal
- **meandom**: average of dominant frequency measured across acoustic signal
- **mindom**: minimum of dominant frequency measured across acoustic signal
- **maxdom**: maximum of dominant frequency measured across acoustic signal
- **dfrange**: range of dominant frequency measured across acoustic signal
- **modindx**: modulation index. Calculated as the accumulated absolute difference between adjacent measurements of fundamental frequencies divided by the frequency range

In [None]:
# pandas
import pandas as pd
pd.options.display.max_columns = None

# sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# pyplot
import matplotlib.pyplot as plt

# seaborn
import seaborn as sns

In [None]:
# load data and create data frame
data = pd.read_csv('dataSet.csv')
df = pd.DataFrame(data)

In [None]:
# transform column label into numbers
#     - male:   0
#     - female: 1
df.replace('male', 0, inplace=True)
df.replace('female', 1, inplace=True)

In [None]:
# display data frame info
print(df.info())

In [None]:
# display data frame
df.head()

In [None]:
# prepare training values:
#     - x: what we know
#     - y: what we want to know
x = df.drop('label', axis=1)
y = df['label']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

In [None]:
# create random fores classifier
rfc = RandomForestClassifier(n_estimators=100)

In [None]:
# train rfc
rfc.fit(x_train, y_train)

# print score
score = rfc.score(x_test, y_test)
print('{}%'.format(round(score*100, 2)))

In [None]:
# make predictions
prediction = rfc.predict([[
    0.2022728,
    0.04060666,
    0.2129694,
    0.1821243,
    0.227241,
    0.04511674,
    3.040879,
    17.07277,
    0.8827420,
    0.2635666,
    0.1200658,
    0.2022728,
    0.1497998,
    0.04319295,
    0.2791139,
    0.3374789,
    0,
    1.593457,
    1.593457,
    0.11383929
]])

print("M" if prediction[0]==0 else "F")

In [None]:
plt.figure(figsize=(14,12))
plt.title('Correlation Matrix')
sns.heatmap(df.corr(), linewidths=0.1, annot=True)