# Classification in PySpark's MLlib Project Solution

### Genre classification
Now it's time to leverage what we learned in the lectures to a REAL classification project! Have you ever wondered what makes us, humans, able to tell apart two songs of different genres? How we do we inherenly know the difference between a pop song and heavy metal? This type of classifcation may seem easy for us, but it's a very difficult challenge for a computer to do. So the question is, could an automatic genre classifcation model be possible? 

For this project we will be classifying songs based on a number of characteristics into a set of 23 electronic genres. This technology could be used by an application like Pandora to recommend songs to users or just create meaningful channels. Super fun!

### Dataset
*beatsdataset.csv*
Each row is an electronic music song. The dataset contains 100 song for each genre among 23 electronic music genres, they were the top (100) songs of their genres on November 2016. The 71 columns are audio features extracted of a two random minutes sample of the file audio. These features have been extracted using pyAudioAnalysis (https://github.com/tyiannak/pyAudioAnalysis).

### Your task
Create an algorithm that classifies songs into the 23 genres provided. Test out several different models and select the highest performing one. Also play around with feature selection methods and finally try to make a recommendation to a user.  

For the feature selection aspect of this project, you may need to get a bit creative if you want to select features from a non-tree algorithm. I did not go over this aspect of PySpark intentionally in the previous lectures to give you chance to get used to researching the PySpark documentation page. Here is the link to the Feature Selectors section of the documentation that just might come in handy: https://spark.apache.org/docs/latest/ml-features.html#feature-selectors

Good luck! Have fun :)

### Source
https://www.kaggle.com/caparrini/beatsdataset

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Genre').getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print('You are working with', cores, 'core(s)')
spark

You are working with 1 core(s)


In [3]:
path = 'Dataset/'

df = spark.read.csv(path+'beatsdataset.csv', inferSchema=True, header=True)

In [4]:
df.limit(5).toPandas()

Unnamed: 0,_c0,1-ZCRm,2-Energym,3-EnergyEntropym,4-SpectralCentroidm,5-SpectralSpreadm,6-SpectralEntropym,7-SpectralFluxm,8-SpectralRolloffm,9-MFCCs1m,...,63-ChromaVector8std,64-ChromaVector9std,65-ChromaVector10std,66-ChromaVector11std,67-ChromaVector12std,68-ChromaDeviationstd,69-BPM,70-BPMconf,71-BPMessentia,class
0,0,0.13644,0.088861,3.201201,0.262825,0.249212,1.114423,0.007003,0.256682,-22.723259,...,0.003431,0.004981,0.010818,0.024001,0.005201,0.015056,133.333333,0.132792,128.0,BigRoom
1,1,0.117039,0.108389,3.194001,0.247657,0.250288,1.065668,0.005387,0.199821,-21.775871,...,0.004461,0.006441,0.007469,0.015499,0.005589,0.019339,120.0,0.112767,126.0,BigRoom
2,2,0.085308,0.128525,3.123837,0.217205,0.228652,0.789647,0.008247,0.156822,-22.472722,...,0.001529,0.004556,0.007723,0.017482,0.002901,0.022201,133.333333,0.123373,129.0,BigRoom
3,3,0.10305,0.167042,3.15083,0.233593,0.245032,0.967082,0.006571,0.168083,-21.470751,...,0.001591,0.003514,0.009477,0.023162,0.004165,0.015379,133.333333,0.158876,129.0,BigRoom
4,4,0.15173,0.148405,3.194498,0.29373,0.267231,1.353005,0.003872,0.292055,-21.371157,...,0.003945,0.004131,0.01133,0.028188,0.002639,0.019079,133.333333,0.190708,129.0,BigRoom


In [5]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- 1-ZCRm: double (nullable = true)
 |-- 2-Energym: double (nullable = true)
 |-- 3-EnergyEntropym: double (nullable = true)
 |-- 4-SpectralCentroidm: double (nullable = true)
 |-- 5-SpectralSpreadm: double (nullable = true)
 |-- 6-SpectralEntropym: double (nullable = true)
 |-- 7-SpectralFluxm: double (nullable = true)
 |-- 8-SpectralRolloffm: double (nullable = true)
 |-- 9-MFCCs1m: double (nullable = true)
 |-- 10-MFCCs2m: double (nullable = true)
 |-- 11-MFCCs3m: double (nullable = true)
 |-- 12-MFCCs4m: double (nullable = true)
 |-- 13-MFCCs5m: double (nullable = true)
 |-- 14-MFCCs6m: double (nullable = true)
 |-- 15-MFCCs7m: double (nullable = true)
 |-- 16-MFCCs8m: double (nullable = true)
 |-- 17-MFCCs9m: double (nullable = true)
 |-- 18-MFCCs10m: double (nullable = true)
 |-- 19-MFCCs11m: double (nullable = true)
 |-- 20-MFCCs12m: double (nullable = true)
 |-- 21-MFCCs13m: double (nullable = true)
 |-- 22-ChromaVector1m: double (null

### How many classes do we have?

In [6]:
df.groupBy('class').count().show(100)

+--------------------+-----+
|               class|count|
+--------------------+-----+
|           PsyTrance|  100|
|           HardDance|  100|
|              Breaks|  100|
|  HardcoreHardTechno|  100|
|   IndieDanceNuDisco|  100|
|              Trance|  100|
|           DeepHouse|  100|
|ElectronicaDowntempo|  100|
|           ReggaeDub|  100|
|             Minimal|  100|
|         DrumAndBass|  100|
|             Dubstep|  100|
|             BigRoom|  100|
|              Techno|  100|
|               House|  100|
|         FutureHouse|  100|
|        ElectroHouse|  100|
|           GlitchHop|  100|
|           TechHouse|  100|
|              HipHop|  100|
|           FunkRAndB|  100|
|               Dance|  100|
|    ProgressiveHouse|  100|
+--------------------+-----+



In [7]:
df.columns

['_c0',
 '1-ZCRm',
 '2-Energym',
 '3-EnergyEntropym',
 '4-SpectralCentroidm',
 '5-SpectralSpreadm',
 '6-SpectralEntropym',
 '7-SpectralFluxm',
 '8-SpectralRolloffm',
 '9-MFCCs1m',
 '10-MFCCs2m',
 '11-MFCCs3m',
 '12-MFCCs4m',
 '13-MFCCs5m',
 '14-MFCCs6m',
 '15-MFCCs7m',
 '16-MFCCs8m',
 '17-MFCCs9m',
 '18-MFCCs10m',
 '19-MFCCs11m',
 '20-MFCCs12m',
 '21-MFCCs13m',
 '22-ChromaVector1m',
 '23-ChromaVector2m',
 '24-ChromaVector3m',
 '25-ChromaVector4m',
 '26-ChromaVector5m',
 '27-ChromaVector6m',
 '28-ChromaVector7m',
 '29-ChromaVector8m',
 '30-ChromaVector9m',
 '31-ChromaVector10m',
 '32-ChromaVector11m',
 '33-ChromaVector12m',
 '34-ChromaDeviationm',
 '35-ZCRstd',
 '36-Energystd',
 '37-EnergyEntropystd',
 '38-SpectralCentroidstd',
 '39-SpectralSpreadstd',
 '40-SpectralEntropystd',
 '41-SpectralFluxstd',
 '42-SpectralRolloffstd',
 '43-MFCCs1std',
 '44-MFCCs2std',
 '45-MFCCs3std',
 '46-MFCCs4std',
 '47-MFCCs5std',
 '48-MFCCs6std',
 '49-MFCCs7std',
 '50-MFCCs8std',
 '51-MFCCs9std',
 '52-MFCCs

In [8]:
# Data Prep function
def MLClassifierDFPrep(df,input_columns,dependent_var,treat_outliers=True,treat_neg_values=True):
    
    # change label (class variable) to string type to prep for reindexing
    # Pyspark is expecting a zero indexed integer for the label column. 
    # Just incase our data is not in that format... we will treat it by using the StringIndexer built in method
    renamed = df.withColumn("label_str", df[dependent_var].cast(StringType())) #Rename and change to string type
    indexer = StringIndexer(inputCol="label_str", outputCol="label") #Pyspark is expecting the this naming convention 
    indexed = indexer.fit(renamed).transform(renamed)

    # Convert all string type data in the input column list to numeric
    # Otherwise the Algorithm will not be able to process it
    numeric_inputs = []
    string_inputs = []
    for column in input_columns:
        if str(indexed.schema[column].dataType) == 'StringType':
            indexer = StringIndexer(inputCol=column, outputCol=column+"_num") 
            indexed = indexer.fit(indexed).transform(indexed)
            new_col_name = column+"_num"
            string_inputs.append(new_col_name)
        else:
            numeric_inputs.append(column)
            
    if treat_outliers == True:
        print("We are correcting for non normality now!")
        # empty dictionary d
        d = {}
        # Create a dictionary of quantiles
        for col in numeric_inputs: 
            d[col] = indexed.approxQuantile(col,[0.01,0.99],0.25) #if you want to make it go faster increase the last number
        #Now fill in the values
        for col in numeric_inputs:
            skew = indexed.agg(skewness(indexed[col])).collect() #check for skewness
            skew = skew[0][0]
            # This function will floor, cap and then log+1 (just in case there are 0 values)
            if skew > 1:
                indexed = indexed.withColumn(col, \
                log(when(df[col] < d[col][0],d[col][0])\
                .when(indexed[col] > d[col][1], d[col][1])\
                .otherwise(indexed[col] ) +1).alias(col))
                print(col+" has been treated for positive (right) skewness. (skew =)",skew,")")
            elif skew < -1:
                indexed = indexed.withColumn(col, \
                exp(when(df[col] < d[col][0],d[col][0])\
                .when(indexed[col] > d[col][1], d[col][1])\
                .otherwise(indexed[col] )).alias(col))
                print(col+" has been treated for negative (left) skewness. (skew =",skew,")")

            
    # Produce a warning if there are negative values in the dataframe that Naive Bayes cannot be used. 
    # Note: we only need to check the numeric input values since anything that is indexed won't have negative values
    minimums = df.select([min(c).alias(c) for c in df.columns if c in numeric_inputs]) # Calculate the mins for all columns in the df
    min_array = minimums.select(array(numeric_inputs).alias("mins")) # Create an array for all mins and select only the input cols
    df_minimum = min_array.select(array_min(min_array.mins)).collect() # Collect golobal min as Python object
    df_minimum = df_minimum[0][0] # Slice to get the number itself

    features_list = numeric_inputs + string_inputs
    assembler = VectorAssembler(inputCols=features_list,outputCol='features')
    output = assembler.transform(indexed).select('features','label')

#     final_data = output.select('features','label') #drop everything else
    
    # Now check for negative values and ask user if they want to correct that? 
    if df_minimum < 0:
        print(" ")
        print("WARNING: The Naive Bayes Classifier will not be able to process your dataframe as it contains negative values")
        print(" ")
    
    if treat_neg_values == True:
        print("You have opted to correct that by rescaling all your features to a range of 0 to 1")
        print(" ")
        print("We are rescaling you dataframe....")
        scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

        # Compute summary statistics and generate MinMaxScalerModel
        scalerModel = scaler.fit(output)

        # rescale each feature to range [min, max].
        scaled_data = scalerModel.transform(output)
        final_data = scaled_data.select('label','scaledFeatures')
        final_data = final_data.withColumnRenamed('scaledFeatures','features')
        print("Done!")

    else:
        print("You have opted not to correct that therefore you will not be able to use to Naive Bayes classifier")
        print("We will return the dataframe unscaled.")
        final_data = output
    
    return final_data

In [20]:
# Read in dependencies
from pyspark.ml.feature import VectorAssembler, StringIndexer, MinMaxScaler
from pyspark.sql.types import * 

from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.sql.functions import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [10]:
input_columns = df.columns
input_columns = input_columns[1:-1]

In [12]:
dependent_var = 'class'


In [13]:
final_data = MLClassifierDFPrep(df,input_columns,dependent_var)
final_data.limit(5).toPandas()

We are correcting for non normality now!
7-SpectralFluxm has been treated for positive (right) skewness. (skew =) 1.6396138160129063 )
22-ChromaVector1m has been treated for positive (right) skewness. (skew =) 2.4162415204309258 )
23-ChromaVector2m has been treated for positive (right) skewness. (skew =) 4.154796693680583 )
24-ChromaVector3m has been treated for positive (right) skewness. (skew =) 1.1974019617504328 )
25-ChromaVector4m has been treated for positive (right) skewness. (skew =) 2.446635863594906 )
26-ChromaVector5m has been treated for positive (right) skewness. (skew =) 2.154482876187508 )
27-ChromaVector6m has been treated for positive (right) skewness. (skew =) 2.01234064472543 )
28-ChromaVector7m has been treated for positive (right) skewness. (skew =) 1.1829228989215521 )
29-ChromaVector8m has been treated for positive (right) skewness. (skew =) 3.7372643733999955 )
30-ChromaVector9m has been treated for positive (right) skewness. (skew =) 2.4117416421548645 )
31-Chr

Unnamed: 0,label,features
0,0.0,"[0.5198182667002392, 0.30338998025826874, 0.89..."
1,0.0,"[0.4352954639925659, 0.37399300906442695, 0.88..."
2,0.0,"[0.2970571291217422, 0.44679648775095127, 0.74..."
3,0.0,"[0.3743526477343687, 0.5860529526571694, 0.796..."
4,0.0,"[0.5864323374662597, 0.5186704525553854, 0.882..."


## Set up our Training and Evaluation Function


In [17]:
def ClassTrainEval(classifier,features,classes,folds,train,test):
    
    def FindMtype(classifier):
        # Intstantiate Model
        M = classifier
        # Learn what it is
        Mtype = type(M).__name__
        
        return Mtype
    
    Mtype = FindMtype(classifier)
    

    def IntanceFitModel(Mtype,classifier,classes,features,folds,train):
        
        if Mtype == "OneVsRest":
            # instantiate the base classifier.
            lr = LogisticRegression()
            # instantiate the One Vs Rest Classifier.
            OVRclassifier = OneVsRest(classifier=lr)
#             fitModel = OVRclassifier.fit(train)
            # Add parameters of your choice here:
            paramGrid = ParamGridBuilder() \
                .addGrid(lr.regParam, [0.1, 0.01]) \
                .build()
            #Cross Validator requires the following parameters:
            crossval = CrossValidator(estimator=OVRclassifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=folds) # 3 is best practice
            # Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
        if Mtype == "MultilayerPerceptronClassifier":
            # specify layers for the neural network:
            # input layer of size features, two intermediate of features+1 and same size as features
            # and output of size number of classes
            # Note: crossvalidator cannot be used here
            features_count = len(features[0][0])
            layers = [features_count, features_count+1, features_count, classes]
            MPC_classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
            fitModel = MPC_classifier.fit(train)
            return fitModel
        if Mtype in("LinearSVC","GBTClassifier") and classes != 2: # These classifiers currently only accept binary classification
            print(Mtype," could not be used because PySpark currently only accepts binary classification data for this algorithm")
            return
        if Mtype in("LogisticRegression","NaiveBayes","RandomForestClassifier","GBTClassifier","LinearSVC","DecisionTreeClassifier"):
  
            # Add parameters of your choice here:
            if Mtype in("LogisticRegression"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .addGrid(classifier.maxIter, [10, 15,20])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("NaiveBayes"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("RandomForestClassifier"):
                paramGrid = (ParamGridBuilder() \
                               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("GBTClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
#                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .addGrid(classifier.maxIter, [10, 15,50,100])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("LinearSVC"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.maxIter, [10, 15]) \
                             .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .build())
            
            # Add parameters of your choice here:
            if Mtype in("DecisionTreeClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
                             .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .build())
            
            #Cross Validator requires all of the following parameters:
            crossval = CrossValidator(estimator=classifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=folds) # 3 + is best practice
            # Fit Model: Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
    
    fitModel = IntanceFitModel(Mtype,classifier,classes,features,folds,train)
    
    # Print feature selection metrics
    if fitModel is not None:
        
        if Mtype in("OneVsRest"):
            # Get Best Model
            BestModel = fitModel.bestModel
            global OVR_BestModel
            OVR_BestModel = BestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            # Extract list of binary models
            models = BestModel.models
            for model in models:
                print('\033[1m' + 'Intercept: '+ '\033[0m',model.intercept)
                print('\033[1m' + 'Top 20 Coefficients:'+ '\033[0m')
                coeff_array = model.coefficients.toArray()
                coeff_scores = []
                for x in coeff_array:
                    coeff_scores.append(float(x))
                # Then zip with input_columns list and create a df
                result = spark.createDataFrame(zip(input_columns,coeff_scores), schema=['feature','coeff'])
                print(result.orderBy(result["coeff"].desc()).show(truncate=False))


        if Mtype == "MultilayerPerceptronClassifier":
            print("")
            print('\033[1m' + Mtype + '\033[0m')
            print('\033[1m' + "Model Weights: "+ '\033[0m',fitModel.weights.size)
            print("")
            global MLPC_Model
            MLPC_BestModel = fitModel

        if Mtype in("DecisionTreeClassifier", "GBTClassifier","RandomForestClassifier"):
            # FEATURE IMPORTANCES
            # Estimate of the importance of each feature.
            # Each feature’s importance is the average of its importance across all trees 
            # in the ensemble The importance vector is normalized to sum to 1. 
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Top 20 Feature Importances"+ '\033[0m')
            print("(Scores add up to 1)")
            print("Lowest score is the least important")
            print(" ")
            featureImportances = BestModel.featureImportances.toArray()
            # Convert from numpy array to list
            imp_scores = []
            for x in featureImportances:
                imp_scores.append(float(x))
            # Then zip with input_columns list and create a df
            result = spark.createDataFrame(zip(input_columns,imp_scores), schema=['feature','score'])
            print(result.orderBy(result["score"].desc()).show(truncate=False))
            
            # Save the feature importance values and the models
            if Mtype in("DecisionTreeClassifier"):
                global DT_featureimportances
                DT_featureimportances = BestModel.featureImportances.toArray()
                global DT_BestModel
                DT_BestModel = BestModel
            if Mtype in("GBTClassifier"):
                global GBT_featureimportances
                GBT_featureimportances = BestModel.featureImportances.toArray()
                global GBT_BestModel
                GBT_BestModel = BestModel
            if Mtype in("RandomForestClassifier"):
                global RF_featureimportances
                RF_featureimportances = BestModel.featureImportances.toArray()
                global RF_BestModel
                RF_BestModel = BestModel

        # Print the coefficients
        if Mtype in("LogisticRegression"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            print("Intercept: " + str(BestModel.interceptVector))
            print('\033[1m' + " Top 20 Coefficients"+ '\033[0m')
            print("You should compares these relative to eachother")
            # Convert from numpy array to list
            coeff_array = BestModel.coefficientMatrix.toArray()
            coeff_scores = []
            for x in coeff_array[0]:
                coeff_scores.append(float(x))
            # Then zip with input_columns list and create a df
            result = spark.createDataFrame(zip(input_columns,coeff_scores), schema=['feature','coeff'])
            print(result.orderBy(result["coeff"].desc()).show(truncate=False))
            # Save the coefficient values and the models
            global LR_coefficients
            LR_coefficients = BestModel.coefficientMatrix.toArray()
            global LR_BestModel
            LR_BestModel = BestModel

        # Print the Coefficients
        if Mtype in("LinearSVC"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            print("Intercept: " + str(BestModel.intercept))
            print('\033[1m' + "Top 20 Coefficients"+ '\033[0m')
            print("You should compares these relative to eachother")
#             print("Coefficients: \n" + str(BestModel.coefficients))
            coeff_array = BestModel.coefficients.toArray()
            coeff_scores = []
            for x in coeff_array:
                coeff_scores.append(float(x))
            # Then zip with input_columns list and create a df
            result = spark.createDataFrame(zip(input_columns,coeff_scores), schema=['feature','coeff'])
            print(result.orderBy(result["coeff"].desc()).show(truncate=False))
            # Save the coefficient values and the models
            global LSVC_coefficients
            LSVC_coefficients = BestModel.coefficients.toArray()
            global LSVC_BestModel
            LSVC_BestModel = BestModel
        
   
    # Set the column names to match the external results dataframe that we will join with later:
    columns = ['Classifier', 'Result']
    
    if Mtype in("LinearSVC","GBTClassifier") and classes != 2:
        Mtype = [Mtype] # make this a list
        score = ["N/A"]
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
    else:
        predictions = fitModel.transform(test)
        MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy") # redictionCol="prediction",
        accuracy = (MC_evaluator.evaluate(predictions))*100
        Mtype = [Mtype] # make this a string
        score = [str(accuracy)] #make this a string and convert to a list
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
        result = result.withColumn('Result',result.Result.substr(0, 5))
        
    return result
    #Also returns the fit model important scores or p values

In [18]:
# Learn how many classes there are in order to specify evaluation type based on binary or multi and turn the df into an object
class_count = df.select(countDistinct("class")).collect()
classes = class_count[0][0]

### Test 1: Without outlier treatment, skew or negative value treatment


In [21]:
# Call on data prep, train and evaluate functions
test1_data = MLClassifierDFPrep(df,input_columns,dependent_var,treat_outliers=False,treat_neg_values=False)
test1_data.limit(5).toPandas()

# Comment out Naive Bayes if your data still contains negative values
classifiers = [
                LogisticRegression()
                ,OneVsRest()
               ,LinearSVC()
#                ,NaiveBayes()
               ,RandomForestClassifier()
               ,GBTClassifier()
               ,DecisionTreeClassifier()
               ,MultilayerPerceptronClassifier()
              ] 

train,test = test1_data.randomSplit([0.7,0.3])
features = test1_data.select(['features']).collect()
folds = 2 # because we have limited data

#set up your results table
columns = ['Classifier', 'Result']
vals = [("Place Holder","N/A")]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier,features,classes,folds,train,test)
    results = results.union(new_result)
results = results.where("Classifier!='Place Holder'")
print("!!!!!Final Results!!!!!!!!")
results.show(100,False)

 
 
You have opted not to correct that therefore you will not be able to use to Naive Bayes classifier
We will return the dataframe unscaled.
 
[1mLogisticRegression[0m
Intercept: [-0.03918683269121573,0.10629061147324848,-0.1065864936313428,-0.02530160676507121,0.034496261097686005,-0.01568447275965721,0.07148430706527434,-0.006469104558045878,-0.010080660578500456,0.0016804568433472636,-0.11804690649997898,-0.013000090552532753,0.001675831314454198,-0.015787443010099402,-0.012887699013786152,-0.022158003746879394,-0.0807630717469547,0.114346977176459,0.013704565767346572,-0.011610621446589229,0.03483436898985101,-0.0037299932606389913,0.102779620533626]
[1m Top 20 Coefficients[0m
You should compares these relative to eachother
+--------------------+------------------+
|feature             |coeff             |
+--------------------+------------------+
|63-ChromaVector8std |35.32696925654321 |
|29-ChromaVector8m   |23.263952303533493|
|30-ChromaVector9m   |22.10130022377164 |
|33-C

+--------------------+------------------+
|feature             |coeff             |
+--------------------+------------------+
|22-ChromaVector1m   |56.54990122028918 |
|63-ChromaVector8std |38.939443490622544|
|31-ChromaVector10m  |38.18284547567682 |
|30-ChromaVector9m   |27.191371637259262|
|25-ChromaVector4m   |20.846847133169184|
|62-ChromaVector7std |9.127174965961409 |
|2-Energym           |7.41882551980643  |
|36-Energystd        |4.103048241809875 |
|51-MFCCs9std        |3.6096115148087446|
|5-SpectralSpreadm   |3.207837718641112 |
|52-MFCCs10std       |2.8612807287314315|
|58-ChromaVector3std |2.8087855674602213|
|3-EnergyEntropym    |2.691632344291727 |
|64-ChromaVector9std |2.6027197331907277|
|49-MFCCs7std        |2.1113764487252364|
|67-ChromaVector12std|1.7756302050980415|
|53-MFCCs11std       |1.6393705389621531|
|12-MFCCs4m          |1.3430149748156903|
|19-MFCCs11m         |1.1834743298076722|
|43-MFCCs1std        |1.1460654052729713|
+--------------------+------------

+----------------------+------------------+
|feature               |coeff             |
+----------------------+------------------+
|57-ChromaVector2std   |88.45951076885791 |
|64-ChromaVector9std   |88.12634122498228 |
|63-ChromaVector8std   |81.03257298210963 |
|25-ChromaVector4m     |25.59835442804075 |
|60-ChromaVector5std   |24.85765475467467 |
|31-ChromaVector10m    |15.339847846141383|
|59-ChromaVector4std   |13.363520807866339|
|38-SpectralCentroidstd|12.78350684352877 |
|35-ZCRstd             |10.781010546038294|
|55-MFCCs13std         |10.192538297233593|
|46-MFCCs4std          |5.019583674405849 |
|53-MFCCs11std         |3.7466225732574787|
|42-SpectralRolloffstd |3.7171815623258286|
|54-MFCCs12std         |2.609268844243488 |
|52-MFCCs10std         |2.259081861812972 |
|20-MFCCs12m           |1.7022572456917724|
|17-MFCCs9m            |1.5853020112492613|
|40-SpectralEntropystd |1.2406507267987814|
|36-Energystd          |1.233561534362306 |
|70-BPMconf            |1.135256

+--------------------+------------------+
|feature             |coeff             |
+--------------------+------------------+
|33-ChromaVector12m  |42.29052436388119 |
|31-ChromaVector10m  |36.353306950985576|
|25-ChromaVector4m   |35.017454352271685|
|65-ChromaVector10std|31.318766944997332|
|23-ChromaVector2m   |30.88737895613705 |
|56-ChromaVector1std |29.83344862880314 |
|41-SpectralFluxstd  |28.054594284115538|
|27-ChromaVector6m   |23.602070656948165|
|59-ChromaVector4std |22.711938285616917|
|66-ChromaVector11std|21.771249700053282|
|30-ChromaVector9m   |17.95273603257814 |
|60-ChromaVector5std |17.58174313765497 |
|7-SpectralFluxm     |17.043783673055888|
|5-SpectralSpreadm   |13.085392294504953|
|22-ChromaVector1m   |13.006435923418188|
|62-ChromaVector7std |11.574522189428174|
|61-ChromaVector6std |6.657719534859626 |
|3-EnergyEntropym    |4.15265243222154  |
|19-MFCCs11m         |3.772079528727229 |
|4-SpectralCentroidm |3.3834101565998997|
+--------------------+------------

### Test 2: Test treatments

Train and evaluate models on treated data (outliers, skewness and negative values) and compare to baseline (#1).

In [22]:
# Call on data prep, train and evaluate functions
test2_data = MLClassifierDFPrep(df,input_columns,dependent_var,treat_outliers=True,treat_neg_values=True)
test2_data.limit(5).toPandas()

# Comment out Naive Bayes if your data still contains negative values
classifiers = [
                LogisticRegression()
                ,OneVsRest()
               ,LinearSVC()
               ,NaiveBayes()
               ,RandomForestClassifier()
               ,GBTClassifier()
               ,DecisionTreeClassifier()
               ,MultilayerPerceptronClassifier()
              ] 

train,test = test2_data.randomSplit([0.7,0.3])
features = test2_data.select(['features']).collect()
folds = 2

#set up your results table
columns = ['Classifier', 'Result']
vals = [("Place Holder","N/A")]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier,features,classes,folds,train,test)
    results = results.union(new_result)
results = results.where("Classifier!='Place Holder'")
print("!!!!!Final Results!!!!!!!!")
results.show(100,False)

We are correcting for non normality now!
7-SpectralFluxm has been treated for positive (right) skewness. (skew =) 1.6396138160129063 )
22-ChromaVector1m has been treated for positive (right) skewness. (skew =) 2.4162415204309258 )
23-ChromaVector2m has been treated for positive (right) skewness. (skew =) 4.154796693680583 )
24-ChromaVector3m has been treated for positive (right) skewness. (skew =) 1.1974019617504328 )
25-ChromaVector4m has been treated for positive (right) skewness. (skew =) 2.446635863594906 )
26-ChromaVector5m has been treated for positive (right) skewness. (skew =) 2.154482876187508 )
27-ChromaVector6m has been treated for positive (right) skewness. (skew =) 2.01234064472543 )
28-ChromaVector7m has been treated for positive (right) skewness. (skew =) 1.1829228989215521 )
29-ChromaVector8m has been treated for positive (right) skewness. (skew =) 3.7372643733999955 )
30-ChromaVector9m has been treated for positive (right) skewness. (skew =) 2.4117416421548645 )
31-Chr

+---------------------+------------------+
|feature              |coeff             |
+---------------------+------------------+
|5-SpectralSpreadm    |4.247542035489987 |
|2-Energym            |4.2059527534896   |
|43-MFCCs1std         |3.3631103271914093|
|59-ChromaVector4std  |2.685556847414041 |
|69-BPM               |2.4946539170166364|
|42-SpectralRolloffstd|2.247954954500442 |
|4-SpectralCentroidm  |2.0784493732354457|
|20-MFCCs12m          |1.6804679329064915|
|10-MFCCs2m           |1.6212541597857189|
|23-ChromaVector2m    |1.5978640725672464|
|40-SpectralEntropystd|1.449733902556846 |
|60-ChromaVector5std  |1.2487501671041492|
|14-MFCCs6m           |1.244784595878367 |
|24-ChromaVector3m    |1.2125358965684725|
|11-MFCCs3m           |1.0323934184037424|
|48-MFCCs6std         |0.99213063989287  |
|9-MFCCs1m            |0.9770663580386122|
|44-MFCCs2std         |0.9417175856731128|
|34-ChromaDeviationm  |0.7679083991003852|
|66-ChromaVector11std |0.733482155342711 |
+----------

+----------------------+------------------+
|feature               |coeff             |
+----------------------+------------------+
|13-MFCCs5m            |4.0069630480003005|
|2-Energym             |3.3445698780660673|
|70-BPMconf            |2.7434894729986206|
|7-SpectralFluxm       |2.260784474835555 |
|30-ChromaVector9m     |2.1026429944181526|
|45-MFCCs3std          |1.8886786862519005|
|38-SpectralCentroidstd|1.5669056464380067|
|1-ZCRm                |1.5169388990710713|
|28-ChromaVector7m     |1.3860182589121266|
|15-MFCCs7m            |1.2479784500724926|
|21-MFCCs13m           |0.9205156085617573|
|56-ChromaVector1std   |0.8516080455463587|
|71-BPMessentia        |0.7801203204526904|
|44-MFCCs2std          |0.7386269391763199|
|66-ChromaVector11std  |0.6942485025399656|
|4-SpectralCentroidm   |0.587954573915287 |
|12-MFCCs4m            |0.5284243601247357|
|46-MFCCs4std          |0.4704050774857298|
|35-ZCRstd             |0.2766753508349693|
|29-ChromaVector8m     |0.271018

+----------------------+------------------+
|feature               |coeff             |
+----------------------+------------------+
|23-ChromaVector2m     |2.7700457628214687|
|67-ChromaVector12std  |2.747997887974174 |
|43-MFCCs1std          |2.6072293163240396|
|11-MFCCs3m            |2.370849184509424 |
|71-BPMessentia        |2.1010127550954323|
|41-SpectralFluxstd    |1.9993676756200485|
|35-ZCRstd             |1.9968280930120483|
|15-MFCCs7m            |1.5224799710159278|
|42-SpectralRolloffstd |1.329448911061085 |
|47-MFCCs5std          |1.322576752399254 |
|38-SpectralCentroidstd|1.2852656942718343|
|13-MFCCs5m            |1.2732529329496687|
|12-MFCCs4m            |1.150378236555362 |
|44-MFCCs2std          |1.0508915820393483|
|37-EnergyEntropystd   |1.009428843433784 |
|51-MFCCs9std          |0.9290499002090249|
|45-MFCCs3std          |0.9185116760593931|
|56-ChromaVector1std   |0.8779292229348162|
|33-ChromaVector12m    |0.7703049260143897|
|60-ChromaVector5std   |0.613243

## Test 3: Feature Selection


In [23]:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors

classifiers = [OneVsRest()] 

#Select the top n features and view results
maximum = len(input_columns)
for n in range(10,maximum,10):
    print("Testing top n = ",n," features")
    
    # For Tree classifiers
#     best_n_features = RF_featureimportances.argsort()[-n:][::-1]
#     best_n_features= best_n_features.tolist() # convert to a list
#     vs = VectorSlicer(inputCol="features", outputCol="best_features", indices=best_n_features)
#     bestFeaturesDf = vs.transform(test2_data)

    # For Logistic regression or One vs Rest
    selector = ChiSqSelector(numTopFeatures=n, featuresCol="features",
                         outputCol="selectedFeatures", labelCol="label")
    bestFeaturesDf = selector.fit(test2_data).transform(test2_data)
    bestFeaturesDf = bestFeaturesDf.select("label","selectedFeatures")
    bestFeaturesDf = bestFeaturesDf.withColumnRenamed("selectedFeatures","features")

    # Collect features
    features = bestFeaturesDf.select(['features']).collect()

    # Split
    train,test = bestFeaturesDf.randomSplit([0.7,0.3])
    
    # Specify folds
    folds = 2

    #set up your results table
    columns = ['Classifier', 'Result']
    vals = [("Place Holder","N/A")]
    results = spark.createDataFrame(vals, columns)

    for classifier in classifiers:
        new_result = ClassTrainEval(classifier,features,classes,folds,train,test)
        results = results.union(new_result)
    results = results.where("Classifier!='Place Holder'")
    results.show(100,False)

Testing top n =  10  features
 
[1mOneVsRest[0m
[1mIntercept: [0m -4.487672346776169
[1mTop 20 Coefficients:[0m
+-------------------+-------------------+
|feature            |coeff              |
+-------------------+-------------------+
|1-ZCRm             |3.683439265649151  |
|10-MFCCs2m         |1.9337872121180415 |
|8-SpectralRolloffm |1.2557424320998343 |
|2-Energym          |0.9840711570584602 |
|7-SpectralFluxm    |0.46893029689142207|
|4-SpectralCentroidm|-0.900565781031107 |
|3-EnergyEntropym   |-1.765073953687721 |
|5-SpectralSpreadm  |-3.24020794863662  |
|9-MFCCs1m          |-3.6455050272759015|
|6-SpectralEntropym |-6.988124806794963 |
+-------------------+-------------------+

None
[1mIntercept: [0m -7.031825249250782
[1mTop 20 Coefficients:[0m
+-------------------+--------------------+
|feature            |coeff               |
+-------------------+--------------------+
|9-MFCCs1m          |4.702880903787596   |
|10-MFCCs2m         |2.95679190776706    |
|7-Sp

+-------------------+--------------------+
|feature            |coeff               |
+-------------------+--------------------+
|9-MFCCs1m          |4.17797043262139    |
|1-ZCRm             |3.8256779202480016  |
|8-SpectralRolloffm |3.7879377888071666  |
|7-SpectralFluxm    |3.156980987732192   |
|4-SpectralCentroidm|1.6397506608593382  |
|2-Energym          |-0.06401856605565351|
|5-SpectralSpreadm  |-0.35908511605182053|
|3-EnergyEntropym   |-0.6820398688810168 |
|10-MFCCs2m         |-0.8677431284590189 |
|6-SpectralEntropym |-2.2417979395549485 |
+-------------------+--------------------+

None
[1mIntercept: [0m -3.4699125629221403
[1mTop 20 Coefficients:[0m
+-------------------+-------------------+
|feature            |coeff              |
+-------------------+-------------------+
|5-SpectralSpreadm  |2.4503170037961546 |
|7-SpectralFluxm    |1.8828852108287308 |
|10-MFCCs2m         |1.6933112612654766 |
|1-ZCRm             |0.8442087309570834 |
|8-SpectralRolloffm |0.334028

+-------------------+--------------------+
|feature            |coeff               |
+-------------------+--------------------+
|12-MFCCs4m         |3.151490993891898   |
|13-MFCCs5m         |2.758421619990745   |
|1-ZCRm             |2.655238273498576   |
|9-MFCCs1m          |0.855882395565747   |
|6-SpectralEntropym |0.588865086299153   |
|5-SpectralSpreadm  |0.5374418263544116  |
|16-MFCCs8m         |0.48025859649235575 |
|17-MFCCs9m         |0.46263024976986067 |
|10-MFCCs2m         |0.23058280687062715 |
|8-SpectralRolloffm |0.04696431630541617 |
|14-MFCCs6m         |-0.0621942775269257 |
|3-EnergyEntropym   |-0.35307243161780305|
|15-MFCCs7m         |-0.4111061306214449 |
|11-MFCCs3m         |-0.6327672540263087 |
|2-Energym          |-0.7346608998362579 |
|18-MFCCs10m        |-0.9094347900850832 |
|20-MFCCs12m        |-0.9407006271764123 |
|4-SpectralCentroidm|-1.5885921438631496 |
|7-SpectralFluxm    |-2.0284987050971104 |
|19-MFCCs11m        |-4.572827406358226  |
+----------

+-------------------+--------------------+
|feature            |coeff               |
+-------------------+--------------------+
|14-MFCCs6m         |4.436710527127969   |
|12-MFCCs4m         |3.0622330194725556  |
|19-MFCCs11m        |2.7542068429666933  |
|3-EnergyEntropym   |2.118857360805834   |
|11-MFCCs3m         |1.6273092949809325  |
|4-SpectralCentroidm|1.3271180759358856  |
|8-SpectralRolloffm |0.951589839733944   |
|15-MFCCs7m         |0.9494218790287848  |
|18-MFCCs10m        |0.8483829679282048  |
|13-MFCCs5m         |0.47793892748974226 |
|9-MFCCs1m          |0.20462317123224935 |
|7-SpectralFluxm    |-0.23019037305181872|
|5-SpectralSpreadm  |-0.38878451747490794|
|16-MFCCs8m         |-0.43578138411762973|
|6-SpectralEntropym |-0.6269151748270196 |
|10-MFCCs2m         |-0.6947101950407887 |
|1-ZCRm             |-0.7657905504279519 |
|17-MFCCs9m         |-1.5613590306102996 |
|2-Energym          |-2.220725810094586  |
|20-MFCCs12m        |-3.9183775333098203 |
+----------

+-------------------+--------------------+
|feature            |coeff               |
+-------------------+--------------------+
|20-MFCCs12m        |8.517748317383187   |
|8-SpectralRolloffm |2.915952307406558   |
|19-MFCCs11m        |2.86684666132009    |
|18-MFCCs10m        |2.8296069978149383  |
|2-Energym          |2.019190020274551   |
|6-SpectralEntropym |1.942886423708473   |
|17-MFCCs9m         |0.9564837308599003  |
|7-SpectralFluxm    |0.1900573066164808  |
|1-ZCRm             |-0.25471674411423406|
|11-MFCCs3m         |-1.0221452196445455 |
|14-MFCCs6m         |-1.1162693864051938 |
|4-SpectralCentroidm|-1.1589305504794665 |
|3-EnergyEntropym   |-1.775068729258134  |
|5-SpectralSpreadm  |-1.878647903697903  |
|16-MFCCs8m         |-1.9569004138867925 |
|9-MFCCs1m          |-2.117010392000153  |
|10-MFCCs2m         |-2.3263530655641063 |
|15-MFCCs7m         |-2.745076603230554  |
|12-MFCCs4m         |-3.255318865751899  |
|13-MFCCs5m         |-3.7046130687864336 |
+----------

+-------------------+---------------------+
|feature            |coeff                |
+-------------------+---------------------+
|3-EnergyEntropym   |5.4582749576587055   |
|12-MFCCs4m         |2.514494854753556    |
|23-ChromaVector2m  |2.1108439458688992   |
|22-ChromaVector1m  |1.7340286648257024   |
|19-MFCCs11m        |1.6988929042753091   |
|9-MFCCs1m          |1.594508409342367    |
|10-MFCCs2m         |1.2327908747845118   |
|30-ChromaVector9m  |0.939481133958861    |
|20-MFCCs12m        |0.6035653683334747   |
|18-MFCCs10m        |0.21373228098198446  |
|4-SpectralCentroidm|0.17926932338453275  |
|25-ChromaVector4m  |0.014157223737505603 |
|13-MFCCs5m         |-0.024128754354001695|
|11-MFCCs3m         |-0.06839085645492034 |
|27-ChromaVector6m  |-0.09270376378584695 |
|17-MFCCs9m         |-0.21073759143331228 |
|28-ChromaVector7m  |-0.24698822304050527 |
|16-MFCCs8m         |-0.3882454341567158  |
|21-MFCCs13m        |-0.42538578557485857 |
|14-MFCCs6m         |-0.60515503

+------------------+--------------------+
|feature           |coeff               |
+------------------+--------------------+
|1-ZCRm            |3.4770643692369285  |
|28-ChromaVector7m |3.0985985520867634  |
|10-MFCCs2m        |2.523520732504476   |
|6-SpectralEntropym|2.4482189063677753  |
|30-ChromaVector9m |2.3119824099950774  |
|8-SpectralRolloffm|2.293468663935132   |
|16-MFCCs8m        |2.228216319953584   |
|5-SpectralSpreadm |1.1834040694676606  |
|29-ChromaVector8m |1.013864230482076   |
|18-MFCCs10m       |0.44236722618755675 |
|15-MFCCs7m        |0.32005673946961805 |
|26-ChromaVector5m |0.22070185086145638 |
|14-MFCCs6m        |-0.12306868191782713|
|24-ChromaVector3m |-0.14324909458544555|
|23-ChromaVector2m |-0.5064023552958379 |
|22-ChromaVector1m |-0.5223028935402153 |
|3-EnergyEntropym  |-0.7178629888299176 |
|7-SpectralFluxm   |-1.1357789491862869 |
|17-MFCCs9m        |-1.2176724986987373 |
|20-MFCCs12m       |-1.3072800342880755 |
+------------------+--------------

+------------------+--------------------+
|feature           |coeff               |
+------------------+--------------------+
|19-MFCCs11m       |4.796840455231579   |
|27-ChromaVector6m |2.8983272202619728  |
|8-SpectralRolloffm|2.687751573775403   |
|17-MFCCs9m        |2.2464284059188415  |
|6-SpectralEntropym|2.2288647648196163  |
|29-ChromaVector8m |2.174600283007778   |
|16-MFCCs8m        |2.0775182332783237  |
|18-MFCCs10m       |1.852066110018495   |
|11-MFCCs3m        |1.5447753537878377  |
|9-MFCCs1m         |1.0515908453375307  |
|12-MFCCs4m        |0.9603299710023947  |
|15-MFCCs7m        |0.8372325750798303  |
|28-ChromaVector7m |0.8016502127859662  |
|22-ChromaVector1m |0.09327558264532215 |
|7-SpectralFluxm   |0.08747766054521276 |
|24-ChromaVector3m |0.07088856647218339 |
|3-EnergyEntropym  |-0.15280602452074832|
|5-SpectralSpreadm |-0.40135033813505877|
|21-MFCCs13m       |-0.4577210204096481 |
|2-Energym         |-0.7573458111577445 |
+------------------+--------------

+----------------------+-------------------+
|feature               |coeff              |
+----------------------+-------------------+
|20-MFCCs12m           |3.58359027536417   |
|19-MFCCs11m           |3.4788955007058173 |
|39-SpectralSpreadstd  |3.320194083222578  |
|4-SpectralCentroidm   |2.9451829397997327 |
|30-ChromaVector9m     |2.189698049386325  |
|15-MFCCs7m            |2.076331444435211  |
|3-EnergyEntropym      |1.6810549551271172 |
|37-EnergyEntropystd   |1.474349041671123  |
|22-ChromaVector1m     |0.8258144204708656 |
|31-ChromaVector10m    |0.69547604473737   |
|10-MFCCs2m            |0.5765915452845913 |
|1-ZCRm                |0.35672630390241694|
|38-SpectralCentroidstd|0.3543769321965663 |
|7-SpectralFluxm       |0.21623727038507326|
|5-SpectralSpreadm     |0.19534620043049197|
|28-ChromaVector7m     |0.14404396795667587|
|8-SpectralRolloffm    |0.13285913524473866|
|32-ChromaVector11m    |0.13089535283898296|
|34-ChromaDeviationm   |0.06802432889863999|
|27-Chroma

+----------------------+-------------------+
|feature               |coeff              |
+----------------------+-------------------+
|34-ChromaDeviationm   |3.723723865307236  |
|40-SpectralEntropystd |2.5254719951662907 |
|1-ZCRm                |2.2393570765787407 |
|38-SpectralCentroidstd|1.67279244878641   |
|6-SpectralEntropym    |1.4695300607682398 |
|22-ChromaVector1m     |1.3398173317388078 |
|10-MFCCs2m            |1.2189683717938196 |
|39-SpectralSpreadstd  |1.1475849608212774 |
|21-MFCCs13m           |1.0085010011765125 |
|26-ChromaVector5m     |0.9282872778353995 |
|23-ChromaVector2m     |0.7636580637339203 |
|5-SpectralSpreadm     |0.7157862165111797 |
|33-ChromaVector12m    |0.6150658982724828 |
|31-ChromaVector10m    |0.6094904248435089 |
|25-ChromaVector4m     |0.4774709394458692 |
|19-MFCCs11m           |0.2867824964990269 |
|32-ChromaVector11m    |0.14254650002607427|
|35-ZCRstd             |0.06308438197753613|
|28-ChromaVector7m     |-0.4634834991404609|
|3-EnergyE

+----------------------+-------------------+
|feature               |coeff              |
+----------------------+-------------------+
|27-ChromaVector6m     |4.042981146068799  |
|14-MFCCs6m            |3.044012354616658  |
|25-ChromaVector4m     |2.4310251287063656 |
|10-MFCCs2m            |2.395695852728696  |
|7-SpectralFluxm       |2.2232322133937448 |
|34-ChromaDeviationm   |2.1767191181701073 |
|37-EnergyEntropystd   |2.020314511339178  |
|6-SpectralEntropym    |1.9832475361041302 |
|38-SpectralCentroidstd|1.839659400741074  |
|24-ChromaVector3m     |1.6819769266414781 |
|16-MFCCs8m            |1.6092325812121353 |
|30-ChromaVector9m     |1.4011069670670504 |
|23-ChromaVector2m     |1.3136600296204142 |
|26-ChromaVector5m     |1.2872026558399383 |
|21-MFCCs13m           |1.0470292030718913 |
|17-MFCCs9m            |0.9825759616740269 |
|22-ChromaVector1m     |0.9042810573931751 |
|33-ChromaVector12m    |0.3822638678486861 |
|11-MFCCs3m            |0.36955729446773694|
|2-Energym

+----------------------+------------------+
|feature               |coeff             |
+----------------------+------------------+
|16-MFCCs8m            |4.700325924241248 |
|6-SpectralEntropym    |4.361969153097247 |
|22-ChromaVector1m     |2.300200365388174 |
|7-SpectralFluxm       |2.2105315045567795|
|14-MFCCs6m            |1.92688766322267  |
|47-MFCCs5std          |1.8998437871275025|
|35-ZCRstd             |1.4970724654265737|
|50-MFCCs8std          |1.4715941125799938|
|25-ChromaVector4m     |1.2033641087172502|
|15-MFCCs7m            |1.2031662229588467|
|40-SpectralEntropystd |1.165160051236462 |
|48-MFCCs6std          |0.9972994322786066|
|37-EnergyEntropystd   |0.9794032454997343|
|13-MFCCs5m            |0.9730753455744994|
|31-ChromaVector10m    |0.9456076152125082|
|23-ChromaVector2m     |0.8037983295991604|
|38-SpectralCentroidstd|0.7873806948464034|
|20-MFCCs12m           |0.7575359681745282|
|24-ChromaVector3m     |0.6711638854274983|
|33-ChromaVector12m    |0.577785

+---------------------+-------------------+
|feature              |coeff              |
+---------------------+-------------------+
|47-MFCCs5std         |3.890296978148169  |
|43-MFCCs1std         |3.3741525109865695 |
|48-MFCCs6std         |2.713041034454776  |
|3-EnergyEntropym     |2.6016457885176374 |
|50-MFCCs8std         |2.500228177225589  |
|10-MFCCs2m           |2.2989520007478643 |
|9-MFCCs1m            |2.045755436896216  |
|29-ChromaVector8m    |1.9841324707478345 |
|2-Energym            |1.9585148859185433 |
|49-MFCCs7std         |1.8207166825934755 |
|1-ZCRm               |1.4965275735244665 |
|8-SpectralRolloffm   |1.3994581345061954 |
|14-MFCCs6m           |1.3573276235366742 |
|4-SpectralCentroidm  |1.1264171825154994 |
|44-MFCCs2std         |0.9762963790863245 |
|33-ChromaVector12m   |0.9104701410109772 |
|28-ChromaVector7m    |0.8353579182553228 |
|18-MFCCs10m          |0.6147669817578113 |
|30-ChromaVector9m    |0.4780585927775791 |
|42-SpectralRolloffstd|0.4266844

+--------------------+------------------+
|feature             |coeff             |
+--------------------+------------------+
|35-ZCRstd           |2.714719950474263 |
|21-MFCCs13m         |2.510210739569409 |
|48-MFCCs6std        |2.4269513713155972|
|39-SpectralSpreadstd|2.168980116615324 |
|46-MFCCs4std        |2.0480565978537846|
|43-MFCCs1std        |1.919730684943463 |
|19-MFCCs11m         |1.8482635709055046|
|31-ChromaVector10m  |1.719324151670775 |
|34-ChromaDeviationm |1.6225199981858605|
|49-MFCCs7std        |1.4939217135349991|
|28-ChromaVector7m   |1.394103397264566 |
|30-ChromaVector9m   |1.348496562200444 |
|22-ChromaVector1m   |1.2602576342231941|
|33-ChromaVector12m  |1.2417261167954297|
|32-ChromaVector11m  |1.1090551357320027|
|11-MFCCs3m          |1.080374114026032 |
|9-MFCCs1m           |1.0792314608745077|
|15-MFCCs7m          |1.0639703187942873|
|14-MFCCs6m          |0.7222319183567779|
|29-ChromaVector8m   |0.7021412566829875|
+--------------------+------------

+--------------------+------------------+
|feature             |coeff             |
+--------------------+------------------+
|5-SpectralSpreadm   |4.358192125811377 |
|2-Energym           |4.126581925176677 |
|59-ChromaVector4std |2.658429010252928 |
|49-MFCCs7std        |2.4209210871333595|
|9-MFCCs1m           |2.164065002647313 |
|34-ChromaDeviationm |2.088119398915176 |
|35-ZCRstd           |1.854465523174072 |
|4-SpectralCentroidm |1.8440768879268015|
|27-ChromaVector6m   |1.8112235222055595|
|36-Energystd        |1.6911366577095484|
|54-MFCCs12std       |1.6847091624674007|
|32-ChromaVector11m  |1.2763070549286086|
|50-MFCCs8std        |1.0137092253680153|
|8-SpectralRolloffm  |0.9638899629800909|
|16-MFCCs8m          |0.9388116239210671|
|56-ChromaVector1std |0.9340144048451848|
|39-SpectralSpreadstd|0.8132852514433997|
|13-MFCCs5m          |0.7026946118542147|
|26-ChromaVector5m   |0.5601289896709495|
|10-MFCCs2m          |0.5520954791840272|
+--------------------+------------

+-------------------+---------------------+
|feature            |coeff                |
+-------------------+---------------------+
|2-Energym          |4.396090763459249    |
|7-SpectralFluxm    |2.606290454208094    |
|12-MFCCs4m         |2.5128738702309055   |
|1-ZCRm             |2.4689070696672597   |
|21-MFCCs13m        |1.6667689417506772   |
|30-ChromaVector9m  |1.652160625590636    |
|46-MFCCs4std       |1.4294204464299929   |
|22-ChromaVector1m  |1.1374358242645273   |
|60-ChromaVector5std|0.9789554986707054   |
|36-Energystd       |0.9253200155644565   |
|29-ChromaVector8m  |0.6801479158921383   |
|14-MFCCs6m         |0.6575380283100476   |
|4-SpectralCentroidm|0.6236943323588249   |
|17-MFCCs9m         |0.5932651090191409   |
|6-SpectralEntropym |0.34306808485674345  |
|23-ChromaVector2m  |0.19525564338330526  |
|52-MFCCs10std      |0.15198656388335827  |
|56-ChromaVector1std|0.06264727128216566  |
|51-MFCCs9std       |0.012866249493885946 |
|31-ChromaVector10m |-0.00980559

+--------------------+------------------+
|feature             |coeff             |
+--------------------+------------------+
|10-MFCCs2m          |2.6406836290952724|
|30-ChromaVector9m   |2.2814396575985723|
|19-MFCCs11m         |1.9421185264772458|
|60-ChromaVector5std |1.771117759447315 |
|57-ChromaVector2std |1.7626039665023834|
|35-ZCRstd           |1.6756746645756642|
|33-ChromaVector12m  |1.6631863637343336|
|34-ChromaDeviationm |1.6306177955872039|
|29-ChromaVector8m   |1.2866729795278182|
|22-ChromaVector1m   |1.1469537104738332|
|39-SpectralSpreadstd|1.1073541607793191|
|14-MFCCs6m          |0.9956502391015851|
|50-MFCCs8std        |0.9186418125222228|
|21-MFCCs13m         |0.8435541224208544|
|9-MFCCs1m           |0.7995838989459605|
|46-MFCCs4std        |0.7896893706906678|
|44-MFCCs2std        |0.7850538799572215|
|27-ChromaVector6m   |0.780275976366032 |
|55-MFCCs13std       |0.7077222358567006|
|12-MFCCs4m          |0.7069087598962623|
+--------------------+------------

+---------------------+------------------+
|feature              |coeff             |
+---------------------+------------------+
|13-MFCCs5m           |4.243049082969794 |
|69-BPM               |3.570544665528942 |
|34-ChromaDeviationm  |2.9702896303197273|
|24-ChromaVector3m    |2.845967264872324 |
|47-MFCCs5std         |1.8308541737636268|
|45-MFCCs3std         |1.778853508887632 |
|36-Energystd         |1.6355650428430573|
|8-SpectralRolloffm   |1.4836782760394656|
|3-EnergyEntropym     |1.398815448155346 |
|4-SpectralCentroidm  |1.2254738146510835|
|68-ChromaDeviationstd|1.2155622577956466|
|44-MFCCs2std         |1.1438824838010349|
|53-MFCCs11std        |1.0987632050100458|
|65-ChromaVector10std |1.0949959720391935|
|29-ChromaVector8m    |1.040398152442052 |
|21-MFCCs13m          |0.9760885990721134|
|59-ChromaVector4std  |0.894994423991329 |
|2-Energym            |0.7626546848511305|
|25-ChromaVector4m    |0.7483879259802303|
|50-MFCCs8std         |0.7196207862914347|
+----------

+--------------------+------------------+
|feature             |coeff             |
+--------------------+------------------+
|69-BPM              |4.08434296260916  |
|44-MFCCs2std        |3.6167445392935216|
|36-Energystd        |3.0940739934262695|
|55-MFCCs13std       |2.8164321888464388|
|27-ChromaVector6m   |2.3242666336491897|
|54-MFCCs12std       |2.1399815912561406|
|3-EnergyEntropym    |1.8656786191742627|
|18-MFCCs10m         |1.6493350993422802|
|34-ChromaDeviationm |1.4524656648622754|
|49-MFCCs7std        |1.3056733220590488|
|14-MFCCs6m          |1.2579648085885162|
|2-Energym           |1.0542095869010712|
|6-SpectralEntropym  |0.9864685598231963|
|53-MFCCs11std       |0.9393249132638196|
|29-ChromaVector8m   |0.9236420859279658|
|63-ChromaVector8std |0.8357882860720817|
|24-ChromaVector3m   |0.7301586512414406|
|48-MFCCs6std        |0.7214464738654626|
|52-MFCCs10std       |0.6337754935459532|
|66-ChromaVector11std|0.5290626314037385|
+--------------------+------------

+----------------------+------------------+
|feature               |coeff             |
+----------------------+------------------+
|18-MFCCs10m           |2.5301911343234464|
|65-ChromaVector10std  |2.381319943770135 |
|36-Energystd          |1.800380710957646 |
|22-ChromaVector1m     |1.6827322682635455|
|44-MFCCs2std          |1.650627734303081 |
|33-ChromaVector12m    |1.6361208148398512|
|70-BPMconf            |1.6097452130398373|
|28-ChromaVector7m     |1.5795554233401214|
|41-SpectralFluxstd    |1.4135740138510324|
|19-MFCCs11m           |1.2970584381320704|
|25-ChromaVector4m     |1.2248249429638887|
|42-SpectralRolloffstd |1.2155451810145763|
|29-ChromaVector8m     |1.2145439973052508|
|55-MFCCs13std         |1.147571883316345 |
|59-ChromaVector4std   |1.1460517760222884|
|38-SpectralCentroidstd|1.127313815355592 |
|14-MFCCs6m            |0.9092171063398097|
|34-ChromaDeviationm   |0.8444129154812419|
|61-ChromaVector6std   |0.8343090036759988|
|32-ChromaVector11m    |0.746013

## Train final model


In [24]:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors

classifiers = [OneVsRest()] 

#Select the top n features and view results
n = 71

# For Logistic regression or One vs Rest
selector = ChiSqSelector(numTopFeatures=n, featuresCol="features",
                     outputCol="selectedFeatures", labelCol="label")
bestFeaturesDf = selector.fit(test2_data).transform(test2_data)
bestFeaturesDf = bestFeaturesDf.select("label","selectedFeatures")
bestFeaturesDf = bestFeaturesDf.withColumnRenamed("selectedFeatures","features")

# Collect features
features = bestFeaturesDf.select(['features']).collect()

# Split
train,test = bestFeaturesDf.randomSplit([0.7,0.3])

# Specify folds
folds = 2

#set up your results table
columns = ['Classifier', 'Result']
vals = [("Place Holder","N/A")]
results = spark.createDataFrame(vals, columns)

for classifier in classifiers:
    new_result = ClassTrainEval(classifier,features,classes,folds,train,test)
    results = results.union(new_result)
results = results.where("Classifier!='Place Holder'")
results.show(100,False)

 
[1mOneVsRest[0m
[1mIntercept: [0m -6.580458247144103
[1mTop 20 Coefficients:[0m
+--------------------+------------------+
|feature             |coeff             |
+--------------------+------------------+
|9-MFCCs1m           |4.002770960508903 |
|71-BPMessentia      |2.361225791764684 |
|5-SpectralSpreadm   |2.1711779325755125|
|18-MFCCs10m         |2.00270061605298  |
|63-ChromaVector8std |1.7988253632428668|
|3-EnergyEntropym    |1.6453831224196154|
|11-MFCCs3m          |1.630045983043709 |
|52-MFCCs10std       |1.5234453053307606|
|24-ChromaVector3m   |1.4841467112591167|
|67-ChromaVector12std|1.348375021269884 |
|43-MFCCs1std        |1.2333461521940112|
|2-Energym           |1.1864363735051946|
|58-ChromaVector3std |0.9937927049682765|
|36-Energystd        |0.913939346115755 |
|50-MFCCs8std        |0.8677429106209348|
|54-MFCCs12std       |0.852871082344563 |
|12-MFCCs4m          |0.8004373673127889|
|22-ChromaVector1m   |0.5836641228481052|
|64-ChromaVector9std |0.572544

+---------------------+------------------+
|feature              |coeff             |
+---------------------+------------------+
|11-MFCCs3m           |3.31386984998968  |
|63-ChromaVector8std  |2.8731970013437618|
|64-ChromaVector9std  |2.281112364507269 |
|40-SpectralEntropystd|2.241755694569634 |
|55-MFCCs13std        |2.1849262087638044|
|19-MFCCs11m          |1.9544263067968628|
|62-ChromaVector7std  |1.7360939677226548|
|65-ChromaVector10std |1.734621147848315 |
|17-MFCCs9m           |1.6523843763546828|
|68-ChromaDeviationstd|1.6143801426269682|
|36-Energystd         |1.5140366187292145|
|22-ChromaVector1m    |1.4767672468111124|
|32-ChromaVector11m   |1.4283850694922808|
|26-ChromaVector5m    |1.4104092122246885|
|10-MFCCs2m           |1.2961097393828782|
|12-MFCCs4m           |1.193467931116323 |
|49-MFCCs7std         |1.1821746485804263|
|18-MFCCs10m          |1.1782275209227497|
|50-MFCCs8std         |1.144852893002729 |
|43-MFCCs1std         |1.1427998758084612|
+----------

+----------------------+------------------+
|feature               |coeff             |
+----------------------+------------------+
|43-MFCCs1std          |4.629663304787441 |
|41-SpectralFluxstd    |4.368284564258287 |
|7-SpectralFluxm       |3.723725712040427 |
|29-ChromaVector8m     |2.7114629226491456|
|21-MFCCs13m           |2.6865501937062555|
|39-SpectralSpreadstd  |2.5455025208635016|
|71-BPMessentia        |2.2813598498706678|
|23-ChromaVector2m     |1.9692372108321492|
|59-ChromaVector4std   |1.766808355182687 |
|45-MFCCs3std          |1.6689417376678417|
|20-MFCCs12m           |1.4368121259571334|
|33-ChromaVector12m    |1.3052309740876942|
|38-SpectralCentroidstd|1.3051827188123588|
|25-ChromaVector4m     |1.2798909202519333|
|22-ChromaVector1m     |1.1760796984731785|
|28-ChromaVector7m     |1.1308264245583173|
|18-MFCCs10m           |1.1153300999493718|
|44-MFCCs2std          |0.9733524674901951|
|42-SpectralRolloffstd |0.962888308219006 |
|34-ChromaDeviationm   |0.951495

## Make a recommendation to a user!


In [25]:
predictions = OVR_BestModel.transform(test)

In [26]:
# From the output earlier we saw that the new label for BigRoom is now 21.0
# Let's get a song from there
count = predictions.filter("label!=21.0 AND prediction == 21.0").count()
print(count)
predictions.filter("label!=21.0 AND prediction == 21.0").show()
# predictions.show()

8
+-----+--------------------+--------------------+----------+
|label|            features|       rawPrediction|prediction|
+-----+--------------------+--------------------+----------+
|  6.0|[0.36117260545393...|[-2.1603912117929...|      21.0|
|  7.0|[0.27419559694221...|[-5.4492064734061...|      21.0|
| 17.0|[0.22307168460494...|[-6.6030087222751...|      21.0|
| 17.0|[0.24690950657943...|[-5.3862152222757...|      21.0|
| 17.0|[0.29653807690588...|[-4.1185608720185...|      21.0|
| 20.0|[0.31510440214688...|[-4.7468092161822...|      21.0|
| 20.0|[0.33728432608828...|[-4.3475349317382...|      21.0|
| 20.0|[0.34565765181533...|[-5.9645097496643...|      21.0|
+-----+--------------------+--------------------+----------+

