# Advanced Machine Learning (CS4662). Cal State Univ. LA, CS Dept.
### Instructor: Dr. Mohammad Porhomayoun
---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------


# Machine Learning and Data Science in Python

#### This is a review of data sceince libraries/packages in python. Feel free to refer to the suggested resources and documentaries for more details.

---------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------


# Review: Scikit-Learn Library (sklearn):
Scikit-learn is the Python Machine Learning Library. It includes optimal implementation of various classification, regression and clustering algorithms. It also includes hundreds of commands and functions for data preprocessing and processing along with a number of default datasets to work with.


## The Main Steps to build (train) and use (test/predict) a predictive model in sklearn:

### Step1: Importing the sklearn class (machine learning algorithm) that you would like to use for modeling:

In [1]:
# The following line will import DecisionTreeClassifier "Class"
# DecisionTreeClassifier is name of a "sklearn class" to perform "Decision Tree Classification" 

from sklearn.tree import DecisionTreeClassifier

In [2]:
# Importing the required packages and libraries
# we will need numpy and pandas later
import numpy as np
import pandas as pd


### Step2: Set up the Feature Matrix and Label Vector:

## We use a new dataset to diagnose Parkinson Disease:


In [3]:
# reading a CSV file directly from Web, and store it in a pandas DataFrame:
# "read_csv" is a pandas function to read csv files from web or local device:

parkinson_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data')


## Source of Data:
'Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007)

Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering.

and
UCI ML Repository: https://archive.ics.uci.edu/ml/datasets/Parkinsons

#### Attribute Information:
Matrix column entries (attributes):
name - ASCII subject name and recording number
MDVP:Fo(Hz) - Average vocal fundamental frequency
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
MDVP:Flo(Hz) - Minimum vocal fundamental frequency
MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
NHR,HNR - Two measures of ratio of noise to tonal components in the voice
status - Health status of the subject (one) - Parkinson's, (zero) - healthy
RPDE,D2 - Two nonlinear dynamical complexity measures
DFA - Signal fractal scaling exponent
spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

In [4]:
# checking the dataset by printing every 10 lines:

parkinson_df[0::10]

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
10,phon_R01_S02_5,88.333,112.24,84.072,0.00505,6e-05,0.00254,0.0033,0.00763,0.02143,...,0.03237,0.01166,21.118,1,0.611137,0.776156,-5.24977,0.391002,2.407313,0.24974
20,phon_R01_S05_3,153.848,165.738,65.782,0.0084,5e-05,0.00428,0.0045,0.01285,0.0381,...,0.05,0.03871,17.536,1,0.660125,0.704087,-4.095442,0.262564,2.73971,0.365391
30,phon_R01_S07_1,197.076,206.896,192.055,0.00289,1e-05,0.00166,0.00168,0.00498,0.01098,...,0.01689,0.00339,26.775,0,0.422229,0.741367,-7.3483,0.177551,1.743867,0.085569
40,phon_R01_S08_5,186.163,197.724,177.584,0.00298,2e-05,0.00165,0.00175,0.00496,0.01495,...,0.02321,0.00231,26.822,1,0.32648,0.765623,-6.647379,0.201095,2.374073,0.130554
50,phon_R01_S13_3,124.445,135.069,117.495,0.00431,3e-05,0.00141,0.00167,0.00422,0.02184,...,0.03724,0.00479,25.135,0,0.553134,0.775933,-6.650471,0.254498,1.840198,0.103561
60,phon_R01_S17_1,209.144,237.494,109.379,0.00282,1e-05,0.00147,0.00152,0.00442,0.01861,...,0.02925,0.00871,25.554,0,0.341788,0.678874,-7.040508,0.066994,2.460791,0.101516
70,phon_R01_S18_5,142.729,162.408,65.476,0.00831,6e-05,0.00469,0.00419,0.01407,0.03485,...,0.05605,0.02599,20.264,1,0.489345,0.730387,-5.720868,0.15883,2.277927,0.180828
80,phon_R01_S20_3,96.106,108.664,84.51,0.00694,7e-05,0.00389,0.00415,0.01168,0.04024,...,0.06799,0.01823,19.055,1,0.544805,0.770466,-4.441519,0.155097,2.645959,0.327978
90,phon_R01_S21_7,166.605,206.008,78.032,0.00742,4e-05,0.00387,0.00453,0.01161,0.0664,...,0.10949,0.08725,11.744,1,0.65341,0.733165,-4.508984,0.389232,3.317586,0.301952


In [5]:
print(parkinson_df.columns)

Index(['name', 'MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',
       'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',
       'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',
       'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'status', 'RPDE', 'DFA',
       'spread1', 'spread2', 'D2', 'PPE'],
      dtype='object')


In [6]:
# Creating the Feature Matrix for the dataset:

# create a python list of feature names that would like to pick from the dataset:
#feature_cols = ['sepal_length','sepal_width','petal_length','petal_width']

feature_cols = ['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',
       'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',
       'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',
       'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'RPDE', 'DFA',
       'spread1', 'spread2', 'D2', 'PPE']


# use the above list to select the features from the original DataFrame
X = parkinson_df[feature_cols]

X

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,MDVP:APQ,Shimmer:DDA,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE
0,119.992,157.302,74.997,0.00784,0.00007,0.00370,0.00554,0.01109,0.04374,0.426,...,0.02971,0.06545,0.02211,21.033,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,122.400,148.650,113.819,0.00968,0.00008,0.00465,0.00696,0.01394,0.06134,0.626,...,0.04368,0.09403,0.01929,19.085,0.458359,0.819521,-4.075192,0.335590,2.486855,0.368674
2,116.682,131.111,111.555,0.01050,0.00009,0.00544,0.00781,0.01633,0.05233,0.482,...,0.03590,0.08270,0.01309,20.651,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,116.676,137.871,111.366,0.00997,0.00009,0.00502,0.00698,0.01505,0.05492,0.517,...,0.03772,0.08771,0.01353,20.644,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,0.584,...,0.04465,0.10470,0.01767,19.649,0.417356,0.823484,-3.747787,0.234513,2.332180,0.410335
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,174.188,230.978,94.261,0.00459,0.00003,0.00263,0.00259,0.00790,0.04087,0.405,...,0.02745,0.07008,0.02764,19.517,0.448439,0.657899,-6.538586,0.121952,2.657476,0.133050
191,209.516,253.017,89.488,0.00564,0.00003,0.00331,0.00292,0.00994,0.02751,0.263,...,0.01879,0.04812,0.01810,19.147,0.431674,0.683244,-6.195325,0.129303,2.784312,0.168895
192,174.688,240.005,74.287,0.01360,0.00008,0.00624,0.00564,0.01873,0.02308,0.256,...,0.01667,0.03804,0.10715,17.883,0.407567,0.655683,-6.787197,0.158453,2.679772,0.131728
193,198.764,396.961,74.904,0.00740,0.00004,0.00370,0.00390,0.01109,0.02296,0.241,...,0.01588,0.03794,0.07223,19.020,0.451221,0.643956,-6.744577,0.207454,2.138608,0.123306


In [7]:
# select a Series of labels (the last column) from the DataFrame
y = parkinson_df['status']



### Step3: Defining (instantiating) an "object" from the sklearn class:

In [10]:
# In the following line, "my_decisiontree" is instantiated as an "object" of DecisionTreeClassifier "class". 

my_decisiontree = DecisionTreeClassifier(random_state=1)


### Step4: Traning Stage: Traning a predictive model using the training dataset:
#### Traning Stage called Fitting in sklearn
#### Method "fit" is used for many sklearn classes

In [11]:
# We can use the method "fit" of the "object my_decisiontree" along with training dataset and labels to train the model.

my_decisiontree.fit(X, y)

DecisionTreeClassifier(random_state=1)

### Step5: Testing (Prediction) Stage: Making prediction on new observations (Testing Data) using the trained model:
##### Now, Suppose that we have a new observation (a new data sample) with Known features and Unknown label. What would be our predition for the label of this new observation?
#### Method "predict" is used for many sklearn classes

In [12]:
# We can use the method "predict" of the *trained* object my_decisiontree on one or more testing data sample to perform prediction:

X_Testing1 = [161,197,75,0.00602,0.00003,0.00290,0.00253,0.00941,0.01791,0.16800,0.00793,0.01057,0.01799,0.02380,0.01170,25.67800,0.427785,0.723797,-6.635729,0.209866,1.957961,0.135242]
X_Testing2 = [196,208,194,0.00189,0.00001,0.0015,0.00168,0.00298,0.01098,0.09700,0.00563,0.00680,0.00802,0.01689,0.00339,26.77500,0.422229,0.741367,-7.348300,0.177551,1.743867,0.085569]
X_Testing = [X_Testing1,X_Testing2]

y_predict = my_decisiontree.predict(X_Testing)

print(y_predict)

[1 0]


# Evaluating the accuracy of our classifier:

#### 1- Let's split the the dataset RANDOMLY into two new datasets: Training Set (e.g. 70% of the dataset) and Testing Set (30% of the dataset).
#### 2- Let's pretend that we do NOT know the label of the Testing Set!
#### 3- Let's Train the model on only Training Set, and then Predict on the Testing Set!
#### 4- After prediction, we can compare the "predicted labels" for the Testing Set with its "actual labels" to evaluate the accuracy of our Decision Tree Classifier!

#### We will learn more about model and accuracy evaluation in future tutorials!

In [13]:
# Randomly splitting the original dataset into training set and testing set
# The function"train_test_split" from "sklearn.cross_validation" library performs random splitting.
# "test_size=0.3" means that pick 30% of data samples for testing set, and the rest (70%) for training set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [14]:
# print the size of the traning set:
print(X_train.shape)
print(y_train.shape)


(136, 22)
(136,)


In [15]:
# print the size of the testing set:
print(X_test.shape)
print(y_test.shape)


(59, 22)
(59,)


In [16]:
print(X_test)
print('\n')
print(y_test)

     MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  \
83        98.804       102.305        87.804         0.00432   
12       136.926       159.866       131.276         0.00293   
33       202.266       211.604       197.079         0.00180   
113      210.141       232.706       185.258         0.00534   
171      112.547       133.374       105.715         0.00355   
134      106.516       112.777        93.105         0.00589   
163      112.150       131.669        97.527         0.00519   
124      156.239       195.107        79.820         0.00694   
74       110.793       128.101       107.316         0.00494   
18       153.046       175.829        68.623         0.00742   
7        107.332       113.840       104.315         0.00290   
5        120.552       131.162       113.787         0.00968   
125      145.174       198.109        80.637         0.00733   
161      115.322       135.738       107.802         0.00619   
170      244.990       272.210       239

### Training ONLY on the training set:

In [17]:
# Training ONLY on the training set:

my_decisiontree.fit(X_train, y_train)


DecisionTreeClassifier(random_state=1)

### Testing on the testing set:

In [18]:
# Testing on the testing set:

y_predict = my_decisiontree.predict(X_test)

print(y_predict)

[1 1 0 1 0 1 0 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 1 1 1 1 1 1 1
 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 0 1 0 1 1 1 1]


# Accuracy Evaluation:
#### After prediction, we can now compare the "predicted labels" for the Testing Set with its "actual labels" to evaluate the accuracy of our KNN Classifier!

In [19]:
# Function "accuracy_score" from "sklearn.metrics" will perform element-to-element comparision and returns the 
# percent of correct predictions:

from sklearn.metrics import accuracy_score

# Example:
y_pred    = [0, 2, 1, 1]
y_actual  = [0, 1, 2, 1]

score = accuracy_score(y_actual, y_pred)

print(score)

0.5


In [20]:
# We can now compare the "predicted labels" for the Testing Set with its "actual labels" to evaluate the accuracy 
# Function "accuracy_score" from "sklearn.metrics" will perform the element-to-element comparision and returns the 
# portion of correct predictions:

from sklearn.metrics import accuracy_score

score = accuracy_score(y_test, y_predict)

print(score)

0.9152542372881356


### checking the results:

In [21]:
results = pd.DataFrame()

results['actual'] = y_test 
results['prediction'] = y_predict 
print(results,'\n\n\n')

print(results[results['actual']!=results['prediction']])

     actual  prediction
83        1           1
12        1           1
33        0           0
113       1           1
171       0           0
134       1           1
163       1           0
124       1           1
74        1           1
18        1           1
7         1           0
5         1           1
125       1           1
161       1           1
170       0           0
181       1           1
123       1           1
60        0           0
44        0           0
141       1           0
56        1           1
173       0           0
136       1           1
89        1           1
63        0           0
55        1           1
110       1           1
166       0           0
175       0           0
45        0           0
22        1           1
155       1           1
66        1           1
37        1           1
4         1           1
80        1           1
178       1           1
106       1           1
160       1           1
26        1           1
139       1     

In [22]:
# How about using only two feature rather than all 4 for classification?
# Try this:
# feature_cols = ['sepal_length','sepal_width']


#   
# Cross-Validation

## Three main steps for K-fold cross-validation
1. Split the dataset Randomly into K equal, non-overlapping sections.
2. Use one of the sections as **testing set** at a time and the union of the other (K-1) sections as the **training set**. Perform training stage, testing stage, and compute the accuracy based on the split each time. Repeat this procedure K times, so that each one of the K sections is used as **testing set** one time, and as a part of **training set** (K-1) times.
5. Calculate the average of the accuracies as final result.

Note: Using K=10 (10-fold cross-validation) is very common and recommended in machine learning.

In [23]:
# importing the method:
from sklearn.model_selection import cross_val_score

### Applying 10-fold Cross Validation for "logistic regression" classifier:


In [31]:
# Applying 10-fold cross validation with DT classifier:

my_decisiontree = DecisionTreeClassifier(random_state=1)


# CV:
accuracy_list = cross_val_score(my_decisiontree, X, y, cv=10, scoring='accuracy')

print('\n\n','accuracy: ',accuracy_list)



 accuracy:  [0.95       0.85       0.8        0.9        0.95       0.78947368
 0.68421053 0.52631579 0.84210526 0.78947368]


#### Each element in "accuracy_list" above is the accuracy value in one of the K rounds of cross validation. We will use the average of them as the final accuracy for our model.

#### As we saw, the method "cross_val_score" will take care of everything, including splitting the data, forming Training and Testing sets (K times), Training and Testing the model (K times), and evaluating and reporting the accuracy for each round!

#### Now, we only need to calculate the average of the accuracies from K rounds!

In [32]:
# use average of accuracy values as final result
accuracy_cv = accuracy_list.mean()

print(accuracy_cv)

0.8081578947368421
