# Lab 3 - Capstone: Clustering

Team: Frank Sclafani, Jan Shook, and Leticia Valadez

Our team spent considerable time determining a good dataset to use in the laboratories leading up to, and including, this capstone project and have learned a great deal about this dataset and its attributes. Given clustering, associative rule mining, or collaborative filtering as options to round out our final lab, clustering is the most conducive approach to this dataset. The following report follows the CRISP-DM (Cross-industry Standard Process for Data Mining) approach, which can be viewed at Wikipedia (https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining).

Because our final project focuses on clustering, the new work associated with clustering starts here: <a href="#Section 4.3: Clustering">Section 4.3: Clustering</a>.

## TV News Channel Commercial Detection

Our team selected this dataset for two reasons: 1) It has a large number of instances (129,685, which is greater than the requirement of at least 30,000) and enough attributes (14, which is greater than the requirement of at least 10), and 2) It looks like an interesting dataset (detecting commercials). Initial questions of interest are how do you detect commercials from this data? Can a model be trained to detect and skip (or remove) commercials? If so, would this solution be robust enough for commercial products like TiVo?

This dataset is from the UCI Machine Learning website (https://archive.ics.uci.edu/ml/datasets/TV+News+Channel+Commercial+Detection+Dataset). It consists of popular audio-visual features of video shots extracted from 150 hours of TV news broadcast of 3 Indian and 2 international news channels (30 Hours each). In the readme accompanying the data, the authors describe the potential benefits of this data as follows:

> Automatic identification of commercial blocks in news videos finds a lot of applications in the domain of television broadcast analysis and monitoring. Commercials occupy almost 40-60% of total air time. Manual segmentation of commercials from thousands of TV news channels is time consuming, and economically infeasible hence prompts the need for machine learning based Method. Classifying TV News commercials is a semantic video classification problem. TV News commercials on particular news channel are combinations of video shots uniquely characterized by audio-visual presentation. Hence various audio visual features extracted from video shots are widely used for TV commercial classification. Indian News channels do not follow any particular news presentation format, have large variability and dynamic nature presenting a challenging machine learning problem. Features from 150 Hours of broadcast news videos from 5 different (3 Indian and 2 International News channels) news channels. Viz. CNNIBN, NDTV 24X7, TIMESNOW, BBC and CNN are presented in this dataset. Videos are recorded at resolution of 720 X 576 at 25 fps using a DVR and set top box. 3 Indian channels are recorded concurrently while 2 International are recorded together. Feature file preserves the order of occurrence of shots.

### Objective: Classify Video Attributes as Commercial or Non-commercial

This dataset has already been classified as commercial (+1) or non-commercial (-1) in the Dimension Index attribute. Hence, in subsequent analysis, we will be able to train and compare our data models against the target variable that has already created to determine the effectiveness of the model.

### Techniques Applied in this Project

#### Data Preparation

> The SVM Light approach to persisting sparse matrix arrays was used loaded into a Pandas dataframe

> The X and Y axis in the SVM Light approach was combined into a two-dimensional Pandas dataframe

> Columns that have little merit to the initial analysis were deleted

> Pandas columns with empty values (i.e., all zeroes) were deleted

> Different types of row and / or columns were separated into different dataframes to analyze the data differently

#### Data Visualization

> The Hexagon Bin Plot was used to visualize the complete dataset, and it appears a linear correlation exists among attributes

> Individual scatter plots were created for each attribute (non-bin related)

## About this Notebook

This Jupyter (v4.3.0) notebook was developed on Windows 10 Pro (64 bit) using Anaconda v4.4.7 and Python v3.6.3.

Packages associated with Anaconda were extracted as follows:

> conda install -c anaconda pandas

> conda install -c anaconda numpy 

## Table of Contents

* <a href="#Section 1: Data Understanding">Section 1: Data Understanding</a>  
> <a href="#Section 1.1: About this Dataset (Summary)">Section 1.1: About this Dataset (Summary)</a>  
> <a href="#Section 1.2: Description of the Attributes">Section 1.2: Description of the Attributes</a>  
> <a href="#Section 1.3: Potentially Useful Attributes">Section 1.3: Potentially Useful Attributes</a>  
> <a href="#Section 1.4: Columns and Data Types">Section 1.4: Columns and Data Types</a>  

* <a href="#Section 2: Data Preparation">Section 2: Data Preparation</a>  
> <a href="#Section 2.1: Download Files">Section 2.1: Download Files</a>  
> <a href="#Section 2.2: Pivot the Y-axis">Section 2.2: Pivot the Y-axis</a>  
> <a href="#Section 2.3: Convert Sparse Matrix Array to an Array">Section 2.3: Convert Sparse Matrix Array to an Array</a>  
> <a href="#Section 2.4: Concatenate the Y-axis before the X-axis">Section 2.4: Concatenate the Y-axis before the X-axis</a>  
> <a href="#Section 2.5: Convert the Arrays into Pandas Dataframes">Section 2.5: Convert the Arrays into Pandas Dataframes</a>  
> <a href="#Section 2.5.1: Convert the First Set of Dataframes (no BoWs)">Section 2.5.1: Convert the First Set of Dataframes (no BoWs)</a>  
> <a href="#Section 2.5.2: Convert the Second Set of Dataframes (BoWs)">Section 2.5.2: Convert the Second Set of Dataframes (BoWs)</a>  
> <a href="#Section 2.6: Rename Columns from Integers to Labels">Section 2.6: Rename Columns from Integers to Labels</a>  
> <a href="#Section 2.6.1: Rename the First Set of Dataframes (no BoWs)">Section 2.6.1: Rename the First Set of Dataframes (no BoWs</a>  
> <a href="#Section 2.6.2: Rename the Second Set of Dataframes (BoWs)">Section 2.6.2: Rename the Second Set of Dataframes (BoWs)</a>  
> <a href="#Section 2.7: Inspect Missing Values">Section 2.7: Inspect Missing Values</a>  
> <a href="#Section 2.7.1: Display Table of Missing Values (no BoWs)">Section 2.7.1: Display Table of Missing Values (no BoWs)</a>  
> <a href="#Section 2.7.2: View Zero Values via a 40% Threshold (no BoWs)">Section 2.7.2: View Zero Values via a 40% Threshold (no BoWs)</a>  
> <a href="#Section 2.7.3: Display Table of Zero Values (BoWs)">Section 2.7.3: Display Table of Zero Values (BoWs)</a>  
> <a href="#Section 2.8: Concatenate the Five Pandas Dataframes">Section 2.8: Concatenate the Five Pandas Dataframes</a>  
> <a href="#Section 2.8.1: Concatenate the First Set of Dataframes (no BoWs)">Section 2.8.1: Concatenate the First Set of Dataframes (no BoWs)</a>  
> <a href="#Section 2.8.2: Concatenate the Second Set of Dataframes (BoWs)">Section 2.8.2: Concatenate the Second Set of Dataframes (BoWs)</a>  
> <a href="#Section 2.8.3: Drop Columns with All Zeroes (BoWs)">Section 2.8.3: Drop Columns with All Zeroes (BoWs)</a>

* <a href="#Section 3: Visualizing the Data">Section 3: Visualizing the Data</a>  
> <a href="#Section 3.1: Attributes: Box Plots">Section 3.1: Attributes: Box Plots</a>  
> <a href="#Section 3.2: Attributes: Hexbin Plots">Section 3.2: Attributes: Hexbin Plots</a>  
> <a href="#Section 3.3: Principal Component Analysis (PCA)">Section 3.3: Principal Component Analysis (PCA)</a>  
> <a href="#Section 3.4: The Final Datasets">Section 3.4: The Final Datasets</a>  

* <a href="#Section 4: Modeling">Section 4: Modeling</a>  
> <a href="#Section 4.1: Baseline Cross-validation and Classification">Section 4.1: Baseline Cross-validation and Classification</a>  
> <a href="#Section 4.1.1: Choosing a Cross-validation Approach">Section 4.1.1: Choosing a Cross-validation Approach</a>  
> <a href="#Section 4.1.2: Baseline using a Random Forest">Section 4.1.2: Baseline using a Random Forest</a>  
> <a href="#Section 4.1.3: Baseline using Logistic Regression">Section 4.1.3: Baseline using Logistic Regression</a>  
> <a href="#Section 4.1.4: Cross-validation Selection">Section 4.1.4: Cross-validation Selection</a>  
> <a href="#Section 4.2: Classification">Section 4.2: Classification</a>  
> <a href="#Section 4.2.1: K-Nearest Neighbors (KNN)">Section 4.2.1: K-Nearest Neighbors (KNN)</a>  
> <a href="#Section 4.2.2: Multinomial Naive Bayes (MNB)">Section 4.2.2: Multinomial Naive Bayes (MNB)</a>  
> <a href="#Section 4.2.2.1: Multinomial Naive Bayes (no BoWs)">Section 4.2.2.1: Multinomial Naive Bayes (no BoWs)</a>  
> <a href="#Section 4.2.2.2: Multinomial Naive Bayes (BoWs)">Section 4.2.2.2: Multinomial Naive Bayes (BoWs)</a>  
> <a href="#Section 4.2.3: Logistical Regression">Section 4.2.3: Logistical Regression</a>  
> <a href="#Section 4.2.3.1: Attributes Weights">Section 4.2.3.1: Attributes Weights</a>  
> <a href="#Section 4.2.3.2: Attribute Weights (with scaling)">Section 4.2.3.2: Attribute Weights (with scaling)</a>  
> <a href="#Section 4.2.3.3: Plot the Weights (with scaling)">Section 4.2.3.3: Plot the Weights (with scaling)</a>  
> <a href="#Section 4.2.3.4: Interpreting the Weights">Section 4.2.3.4: Interpreting the Weights</a>  
> <a href="#Section 4.2.3.5: Adjusting Parameters (the C value)">Section 4.2.3.5: Adjusting Parameters (the C value)</a>  
> <a href="#Section 4.2.3.6: Adjusting Parameters (the solvers)">Section 4.2.3.6: Adjusting Parameters (the solvers)</a>  
> <a href="#Section 4.2.4: Neural Network">Section 4.2.4: Neural Network</a>  
> <a href="#Section 4.3: Clustering">Section 4.3: Clustering</a>  
> <a href="#Section 4.3.1: Baseline (Random Forest)">Section 4.3.1: Baseline (Random Forest)</a>  
> <a href="#Section 4.3.2: DBSCAN Clustering">Section 4.3.1: DBSCAN Clustering</a>  
> <a href="#Section 4.3.2.1: DBSCAN Model">Section 4.3.2.1: DBSCAN Model</a>  
> <a href="#Section 4.3.2.2: DBSCAN Visualization">Section 4.3.2.2: DBSCAN Visualization</a>  
> <a href="#Section 4.3.3: K-Means Clustering">Section 4.3.3: K-Means Clustering</a>  
> <a href="#Section 4.3.3.1: K-Means Model">Section 4.3.3.1: K-Means Model</a>  
> <a href="#Section 4.3.3.2: K-Means Visualization">Section 4.3.3.2: K-Means Visualization</a>  

* <a href="#Section 5: Evaluation">Section 5: Evaluation</a>  
> <a href="#Section 5.1: Classification">Section 5.1: Classification</a>  
> <a href="#Section 5.2: Clustering">Section 5.2: Clustering</a>  

* <a href="#Section 6: Deployment">Section 6: Deployment</a>  
> <a href="#Section 6.1: Value Proposition">Section 6.1: Value Proposition</a>  
> <a href="#Section 6.2: Potential Usefulness">Section 6.2: Potential Usefulness</a>  

In [None]:
%%time

# Runtime Expectation: The following cell runs about 30 seconds on the first execution of this notebook, and a second or two after that.

import pandas as pd
import numpy as np

pd.show_versions()

import warnings
warnings.filterwarnings('ignore')

<a id="Section 1: Data Understanding"></a>

# Section 1: Data Understanding

<a id="Section 1.1: About this Dataset (Summary)"></a>

## Section 1.1: About this Dataset (Summary)

This project is comprised of five datasets (bbc.txt, cnn.txt, cnnibn.txt, ndtv.txt, and timesnow.txt), all found at the UCI Machine Learning webset at https://archive.ics.uci.edu/ml/datasets/TV+News+Channel+Commercial+Detection+Dataset. Combined, these five datasets have 129,685 instances (rows) and 14 attributes. As shown in the example record below, most of these attributes have multiple data points (often hundreds) and almost all of these values are floating point.

> 1  1:123 2:1.316440 3:1.516003 4:5.605905 5:5.346760 6:0.013233 7:0.010729 8:0.091743 9:0.050768 10:3808.067871 11:702.992493 12:7533.133301 13:1390.499268 14:971.098511 15:1894.978027 16:114.965019 17:45.018257 18:0.635224 19:0.095226 20:0.063398 21:0.061210 22:0.038319 23:0.018285 24:0.011113 25:0.007736 26:0.004864 27:0.004220 28:0.003273 29:0.002699 30:0.002553 31:0.002323 32:0.002108 33:0.002036 34:0.001792 35:0.001553 36:0.001250 37:0.001317 38:0.001084 39:0.000818 40:0.000624 41:0.000586 42:0.000529 43:0.000426 44:0.000359 45:0.000446 46:0.000268 47:0.000221 48:0.000154 49:0.000217 50:0.000193 51:0.000163 52:0.000165 53:0.000210 54:0.000114 55:0.000130 56:0.000055 57:0.000013 58:0.733037 59:0.133122 60:0.041263 61:0.019699 62:0.010962 63:0.006927 64:0.004525 65:0.003128 66:0.002314 67:0.001762 68:0.001361 69:0.001065 70:0.000914 71:0.000777 72:0.000667 73:0.000565 74:0.000520 75:0.000467 76:0.000469 77:0.000486 78:0.000417 79:0.000427 80:0.000349 81:0.000258 82:0.000262 83:0.000344 84:0.000168 85:0.000163 86:0.001058 90:0.020584 91:0.185038 92:0.148316 93:0.047098 94:0.169797 95:0.061318 96:0.002200 97:0.010440 98:0.004463 100:0.010558 101:0.002067 102:0.338970 103:0.470364 104:0.189997 105:0.018296 106:0.126517 107:0.047620 108:0.045863 109:0.184865 110:0.095976 111:0.015295 112:0.056323 113:0.024587 115:0.037647 116:0.006015 117:0.160327 118:0.251688 119:0.176144 123:0.006356 219:0.002119 276:0.002119 296:0.341102 448:0.099576 491:0.069915 572:0.141949 573:0.103814 601:0.002119 623:0.050847 726:0.038136 762:0.036017 816:0.036017 871:0.016949 924:0.008475 959:0.036017 1002:0.006356 1016:0.008475 1048:0.002119 4124:0.422333825949 4125:0.663917631952

All five datasets are formated in the svmlight / libsvm format. This format is a text-based format, with one sample per line. It is a light format meaning it does not store zero valued features, every fetature that is "missing" has a value of zero. The first element of each line is used to store a target variable, and in this case it is the vaue of the atriburtes below. 

Hence, the file simply contains more records like the one shown above. While there are only 14 attributes in each dataset, most attributes can have more than one column of data. 

<a id="Section 1.2: Description of the Attributes"></a>

## Section 1.2: Description of the Attributes

The following sections describe this dataset using the Readme.txt file, examination of the data, and definition of the terms.

### Dimension Index (Dependent Variable)

This is the dependent variable of Commercial (+1) or Non-Commercial (-1) (i.e., the classification).

### Shot Length

Commercial video shots are usually short in length, fast visual transitions with peculiar placement of overlaid text bands. Video Shot Length is directly used as one of the feature.

### Short time energy

Short term energy (STE) can be used for voiced, unvoiced and silence classification of speech. The relation for finding the short term energy can be derived from the total energy relation defined in signal processing. STE is defined as sum of squares of samples in an audio frame. To attract user’s attention commercials generally have higher audio amplitude leading to higher STE.

### ZCR
Zero Crossing Rate (ZCR) is the rate of sign-changes along a signal. This is used in both speech recognition and music information retrieval and it is a feature used to classify sounds. That is precisely its use here in this dataset, it will be used as one of the attributes to help differentiate commercials from the news program. The Zero Crossing Rate measures how rapidly an audio signal changes. ZCR varies significantly for non-pure speech (High ZCR), music (Moderate ZCR) and speech (Low ZCR). Usually commercials have background music along with speech and hence the use of ZCR as a feature. Audio signals associated with commercials generally have high music content and faster rate of signal change compared to that of non-commercials.

### Spectral Centroid

Spectral Centroid is a measure of the “center of gravity” using the Fourier transform's frequency and magnitude information. It is commonly used in digital signal processing to help characterize a spectrum. This motivated the use of spectral features where higher Spectral Centroid signify higher frequencies (music).

### Spectral Roll off

Spectral Roll off Point is a measure of the amount of the right-skewedness of the power spectrum. This feature discriminates between speech, music and non-pure speech.

### Spectral Flux

Spectral flux is a measure of how quickly the power spectrum of a signal changes. It is calculated by comparing the power spectrum for one frame against the power spectrum from the previous frame.

### Fundamental Frequency

The fundamental frequency is the lowest frequency of a waveform. In music, the fundamental is the musical pitch of a note that is perceived as the lowest fundamental frequency present. This feature is also used as non-commercials (dominated by pure speech) will produce lower fundamental frequencies compared to that of commercials (dominated by music).

### Motion Distribution

Motion Distribution is obtained by first computing dense optical flow (Horn-Schunk formulation) followed by construction of a distribution of flow magnitudes over the entire shot with 40 uniformly divided bins in range of [0, 40]. Motion Distribution is a significant feature as many previous works have indicated that commercial shots mostly have high motion content as they try to convey maximum information in minimum possible time.

### Frame Difference Distribution

The Frame Difference Distribution is the measure of the difference between the current frame and a reference frame, often called "background image", or "background model". This will assist in measuring the perceived speed at which the frames appear to differentiate. Sudden changes in pixel intensities are grasped by Frame Difference Distribution. Such changes are not registered by optical flow. Thus, Frame Difference Distribution is also computed along with flow magnitude distributions. The researchers obtain the frame difference by averaging absolute frame difference in each of 3 color channels and the distribution is constructed with 32 bins in the range of [0, 255].

### Text area distribution

The text area distribution is like the text area distribution in that is the measure of the difference between the current text on screen and a reference amount of text. The text distribution feature is obtained by averaging the fraction of text area present in a grid block over all frames of the shot.
Bag of Audio Words
This attribute is to be removed to reduce the sparseness of the data set.

### Bag of Audio Words (4000 bins)

The MFCC Bag of Audio Words have been successfully used in several existing speech / audio processing applications. MFCC coefficients along with Delta and Delta-Delta Cepstrum are computed from 150 hours of audio tracks. These coefficients are clustered into 4,000 groups which form the Audio words. Each shot is then represented as a 4,000 Dimensional Bag of Audio Words by forming the normalized histograms of the MFCC's extracted from 20 ms windows with overlap of 10 ms in the shots.

###  Edge change Ratio

Edge Change Ratio Captures the motion of edges between consecutive frames and is defined as ratio of displaced edge pixels to the total number of edge pixels in a frame. The researchers calculated the mean and variance of the ECR over the entire shot.

<a id="Section 1.3: Potentially Useful Attributes"></a>

## Section 1.3: Potentially Useful Attributes

* A broadcast company code and/or name (there are five broadcast companies in this dataset)
* The volume of the audio (commercials tend to be louder in volume than the show)

<a id="Section 1.4: Columns and Data Types"></a>

## Section 1.4: Columns and Data Types

The table below shows the attributes and their data types in tabular format for quick review.

NOTE: There are inconsistencies in the column indexing per the readme.txt file - all relating to binning, the Motion Distribution attribute (18-58) should be columns 18-57 leaving column 58 as a 'filler' with an unknown value. Likewise, the Frame Difference Distribution attribute (59-91) should be columns 59-90 leaving column 91 as a 'filler' with an unknown value. The Text Area Distribution attribute (92-122) should be columns 92-121 leaving column 122 as a 'filler with an unknown value. One hint that the indexing is off is the binning attributes ending in an even number rather than an odd number. While the filler values are unknown, they are still included in the dataframes, and, therefore, the models. So while their labels may be a bit unclear, the actual values in those columns are still being used as input into our analysis.

In [None]:
# We are using a Pandas dataframe to tabulate the data (and provide an simple introduction into Pandas)

df_attributes = pd.DataFrame(
  data=[
    ('Dimension Index','0','integer','Categorical','Target variable'),
    ('Shot Length','1','integer','Continuous',''),
    ('Motion Distribution','2-3','float','Continuous','Mean and Variance'),
    ('Frame Difference Distribution','4-5','float','Continuous','Mean and Variance'),
    ('Short time energy','6-7','float','Continuous','Mean and Variance'),
    ('ZCR','8-9','float','Continuous','Mean and Variance'),
    ('Spectral Centroid','10-11','float','Continuous','Mean and Variance'),
    ('Spectral Roll off','12-13','float','Continuous','Mean and Variance'),
    ('Spectral Flux','14-15','float','Continuous','Mean and Variance'),
    ('Fundamental Frequency','16-17','float','Continuous','Mean and Variance'),
    ('Motion Distribution','18-57','float','Continuous','40 bins'),
    ('Filler','58', 'float','Continuous','Unknown value'),
    ('Frame Difference Distribution','59-90','float','Continuous','32 bins'),
    ('Filler','91', 'float','Continuous','Unknown value'),
    ('Text area distribution','92-121','float','Continuous','15 bins Mean and 15 bins for variance'),
    ('Filler','122', 'float','Continuous','Unknown value'),
    ('Bag of Audio Words','123-4123','float','Continuous','4,000 bins'), 
    ('Edge change Ratio','4124-4125','float','Continuous','Mean and Variance')
  ],
  columns=[
    'Attribute Name','Columns','Data Types', 'Type', 'Notes'
  ],
)

from tabulate import tabulate

print(tabulate(df_attributes, showindex=True, headers=df_attributes.columns))

<a id="Section 2: Data Preparation"></a>

# Section 2: Data Preparation

This section covers the activities needed to construct the dataset that will be fed into the models. The files for this project  (bbc.txt, cnn.txt, cnnibn.txt, ndtv.txt, and timesnow.txt) can be found at  https://archive.ics.uci.edu/ml/datasets/TV+News+Channel+Commercial+Detection+Dataset as a single ZIP file. To eliminate  manual work and streamline file processing, these five files were extracted and put on a team member's website (http://www.shookfamily.org) as follows:

http://www.shookfamily.org/data/BBC.txt (17,720 lines)

http://www.shookfamily.org/data/CNN.txt (22,545 lines)

http://www.shookfamily.org/data/CNNIBN.txt (33,117 lines)

http://www.shookfamily.org/data/NDTV.txt (17,051 lines)

http://www.shookfamily.org/data/TIMESNOW.txt (39,252 lines)

As shown in the cells below, it takes several steps to download the files and process them into the final dataset.

The overall goal is to download the files from the internet and load them into an in-memory object. Because these files are stored in the SVM Light format, they are first loaded into a scipy.sparse matrix array object. These sparse matrix arrays are then inspected to eliminate as many columns as possible, and, consequently, reduce the sparseness of the matrix. Once that is accomplished, the scipy.sparse matrix arrays are converted to Pandas DataFrames for faster data processing and input into the accompanying data models.

<a id="Section 2.1: Download Files"></a>

## Section 2.1: Download Files

The first step in this proces is to download the five files from the internet. The data is in a pickled (marshalled / serialized) format used to persist an SVM Light dataset. The SVM Light format is basically an Index : Value pair where the index represents an element in a sparse matrix array and the value associated with that element. For example, a partial record like the following:

> 1 1:123 2:1.316440 3:1.516003 ...

represents the Y-axis lable followed by the X-Axis values where the first, second, and third elements are a sparse matrix array with the values 123, 1.316440, and 1.516003 (or array[0] == 123, array[1] == 1.316440, and array[2] == 1.516003. The code below downloads each SVM Light file from the internet as a scipy.sparse matrix object and converts this to as two numpy arrays X and Y representing the X axis and the Y axis.

**Runtime Expectation:** It takes about 30 to 60 seconds to download and convert these files.

In [None]:
%%time

import urllib.request
import tempfile

from sklearn.datasets import load_svmlight_file

################################################################################
################################################################################

url_bbc      = 'http://www.shookfamily.org/data/BBC.txt'
url_cnn      = 'http://www.shookfamily.org/data/CNN.txt'
url_cnnibn   = 'http://www.shookfamily.org/data/CNNIBN.txt'
url_ndtv     = 'http://www.shookfamily.org/data/NDTV.txt'
url_timesnow = 'http://www.shookfamily.org/data/TIMESNOW.txt'

################################################################################
# Download file to a temporary file. Load that file into a scipy.sparse matrix
# array, and then return that object to the caller.
################################################################################

def get_pickled_file(url):
    response = urllib.request.urlopen(url)
    data = response.read()      # a `bytes` object
    text = data.decode('utf-8') # a `str`; this step can't be used if data is binary

    with tempfile.NamedTemporaryFile(delete=False, mode='w') as file_handle:
        assert text is not None
        file_handle.write(text)
        filename = file_handle.name

        return load_svmlight_file(filename)   # Returns the X axis and  Y axis

################################################################################
# Dowload files as scipy.sparse matrix arrays
################################################################################

print('Downloading datasets from the internet ...\n')
print('Downloading (as scipy.sparse matrix) ...', url_bbc)

%time X1, y1 = get_pickled_file(url_bbc)
%time X2, y2 = get_pickled_file(url_cnn)
%time X3, y3 = get_pickled_file(url_cnnibn)
%time X4, y4 = get_pickled_file(url_ndtv)
%time X5, y5 = get_pickled_file(url_timesnow)

print('\nAll files have been downloaded')

<a id="Section 2.2: Pivot the Y-axis"></a>

## Section 2.2: Pivot the Y-axis

The Y-axis variables (y1, y2, y3, y4, y5) are returned from the cell above as arrays in a column-wise orientation:

> array([ 1.,  1.,  1., ...,  1.,  1.,  1.])

The code below pivots those arrays to a row-wise orientation:

> array(  
&nbsp;&nbsp;[  
&nbsp;&nbsp;&nbsp;&nbsp;[ 1.],  
&nbsp;&nbsp;&nbsp;&nbsp;[ 1.],  
&nbsp;&nbsp;&nbsp;&nbsp;[ 1.],  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;...  
&nbsp;&nbsp;&nbsp;&nbsp;[ 1.],  
&nbsp;&nbsp;&nbsp;&nbsp;[ 1.],  
&nbsp;&nbsp;&nbsp;&nbsp;[ 1.]  
&nbsp;&nbsp;]  
)

**Runtime Expectation:** It takes less than a second to run the following cell.

In [None]:
%%time

Y1 = y1[:, None]   # bbc
Y2 = y2[:, None]   # cnn
Y3 = y3[:, None]   # cnnibn
Y4 = y4[:, None]   # ndtv
Y5 = y5[:, None]   # timesnow

<a id="Section 2.3: Convert Sparse Matrix Array to an Array"></a>

## Section 2.3: Convert Sparse Matrix Array to an Array

The first five cells display some information about each sparse matrix array. The last cell converts those sparse matrix array into a dense array.

**Runtime Expectation:** The following cell runs in about a second.

In [None]:
%time X1  # bbc

In [None]:
%time X2  # cnn

In [None]:
%time X3  # cnnibn

In [None]:
%time X4  # ndtv

In [None]:
%time X5  # timesnow

In [None]:
%%time

X_dense1 = X1.toarray()  # bbc
X_dense2 = X2.toarray()  # cnn
X_dense3 = X3.toarray()  # cnnibn
X_dense4 = X4.toarray()  # ndtv
X_dense5 = X5.toarray()  # timesnow

<a id="Section 2.4: Concatenate the Y-axis before the X-axis"></a>

## Section 2.4: Concatenate the Y-axis before the X-axis

Now that the Y-axis has been pivoted from a column-wise orientation to a row-wise orientation, we can concatenate the two arrays so the Y-axis is i
nserted before the X-axis. This places the Dependent Variable in the first column followed by the Independent Variables.

**Runtime Expectation:** The following cell runs in about 10 to 15 seconds.

In [None]:
%%time

concat1 = np.hstack((Y1, X_dense1))  # bbc
concat2 = np.hstack((Y2, X_dense2))  # cnn
concat3 = np.hstack((Y3, X_dense3))  # cnnibn
concat4 = np.hstack((Y4, X_dense4))  # ndtv
concat5 = np.hstack((Y5, X_dense5))  # timesnow

<a id="Section 2.5: Convert the Arrays into Pandas Dataframes"></a>

## Section 2.5: Convert the Arrays into Pandas Dataframes

The following code converts the concatenated dense arrays into Pandas dataframes (to get them into the Pandas ecosystem).

<a id="Section 2.5.1: Convert the First Set of Dataframes (no BoWs)"></a>

### Section 2.5.1: Convert the First Set of Dataframes (no BoWs)

The first set of dataframes will be used to model without the Bag of Words.

This set of dataframes is consistent with the data preparation, visualization, and modeling in Lab 1 and the MiniLab (where we had deleted the Bag of Words to simplify those projects).

**Runtime Expectation:** The following cell runs in a second or two.

In [None]:
%%time

df_bbc      = pd.DataFrame(concat1)
df_cnn      = pd.DataFrame(concat2)
df_cnnibn   = pd.DataFrame(concat3)
df_ndtv     = pd.DataFrame(concat4)
df_timesnow = pd.DataFrame(concat5)

print(len(df_bbc.index), len(df_cnn.index), len(df_cnnibn.index), len(df_ndtv.index), len(df_timesnow.index),
    len(df_bbc.index) + len(df_cnn.index) + len(df_cnnibn.index) + len(df_ndtv.index) + len(df_timesnow.index))

drop_cols = np.arange(123, 4124)

df_bbc      = df_bbc.drop(drop_cols, 1)
df_cnn      = df_cnn.drop(drop_cols, 1)
df_cnnibn   = df_cnnibn.drop(drop_cols, 1)
df_ndtv     = df_ndtv.drop(drop_cols, 1)
df_timesnow = df_timesnow.drop(drop_cols, 1)

df_bbc.info()
df_cnn.info()
df_cnnibn.info()
df_ndtv.info()
df_timesnow.info()

<a id="Section 2.5.2: Convert the Second Set of Dataframes (BoWs)"></a>

### Section 2.5.2: Convert the Second Set of Dataframes (BoWs)

The second set of dataframes will be used to model with the Bag of Words (*_bow).

**Runtime Expectation:** The following cell runs in about 10 to 20 seconds.

In [None]:
%%time

df_bbc_bow      = pd.DataFrame(concat1)   # df_bbc_bow (*_bag_of_words)
df_cnn_bow      = pd.DataFrame(concat2)
df_cnnibn_bow   = pd.DataFrame(concat3)
df_ndtv_bow     = pd.DataFrame(concat4)
df_timesnow_bow = pd.DataFrame(concat5)

print(len(df_bbc_bow.index), len(df_cnn_bow.index), len(df_cnnibn_bow.index), len(df_ndtv_bow.index),
    len(df_timesnow_bow.index), len(df_bbc_bow.index) + len(df_cnn_bow.index) + len(df_cnnibn_bow.index) +
    len(df_ndtv_bow.index) + len(df_timesnow_bow.index))

drop_cols = np.append(np.arange(1, 123), np.arange(4124, 4126))

df_bbc_bow      = df_bbc_bow.drop(drop_cols, 1)
df_cnn_bow      = df_cnn_bow.drop(drop_cols, 1)
df_cnnibn_bow   = df_cnnibn_bow.drop(drop_cols, 1)
df_ndtv_bow     = df_ndtv_bow.drop(drop_cols, 1)
df_timesnow_bow = df_timesnow_bow.drop(drop_cols, 1)

df_bbc_bow.info()
df_cnn_bow.info()
df_cnnibn_bow.info()
df_ndtv_bow.info()
df_timesnow_bow.info()

<a id="Section 2.6: Rename Columns from Integers to Labels"></a>

## Section 2.6: Rename Columns from Integers to Labels

<a id="Section 2.6.1: Rename the First Set of Dataframes (no BoWs)"></a>

### Section 2.6.1: Rename the First Set of Dataframes (no BoWs)

**Runtime Expectation:** The following cell runs in less than a second.

In [None]:
%%time

ren_cols = np.array([
    'Dimension Index',
    'Shot Length',
    'Motion Distribution-Mean', 'Motion Distribution-Variance',
    'Frame Difference Distribution-Mean', 'Frame Difference Distribution-Variance',
    'Short time energy-Mean', 'Short time energy-Variance',
    'ZCR-Mean', 'ZCR-Variance',
    'Spectral Centroid-Mean', 'Spectral Centroid-Variance',
    'Spectral Roll off-Mean', 'Spectral Roll off-Variance',
    'Spectral Flux-Mean', 'Spectral Flux-Variance',
    'Fundamental Frequency-Mean', 'Fundamental Frequency-Variance',
    'Motion Distribution-Bin 1', 'Motion Distribution-Bin 2', 'Motion Distribution-Bin 3', 'Motion Distribution-Bin 4',
    'Motion Distribution-Bin 5', 'Motion Distribution-Bin 6', 'Motion Distribution-Bin 7', 'Motion Distribution-Bin 8',
    'Motion Distribution-Bin 9', 'Motion Distribution-Bin 10', 'Motion Distribution-Bin 11', 'Motion Distribution-Bin 12',
    'Motion Distribution-Bin 13', 'Motion Distribution-Bin 14', 'Motion Distribution-Bin 15', 'Motion Distribution-Bin 16',
    'Motion Distribution-Bin 17', 'Motion Distribution-Bin 18', 'Motion Distribution-Bin 19', 'Motion Distribution-Bin 20',
    'Motion Distribution-Bin 21', 'Motion Distribution-Bin 22', 'Motion Distribution-Bin 23', 'Motion Distribution-Bin 24',
    'Motion Distribution-Bin 25', 'Motion Distribution-Bin 26', 'Motion Distribution-Bin 27', 'Motion Distribution-Bin 28',
    'Motion Distribution-Bin 29', 'Motion Distribution-Bin 30', 'Motion Distribution-Bin 31', 'Motion Distribution-Bin 32',
    'Motion Distribution-Bin 33', 'Motion Distribution-Bin 34', 'Motion Distribution-Bin 35', 'Motion Distribution-Bin 36',
    'Motion Distribution-Bin 37', 'Motion Distribution-Bin 38', 'Motion Distribution-Bin 39', 'Motion Distribution-Bin 40',
    'Filler 1',
    'Frame Difference Distribution-Bin 1', 'Frame Difference Distribution-Bin 2',
    'Frame Difference Distribution-Bin 3', 'Frame Difference Distribution-Bin 4',
    'Frame Difference Distribution-Bin 5', 'Frame Difference Distribution-Bin 6',
    'Frame Difference Distribution-Bin 7', 'Frame Difference Distribution-Bin 8',
    'Frame Difference Distribution-Bin 9', 'Frame Difference Distribution-Bin 10',
    'Frame Difference Distribution-Bin 11', 'Frame Difference Distribution-Bin 12',
    'Frame Difference Distribution-Bin 13', 'Frame Difference Distribution-Bin 14',
    'Frame Difference Distribution-Bin 15', 'Frame Difference Distribution-Bin 16',
    'Frame Difference Distribution-Bin 17', 'Frame Difference Distribution-Bin 18',
    'Frame Difference Distribution-Bin 19', 'Frame Difference Distribution-Bin 20',
    'Frame Difference Distribution-Bin 21', 'Frame Difference Distribution-Bin 22',
    'Frame Difference Distribution-Bin 23', 'Frame Difference Distribution-Bin 24',
    'Frame Difference Distribution-Bin 25', 'Frame Difference Distribution-Bin 26',
    'Frame Difference Distribution-Bin 27', 'Frame Difference Distribution-Bin 28',
    'Frame Difference Distribution-Bin 29', 'Frame Difference Distribution-Bin 30',
    'Frame Difference Distribution-Bin 31', 'Frame Difference Distribution-Bin 32',
    'Filler 2',
    'Text area distribution-Bin 1-Mean', 'Text area distribution-Bin 2-Mean',
    'Text area distribution-Bin 3-Mean', 'Text area distribution-Bin 4-Mean',
    'Text area distribution-Bin 5-Mean', 'Text area distribution-Bin 6-Mean',
    'Text area distribution-Bin 7-Mean', 'Text area distribution-Bin 8-Mean',
    'Text area distribution-Bin 9-Mean', 'Text area distribution-Bin 10-Mean',
    'Text area distribution-Bin 11-Mean', 'Text area distribution-Bin 12-Mean',
    'Text area distribution-Bin 13-Mean', 'Text area distribution-Bin 14-Mean',
    'Text area distribution-Bin 15-Mean',
    'Text area distribution-Bin 1-Variance', 'Text area distribution-Bin 2-Variance',
    'Text area distribution-Bin 3-Variance', 'Text area distribution-Bin 4-Variance',
    'Text area distribution-Bin 5-Variance', 'Text area distribution-Bin 6-Variance',
    'Text area distribution-Bin 7-Variance', 'Text area distribution-Bin 8-Variance',
    'Text area distribution-Bin 9-Variance', 'Text area distribution-Bin 10-Variance',
    'Text area distribution-Bin 11-Variance', 'Text area distribution-Bin 12-Variance',
    'Text area distribution-Bin 13-Variance', 'Text area distribution-Bin 14-Variance',
    'Text area distribution-Bin 15-Variance', 'Attribute 122 should be Bin 15-Variance',
    'Edge change Ratio-Mean', 'Edge change Ratio-Variance'
])
    
df_bbc.columns = ren_cols
df_cnn.columns = ren_cols
df_cnnibn.columns = ren_cols
df_ndtv.columns = ren_cols
df_timesnow.columns = ren_cols

print(df_bbc.iloc[0:1:,])

<a id="Section 2.6.2: Rename the Second Set of Dataframes (BoWs)"></a>

### Section 2.6.2: Rename the Second Set of Dataframes (BoWs)

**Runtime Expectation:** The following cell runs in less than a second.

In [None]:
%%time

ren_cols = np.array(['Dimension Index'])

print(ren_cols.size)

for i in np.arange(1, 4002):
    ren_cols = np.append(ren_cols, 'BoW ' + str(i))

print(ren_cols.size)
print(ren_cols)

df_bbc_bow.columns = ren_cols
df_cnn_bow.columns = ren_cols
df_cnnibn_bow.columns = ren_cols
df_ndtv_bow.columns = ren_cols
df_timesnow_bow.columns = ren_cols

print(df_bbc_bow.iloc[0:1:,])

<a id="Section 2.7: Inspect Missing Values"></a>

## Section 2.7: Inspect Missing Values

As shown is the output above, 120 columns are left in the dataframe. 4,005 columns were deleted after eliminating the Bag of Words (4,000 columns) and the five columns (88, 89, 120, 121, 123) with all zero values.

<a id="Section 2.7.1: Display Table of Missing Values (no BoWs)"></a>

###  Section 2.7.1: Display Table of Zero Values (no BoWs)

The code below displays columns with SOME zero values (versus ALL zero values).

**Runtime Expectation:** The following cell runs in a few seconds.

In [None]:
%%time

def percentage_of_zeros_table(df):
    numberOf_nonzeros = df.astype(bool).sum(axis=0)
    NumberOf_Zeros = df.count()-numberOf_nonzeros
    percentOf_Zeros=NumberOf_Zeros / df.count() * 100
    table1 = pd.concat([NumberOf_Zeros, percentOf_Zeros], axis=1)
    table2 = table1.rename(columns={0 : 'Zero Values', 1 : '% of Total Values'})
    return table2

df_missing = percentage_of_zeros_table(df_bbc)

print(df_missing)
print(df_missing.info)

<a id="Section 2.7.2: View Zero Values via a 40% Threshold (no BoWs)"></a>

### Section 2.7.2: View Zero Values via a 40% Threshold (no BoWs)

The code below displays columns having over 40% of its values as zero.

**Runtime Expectation:** The following cell runs in a few seconds.

In [None]:
%%time

df_missing = df_missing[(df_missing['% of Total Values'] > 40)]

print(df_missing)

<a id="Section 2.7.3: Display Table of Zero Values (BoWs)"></a>

###  Section 2.7.3: Display Table of Zero Values (BoWs)

The code below displays columns with SOME zero values (versus ALL zero values).

**Runtime Expectation:** The following cell runs in about 5 seconds.

In [None]:
%%time

def percentage_of_zeros_table(df):
    numberOf_nonzeros = df.astype(bool).sum(axis=0)
    NumberOf_Zeros = df.count()-numberOf_nonzeros
    percentOf_Zeros=NumberOf_Zeros / df.count() * 100
    table1 = pd.concat([NumberOf_Zeros, percentOf_Zeros], axis=1)
    table2 = table1.rename(columns={0 : 'Zero Values', 1 : '% of Total Values'})
    return table2

df_missing = percentage_of_zeros_table(df_bbc_bow)

print(df_missing)
print(df_missing.info)

<a id="Section 2.8: Concatenate the Five Pandas Dataframes"></a>

## Section 2.8: Concatenate the Five Pandas Dataframes

This step concatenates the five Pandas dataframes into the final dataframe.

<a id="Section 2.8.1: Concatenate the First Set of Dataframes (no BoWs)"></a>

### Section 2.8.1: Concatenate the First Set of Dataframes (no BoWs)

**Runtime Expectation:** The following cell runs in less than a second.

In [None]:
%%time

df_final = pd.concat([df_bbc, df_cnn, df_cnnibn, df_ndtv, df_timesnow])

df_final.name = 'TV News Channel Commercial Detection'

cols = list(df_final)  # List of the columns as string (to easy indexing)

df_final.info()

<a id="Section 2.8.2: Concatenate the Second Set of Dataframes (BoWs)"></a>

### Section 2.8.2: Concatenate the Second Set of Dataframes (BoWs)

**Runtime Expectation:** The following cell runs in about 10 to 20 seconds.

In [None]:
%%time

df_final_bow = pd.concat([df_bbc_bow, df_cnn_bow, df_cnnibn_bow, df_ndtv_bow, df_timesnow_bow])

df_final_bow.name = 'TV News Channel Commercial Detection'

df_final_bow.info()

<a id="Section 2.8.3: Drop Columns with All Zeroes (BoWs)"></a>

### Section 2.8.3: Drop Columns with All Zeroes (BoWs)

**Runtime Expectation:** The following cell runs in about 40 to 50 seconds.

In [None]:
%%time

df_final_bow = df_final_bow.loc[:, (df_final_bow != 0).any(axis=0)]

cols_bow = list(df_final_bow)  # List of the columns as string (to easy indexing)

df_final_bow.info()

<a id="Section 3: Visualizing the Data"></a>

# Section 3: Visualizing the Data 

<a id="Section 3.1: Attributes: Box Plots"></a>

## Section 3.1: Attributes: Box Plots

The code below creates a Box Plot for each of the non-binned attributes (columns 0 - 18 and 4124-4125). 

**Runtime expectation** The following three cells run in about 5 to 10 seconds.

In [None]:
%%time

import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
import seaborn as sns

# Box Plot: Attribute 1 - Shot Length

fig, ax = plt.subplots(1, 1, figsize=(6.7, 3))

axes = df_final.boxplot(column=cols[1:2], by='Dimension Index', patch_artist=True, ax=ax)

axes.set_xlabel('Non-commercial vs. Commercial')   # Non-commericial == -1, Commercial == +1

plt.subplots_adjust(top=1.5)
plt.suptitle('')
plt.show()

In [None]:
%%time

# Box Plot: Attributes 2-18 - Motion Distribution-Mean to Fundamental Frequency-Variance

fig, ax = plt.subplots(8, 2, figsize=(15, 26))

axes = df_final.boxplot(column=cols[2:18], by='Dimension Index', patch_artist=True, ax=ax)

for i in axes:
    i.set_xlabel('Non-commercial vs. Commercial')   # Non-commericial == -1, Commercial == +1

plt.subplots_adjust(top=1.5)
plt.suptitle('')
plt.show()

In [None]:
%%time

# Box Plot: Attributes 4124-4125 - Edge change Ratio-Mean to Edge change Ratio-Variance

fig, ax = plt.subplots(1, 2, figsize=(15, 3))

axes = df_final.boxplot(column=cols[122:124], by='Dimension Index', patch_artist=True, ax=ax)

for i in axes:
    i.set_xlabel('Non-commercial vs. Commercial')   # Non-commericial == -1, Commercial == +1

plt.subplots_adjust(top=1.5)
plt.suptitle("")
plt.show()

<a id="Section 3.2: Attributes: Hexbin Plots"></a>

## Section 3.2: Attributes: Hexbin Plots

The hex bin plots below compare the relationship between the different news sources. The charts visualize the linear relationship that all of the news networks have with the means. They will also help identify outliers.

**Runtime Expectation:** The following cell runs in about 5 seconds.

In [None]:
%%time

fig, ax = plt.subplots(2, 3, figsize=(20,12))

# Plot all five datasets / broadcast

df_final.plot('Spectral Centroid-Mean','Spectral Roll off-Mean',kind='hexbin',gridsize=30,title='All Five Networks',ax=ax[0,0])

# Plot each dataset / broadcast

df_bbc.plot('Spectral Centroid-Mean','Spectral Roll off-Mean',kind='hexbin',gridsize=30,title='BBC',ax=ax[0,1])
df_cnn.plot('Spectral Centroid-Mean','Spectral Roll off-Mean',kind='hexbin',gridsize=30,title='CNN',ax=ax[0,2])
df_cnnibn.plot('Spectral Centroid-Mean','Spectral Roll off-Mean',kind='hexbin',gridsize=30,title='CNNIBN',ax=ax[1,0])
df_ndtv.plot('Spectral Centroid-Mean','Spectral Roll off-Mean',kind='hexbin',gridsize=30,title='NDTV',ax=ax[1,1])
df_timesnow.plot('Spectral Centroid-Mean','Spectral Roll off-Mean',kind='hexbin',gridsize=30,title='TIMESNOW',ax=ax[1,2])

plt.show()

<a id="Section 3.3: Principal Component Analysis (PCA)"></a>

## Section 3.3: Principal Component Analysis (PCA)

The code below creates and X-array of non-binned attributes and a Y-array of the target (Dimension Index: Commercial (+1) or Non-commercial (-1)). The X-array is then scaled and the PCA algorithm is executed against that scaled array. The components array is then concatenated with the target array and converted into a Pandas dataset for further manipulation.

**Runtime Expectation:** The following cell runs in a few seconds.

In [None]:
%%time

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

x = df_final.loc[:, cols[1:19]].values
y = df_final.loc[:,['Dimension Index']].values

x = StandardScaler().fit_transform(x)

pca = PCA(n_components=18)

components = pca.fit_transform(x)

col_names = ['Dimension Index','PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10','PC11','PC12','PC13','PC14',
    'PC15','PC16','PC17','PC18']

df_pca = pd.DataFrame(np.hstack((y, components)), columns=col_names)

df_pca.head()

print(pca.explained_variance_)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_ratio_.sum())

In [None]:
%%time

import seaborn as sb
from IPython.display import Image
from IPython.core.display import HTML 
from pylab import rcParams

import sklearn
from sklearn import decomposition
from sklearn.decomposition import PCA
from sklearn import datasets

%matplotlib inline

sb.heatmap(df_pca)

<a id="Section 3.4: The Final Datasets"></a>

## Section 3.4: The Final Datasets

As the image shows below, there are many dataframes availabile for modeling. Basically, there are two sets of dataframes (with or without the Bag of Words) and five dataframes within that set for each broadcast company.

> **No Bag of Words:** Within this set of dataframes, there are are five dataframes (without the Bag of Words) that allows each broadcast company to be modeled independently plus a combined dataframe:

> * df_bbc
* df_cnn
* df_cnnibn
* df_ndtv
* df_timesnow
* df_final (combined)

> **Bag of Words:** Within this set of dataframes, there are are five dataframes (with the Bag of Words) that allows each broadcast company to be modeled independently plus a combined dataframe:

> * df_bbc_bow
* df_cnn_bow
* df_cnnibn_bow
* df_ndtv_bow
* df_timesnow_bow
* df_final_bow (combined)

In the dataframes associated with no Bag of Words (df_bbc, df_cnn, df_cnnibn, df_ndtv, df_timesnow, and df_final), all the attributes were left intact after inspecting for missing values or columns with all zeroes. Essentially, the dataframes are pretty clean.

In the dataframes associated with the Bag of Words (df_bbc_bow, df_cnn_bow, df_cnnibn_bow, df_ndtv_bow, df_timesnow_bow, and df_final_bow), over 90% of the columns had all zeroes and were dropped. This resulted in only 113 columns with ANY values (other than zero) and a reduction of almost 3,900 columns.

In [None]:
from IPython.display import Image

Image(url='http://www.shookfamily.org/data/final_dataframes.jpg')

<a id="Section 4: Modeling"></a>

## Section 4: Modeling

<a id="Section 4.1: Baseline Cross-validation and Classification"></a>

## Section 4.1: Baseline Cross-validation and Classification

Get the average accuracy of classifying wehther a video is Commercial or Non-commercial.

<a id="Section 4.1.1: Choosing a Cross-validation Approach"></a>

### Section 4.1.1: Choosing a Cross-validation Approach

In the MiniLab, we used a ShuffleSplit as the cross-validation (cv) object. This was based on using examples rather than a comprehensive exploration of cross-validation. Hence, in this lab the following cell will create four different cross-validation objects to be used in developing a baseline performance. (See: <a href="http://scikit-learn.org/stable/modules/cross_validation.html">http://scikit-learn.org/stable/modules/cross_validation.html</a>)

> **cv1 (ShuffleSplit):** Generates a user defined number of independent train / test dataset splits. Samples are first shuffled and then split into a pair of train and test sets.

> **cv2 (StratifiedShuffleSplit):** A variation of ShuffleSplit, which returns stratified splits, which creates splits by preserving the same percentage for each target class as in the complete set.

> **cv3 (KFold):** Divides all the samples in k groups of samples, called folds (if k = n, this is equivalent to the Leave One Out strategy), of equal sizes (if possible). The prediction function is learned using k - 1 folds, and the fold left out is used for test.

> **cv4 (StratifiedKFold):** A variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

> **cv5 (RepeatedKFold):** Repeats K-Fold n times. It can be used when one requires to run KFold n times, producing different splits in each repetition.

**Runtime Expectation:** The following cell runs in less than a second.

In [None]:
%%time

from sklearn.model_selection import StratifiedKFold, ShuffleSplit, StratifiedShuffleSplit, KFold, RepeatedKFold, cross_val_score

df_test = df_final.copy(deep=True)   # All five datasets (no Bag of Words)

y = df_test['Dimension Index'].values
X = df_test[cols[1:]].values   # <-- ALL attirubes (except the target variable)

cv_splits = 2   # The minimum number of splits for K-fold

cv1 = ShuffleSplit(n_splits=cv_splits, test_size=0.2)
cv2 = StratifiedShuffleSplit(n_splits=cv_splits, test_size=0.2)
cv3 = KFold(n_splits=cv_splits)
cv4 = StratifiedKFold(n_splits=cv_splits)
cv5 = RepeatedKFold(n_splits=cv_splits, n_repeats=2, random_state=12883823)   # random_state from scikit example

print(cv1)
print(cv2)
print(cv3)
print(cv4)
print(cv5)

print(cv1.get_n_splits(X), cv2.get_n_splits(X), cv3.get_n_splits(X), cv4.get_n_splits(X), cv5.get_n_splits(X))

print(cv1.split)
print(cv2.split)
print(cv3.split)
print(cv4.split)
print(cv5.split)

<a id="Section 4.1.2: Baseline using a Random Forest"></a>

### Section 4.1.2: Baseline using a Random Forest

The following cell uses a RandomForestClassifier wit the five cross-validators to get accuracy, standard deviation, and wall clock time.

**Runtime Expectation:** The following cell runs in about 2 minutes.

In [None]:
%%time

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

df_test = df_bbc.copy(deep=True)   # <<--- BBC (no BoWs)

y = df_test['Dimension Index'].values
X = df_test[cols[1:]].values   # <-- ALL attributes (except the target variable)

rf_clf = RandomForestClassifier(n_estimators=150, random_state=1)

def do_rf(cv_object):
    cv = cv_object

    print(cv)

    acc = cross_val_score(rf_clf, X, y=y, cv=cv)

    print ("Average accuracy = ", acc.mean() * 100, "+-", acc.std() * 100, "\n")

#############################################################################################
#############################################################################################

#cv1 == ShuffleSplit
#cv2 == StratifiedShuffleSplit
#cv3 == KFold
#cv4 == StratifiedKFold
#cv5 == RepeatedKFold

cv_objects = [cv1, cv2, cv3, cv4, cv5]

for cv_object in cv_objects:
    %time do_rf(cv_object)

Based on the output above, all of the cross-validation objects have an accuracy exceeding 80%. However, the shuffle splits outperform the k-fold cross-validators with a smaller standard deviation and faster execution time. While the RepeatedKFold cross-validator did well against this dataset, it ran the slowest and had a higher standard deviation than the shuffle splits. The performance in the cell above had 2 splits (the minimum number of splits for K-fold).

| Cross-validator | Accuracy | Std Dev | Time |
| ------------- | ------------- | ------------- | ------------- |
| ShuffleSplit | 87.8527088036 | +-0.268058690745 | 30.8 s |
| StratifiedShuffleSplit | 87.8950338600 | +-0.338600451467 | 38.3 s |
| KFold | 81.6534988713 | +-3.075620767490 | 22.0 s |
| StratifiedKFold | 86.8030474041 | +-0.466754100535 | 19.6 s |
| RepeatedKFold | 86.8030474041 | +-0.466754100535 | 44.4 s |
||**2 Splits**

With a higher number of splits, accuracy in the k-fold cross-validators improves (as show in the table below). 

| Cross-validator | Accuracy | Std Dev | Time |
| ------------- | ------------- | ------------- | ------------- |
| ShuffleSplit | 87.7483069977 | +- 0.509747082745 |2min 25s |
| StratifiedShuffleSplit | 87.8724604966 | +-0.686634815993 | 2min 41s |
| KFold |  87.1331828442 | +-7.24996905315 | 2min 54s |
| StratifiedKFold |85.7075759452 | +-8.70263625731 | 3min 11s |
| RepeatedKFold | 87.9006772009 | +-0.755322753007 | 6min 15s |
||**10 Splits**

As the number of splits increases, however, k-fold accuracy does not exceed the shuffle splits cross-validators and they have a higher standard deviation.

**Note:** The performance of the cross-validation objects with 10 splits was run by changing the variable cv_splits from 2 to 10 in <a href="#Section 4.1.1: Choosing a Cross-validation Approach">Section 4.1.1: Choosing a Cross-validation Approach</a> and then executing the code in <a href="#Section 4.1.2: Random Forest">Section 4.1.2: Random Forest</a>. This was done on a one-time basis (as it takes a very long time for all 10 splits to execute), and then cv_splits was set back to 2.

<a id="Section 4.1.3: Baseline using Logistic Regression"></a>

### Section 4.1.3: Baseline using Logistic Regression

The following cell runs a logistic regression using the default parameters. Unlike the Support Vector Models (discussed below), the Logistic Regression models use all the attributes and all the instances in the concatenated dataset (129,685 rows).

**Runtime Expectation:** The following cell runs in about 4 minutes.

In [None]:
%%time

from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedShuffleSplit

df_test = df_final.copy(deep=True)   # <-- All 129,685 rows

y = df_test['Dimension Index'].values
X = df_test[cols[1:]].values   # <-- ALL attributes (except the target variable)

lr_clf = LogisticRegression()

def do_lr(cv_object):
    cv = cv_object
    
    print(cv)

    for train_indices, test_indices in cv.split(X, y):
        X_train = X[train_indices]
        y_train = y[train_indices]
    
        X_test = X[test_indices]
        y_test = y[test_indices]
    
        lr_clf.fit(X_train, y_train)  # train object

        y_hat = lr_clf.predict(X_test) # get test set predictions
    
        acc = mt.accuracy_score(y_test, y_hat)
        conf = mt.confusion_matrix(y_test, y_hat)

        print("Accuracy:", acc )
        print("Confusion matrix:\n", conf)
    
#############################################################################################
#############################################################################################

#cv1 == ShuffleSplit
#cv2 == StratifiedShuffleSplit
#cv3 == KFold
#cv4 == StratifiedKFold
#cv5 == RepeatedKFold

cv_objects = [cv1, cv2, cv3, cv4, cv5]

for cv_object in cv_objects:
    %time do_lr(cv_object)

Based on the output above, all of the cross-validation objects for Logistic Regression also have accuracy exceeding 80%. In a sample execution, these results can be seen in the table below:

| Cross-validator | Accuracy (Split 1) | Accuracy (Split 2) | Accuracy (Repeat 1) |  Accuracy (Repeat 1) | Time |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| ShuffleSplit | 0.846474148899 | 0.848903111385 | N/A | N/A | 45.1 s |
| StratifiedShuffleSplit | 0.848247677064 | 0.835678760072 | N/A | N/A | 41.8 s |
| KFold | 0.810696605647 | 0.850359334999 |  N/A | N/A | 27.9 s |
| StratifiedKFold | 0.821106364604 | 0.846951050245 | N/A | N/A | 27.2 s |
| RepeatedKFold | 0.844084943633 | 0.849850405601 | 0.850562126984 | 0.850806575985 | 53.7 s |

As seen in this table, the shuffle splits did well against the k-fold cross-validators. However, k-fold accuracy improved on the second split. Again, the RepeatedKFold did well against the shuffle splits but executes much slower. This suggests that the k-folds may benefit from additional splits (the default splits for a StratifiedKFold, for example is 3 and 10 seems to be the suggested number of splits on the internet).

<a id="Section 4.1.4: Cross-validation Selection"></a>

### Section 4.1.4: Cross-validation Selection

Based on the cross-validation analysis above, we will be using the StratifiedShuffleSplit cross-validator for the remaining models. Hence, the cell below sets the cross-validator to a StratifiedShuffleSplit with 1 split (the minimum) and with an 80% / 20% training / test size.

**Runtime Expectation:** The following cell runs in less than a second.

In [None]:
%%time

cv = StratifiedShuffleSplit(n_splits=1, test_size=0.2)   # Use 1 split (the minimum for this cv)

print(cv)

<a id="Section 4.2: Classification"></a>

## Section 4.2: Classification

<a id="Section 4.2.1: K-Nearest Neighbors (KNN)"></a>

### Section 4.2.1: K-Nearest Neighbors (KNN)

The following cell uses the KNN classifier with Euclidean distance.

**Runtime Expectation:** The following cell runs in about 2 minutes.

In [None]:
%%time

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

df_test = df_final.copy(deep=True)   # All five datasets (no Bag of Words)

y = df_test['Dimension Index'].values
X = df_test[cols[1:]].values   # <-- ALL attirubes (except the target variable)

acc_l = []

for train_indices, test_indices in cv.split(X, y):
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights='uniform', metric='euclidean')

        knn_clf.fit(X_train, y_train)

        y_hat = knn_clf.predict(X_test)

        acc = accuracy_score(y_test, y_hat)
        
        print('Accuracy of {}-NN classifier is: {}' .format(k, acc))
        
        acc_l.append([k, acc])

sorted(acc_l, key=lambda l:l[1], reverse=True)

print(acc_l)

df_acc = pd.DataFrame(acc_l)

print("\nThis model's most accurate classifer uses K ==", df_acc.iloc[0, 0], "and an accuracey of", df_acc.iloc[0,1])

<a id="Section 4.2.2: Multinomial Naive Bayes (MNB)"></a>

### Section 4.2.2: Multinomial Naive Bayes (MNB)

<a id="Section 4.2.2.1: Multinomial Naive Bayes (no BoWs)"></a>

#### Section 4.2.2.1: Multinomial Naive Bayes (no BoWs)

**Runtime Expectation:** The following cell runs in about less than a second.

In [None]:
%%time

from sklearn.naive_bayes import MultinomialNB

df_test = df_final.copy(deep=True)   # <-- All 129,685 rows

y = df_test['Dimension Index'].values
X = df_test[cols[1:]].values   # <-- ALL attributes (except the target variable)

clf_mnb = MultinomialNB(alpha=0.001)

for train_indices, test_indices in cv.split(X, y):
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]

    clf_mnb.fit(X_train, y_train)

    y_hat = clf_mnb.predict(X_test)

    print('Accuracy of the Multinomial Naive Bayes classifier is:', accuracy_score(y_test, y_hat))

<a id="Section 4.2.2.2: Multinomial Naive Bayes (BoWs)"></a>

#### Section 4.2.2.2: Multinomial Naive Bayes (BoWs)

**Runtime Expectation:** The following cell runs in about less than a second.

In [None]:
%%time

from sklearn.naive_bayes import MultinomialNB

df_test = df_final_bow.copy(deep=True)   # <-- All 129,685 rows

y = df_test['Dimension Index'].values
X = df_test[cols_bow[1:]].values   # <-- ALL attributes (except the target variable)

clf_mnb = MultinomialNB(alpha=0.001)

for train_indices, test_indices in cv.split(X, y):
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]

    clf_mnb.fit(X_train, y_train)

    y_hat = clf_mnb.predict(X_test)

    print('Accuracy of the Multinomial Naive Bayes classifier is:', accuracy_score(y_test, y_hat))

As shown in the output above, the accuracy of the Multinomial Naive Bayes classifier against the Bag of Words is rather poor (compared to the previous models against the non-Bag of Words).

<a id="Section 4.2.3: Logistical Regression"></a>

### Section 4.2.3: Logistical Regression

**Runtime Expectation:** The following cell runs in about 30 seconds.

In [None]:
%%time

df_test = df_final.copy(deep=True)   # <-- All 129,685 rows

y = df_test['Dimension Index'].values
X = df_test[cols[1:]].values   # <-- ALL attirube (except the target variable)

lr_clf = LogisticRegression()   # Default parms

for train_indices, test_indices in cv.split(X, y):
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    lr_clf.fit(X_train, y_train)  # train object

    y_hat = lr_clf.predict(X_test) # get test set predictions
    
    acc = mt.accuracy_score(y_test, y_hat)
    conf = mt.confusion_matrix(y_test, y_hat)

    print("Accuracy:", acc )
    print("Confusion matrix:\n", conf)

As shown above, the accuracy score for this Logistic Regresson is about 84% with default paramaters. The value of accuracy can be calculated from the confusion matrix as (1310 + 2,673) / (1310 + 2,673 + 343 + 183) or the sum of (True Positives + True Negatives) / (the sum of all values: True Positive, True Negative, False Positve, and False Negative).

<a id="Section 4.2.3.1: Attributes Weights"></a>

#### Section 4.2.3.1: Attributes Weights

The cell below prints out the weight for each attribute in the Logistic Regression.

**Runtime expectation:** The cell below runs in less than a second.

In [None]:
%%time

weights = lr_clf.coef_.T   # Transpose to make a column vector

variable_names = df_final.columns

for coef, name in zip(weights, variable_names):
    print(name, 'has weight of', coef[0])

<a id="Section 4.2.3.2: Attribute Weights (with scaling)"></a>

#### Section 4.2.3.2: Attribute Weights (with scaling)

The cell below prints out the weight for each attribute in the Logistic Regression (after scaling). Note that accuracy improves a bit (about .88 vs. about .85).

**Runtime expectation:** The cell below runs in about 10 to 15 seconds.

In [None]:
%%time

scl_obj = StandardScaler()
scl_obj.fit(X_train)

X_train_scaled = scl_obj.transform(X_train)
X_test_scaled = scl_obj.transform(X_test)

lr_clf = LogisticRegression(penalty='l2', C=0.05)

lr_clf.fit(X_train_scaled, y_train)

y_hat = lr_clf.predict(X_test_scaled)

acc = mt.accuracy_score(y_test, y_hat)
conf = mt.confusion_matrix(y_test, y_hat)

print('Accuracy:', acc )
print(conf )

zip_vars = zip(lr_clf.coef_.T, df_final.columns)
zip_vars = sorted(zip_vars)

for coef, name in zip_vars:
    print(name, 'has weight of', coef[0])

<a id="Section 4.2.3.3: Plot the Weights (with scaling)"></a>

#### Section 4.2.3.3: Plot the Weights (with scaling)

With over 100 attributes, the follow bar plot is not very pretty. (The labels, for example, are unreadable). It does, however, give a visual clue into the attributes and their weights.

**Runtime expectation:** The cell below runs in just a few seconds.

In [None]:
%%time

%matplotlib inline

plt.style.use('ggplot')

weights = pd.Series(lr_clf.coef_[0], index=df_final.columns[1:])

weights.plot(kind='bar')

plt.show()

<a id="Section 4.2.3.4: Interpreting the Weights"></a>

#### Section 4.2.3.4: Interpreting the Weights

The first 10 sorted attributes (by weight after scaling) is shown below:

> Motion Distribution-Variance has weight of -2.51974347148  
> Short time energy-Mean has weight of -1.39512309456  
> ZCR-Variance has weight of -1.33520317615  
> Motion Distribution-Bin 13 has weight of -1.20928601292  
> Attribute 58 should be Bin 40 has weight of -0.818619501619  
> Text area distribution-Bin 7-Mean has weight of -0.712396779843  
> Text area distribution-Bin 12-Variance has weight of -0.690287908028  
> Spectral Flux-Variance has weight of -0.678467583285  
> Spectral Roll off-Variance has weight of -0.572899154928  
> Fundamental Frequency-Variance has weight of -0.391339499589  

Surprisingly, the attribute 'Shot Length' is not in the top 10 most important weighted attributes.

<a id="Section 4.2.3.5: Adjusting Parameters (the C value)"></a>

#### Section 4.2.3.5: Adjusting Parameters (the C value)

The Logistic Regression model above used default paramaters in the classifier. The cell below changes the C value from 0.05 to 1.50 (5% to 150%) to see if it can capture the highest accuracy within that C value range.

**Runtime expectation:** The following cell runs in about 20 to 25 minutes.

In [None]:
%%time

def do_lr(c):
    lr_clf = LogisticRegression(C=c)   # Default parms, except the C value

    for train_indices, test_indices in cv.split(X, y):
        X_train = X[train_indices]
        y_train = y[train_indices]
    
        X_test = X[test_indices]
        y_test = y[test_indices]

        lr_clf.fit(X_train, y_train)

        y_hat = lr_clf.predict(X_test)

        acc = mt.accuracy_score(y_test, y_hat)
        conf = mt.confusion_matrix(y_test, y_hat)

        return acc, conf

############################################################################################################
############################################################################################################

c_parms = [5]

for i in np.arange(1, 30):
    c_parms.append(c_parms[i-1] + int(5))

highest_acc = 0.0
highest_c = 0.0
highest_conf = [[]]

for i in np.arange(0, 30):
    acc, conf = do_lr(c_parms[i] / 100)

    if acc > highest_acc:
        highest_acc = acc
        highest_c = c_parms[i]
        highest_conf = conf
        
print('Highest accuracy:', highest_acc, ', C:', highest_c, ', Confusion matrix:', highest_conf)

As shown above, the cell above will capture the C value with the highest accuracy (which is generally close to the default).

<a id="Section 4.2.3.6: Adjusting Parameters (the solvers)"></a>

#### Section 4.2.3.6: Adjusting Parameters (the solvers)

The cell below uses default parameters except for the solver='sag' (Stochastic Average Gradient descent solver). 

**Runtime Expectation:** The following cell runs in about 1 minute.

In [None]:
%%time

lr_clf = LogisticRegression(solver='sag')    # Default parms: penalty='l2', C=1.0, class_weight=None, 

for train_indices, test_indices in cv.split(X, y):
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    lr_clf.fit(X_train, y_train)  # train object

    y_hat = lr_clf.predict(X_test) # get test set predictions

    acc = mt.accuracy_score(y_test, y_hat)
    conf = mt.confusion_matrix(y_test, y_hat)

    print("Accuracy:", acc )
    print("Confusion matrix:\n", conf)

As shown in the cell above, the accuracy score for this linear regression, sag with an accuracy of about 0.78 is much lower than using the default solver='liblinear' with an accuracy generally greater than 0.88.

The cell below uses default parameters except for the solver='saga' (another Stochastic Average Gradient descent solver).

**Runtime Expectation:** The following cell runs in about 30 to 40 seconds.

In [None]:
%%time

lr_clf = LogisticRegression(solver='saga')   # Default parms: penalty='l2', C=1.0, class_weight=None,

for train_indices, test_indices in cv.split(X, y):
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    # train the reusable logisitc regression model on the training data
    
    lr_clf.fit(X_train, y_train)  # train object

    y_hat = lr_clf.predict(X_test) # get test set predictions

    # now let's get the accuracy and confusion matrix for this iterations of training/testing
    
    acc = mt.accuracy_score(y_test, y_hat)
    conf = mt.confusion_matrix(y_test, y_hat)

    print("Accuracy:", acc)
    print("Confusion matrix:\n", conf)

As shown in the cell above, solver='saga' has an accuracy of about 0.77, which is also much lower than using the default solver='liblinear' having an accuracy generally greater around 0.85 or so.

<a id="Section 4.2.4: Neural Network"></a>

### Section 4.2.4: Neural Network

**Runtime Expectation:** The following cell runs in about 10 to 15 seconds.

In [None]:
%%time

from sklearn.neural_network import MLPClassifier

mlp_clf = MLPClassifier()

df_test = df_final.copy(deep=True)   # All five datasets (no Bag of Words)

y = df_test['Dimension Index'].values
X = df_test[cols[1:]].values   # <-- ALL attirubes (except the target variable)

for train_indices, test_indices in cv.split(X, y):
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]

    mlp_clf.fit(X_train, y_train)

    y_hat = mlp_clf.predict(X_test)

    print('Accuracy of Neural Network classifier is: {}' .format(accuracy_score(y_test, y_hat)))

<a id="Section 4.3: Clustering"></a>

## Section 4.3: Clustering

<a id="Section 4.3.1: Baseline (Random Forest)"></a>

### Section 4.3.1: Baseline (Random Forest)

The code below runs a Random Forest classification to create a baseline. Because each of the following clustering solutions use a Random Forest classifier, this score can be used to validate whether the clustering solution is improving classification (or not).

Note that this baseline is a bit different than the baseline in section <a href="#Section 4.1.2: Baseline using a Random Forest">Section 4.1.2: Baseline using a Random Forest</a>, as this code uses the final dataset (all five broadcast datasets) rather then a subset (BBC). Using the final dataset produces an accuracy score that is quite a bit higher (~95% vs. ~87%).

**Runtime Expectation:** The following cell runs in about 3 to 4 minutes.

In [None]:
%%time

df_test = df_final.copy(deep=True)

y = df_test['Dimension Index'].values
X = df_test[cols[1:]].values   # <-- ALL attirubes (except the target variable: Dimension Index)

rf_clf = RandomForestClassifier(n_estimators=150, random_state=1)

acc = cross_val_score(rf_clf, X, y=y, cv=cv)

print("Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)

The following (handy) routine was found at this web site: <a href="http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html">Comparing Python Clustering Algorithms</a>.

In [None]:
import seaborn as sns
import sklearn.cluster as cluster
#import time

%matplotlib inline
sns.set_context('poster')
sns.set_color_codes()
plot_kwds = {'alpha' : 0.25, 's' : 80, 'linewidths':0}

def plot_clusters(data, algorithm, args, kwds):
    labels = algorithm(*args, **kwds).fit_predict(data)
    palette = sns.color_palette('deep', np.unique(labels).max() + 1)
    colors = [palette[x] if x >= 0 else (0.0, 0.0, 0.0) for x in labels]
    plt.scatter(data.T[0], data.T[1], c=colors, **plot_kwds)
    frame = plt.gca()
    frame.axes.get_xaxis().set_visible(False)
    frame.axes.get_yaxis().set_visible(False)
    plt.title('Clusters found by {}'.format(str(algorithm.__name__)), fontsize=24)

<a id="Section 4.3.2: DBSCAN Clustering"></a>

### Section 4.3.2: DBSCAN Clustering

<a id="Section 4.3.2.1: DBSCAN Model"></a>

#### Section 4.3.2.1: DBSCAN Model

The cell below computes a DBSCAN clustering model. The output (i.e., the labels) from that model is then used as a new feature in a Random Forest classifer to score the impact of clustering on classification (commercial or non-commercial).

**Runtime Expectation:** The following cell runs in about 5 to 6 minutes.

In [None]:
%%time

from sklearn.cluster import DBSCAN

df_test = df_final.copy(deep=True)

y = df_test['Dimension Index'].values
X = df_test[cols[1:]].values   # <-- ALL attirubes (except the target variable: Dimension Index)

# Compute DBSCAN

dbscan_model = DBSCAN(eps=0.15, min_samples=10).fit(X)

dbscan_labels = dbscan_model.labels_

print(dbscan_labels)

# Number of clusters in labels, ignoring noise if present

n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)   # -1 is noise, so ignore

print('Estimated number of clusters: %d' % n_clusters)

# Add the labels produced from the clustering model as new attributes

X1 = np.column_stack((pd.get_dummies(dbscan_labels), X))

acc = cross_val_score(rf_clf, X1, y=y, cv=cv)   # rf_clf == RandomForestClassifier; cv=StratifiedShuffleSplit

print("Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)

As shown above, DBSCAN produced two clusters. This is consistent with the classification of the data as Commercial (+1) or Non-Commercial (-1). The accuracy of the Random Forest classifier using the output of the DBSCAN did not materially change the score (they're both about 97%).

<a id="Section 4.3.2.2: DBSCAN Visualization"></a>

#### Section 4.3.2.2: DBSCAN Visualization

**Runtime Expectation:** The following cell runs in about 30 to 40 seconds.

In [None]:
%%time

data = np.column_stack((dbscan_labels, X))

plot_clusters(data, cluster.DBSCAN, (), {'eps':0.15})

<a id="Section 4.3.3: K-Means Clustering"></a>

### Section 4.3.3: K-Means Clustering

<a id="Section 4.3.3.1: K-Means Model"></a>

#### Section 4.3.3.1: K-Means Model

The cell below computes a K-Means clustering model. The output (i.e., the labels) from that model is then used as a new feature in a Random Forest classifer to score the impact of clustering on classification (commercial or non-commercial).

K-Means requires the number of clusters as input into the model. We set n_cluster=2 (i.e., commercial or non-commercial).

**Runtime Expectation:** The following cell runs in about 5 to 6 minutes.

In [None]:
%%time

from sklearn.cluster import KMeans

df_test = df_final.copy(deep=True)

y = df_test['Dimension Index'].values
X = df_test[cols[1:]].values   # <-- ALL attirubes (except the target variable: Dimension Index)

model = KMeans(n_clusters=2, init='k-means++', random_state=1).fit(X)

km_labels = model.labels_

print(km_labels)

n_clusters = len(set(km_labels)) - (1 if -1 in km_labels else 0)

print('Estimated number of clusters: %d' % n_clusters)

# Add the labels produced from the clustering model as new attributes

X1 = np.column_stack((pd.get_dummies(dbscan_labels), X))

acc = cross_val_score(rf_clf, X1, y=y, cv=cv)   # rf_clf == RandomForestClassifier; cv=StratifiedShuffleSplit

print("Average accuracy = ", acc.mean()*100, "+-", acc.std()*100)

<a id="Section 4.3.3.2: K-Means Visualization"></a>

#### Section 4.3.3.2: K-Means Visualization

**Runtime Expectation:** The following cell runs in about 30 to 40 seconds.

In [None]:
%%time

data = np.column_stack((km_labels, X))
                     
plot_clusters(data, cluster.KMeans, (), {'n_clusters':2})

<a id="Section 5: Evaluation"></a>

## Section 5: Evaluation

<a id="Section 5.1: Classification"></a>

### Section 5.1: Classification

The target variable for this dataset is Dimension Index, which has a binary classification of Commercial (+1) or Non-Commercial (-1). As a result, all our models were based on classification algorithms.

In the baseline analysis, we used Random Forest and Logistic Regression algorithms to perform baseline classifications. Then we used K-Nearest Neighbors, Multinomial Naive Bayes (MNB), and a deep dive into Logistic Regression to perform additional classifications.

Here are some observations on model performance (no Bag of Words):

* The baseline Random Forest accuracy was about 88% (87.8950338600 in one execution).  
* The baseline Logistic Regression accuracy was about 84% (0.835678760072 in one execution).  
* The K-Nearest Neighbor accuracy was about 87% (0.874040945368 in one execution).  
* The Multinomial Naive Bayes (MNB) accuracy was about 71% (0.704052126306 in one execution).  
* The Logistical Regression accuracy was about 85% (0.850608392579 in the best execution).  
* The Neural Network accuracy was about 82% (0.8275051085322127 in one execution).

Here are some observations on model performance (Bag of Words):

* The Multinomial Naive Bayes (MNB) accuracy was about 71% (0.70983537032 in one execution).  

We hypothesized that the Bag of Words would have much better accuracy and the Multinomial Naive Bayes seemed to be a good classifier to use on that use case. At 71% accuracy (for with and without the Bag of Words, however, modeling against the Bag of Words did not add much value and the Multinomial Naive Bayes classifier does not appear to be a good algorithm against this dataset.

In general, the baseline Random Forest model performed better than all the other algorithms. The K-Nearest Neighbor, Logistical Regression, and Logistical Regression all performed well too. The only poor performing algorithm was the Multinomial Naive Bayes (probably due to its inability to automatically learn feature interactions).

<a id="Section 5.2: Clustering"></a>

In [None]:
Image(url='http://www.shookfamily.org/data/algo_compare.png')

### Section 5.2: Clustering

We used DBSCAN and K-Means clustering models to explore if they could enhance classification via feature engineering. We also tried HDBSCAN and Affinity Propagation; however, we could not execute these clustering models due to memory problems.

* DBSCAN produced two clusters, which is the expected output. Using this cluster to perform feature engineering, however, did not materialy improve Random Forest classification.
* K-Means also producted two clusters, but it also failed to improve Random Forest classification.

Random Forest classification with DBSCAN and K-Means labels had about a 95% accuracy (the same as the Random Forest baseline without clustering). Hence, clustering had no material impact on the accuracy of Random Forest classification.

While clustering did not have a material impact on accuracy, it was interesting to explore how clustering can be used as feature engineering.

<a id="Section 6: Deployment"></a>

## Section 6: Deployment

<a id="Section 6.1: Value Proposition"></a>

### Section 6.1: Value Proposition

When using the final dataset with all five broadcasting companies, the Random Forest classifier was about 95% accurate.

That is a very high accuracy score and suggests that model is potenially very useful.

<a id="Section 6.2: Potential Usefulness"></a>

### Section 6.2: Potential Usefulness

When considering how useful the models utilized in this notebook is for any interested parties it is first important to theorize what companies or organizations would be interested in this analysis.

The ability to accurately detect and recognize commercials in a live (or slightly delayed) feed of television is something that many consumers would be interested in. Services like TiVo, Hulu and Digital Video Recorders (DVR) allow the average consumer to consume and view collection of television at one. Many of them provide the option to fast forward and attempt to skip commercials.

Although advertisers may not like a future where pre-recorded or even delayed TV can be viewed devoid of advertisements, a model similar to ours can help make that possible with zero or little additional work from content providers.
Although our analysis is currently performed on a static dataset, it does have a potential useful business application. If a company like TiVo or another DVR implements this to identify and delete commercials from videos their customers device to their devices it would be a very useful feature to many consumers.

The next step would be to take the model and implement it into a stream analysis. Streaming analytics would provide an organization with the technology to enable the action of automated commercial removal based on the analysis of a series of video events that have just happened (so a delay would be required from a live feed to allow for analysis and processing time).

A streaming analytics model like this would benefit from data from numerous TV networks and commercial sources such as local and regional TV shows and commercials. This model is currently supervised and would require updating and input to main its accuracy.