NAME: __FULLNAME__

# Homework 1

### Objectives
* Basic numpy operations to access data
* Basic plotting of subsets of data
* Simple descriptive statistics
* Do not save work within the mlp_2020 folder
  + create a folder in your home directory for assignments, and copy the templates there  

### General References
* [Sci-kit Learn Iris Dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html)
* [Numpy Reference](https://docs.scipy.org/doc/numpy/reference/index.html)
* [Summary of matplotlib](https://matplotlib.org/3.1.1/api/pyplot_summary.html)
  + [Plot](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.plot.html)
  + [Boxplots](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.boxplot.html)
  + [Histograms](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html#matplotlib.pyplot.hist)
  + [Scatter plots](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter)
  + [Colormap Plots](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.imshow.html)

### Hand-In Procedure
* Execute all cells so they are showing correct results
* Notebook:
  + Submit this file (.ipynb) to the Canvas HW1 dropbox
* PDF:
  + File/Export Notebook As/PDF -> Produces a copy of the notebook in PDF format
  + Submit the PDF file to the Gradescope HW1 dropbox


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython import get_ipython
from sklearn.datasets import load_iris

# LOAD IRIS DATA SET

In [None]:
"""
Load the dataset into the iris_dataset variable, by calling the 
load_iris() function imported from sklearn.datasets.
Then display the iris_dataset object's list of keys. iris_dataset
is a dictionary object.
"""
iris_dataset = # TODO
iris_dataset.keys()

### Dataset Details
The `iris_dataset` variable is a dictionary with multiple fields:
* `data` : m by n numpy array of the n observed feature values, for each of the m samples  
* `target` : m by 1 numpy array of samples' classification as iris-setosa (i.e. 0), iris-versicolour (i.e. 1), or iris-virginica (i.e. 2)
* `target_names` : 3 by 1 numpy array of the possible iris classifications  
* `DESCR` : string containing a detailed description of the dataset  
* `feature_names` : n by 1 numpy array of the names of the feature variables  
* `filename` : string containing the absolute path to where the file containing all the data information is located on the local system  

In [None]:
"""
Print out the description of the data, by accessing the 
'DESCR' field of the iris data set
"""
# TODO

## SETUP USEFUL VARIABLES

In [None]:
"""
Store the names of the features and the names 
of the target classes, into the variables
feature_names and target_names respectively.
"""
feature_names =  # TODO
target_names = # TODO

"""
Print the list of feature names and target names
"""


In [None]:
""" 
Create variables for the feature and target data 
The X variable is a numpy array containing the data measured 
for each feature for each sample. Each column of X is a 
different feature for all the samples. Each row of X is a 
different sample with all its features.
The y variable is a numpy array containing the classification 
for each sample. A sample iris is either setosa, versicolor, or virginica.
""" 
X = # TODO
y = # TODO

"""
Print the dimensions of the X and y variables respectively
"""


In [None]:
""" 
Store the number of samples and the number of features, by
accessing the values from the shape of X
"""
nsamples = # TODO
nfeatures = # TODO

"""
Print the print the number of samples and numberof features respectively
"""


## SELECT SUBSET OF FEATURES
Not all available data is necessary or useful for making predictions and classifying observations. There are numerous feature selection algorithms that exist, which will be discussed in more detail later within the cousre. For now we are going to arbitrarly select sepal length, sepal width and petal length as our predictor variables. We will not yet be performing any predictions in this assignment; rather this term is used to conveniently distinuguish this subset of features from the full set of features.

In [None]:
""" PROVIDED
Feature Column Indices
The values observed for each feature resides within a particular 
column of the feature matrix, X. For example, column 0 contains the 
values of the mean radius for each observation, the column at index 
3 contains the values for the mean area, and so on.
"""
sepal_length_idx = 0
sepal_width_idx = 1
petal_length_idx = 2

"""
Create a list of the select subset of features
"""
predictors = [sepal_length_idx, sepal_width_idx, petal_length_idx]

"""
Create a variable, storing the number of predictors
"""
# TODO

"""
Create a list of corresponding names for the selected set of features.
This is conveniently done using list comprehension
"""
# TODO

"""
Print the list of predictor names
"""
# TODO


## BASIC HISTOGRAMS OF FEATURES

In [None]:
""" TODO
HISTOGRAMS OF THE CHOSEN PREDICTOR FEATURES
Please plot histograms in their own subplot of 
the same figure.
"""
plt.figure(figsize=(20,4))
plt.subplots_adjust(wspace=.3)
for i, fidx in enumerate(predictors, 1):
    plt.subplot(1, 3, i)
    # TODO: Plot the histogram 


In [None]:
""" TODO
Create a histogram or barplot for the counts
for each target class
"""


## BASIC BOXPLOTS OF FEATURES
Boxplots or box-and-whisker plots are used to obtain a perspective of the distribution of the data.
The box within the figure displays the 25th percentile (Q1), the median, and the 75th percentile (Q3) of the data. The range between the 75th percentile value and the 25th percentile value is the interquartile range (IQR = Q3 - Q1). The end of bottom line is Q1 - 1.5 * IQR. The end of top line is Q3 + 1.5 * IQR. Anything beyond the lines, the circles, are suggested outliers.  
<center><img src="boxplot_diagram.png" style="width:30%;height:30%"><\center>

One can use the `boxplot(data_values, labels=[name])` to generate a boxplot. `data_values` would be the set of observed values for a paritucular feature and `labels` should be provided as a list, with the name of the feature, in place of `name`.

In [None]:
""" TODO
BOXPLOTS OF THE CHOSEN PREDICTOR FEATURES
Please place the boxplots within their own 
subplot of the same figure 
"""


## DESCRIPTIVE STATISTICS

In [None]:
# Simply run this cell
""" 
Create a separate variable of the data from the 
predictors
"""
Xpreds = X[:, predictors]

"""
Check if any values are NaN (not a number)
"""
np.any(np.isnan(Xpreds))

In [None]:
""" TODO
Compute the following descriptive statistics of the 
features ignoring NaN values, using numpy:
mean, median, standard deviation, min, and max

Make sure to compute the statistics of the columns
of X (i.e. of each feature). You can specify this 
by setting axis=0 for each of the functions

Compute and print the results
"""


## FEATURE CORRELATIONS
It's useful to know the correlation between various features, as well as each feature and the predicted label. Feature correlation is useful for feature selection and understanding the relationship between multiple variables within a dataset. Correlation is either positive, negative, or zero. When two features increase simultaneously, they are positively correlated. When one feature increases while the other decreases, the features are negatively correlated. Zero correlation is when there is no relationship between the features. Correlation is on the range -1 (perfect negative correlation) and 1 (pefect positive correlation).  
We can construct scatter plots of one feature versuses another to observe linear or nonlinear relationships.

Complete the following set of scatter plots:
<center><img src="scatterplots.png"  style="width:50%;height:50%"><\center>

In [None]:
"""
Using the scatter plot function, construct plots depicting the
correlation between all pairings of the selected predictor features
and between all predictors and the determined target.
The figure will contain r by r subplots, where r = npredictors + 1.
Where subplot(i,j) is a scatter plot of the feature i versus feature j.
When i == j, plot the histogram of feature i instead of a scatter plot.
We are also interested in the correlation between each of the features 
and the target classification, thus we will combine the predictors matrix
and the target vector into one large matrix for convenience.
"""
# Append the y to the end of the matrix of predictors
Xycombo = np.append(Xpreds, y.reshape(-1, 1), axis=1)
# Append the name 'target' to the end of the list of predictor names
Xycolnames = pred_names + ['target']

# Create the scatter plots
fig, axs = plt.subplots(npredictors+1, npredictors+1, figsize=(15, 15))
fig.subplots_adjust(wspace=.35)
for f1 in range(npredictors+1):
    for f2 in range(npredictors+1): 
        if f1 == f2:
            # TODO: plot the histogram of feature f1
        else:
            # TODO: plot the scatter plot between features

        # include labels only when necessary
        if f1 == npredictors:
            axs[f1, f2].set_xlabel(Xycolnames[f2])
        if f2 == 0:
            axs[f1, f2].set_ylabel(Xycolnames[f1])

## IMAGES AND COLORMAPS
Create a colormap plot of the correlations between the 
all the predictors and the target

In [None]:
""" PROVIDED
Generate a figure that plots the a correlation matrix
as a colormap.
PARAMS:
    corrs: matrix of correlations between the features
    varnames: list of the names of each of the features 
              (e.g. the column names)
"""
def correlationmap(corrs, varnames):
    nvars = corrs.shape[0]
    
    # create the figure and plot the correlation matrix
    fig, ax = plt.subplots()
    im = ax.imshow(corrs, cmap='RdBu', vmin=-1, vmax=1)
    cbar = ax.figure.colorbar(im, ax=ax)
    cbar.ax.set_ylabel("Pearson Correlation", rotation=-90, va="bottom")
    
    # Specify the row and column ticks and labels for the figure
    ax.set_xticks(range(nvars))
    ax.set_yticks(range(nvars))
    ax.set_xticklabels(varnames)
    ax.set_yticklabels(varnames)

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    for i in range(nvars):
        for j in range(nvars):
            text = ax.text(j, i, "%.3f" % corrs[i, j],
                           ha="center", va="center", color="k")
# END DEF correlationmap
            

""" TODO
Compute the Pearson correlation between the columns of Xycombo using
the numpy function corrcoef(). The corrcoef() function performs the 
the pairwise correlation on the rows of a matrix, thus you will need to
transpose the input.
""" 
Xycorrs = # TODO

""" TODO
Call the function defined above, correlationmap(), to generate a 
colormap plot of the correlations between columns of the Xycombo matrix
"""
