# Cluster Analysis: Partitioning (Segmentation)

Create a set of functions that can be used together to segment satellite  images into similar regions using k-means clustering, and then create and apply a color mask to areas of water. Specifically:

1. Determine the best image pre-processing method that will do the best job of smoothing similar colors
    - sub-regions in satellite  images tend to have a lot of color variation / texture that can negatively impact the performance of segmentation when using something like k-means clustering
    - use scikit-image for this
2. Create a pre_process funcition that returns a pre-processed version of the image that has the following parameters
    - img : the input image
    - p : whatever parameter belongs to the method you chose from (1)
    - multichannel : Bool for whether or not the image has multiple channels (e.g. RGB)
        + only if necessary
3. Create one or more functions that together are used for segmenting an image using k-means clustering
4. Create a function to help automate the selection of parameters to use in the method from (1) and for k-means 
    - it should iterate over a set of 6 possible test parameter combinations
        + each combination is (pre-processing parameter, n_clusters for kmeans)
    - each iteration should segment the provided satellite image using the the given combination
    - return a single image that displays the segmented versions in a 3x2 grid
    - visually inspect the 6 versions and decide on the best combination to use
5. Use the parameters determined above to create a version of the original image that has a single-colored mask wherever water appears in the image

# Cluster Analysis: Target Study (Young People Survey) using agglomerative clustering

The dataset provided can be summarized as follows:

In 2013, students of the Statistics class at FSEV UK were asked to invite their friends to participate in this survey.

* The data file (responses.csv) consists of 1010 rows and 150 columns (139 integer and 11 categorical).
* For convenience, the original variable names were shortened in the data file. See the columns.csv file if you want to match the data with the original names.
* The data contain missing values.
* The survey was presented to participants in both electronic and written form.
* The original questionnaire was in Slovak language and was later translated into English.
* All participants were of Slovakian nationality, aged between 15-30.
* Numerical columns are primarily in range [1,5], except for values like weight and height.

Your task is to use agglomerative (hierarchical) cluster analysis to analyze this data. Specifically:

1. Modify the original data
    - Remove any categorical variables from the data
        - categorical doesn't work with this clustering method
        - we aren't concerned with creating dummy vars
    - Add a new column for gender, but make it binary
    - Remove any rows with null values
2. Use scipy to cluster and create a dendrogram of the cluster hierarchy for the data
    - use the ward method
    - exclude gender for now
    - plot the dendrogram
    - determine a good cutoff value
    - re-plot the dendrogram with a line for the cutoff and determine the number of clusters this gives
3. Create a new dendrogram that truncates using the determined number of clusters and show number of points per cluster
    - hint: check the truncate_mode options
4. Use the scipy `hierarchy.fcluster` method to get cluster labels for 16 clusters from the data
    - hint: threshold
    - create a column for the labels in the data
    - compare the class distributions between genders
    - reset the threshold and create a new set of labels that will give only two classes
        - add this as a second column for labels
    - compare the distributions of these two classes between genders
        - does it seem that the two top-level clusters are gender-specific?

# Data Summarization / Color Quantization

Given a set of images, use k-means to perform color quantization to reduce the number of distinct colors in each image. Specifically

1. Create a quantize function that takes an image and the desired number of colors as parameters and
    - performs color quantization using k-means
        - reduce the number of colors in the image
    - return the color-reduced image
    - make sure it can handle both greyscale and multichannel images
    - run this function on the image titled bw.jpg and view the result
2. Create a batch_reduce function that takes a list of file names and number of desired colors as parameters and
    - imports/opens each image in the list of file names
    - uses the quantize function to reduce the number of colors in each image
    - saves the original and reduced images in separate folders
3. Run the batch_reduce on everything in the provided images folder and then compare original and reduced file sizes

# Clustering with PCA and Machine Learning

Create a simple custom pipeline that does the following with the survey data:

- train/test split using 67% of the data for the train
- perform PCA and select the number of components that retain at least 90% of the variance
- perform k-means clustering on the PCA training set (scki-kit learn) with 16 clusters
- apply the cluster labels to the training set
- fit the labeled training set using a Random Forest Classifier
- make predictions on the test set using both the k-means and rfc models
- compare the two model predictions using class-wise precision, recall, and f1