In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

We will be using a dataset containing measurements for diagnosis of inflamatory diseases in the urinary system. [Read the info here](https://archive.ics.uci.edu/ml/datasets/Acute+Inflammations)
Load `acute.data` (tab-delimited file) into a pandas DataFrame, using the following column names:

'Temperature', 'nausea', 'Lumbar pain', 'Urine pushing', 'Micturition pains', 'Burning', 'UBI', 'Neph'

The data needs some cleaning before we can actually do something with it.

- First, fix the Temperature column by removing the spaces, replacing the comma (,) with a period (.) and changing the data type from string to float.
- Next, replace all the ' y e s ' values with 1 and the ' n o ' values with 0

Next, we would like to do some unsupervised clustering of the features and compare it to the labels. 

Split the data to X and y as follows:
- X should contain all the features (exclude UBI and Neph columns)
- y should be a new list, where each value equals 'UBI', 'Neph', 'Both', or 'Neither' according to the values in the UBI and Neph columns

- Visualize the value counts for each diagnosis using a pie chart.

- Use the KMeans estimator from the scikit-learn cluster module to find 4 clusters in the data using all the features in X.
- Can you figure out which cluster is which?
- Are value counts enough to make a determination?

To find out, let's perform dimensionality reduction:
- use the sklearn PCA estimator to fit and transform X onto 2 dimensions.
- Visualize the transformed data as a scatter plot (tip: create a new pandas dataframe with the transformed data, and add a column with the diagnosis. Then use seaborn's pointplot function to visualize. Set join=False to avoid connecting the dots)
- Visualize the same but using the predicted clusters as the label. 
- Can you figure out with label matches which cluster now? Try using the sklearn confusion_matrix to verify your conclusion. **Bonus**: use imshow to visualize the confusion matrix.

Use the `pca_results` function defined below to visualize the PCA. Do you understand the result?

In [43]:
def pca_results(df, pca):
    '''
    Create a DataFrame of the PCA results
    Includes dimension feature weights and explained variance
    Visualizes the PCA results
    '''

    # Dimension indexing
    dimensions = dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]

    # PCA components
    components = pd.DataFrame(np.round(pca.components_, 4), columns = df.keys())
    components.index = dimensions

    # PCA explained variance
    ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
    variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
    variance_ratios.index = dimensions

    # Create a bar plot visualization
    fig, ax = plt.subplots(figsize = (14,8))

    # Plot the feature weights as a function of the components
    components.plot(ax = ax, kind = 'bar');
    ax.set_ylabel("Feature Weights")
    ax.set_xticklabels(dimensions, rotation=0)


    # Display the explained variance ratios
    for i, ev in enumerate(pca.explained_variance_ratio_):
        ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Explained Variance\n          %.4f"%(ev))

    # Return a concatenated DataFrame
    return pd.concat([variance_ratios, components], axis = 1)


Supervised classification:
- Use the KneighborsClassifier estimator from sklearn to fit a model to the data:
    - Perform train-test-split
    - Initiate a classifier instance and fit it on the training data & labels (You can train with both labels at once with `y = data[['UBI', 'Neph']]` as the label) 
    - Check the model performance using the score method. What does that score mean?
    - Verify the validity of your score with a cross validation score on a stratified 5 fold crossvalidation set.