# Math 425 Fall 2025 Project 1  
**Due:** 5PM on Fri October 10  

## Project Directions
- Include a report on every group member’s contribution.  
- Submit the group’s well-commented code used for the project with instructions on how to compile and run.  
- Make a 15 to 20 minute video presentation of your results.  

The project consists of 3 problems.

You are given part of the **Wisconsin Diagnostic Breast Cancer (WDBC) dataset**. For each patient, you are given a vector $\mathbf{a}$ giving features computed from digitized images of a fine needle aspirate (FNA) of a breast mass. The features describe characteristics of the cell nuclei present in the image. The goal is to decide whether the cells are malignant or benign.  

### Feature Computation
Ten real-valued quantities are computed for each cell nucleus:
- **radius** (mean of distances from center to points on the perimeter)  
- **texture** (standard deviation of gray-scale values)  
- **perimeter**  
- **area**  
- **smoothness** (local variation in radius lengths)  
- **compactness** $\;=\; \dfrac{\text{perimeter}^2}{\text{area}} - 1.0$  
- **concavity** (severity of concave portions of the contour)  
- **concave points** (number of concave portions of the contour)  
- **symmetry**  
- **fractal dimension** (“coastline approximation” - 1)  

The **mean**, **standard error (stderr)**, and a measure of the **largest (worst)** values were computed for each feature.  
Thus, each specimen is represented by a vector $\mathbf{a}$ with **30 entries**.  

The domain $D$ consists of thirty strings identifying these features, e.g.  
- `radius (mean)`  
- `radius (stderr)`  
- `radius (worst)`  
- `area (mean)`  

### Provided Files
- `train.txt`: data for 300 patients  
- `train_values.txt`: indicator for malignant specimen (+1) or benign specimen (−1)  
- `validate.txt`: data for 260 patients  
- `validate_values.txt`: indicator for malignant specimen (+1) or benign specimen (−1) 

## Problem 1
**(a)** Apply $k$-means clustering with $k=2$ to the training data.  
Then use the validation data to assess clustering accuracy. You will need a scheme to determine whether a patient in the validation set has a malignant or benign tumor based on clustering.  

**(b)** Embed the data in dimensions $d \in \{5, 10, 20\}$ using **Gaussian matrix embedding**, then rerun $k$-means.  
- What is the clustering accuracy for each $d$?  
- What is the computational time averaged over 500 independent runs? 

**(c)** Repeat part (b) but use **sparse random projection** instead of Gaussian embedding.  

## Problem 2
- Read the data in `train.txt` into a matrix $A$ (rows = patients, columns = 30 features).  

In [3]:
import pandas as pd

url = "https://raw.githubusercontent.com/ddangman/Math425_Project1/refs/heads/main/Files/Wisconsin_Breast_Cancer_Data/train.txt"

matrixA = pd.read_csv(url, sep=",", header=None)
print(matrixA.shape)
matrixA.head()

(300, 30)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


- Read the data in `train_values.txt` into a vector $b$ where  

In [4]:
url = "https://raw.githubusercontent.com/ddangman/Math425_Project1/refs/heads/main/Files/Wisconsin_Breast_Cancer_Data/train_values.txt"

vector_b = pd.read_csv(url, sep=",", header=None)
print(vector_b.shape)
vector_b.head()

(300, 1)


Unnamed: 0,0
0,1
1,1
2,1
3,1
4,1


  $$
  b_i = \begin{cases} 
  +1 & \text{if malignant} \\ 
  -1 & \text{if benign} 
  \end{cases}
  $$

**(a)** Use the **QR algorithm** to find the least-squares linear model for the data. 

**(b)** Apply the linear model to `validate.txt` and predict malignancy. Define a classifier:

$$
C(y) = \begin{cases} 
+1 & \text{if } y \geq 0 \\ 
-1 & \text{otherwise}
\end{cases}
$$

In [6]:
url = "https://raw.githubusercontent.com/ddangman/Math425_Project1/refs/heads/main/Files/Wisconsin_Breast_Cancer_Data/validate.txt"

validateA = pd.read_csv(url, sep=",", header=None)
print(validateA.shape)
validateA.head()

(260, 30)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,19.53,18.9,129.5,1217.0,0.115,0.1642,0.2197,0.1062,0.1792,0.06552,...,25.93,26.24,171.1,2053.0,0.1495,0.4116,0.6121,0.198,0.2968,0.09929
1,12.46,19.89,80.43,471.3,0.08451,0.1014,0.0683,0.03099,0.1781,0.06249,...,13.46,23.07,88.13,551.3,0.105,0.2158,0.1904,0.07625,0.2685,0.07764
2,20.09,23.86,134.7,1247.0,0.108,0.1838,0.2283,0.128,0.2249,0.07469,...,23.68,29.43,158.8,1696.0,0.1347,0.3391,0.4932,0.1923,0.3294,0.09469
3,10.49,18.61,66.86,334.3,0.1068,0.06678,0.02297,0.0178,0.1482,0.066,...,11.06,24.54,70.76,375.4,0.1413,0.1044,0.08423,0.06528,0.2213,0.07842
4,11.46,18.16,73.59,403.1,0.08853,0.07694,0.03344,0.01502,0.1411,0.06243,...,12.68,21.61,82.69,489.8,0.1144,0.1789,0.1226,0.05509,0.2208,0.07638


In [7]:
url = "https://raw.githubusercontent.com/ddangman/Math425_Project1/refs/heads/main/Files/Wisconsin_Breast_Cancer_Data/validate_values.txt"

validate_b = pd.read_csv(url, sep=",", header=None)
print(validate_b.shape)
validate_b.head()

(260, 1)


Unnamed: 0,0
0,1
1,-1
2,1
3,-1
4,-1


**(c)** What is the percentage of incorrectly classified samples? Compare with the success rate on the training data.  

**(d)** Embed the data in $d \in \{5, 10, 20\}$ using **Gaussian matrix embedding** and repeat (a), (b), and (c). Report average computational time over 500 runs.  

**(e)** Repeat part (d) but use **sparse random projection** instead.  

## Problem 3
Apply $k$-means to the **class music data** `songList.xlsx` and use **Class Roster** to group the class into **8 distinct music clusters**.


In [8]:
url = "https://github.com/ddangman/Math425_Project1/raw/main/Files/MATH425_songList.xlsx"
song_df = pd.read_excel(url)

print(song_df.head())


                      Song        Artist    1    2    3    4    5    6    7  \
0               Take On Me          A-Ha  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
1            Thunderstruck         Ac/Dc  0.0  0.0  0.0  0.0  3.0  0.0  0.0   
2                    Hello         Adele  2.0  0.0  0.0  0.0  3.0  1.0  0.0   
3      Rolling In The Deep         Adele  0.0  0.0  0.0  0.0  4.0  1.0  0.0   
4  Scars To Your Beautiful  Alessia Cara  0.0  0.0  0.0  0.0  3.0  0.0  0.0   

     8  ...   23   24   25   26   27   28   29   30   31   32  
0  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
1  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
2  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
3  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
4  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  

[5 rows x 34 columns]
