# DX 601 Final Project

## Introduction

In this project, you will practice all the skills that you have learned throughout this module.
You will pick a data set to analyze from a list provided, and then perform a variety of analysis.
Most of the problems and questions are open ended compared to your previous homeworks, and you will be asked to explain your choices.
Most of them will have a particular type of solution implied, but it is up to you to figure out the details based on what you have learned in this module.

## Instructions

Each problem asks you to perform some analysis of the data, and usually answer some questions about the results.
Make sure that your question answers are well supported by your analysis and explanations; simply stating an answer without support will earn minimal points.

Notebook cells for code and text have been added for your convenience, but feel free to add additional cells.

## Example Code

You may find it helpful to refer to this GitHub repository of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx500-examples
* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Submission

This project will be entirely manually graded.
However, we may rerun some or all of your code to confirm that it works as described.

### Late Policy

The normal homework late policy for OMDS does not apply to this project.
Boston University requires final grades to be submitted within 72 hours of class instruction ending, so we cannot accommodate 5 days of late submissions.

However, we have delayed the due date of this project to be substantially later than necessary given its scope, and given you more days for submission with full credit than you would have had days for submission with partial credit under the homework late policy.
Finally, the deadlines for DX 601 and DX 602 were coordinated to be a week apart while giving ample time for both of their projects.

## Shared Imports

For this project, you are forbidden to use modules that were not loaded in this template.
While other modules are handy in practice, modules that trivialize these problems interfere with our assessment of your own knowledge and skills.

If you believe a module covered in the course material (not live sessions) is missing, please check with your learning facilitator.

In [1]:
import math
import sys

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats
import sklearn.linear_model

from sklearn.decomposition import PCA

## Problems

### Problem 1 (5 points)

Pick one of the following data sets to analyze in this project.
Load the data set, and show a random sample of 10 rows.

* [Iris data set](https://archive.ics.uci.edu/dataset/53/iris) ([PMLB copy](https://github.com/EpistasisLab/pmlb/tree/master/datasets/iris))
* [Breast Cancer Wisconsin](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) ([PMLB copy](https://github.com/EpistasisLab/pmlb/tree/master/datasets/_deprecated_breast_cancer_wisconsin))
* [Wine Quality](https://archive.ics.uci.edu/dataset/186/wine+quality) ([PMLB - white subset only](https://github.com/EpistasisLab/pmlb/tree/master/datasets/wine_quality_white))


The PMLB copies of the data are generally cleaner and recommended for this project, but the other links are provided to give you more context.
To load the data from the PMLB Github repository, navigate to the `.tsv.gz` file in GitHub and copy the link from the "Raw" button.

If the data set you choose has more than ten columns, you may limit later analysis that is requested per column to just the first ten columns.

In [None]:
breastCancer = pd.read_csv("https://github.com/EpistasisLab/pmlb/raw/refs/heads/master/datasets/_deprecated_breast_cancer_wisconsin/_deprecated_breast_cancer_wisconsin.tsv.gz", sep='\t')
breastCancer.sample(10)

Unnamed: 0,target,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
6,1,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,...,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368
78,1,20.18,23.97,143.7,1245.0,0.1286,0.3454,0.3754,0.1604,0.2906,...,23.37,31.72,170.3,1623.0,0.1639,0.6164,0.7681,0.2508,0.544,0.09964
99,1,14.42,19.77,94.48,642.5,0.09752,0.1141,0.09388,0.05839,0.1879,...,16.33,30.86,109.5,826.4,0.1431,0.3026,0.3194,0.1565,0.2718,0.09353
54,1,15.1,22.02,97.26,712.8,0.09056,0.07081,0.05253,0.03334,0.1616,...,18.1,31.69,117.7,1030.0,0.1389,0.2057,0.2712,0.153,0.2675,0.07873
227,0,15.0,15.51,97.45,684.5,0.08371,0.1096,0.06505,0.0378,0.1881,...,16.41,19.31,114.2,808.2,0.1136,0.3627,0.3402,0.1379,0.2954,0.08362
252,1,19.73,19.82,130.7,1206.0,0.1062,0.1849,0.2417,0.0974,0.1733,...,25.28,25.59,159.8,1933.0,0.171,0.5955,0.8489,0.2507,0.2749,0.1297
157,0,16.84,19.46,108.4,880.2,0.07445,0.07223,0.0515,0.02771,0.1844,...,18.22,28.07,120.3,1032.0,0.08774,0.171,0.1882,0.08436,0.2527,0.05972
218,1,19.8,21.56,129.7,1230.0,0.09383,0.1306,0.1272,0.08691,0.2094,...,25.73,28.64,170.3,2009.0,0.1353,0.3235,0.3617,0.182,0.307,0.08255
339,1,23.51,24.27,155.1,1747.0,0.1069,0.1283,0.2308,0.141,0.1797,...,30.67,30.73,202.4,2906.0,0.1515,0.2678,0.4819,0.2089,0.2593,0.07738
170,0,12.32,12.39,78.85,464.1,0.1028,0.06981,0.03987,0.037,0.1959,...,13.5,15.64,86.97,549.1,0.1385,0.1266,0.1242,0.09391,0.2827,0.06771


YOUR ANSWERS HERE

### Problem 2 (10 points)

List all the columns in the data set, and describe each of them in your own words.
You may have to search to learn about the data set columns, but make sure that the descriptions are your own words.

In [None]:
breastCancer.columns

Index(['target', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
       '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
       '25', '26', '27', '28', '29', '30', '31'],
      dtype='object')

## Breast Mass Diagnosis Features

**`target`**: Used to diagnose breast masses. The target feature is the **Diagnosis** of the breast mass. (0 = *Malignant*, 1 = *Benign*)

---

### Size Features and Shape Features

**2. `radius1`**: The mean radius of the nuclei in the sample, measured from the center to points on the perimeter. 

**4. `perimeter1`**: The mean perimeter of the nuclei in the sample.

**5. `area1`**: The mean area of the nuclei in the sample.

**12. `radius2`**: The standard error of the radius of the nuclei in the sample, measured from the center to points on the perimeter.

**14. `perimeter2`**: The standard error of the perimeter of the nuclei in the sample.

**15. `area2`**: The standard error of the area of the nuclei in the sample.

**22. `radius3`**: The **"worst"** (mean of 3 largest values) radius of the nuclei in the sample, measured from the center to points on the perimeter.

**24. `perimeter3`**: The **"worst"** (mean of 3 largest values) perimeter of the nuclei in the sample.

**25. `area3`**: The **"worst"** (mean of 3 largest values) area of the nuclei in the sample.

---

## Shape Features

**6. `smoothness1`**: The mean local variation in radius lengths of the nuclei in the sample. The higher the number, the **"rougher"** the contour.

**7. `compactness1`** ($perimeter^2/area - 1.0$): Measures the mean compactness of the nuclei in the sample. The higher the number, the more **irregular** the contour.

**10. `symmetry1`**: The mean symmetry of the nuclei shapes in the sample.

**16. `smoothness2`**: The standard error local variation in radius lengths of the nuclei in the sample. The higher the number, the **"rougher"** the contour.

**17. `compactness2`** ($perimeter^2/area - 1.0$): Measures the standard error compactness of the nuclei in the sample. The higher the number, the more **irregular** the contour.

**20. `symmetry2`**: The standard error symmetry of the nuclei shapes in the sample.

**26. `smoothness3`**: The **"worst"** (mean of 3 largest values) local variation in radius lengths of the nuclei in the sample. The higher the number, the **"rougher"** the contour.

**27. `compactness3`** ($perimeter^2/area - 1.0$): The **"worst"** (mean of 3 largest values) compactness of the nuclei in the sample. The higher the number, the more **irregular** the contour.

**30. `symmetry3`**: The **"worst"** (mean of 3 largest values) symmetry of the nuclei shapes in the sample.

---

### Contour (Edge) Irregularity Features

**8. `concavity1`**: The mean severity of concave portions of the contour. This measures the indentations of the nuclei in the sample.

**9. `concave_points1`**: The mean amount of concave portions of the contour in the sample. This measures the number of "dips" or indentations in the nuclei contour.

**11. `fractal_dimension1`**: The mean irregularity of the nuclei contour using **"costline approximation" - 1**. High fractal dimension means the nuclei have a more complex, jagged countour. Low fractal dimension means the nuclei have a more smooth, simple contour.

**18. `concavity2`**: The standard error severity of concave portions of the contour. This measures the indentations of the nuclei in the sample.

**19. `concave_points2`**: The standard error concave portions of the contour in the sample. This measures the number of "dips" or indentations in the nuclei contour.

**21. `fractal_dimension2`**: The standard error irregularity of the nuclei contour using **"costline approximation" - 1**. High fractal dimension means the nuclei have a more complex, jagged countour. Low fractal dimension means the nuclei have a more smooth, simple contour.

**28. `concavity3`**: The **"worst"** (mean of 3 largest values) severity of concave portions of the contour. This measures the indentations of the nuclei in the sample.

**29. `concave_points3`**: The **"worst"** (mean of 3 largest values) concave portions of the contour in the sample. This measures the number of "dips" or indentations in the nuclei contour.

**31. `fractal_dimension3`**: The **"worst"** (mean of 3 largest values) irregularity of the nuclei contour using **"costline approximation" - 1**. High fractal dimension means the nuclei have a more complex, jagged countour. Low fractal dimension means the nuclei have a more smooth, simple contour.

---

### Texture Feature

**3. `texture1`**: is the mean standard deviation of gray-scale values of the nuclei in the sample. This essentially measures the irregularity of the surface texture.

**13. `texture2`**: The standard error of the standard deviation of gray-scale values of the nuclei in the sample. This essentially measures the irregularity of the surface texture.

**23. `texture3`**: The **"worst"** (mean of 3 largest values) standard deviation of gray-scale values of the nuclei in the sample. This essentially measures the irregularity of the surface texture.

---

#### Measurements

* Mean: $\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$

* Standard Error: $\frac{\sigma}{\sqrt{n}}$

* Worst: mean of 3 largest values

---

#### Sources

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names

https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

### Problem 3 (15 points)

Plot histograms of each column.
For each column, state the distribution covered in this module that you think best matches that column.

In [5]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 4 (20 points)

Plot each pair of an input column and the output column.
Classify each pair of input column and the output column as being independent or not.
Describe in words why you think that was the case.

In [6]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 5 (20 points)

Build an ordinary least squares regression for the target using all the input columns.
Report the mean squared error of the model over the whole data set.
Plot the actual values vs the predicted outputs to compare them. 

In [7]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 6 (20 points)

Which input column gives the best linear model of the target on its own?
How does that model compare to the model in problem 5?


In [8]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 7 (20 points)

Pick and plot a pair of input columns with a visible dependency.
Identify a split of the values of one column illustrating the dependency and plot histograms of the other variable on both sides of the split.
That is, pick a threshold $t$ for one column $x$ and make two histograms, one where $x < t$ and one where $x \geq t$.

These histograms should look significantly different to make the dependency clear.
There should be enough data in both histograms so that these differences are unlikely to be noise.
Also make sure that the horizontal axis is the same in both histograms for clarity.

In [9]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 8 (40 points)

Perform principal components analysis of the input columns.
Compute how much of the data variation is explained by the first half of the principal components.
Build a linear regression using coordinates computed from the first half of the principal components.
Compare the mean squared error of this model to the previous model.
Plot actual targets vs predictions again. 

This problem depends on material from week 13.

In [10]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 9 (20 points)

What pair of input columns has the highest correlation?
How is that correlation reflected in the principal components?

In [11]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Problem 10 (30 points)

Identify an outlier row in the data set.
You may use any criteria discussed in this module, and you must explain the criteria and how it led to picking this row.
Give a visualization showing how much this row sticks out compared to the other data based on your criteria.

In [12]:
# YOUR CODE HERE

YOUR ANSWERS HERE

### Generative AI Usage

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the [generative AI policy](https://www.bu.edu/cds-faculty/culture-community/gaia-policy/).
If you did not use any generative AI tools, simply write NONE below.

YOUR ANSWER HERE