
# Breast Cancer prediction, SVM



**Description:**

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features describe the characteristics of the cell nuclei present in the image. The 3-dimensional space referenced here is described in:

*K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34.*

### Attribute Information:

1. **ID number**
2. **Diagnosis**:
   - **M** = malignant
   - **B** = benign
3. **Features (3-32)**:
   Ten real-valued features are computed for each cell nucleus:
   
   - **Radius**: Mean of distances from center to points on the perimeter
   - **Texture**: Standard deviation of gray-scale values
   - **Perimeter**
   - **Area**
   - **Smoothness**: Local variation in radius lengths
   - **Compactness**: \( \text{perimeter}^2 / \text{area} - 1.0 \)
   - **Concavity**: Severity of concave portions of the contour
   - **Concave points**: Number of concave portions of the contour
   - **Symmetry**
   - **Fractal dimension**: "Coastline approximation" (fractal dimension - 1)

For each image, the **mean**, **standard error**, and **worst** (largest mean of the three largest values) of these features were computed, resulting in **30 features**. For example:
- Field 3: Mean Radius
- Field 13: Radius SE
- Field 23: Worst Radius

All feature values are recorded with four significant digits.

### Additional Information:
- **Missing attribute values**: None
- **Class distribution**:
   - 357 benign
   - 212 malignant



In [None]:
!git clone https://github.com/cesarlegendre/credit_scoring_7904_Q4_2024


In [None]:
#Import libraries
#loading dataset
import pandas as pd
import numpy as np

#visualisation
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# data splitting
from sklearn.model_selection import train_test_split

# data modeling
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Mondel performance
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report
from sklearn import metrics


#warnings
import warnings
warnings.simplefilter(action='ignore')


file = 'credit_scoring_7904_Q4_2024/data_sets/breast_cancer/data.csv'


### 1. **View basic statistical details of the training data**
   - **Instruction**: View summary statistics of the `df` DataFrame.
   - **Prompt**: "Generate basic statistical summaries for the DataFrame."


### 2. **Check for duplicated entries**
   - **Instruction**: Check if there are any duplicate rows in the `df` DataFrame.
   - **Prompt**: "Check if the DataFrame has any duplicated rows and sum them."


### 3. **Get information about the DataFrame**
   - **Instruction**: Retrieve detailed information about the structure of the DataFrame, such as column types and null values.
   - **Prompt**: "Get DataFrame information, including column data types and non-null counts."


### 4. **Count unique values per column**
   - **Instruction**: Find the number of unique values in each column of the `df` DataFrame.
   - **Prompt**: "Count the unique values for each column in the DataFrame."


### 5. **Find missing values in the DataFrame**
   - **Instruction**: Find the total number of missing values in each column.
   - **Prompt**: "Identify the number of missing values in each column."


### 6. **Drop columns with null values**
   - **Instruction**: Drop the `Unnamed: 32` column (which contains all null values) and the `id` column.
   - **Prompt**: "Drop the 'Unnamed: 32' and 'id' columns from the DataFrame."


### 7. **Plot the mean values heatmap**
   - **Instruction**: Create a heatmap to display the mean values of the numerical columns.
   - **Prompt**: "Generate a heatmap of mean values from the DataFrame."


### 8. **Identify categorical and numerical features**
   - **Instruction**: Identify and print the categorical and numerical features in the DataFrame.
   - **Prompt**: "Separate the features into categorical and numerical columns."


### 9. **Transform categorical values in the 'diagnosis' column**
   - **Instruction**: Map categorical values of `diagnosis` ('M' and 'B') to 1 and 0.
   - **Prompt**: "Map the 'diagnosis' column to binary values, where 'M' is 1 and 'B' is 0."


### 10. **Plot distribution of all features**
   - **Instruction**: Plot the distribution of all numerical features using histograms.
   - **Prompt**: "Plot distribution histograms for all numerical features in the DataFrame."


### 11. **Create a pairplot for mean features**
   - **Instruction**: Generate a pairplot to visualize relationships between all mean features and color by the `diagnosis` column.
   - **Prompt**: "Generate a pairplot of all the mean features, with the points colored by 'diagnosis'."


### 12. **Show diagnosis value counts as pie chart**
   - **Instruction**: Display a pie chart showing the counts of benign and malignant diagnoses.
   - **Prompt**: "Plot a pie chart to show the distribution of the 'diagnosis' values."


### 13. **Boxplots for numerical features by diagnosis**
   - **Instruction**: Create boxplots for each numerical feature based on the `diagnosis`.
   - **Prompt**: "Generate boxplots for all numerical features grouped by 'diagnosis'."


### 14. **Correlation heatmap**
   - **Instruction**: Plot a correlation heatmap of all features in the DataFrame.
   - **Prompt**: "Plot a heatmap showing the correlation between all features."


### 15. **Train a Support Vector Machine (SVM) model**
   - **Instruction**: Train a SVM classifier to predict `diagnosis`.
   - **Prompt**: "Train a linear SVM model using the 'diagnosis' column as the target."

