# Module 8 Exercise 1 - Feature selection for breast cancer survival using association rule mining

## Overview
In this exercise, you will use association rule mining to find rules and features in a breast cancer registry dataset that might be useful for predictions.  We wish to predict the survival months of breast cancer patients using the other relevant data in the file.


## Data Format
The data in this exercise is a breast cancer dataset following the coding rules of the SEER cancer registry file.  The patients in this file are from a single state, Louisiana.

Fields:
* Patient ID - sequential identifier for the patient in this file
* Marital - marital status, see supplemental documentation
* Race - see supplemental documentation
* SEX
    * 1 - Male
    * 2 - Female
* AGE - age in years at diagnosis
    * 000-130 - Actual age in years
    * 999 - Unknown
* [Primary Site](https://staging.seer.cancer.gov/cs/input/02.05.50/breast/site/?breadcrumbs=(~schema_list~),(~view_schema~,~breast~)) - see supplemental documentation
* [Tumor Size](https://staging.seer.cancer.gov/cs/input/02.05.50/breast/size/?breadcrumbs=(~schema_list~),(~view_schema~,~breast~)) - largest dimension of the primary tumor in millimeters.  
    * 000 - No mass; no tumor found; no Paget’s disease
    * 001 - Microscopic focus or foci only
    * 002 - Mammography/xerography diagnosis only with no size given (tumor not clinically palpable)
    * 003 - <= 3 mm
    * 004-989 - size in millimeters
    * 990 - Microinvasion - Microscopic focus or foci only and no size given - Described as "less than 1 mm" - Stated as T1mi  with no other information on tumor size
    * 991 - Described as "less than 1 centimeter (cm)" - Stated as T1b  with no other information on tumor size
    * 992 - Described as "less than 2 cm," or "greater than 1 cm," or "between 1 cm and 2 cm" - Stated as T1 [NOS] or T1c [NOS] with no other information on tumor size
    * 993 - Described as "less than 3 cm," or "greater than 2 cm," or "between 2 cm and 3 cm"
    * 994 - Described as "less than 4 cm," or "greater than 3 cm," or "between 3 cm and 4 cm"
    * 995 - Described as "less than 5 cm," or "greater than 4 cm," or "between 4 cm and 5 cm" - Stated as T2 with no other information on tumor size
    * 996 - Mammographic/xerographic diagnosis only, no size given; clinically not palpable
    * 997 - Paget’s Disease of nipple with no demonstrable tumor
    * 998 - Diffuse; widespread: 3/4’s or more of breast; inflammatory carcinoma
    * 999 - Unknown
* FIPS - State and county FIPS code
* Survival month - months of survival
    * 000-9998 - Actual survival in months
    * 9999 - Unknown
* Vital Status
    * 1 - alive
    * 4 - dead
    
[Supplemental documentation of codes](../resources/dictionary.pdf)

## Required Output
The output from this exercise will be submitted on Canvas in the form of a written response to the assignment submission, as a PDF.  In the response, you will include the following 6 sections:

1. Describe your itemization process for each field.  How did you choose the number of bins for the continuous data?   Did you re-itemize already categorical data, and if so, how and why?
1. Describe the features that you dropped from the dataset, and why.
1. Discuss the number of itemized columns you ended up with after your one-hot encoding.
1. List the number of frequent itemsets and association rules you ended up with _before_ finding the "interesting" rules.
1. List the "interesting" rules you discovered.  Discuss them in context of these metrics:
    * Support
    * Lift
    * Conviction
1. List the features from your "interesting" rules that might be useful for predictions or could be used in further modeling.

## Grading
You will be graded on the completeness of your report.  You must answer all of the above questions and include any necessary data to explain your answers.

In addition, you will be graded on the completeness of your code. Your jupyter notebook must be able to be run from a restarted kernel and run to completion, producing the proper output, using no more than the data provided or produced from these exercises. In other words, you cannot do work outside of the jupyter notebook, save it to a file, and use that information in this project.  



### Rubric

The report is worth 24 points, 4 points for each section.  Any missing section will result in the loss of three points.  Any incomplete section will result in the loss of 2 points.

The jupyter notebook is worth 25 points.  Failure to submit a functioning notebook will result in a deduction of 25 points. You will be graded on having the proper outputs in these sections:
* Find interesting rules (10 or fewer)
* List the unique antecedent features

Each section is worth 5 points.  Partial points will be applied if there is an issue with your output.

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import sys
!{sys.executable} -m pip install mlxtend
import mlxtend

Collecting mlxtend
[?25l  Downloading https://files.pythonhosted.org/packages/2a/4f/11a257bc17f675691080219c6fe3525e49c7077535c3d64c0c2afc79cfc9/mlxtend-0.19.0-py2.py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.4MB 3.5MB/s eta 0:00:01
Collecting joblib>=0.13.2 (from mlxtend)
[?25l  Downloading https://files.pythonhosted.org/packages/3e/d5/0163eb0cfa0b673aa4fe1cd3ea9d8a81ea0f32e50807b0c295871e4aab2e/joblib-1.1.0-py2.py3-none-any.whl (306kB)
[K     |████████████████████████████████| 307kB 12.8MB/s eta 0:00:01
Installing collected packages: joblib, mlxtend
Successfully installed joblib-1.1.0 mlxtend-0.19.0


In [2]:
data = pd.read_csv('../resources/breastcancer.csv')
display(data.head())

Unnamed: 0,Patient ID,Marital,Race,SEX,AGE,Primary Site,Tumor Size,FIPS,Survival month,Vital Status
0,1,5,1,2,77,C508,11,22051,112,1
1,2,1,2,2,42,C508,20,22071,102,4
2,3,2,2,2,74,C504,6,22089,107,4
3,4,5,1,2,81,C504,6,22117,91,4
4,5,5,2,2,55,C508,25,22071,108,1


In [3]:
data

Unnamed: 0,Patient ID,Marital,Race,SEX,AGE,Primary Site,Tumor Size,FIPS,Survival month,Vital Status
0,1,5,1,2,77,C508,11,22051,112,1
1,2,1,2,2,42,C508,20,22071,102,4
2,3,2,2,2,74,C504,6,22089,107,4
3,4,5,1,2,81,C504,6,22117,91,4
4,5,5,2,2,55,C508,25,22071,108,1
...,...,...,...,...,...,...,...,...,...,...
1400,1401,2,2,2,75,C504,15,22119,45,4
1401,1402,1,2,2,48,C504,70,22017,39,4
1402,1403,1,2,2,40,C509,999,22071,87,4
1403,1404,5,1,2,77,C509,6,22103,110,1


## Itemize the data
The data should be itemized in preparation for ARM.  Apply appropriate itemization techniques here to categorize continuous or labeled data using techniques form lab 3, and/or convert already categorical data into smaller sets of categories using `apply`.

In [4]:
# your code here

text_columns = ['Primary Site']
encoder = preprocessing.OrdinalEncoder()
new_data = pd.DataFrame(encoder.fit_transform(data[text_columns]), columns=text_columns) # itemizing the categorical data
data = data.join(new_data,rsuffix = ' Enc')
display(data.head(30))

Unnamed: 0,Patient ID,Marital,Race,SEX,AGE,Primary Site,Tumor Size,FIPS,Survival month,Vital Status,Primary Site Enc
0,1,5,1,2,77,C508,11,22051,112,1,7.0
1,2,1,2,2,42,C508,20,22071,102,4,7.0
2,3,2,2,2,74,C504,6,22089,107,4,4.0
3,4,5,1,2,81,C504,6,22117,91,4,4.0
4,5,5,2,2,55,C508,25,22071,108,1,7.0
5,6,5,1,2,97,C509,999,22117,13,4,8.0
6,7,3,2,2,52,C509,999,22109,81,4,8.0
7,8,5,2,2,92,C508,999,22071,23,4,7.0
8,9,5,1,2,77,C508,999,22087,113,1,7.0
9,10,2,1,2,65,C504,6,22105,113,1,4.0


In [5]:
# continuous = ['AGE', 'FIPS', 'Tumor Size', 'Survival month']
discrete1 = preprocessing.KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform').fit_transform(data['AGE'].values.reshape(-1,1))

data['AGE Enc'] = discrete1

discrete2 = preprocessing.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform').fit_transform(data['FIPS'].values.reshape(-1,1))

data['FIPS Enc'] = discrete2

discrete3 = preprocessing.KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform').fit_transform(data['Tumor Size'].values.reshape(-1,1))

data['Tumor Size Enc'] = discrete3

discrete4 = preprocessing.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform').fit_transform(data['Survival month'].values.reshape(-1,1))

data['Survival month Enc'] = discrete4


display(data.head(30))

Unnamed: 0,Patient ID,Marital,Race,SEX,AGE,Primary Site,Tumor Size,FIPS,Survival month,Vital Status,Primary Site Enc,AGE Enc,FIPS Enc,Tumor Size Enc,Survival month Enc
0,1,5,1,2,77,C508,11,22051,112,1,7.0,2.0,1.0,0.0,0.0
1,2,1,2,2,42,C508,20,22071,102,4,7.0,0.0,2.0,0.0,0.0
2,3,2,2,2,74,C504,6,22089,107,4,4.0,2.0,3.0,0.0,0.0
3,4,5,1,2,81,C504,6,22117,91,4,4.0,3.0,4.0,0.0,0.0
4,5,5,2,2,55,C508,25,22071,108,1,7.0,1.0,2.0,0.0,0.0
5,6,5,1,2,97,C509,999,22117,13,4,8.0,3.0,4.0,9.0,0.0
6,7,3,2,2,52,C509,999,22109,81,4,8.0,1.0,4.0,9.0,0.0
7,8,5,2,2,92,C508,999,22071,23,4,7.0,3.0,2.0,9.0,0.0
8,9,5,1,2,77,C508,999,22087,113,1,7.0,2.0,3.0,9.0,0.0
9,10,2,1,2,65,C504,6,22105,113,1,4.0,2.0,4.0,0.0,0.0


## Find the frequent itemsets
Get the data into the correct format.  Also, you should consider whether all of the features are useful for creating association rules and drop those that are not.

Use a min_support of 0.005.  If you have a lot of columns, this might take a few minutes to run.

In [6]:
# your code here

data = data.drop(columns=['Patient ID', 'AGE', 'Primary Site', 'Tumor Size', 'FIPS', 'Survival month'])
data

Unnamed: 0,Marital,Race,SEX,Vital Status,Primary Site Enc,AGE Enc,FIPS Enc,Tumor Size Enc,Survival month Enc
0,5,1,2,1,7.0,2.0,1.0,0.0,0.0
1,1,2,2,4,7.0,0.0,2.0,0.0,0.0
2,2,2,2,4,4.0,2.0,3.0,0.0,0.0
3,5,1,2,4,4.0,3.0,4.0,0.0,0.0
4,5,2,2,1,7.0,1.0,2.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
1400,2,2,2,4,4.0,2.0,4.0,0.0,0.0
1401,1,2,2,4,4.0,1.0,0.0,0.0,0.0
1402,1,2,2,4,8.0,0.0,2.0,9.0,0.0
1403,5,1,2,1,8.0,2.0,4.0,0.0,0.0


In [7]:
onehot = pd.get_dummies(data, columns = data.columns)
display(onehot.head())

Unnamed: 0,Marital_1,Marital_2,Marital_3,Marital_4,Marital_5,Marital_9,Race_1,Race_2,Race_3,Race_4,...,FIPS Enc_2.0,FIPS Enc_3.0,FIPS Enc_4.0,Tumor Size Enc_0.0,Tumor Size Enc_1.0,Tumor Size Enc_2.0,Tumor Size Enc_6.0,Tumor Size Enc_9.0,Survival month Enc_0.0,Survival month Enc_4.0
0,0,0,0,0,1,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,1,0,0,0,0,0,0,1,0,0,...,1,0,0,1,0,0,0,0,1,0
2,0,1,0,0,0,0,0,1,0,0,...,0,1,0,1,0,0,0,0,1,0
3,0,0,0,0,1,0,1,0,0,0,...,0,0,1,1,0,0,0,0,1,0
4,0,0,0,0,1,0,0,1,0,0,...,1,0,0,1,0,0,0,0,1,0


In [8]:
from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(onehot, min_support=0.005, use_colnames=True)

mask = [True if len(x) >= 8 else False for x in frequent_itemsets.itemsets.values]
frequent_itemsets[mask]

Unnamed: 0,support,itemsets
16407,0.007829,"(SEX_2, AGE Enc_1.0, Marital_1, Tumor Size Enc..."
16408,0.007829,"(SEX_2, AGE Enc_1.0, Marital_1, Tumor Size Enc..."
16409,0.007117,"(SEX_2, AGE Enc_1.0, Marital_1, Tumor Size Enc..."
16410,0.005694,"(FIPS Enc_0.0, SEX_2, AGE Enc_1.0, Marital_1, ..."
16411,0.009253,"(Marital_2, SEX_2, AGE Enc_1.0, Race_1, Primar..."
...,...,...
16644,0.005694,"(Marital_2, SEX_2, AGE Enc_1.0, Race_1, Primar..."
16645,0.005694,"(Marital_2, FIPS Enc_0.0, SEX_2, Race_1, Prima..."
16646,0.007117,"(Marital_2, SEX_2, FIPS Enc_1.0, Race_1, Prima..."
16647,0.005694,"(Marital_2, FIPS Enc_0.0, SEX_2, Primary Site ..."


## Find the association rules
Use confidence as the metric with a minimum threshold of 0.25.  If you have a lot of frequent itemsets, this might take a few minutes to run.

In [9]:
# your code here

from mlxtend.frequent_patterns import association_rules

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.25)
display(rules.head(50))

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Marital_1),(Race_1),0.133096,0.713879,0.054804,0.411765,0.576799,-0.04021,0.486406
1,(Race_2),(Marital_1),0.278292,0.133096,0.076157,0.273657,2.056088,0.039117,1.193519
2,(Marital_1),(Race_2),0.133096,0.278292,0.076157,0.572193,2.056088,0.039117,1.686993
3,(Marital_1),(SEX_2),0.133096,0.986477,0.130961,0.983957,0.997446,-0.000335,0.842942
4,(Marital_1),(Vital Status_1),0.133096,0.626335,0.074733,0.561497,0.896482,-0.00863,0.85214
5,(Marital_1),(Vital Status_4),0.133096,0.373665,0.058363,0.438503,1.173517,0.00863,1.115472
6,(Marital_1),(Primary Site Enc_4.0),0.133096,0.303203,0.041993,0.315508,1.040584,0.001638,1.017977
7,(Marital_1),(Primary Site Enc_8.0),0.133096,0.211388,0.034164,0.256684,1.214282,0.006029,1.060939
8,(AGE Enc_0.0),(Marital_1),0.081851,0.133096,0.021352,0.26087,1.960009,0.010458,1.17287
9,(Marital_1),(AGE Enc_1.0),0.133096,0.417082,0.07331,0.550802,1.320609,0.017798,1.297687


## Find rules that predict survival
Using the techniques from the labs, find rules whose consequents could be used to predict survival months.

In [10]:
# your code here

survival_names = [x for x in onehot.columns if 'Survival month' in x]
mask = [True if c.intersection(survival_names) and len(c) == 1 else False for c in rules.consequents.values]
survival_rules = rules[mask]
display(survival_rules)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
12,(Marital_1),(Survival month Enc_0.0),0.133096,0.989324,0.131673,0.989305,0.999981,-0.000003,0.998221
45,(Marital_2),(Survival month Enc_0.0),0.506050,0.989324,0.503203,0.994374,1.005105,0.002556,1.897687
51,(Marital_3),(Survival month Enc_0.0),0.010676,0.989324,0.007829,0.733333,0.741247,-0.002733,0.040036
63,(Marital_4),(Survival month Enc_0.0),0.093950,0.989324,0.092527,0.984848,0.995476,-0.000420,0.704626
78,(Marital_5),(Survival month Enc_0.0),0.211388,0.989324,0.209253,0.989899,1.000581,0.000122,1.056940
...,...,...,...,...,...,...,...,...,...
202181,"(Marital_2, SEX_2, AGE Enc_1.0, Race_1, Primar...",(Survival month Enc_0.0),0.005694,0.989324,0.005694,1.000000,1.010791,0.000061,inf
202296,"(Marital_2, SEX_2, FIPS Enc_0.0, Race_1, Prima...",(Survival month Enc_0.0),0.005694,0.989324,0.005694,1.000000,1.010791,0.000061,inf
202411,"(Marital_2, SEX_2, FIPS Enc_1.0, Race_1, Prima...",(Survival month Enc_0.0),0.007117,0.989324,0.007117,1.000000,1.010791,0.000076,inf
202542,"(Primary Site Enc_8.0, Marital_2, SEX_2, FIPS ...",(Survival month Enc_0.0),0.005694,0.989324,0.005694,1.000000,1.010791,0.000061,inf


## Find interesting rules
Using the techniques from the labs and practice exercise, find and print the rules from the previous selection (those that could be used to predict survival) that are interesting and useful for predictions.

Reduce the total number of interesting rules you find to 10 or fewer.

In [14]:
# your code here

rules1 = survival_rules.sort_values(by=['lift'], ascending=False)
rules1 = rules1.head(10)
rules1

# less interesting

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
69428,"(AGE Enc_3.0, Primary Site Enc_8.0, SEX_2, FIP...",(Survival month Enc_0.0),0.007117,0.989324,0.007117,1.0,1.010791,7.6e-05,inf
102009,"(AGE Enc_1.0, Race_1, Tumor Size Enc_0.0, Vita...",(Survival month Enc_0.0),0.029181,0.989324,0.029181,1.0,1.010791,0.000312,inf
102254,"(AGE Enc_1.0, Race_1, Primary Site Enc_7.0, Tu...",(Survival month Enc_0.0),0.007829,0.989324,0.007829,1.0,1.010791,8.4e-05,inf
102235,"(Race_1, Tumor Size Enc_0.0, Primary Site Enc_...",(Survival month Enc_0.0),0.005694,0.989324,0.005694,1.0,1.010791,6.1e-05,inf
102215,"(FIPS Enc_0.0, Race_1, Tumor Size Enc_0.0, Pri...",(Survival month Enc_0.0),0.005694,0.989324,0.005694,1.0,1.010791,6.1e-05,inf
102198,"(Race_1, Tumor Size Enc_0.0, AGE Enc_2.0, Prim...",(Survival month Enc_0.0),0.006406,0.989324,0.006406,1.0,1.010791,6.8e-05,inf
102179,"(AGE Enc_1.0, Race_1, Tumor Size Enc_0.0, Prim...",(Survival month Enc_0.0),0.010676,0.989324,0.010676,1.0,1.010791,0.000114,inf
102161,"(Race_1, Primary Site Enc_2.0, Tumor Size Enc_...",(Survival month Enc_0.0),0.005694,0.989324,0.005694,1.0,1.010791,6.1e-05,inf
102138,"(Race_1, Tumor Size Enc_0.0, AGE Enc_2.0, Vita...",(Survival month Enc_0.0),0.010676,0.989324,0.010676,1.0,1.010791,0.000114,inf
102123,"(Race_1, Tumor Size Enc_0.0, FIPS Enc_3.0, Vit...",(Survival month Enc_0.0),0.005694,0.989324,0.005694,1.0,1.010791,6.1e-05,inf


In [19]:
rules2 = survival_rules.sort_values(by=['lift'], ascending=True)
rules2 = rules2.head(10)
rules2

# more interesting

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
179033,"(Primary Site Enc_8.0, SEX_2, Tumor Size Enc_9...",(Survival month Enc_0.0),0.009964,0.989324,0.005694,0.571429,0.577595,-0.004164,0.024911
123240,"(Primary Site Enc_8.0, SEX_2, Tumor Size Enc_9...",(Survival month Enc_0.0),0.010676,0.989324,0.006406,0.6,0.606475,-0.004156,0.02669
126864,"(Primary Site Enc_8.0, Tumor Size Enc_9.0, Rac...",(Survival month Enc_0.0),0.010676,0.989324,0.006406,0.6,0.606475,-0.004156,0.02669
136507,"(Primary Site Enc_8.0, SEX_2, Tumor Size Enc_9...",(Survival month Enc_0.0),0.012811,0.989324,0.007829,0.611111,0.617706,-0.004845,0.027453
71890,"(Primary Site Enc_8.0, Vital Status_4, Tumor S...",(Survival month Enc_0.0),0.012811,0.989324,0.007829,0.611111,0.617706,-0.004845,0.027453
59394,"(AGE Enc_3.0, Primary Site Enc_8.0, Tumor Size...",(Survival month Enc_0.0),0.011388,0.989324,0.007117,0.625,0.631745,-0.004149,0.02847
71851,"(AGE Enc_3.0, Primary Site Enc_8.0, Vital Stat...",(Survival month Enc_0.0),0.013523,0.989324,0.008541,0.631579,0.638395,-0.004838,0.028978
69457,"(AGE Enc_3.0, Primary Site Enc_8.0, SEX_2, Tum...",(Survival month Enc_0.0),0.013523,0.989324,0.008541,0.631579,0.638395,-0.004838,0.028978
136588,"(Primary Site Enc_8.0, SEX_2, Tumor Size Enc_9...",(Survival month Enc_0.0),0.0121,0.989324,0.007829,0.647059,0.654041,-0.004141,0.030249
23451,"(AGE Enc_3.0, Primary Site Enc_8.0, Tumor Size...",(Survival month Enc_0.0),0.014235,0.989324,0.009253,0.65,0.657014,-0.00483,0.030503


## List the unique antecedent features
From the list of rules you found above, find and print the unique set of antecedent features.

In [20]:
# your code here

frozenset.union(*rules2['antecedents'])

frozenset({'AGE Enc_3.0',
           'FIPS Enc_2.0',
           'Primary Site Enc_8.0',
           'Race_1',
           'SEX_2',
           'Tumor Size Enc_9.0',
           'Vital Status_4'})

## Discuss your findings
Discuss your rules and selected features in your report.