# Introduction

Perform Univariate feature selection

### Imports
Import libraries and write settings here.

In [2]:
import matplotlib 
# Specify renderer
# matplotlib.use('Agg')

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

# Boiler-plate settings for producing pub-quality figures
# 1 point = 1/72 inch
from cycler import cycler
matplotlib.rcParams['axes.prop_cycle'] = cycler(color='bgrcmyk')
matplotlib.rcParams.update({'figure.figsize': (8, 5)    # inches
                            , 'font.size': 22      # points
                            , 'legend.fontsize': 16      # points
                            , 'lines.linewidth': 1.5       # points
                            , 'axes.linewidth': 1.5       # points
                            , 'text.usetex': True    # Use LaTeX to layout text
                            , 'font.family': "serif"  # Use serifed fonts
                            , 'xtick.major.size': 10     # length, points
                            , 'xtick.major.width': 1.5     # points
                            , 'xtick.minor.size': 6     # length, points
                            , 'xtick.minor.width': 1     # points
                            , 'ytick.major.size': 10     # length, points
                            , 'ytick.major.width': 1.5     # points
                            , 'ytick.minor.size': 6     # length, points
                            , "xtick.minor.visible": True
                            , "ytick.minor.visible": True
                            , 'font.weight': 'bold'
                            , 'ytick.minor.width': 1     # points
                            , 'font.serif': ("Times", "Palatino", "Computer Modern Roman", "New Century Schoolbook", "Bookman"), 'font.sans-serif': ("Helvetica", "Avant Garde", "Computer Modern Sans serif"), 'font.monospace': ("Courier", "Computer Modern Typewriter"), 'font.cursive': "Zapf Chancery"
                            })


import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [3]:
from sklearn.preprocessing import MinMaxScaler

from sklearn.feature_selection import SelectFromModel, VarianceThreshold, RFE
from sklearn.model_selection import train_test_split, RandomizedSearchCV, KFold, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, average_precision_score, precision_score, recall_score, confusion_matrix, precision_recall_curve, roc_curve, roc_auc_score

In [4]:
# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [5]:
import seaborn as sns
sns.set_context("poster")
sns.set(rc={'figure.figsize': (16, 9.)})
sns.set_style("whitegrid")

In [6]:
import sys
import logging
logging.basicConfig(level=logging.INFO, stream=sys.stdout)

%load_ext autoreload
%autoreload 2

import numpy as np
import scipy as sp

# ML
import sklearn

import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

#Set the display format to be scientific for ease of analysis
# pd.options.display.float_format = '{:,.2g}'.format

In [7]:
%load_ext watermark
%watermark -v -h -n -g -m -p jupyerlab,numpy,scipy,sklearn,pandas,matplotlib

Fri Jun 26 2020 

CPython 3.7.6
IPython 7.13.0

jupyerlab not installed
numpy 1.18.1
scipy 1.4.1
sklearn 0.22.1
pandas 1.0.3
matplotlib 3.1.3

compiler   : Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 19.5.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit
host name  : C02X61QTJHD5
Git hash   : 76974f1c37147a66bf15e5a686b2b424675dd250


In [8]:
df = pd.read_csv('../data/cleaned/cancer.csv')
df.head()

Unnamed: 0,id,clumpthickness,uniformityofcellsize,uniformityofcellshape,marginaladhesion,singleepithelialcellsize,barenuclei,blandchromatin,normalnucleoli,mitoses,malignant
0,1241035,7,8.0,3.0,7.0,4.0,5.0,7.0,8.0,2.0,0
1,1107684,6,10.0,5.0,5.0,4.0,10.0,6.0,10.0,1.0,0
2,691628,8,6.0,4.0,10.0,10.0,1.0,3.0,5.0,1.0,0
3,1226612,7,5.0,6.0,3.0,3.0,8.0,7.0,4.0,1.0,0
4,1142706,5,10.0,10.0,10.0,6.0,10.0,6.0,5.0,2.0,0


# Analysis/Modeling
Univariate feature selection works by selecting the best features based on univariate statistical tests. We compare each feature to the target variable, to see whether there is any statistically significant relationship between them. It is also called analysis of variance (ANOVA). When we analyze the relationship between one feature and the target variable, we ignore the other features. That is why it is called ‘univariate’. Each feature has its test score. 

https://towardsdatascience.com/feature-selection-using-python-for-classification-problem-b5f00a1c7028

(ANOVA) F-test in regression compares fits of different linear models. Unlike t-tests that can assess only one coefficient at a time, F-test can assess multiple coefficients simultaneously

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

In [9]:

X = df.loc[:, df.columns != 'malignant']
X = X.loc[:, X.columns != 'id']
Y = df['malignant']
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0, test_size=0.3)

## Chi-2
Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

In [12]:
from sklearn.feature_selection import f_classif, SelectKBest, chi2, SelectFpr

sel_chi2 = SelectKBest(chi2, k='all')    # select 4 features
X_train_chi2 = sel_chi2.fit_transform(X_train, y_train)
print(sel_chi2.get_support())
sel_chi2.scores_, sel_chi2.pvalues_

feature_scores = [(item, score) for item, score in zip(X_train.columns, sel_chi2.scores_)]
sorted(feature_scores, key=lambda x: -x[1])[:10]

[ True  True  True  True  True  True  True  True  True]


(array([ 426.5132758 ,  875.70963027,  850.15811637,  723.76181294,
         349.43491738, 1141.27872364,  448.14697943,  747.49224266,
         187.91219398]),
 array([9.32782461e-095, 1.87212733e-192, 6.71725014e-187, 2.03546411e-159,
        5.62593397e-078, 3.52645628e-250, 1.82551732e-099, 1.40824922e-164,
        9.07960444e-043]))

[('barenuclei', 1141.2787236357278),
 ('uniformityofcellsize', 875.7096302712066),
 ('uniformityofcellshape', 850.1581163670111),
 ('normalnucleoli', 747.4922426570436),
 ('marginaladhesion', 723.7618129430464),
 ('blandchromatin', 448.1469794337213),
 ('clumpthickness', 426.51327580365785),
 ('singleepithelialcellsize', 349.43491738285917),
 ('mitoses', 187.9121939836225)]

## f test

In [13]:

sel_f = SelectKBest(f_classif, k='all')
X_train_f = sel_f.fit_transform(X_train, y_train)
print(sel_f.get_support())
sel_f.scores_, sel_f.pvalues_


feature_scores = [(item, score) for item, score in zip(X_train.columns, sel_f.scores_)]
sorted(feature_scores, key=lambda x: -x[1])[:10]

[ True  True  True  True  True  True  True  True  True]


(array([491.97126082, 902.13622772, 987.16463411, 532.17906165,
        413.50838088, 943.92945152, 647.77823184, 465.74286219,
        121.21268502]),
 array([4.21096757e-074, 2.21221992e-109, 2.49555352e-115, 3.48124558e-078,
        1.32602748e-065, 2.37660845e-112, 4.76364380e-089, 2.41642800e-071,
        4.10389780e-025]))

[('uniformityofcellshape', 987.1646341083457),
 ('barenuclei', 943.929451522722),
 ('uniformityofcellsize', 902.1362277221252),
 ('blandchromatin', 647.778231839517),
 ('marginaladhesion', 532.179061648525),
 ('clumpthickness', 491.97126081999863),
 ('normalnucleoli', 465.7428621863493),
 ('singleepithelialcellsize', 413.5083808804715),
 ('mitoses', 121.21268502223457)]

2. What features drive your false positive rate for your model you derived above, what features drive your false negative rate? 

In [23]:
X_new = SelectFpr(chi2, alpha=0.01).fit_transform(X_train, y_train)
X_new.shape

(450, 9)

In [27]:
X_new = SelectFpr(f_classif, alpha=0.01).fit_transform(X_train, y_train)
X_new.shape

(450, 9)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Imports</a></span></li></ul></li></ul></li><li><span><a href="#Analysis/Modeling" data-toc-modified-id="Analysis/Modeling-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Analysis/Modeling</a></span><ul class="toc-item"><li><span><a href="#Chi-2" data-toc-modified-id="Chi-2-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Chi-2</a></span></li><li><span><a href="#f-test" data-toc-modified-id="f-test-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>f test</a></span></li></ul></li></ul></div>