# Let's start working

**I will follow and apply data processing techniques one by one, with the aim of improving results.**

**The main objective here is to apply the techniques of feature selection and know their impact on the results. Therefore, we may stop the application as soon as we notice a change in the accuracy of the results.**

In [2]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

**Now I will try to apply the techniques for choosing the features that I previously published on my LinkedIn account yesterday.
In addition to applying the genetic algorithm, I will compare the results for each one and see the best case scenario in that case**

In [3]:
data = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")
data.head()

In [4]:
data.shape

In [5]:
data.columns

**In the beginning, and before entering into the techniques for selecting features, I will adjust the data, refine it and treat it from some problems at the beginning, and then work will begin in the second part.**

In [6]:
data.isnull().sum()

**There are no missing values shown here.**

In [7]:
data = data.drop(["Unnamed: 32"],axis=1)
data = data.drop(["id"],axis=1)

# Outlier management

In [8]:
data.describe()

In [9]:
data.info()

In [10]:
# Capping the outlier rows with percentiles
upper_lim = data['fractal_dimension_worst'].quantile(.99)
lower_lim = data['fractal_dimension_worst'].quantile(.01)

In [11]:
data.loc[(data['fractal_dimension_worst'] > upper_lim), 'fractal_dimension_worst'] = upper_lim
data.loc[(data['fractal_dimension_worst'] < lower_lim), 'fractal_dimension_worst'] = lower_lim
data.head()

# One-hot encoding
**for the target column**

**One-hot encoding is an often-used technique in machine learning for feature
engineering**
**so one-hot encoding is a way to convert these categorical features into numerical
features.**
**Here I will follow one of the tips, which is that I will focus on the malignant tumor more than the benign tumor, but by looking at the number of supporting data for each category of data, we notice that the number of data supporting the benign tumor is more than the other category, and therefore I will give the value No. 1- to the benign tumor and No. 1 to the tumor The malignant, and therefore I have formed a balance to some extent at that point and provided some balance to the data.**

In [12]:
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': -1})

**Here I do what is called the initial separation between the target column and the rest of the features.**

In [13]:
# independent columns
x = data.drop(["diagnosis"],axis=1)
# pick last column for the target feature
y = data["diagnosis"]

In [14]:
x.shape

In [15]:
y.head()

In [16]:
y.value_counts()

**The total difference between the number of categories comprising the target column is not large and therefore we leave the matter as it is for the time being unless it turns out otherwise.**

In [17]:
X_train, X_valid, y_train, y_valid = train_test_split(x, y, test_size = 0.20, random_state = 44)

In [18]:
print('X_train after :',X_train.shape)
print('X_valid after :' , X_valid.shape)
print('y_train after :',y_train.shape)
print('y_valid after :' , y_valid.shape)

In [19]:
from sklearn.ensemble import ExtraTreesClassifier

In [20]:
model = ExtraTreesClassifier()
model.fit(X_train,y_train)

In [21]:
#Calculating Prediction
y_predict_model = model.predict(X_valid)
y_predict_model

In [22]:
#Calculating Details
print('model Train Score is : ' , model.score(X_train, y_train))
print('model Test Score is : ' , model.score(X_valid, y_valid))

In [23]:
#Calculating Confusion Matrix
from sklearn.metrics import confusion_matrix,classification_report,plot_confusion_matrix
confusion_matrix=confusion_matrix(y_valid,y_predict_model)
confusion_matrix

In [24]:
print(model.feature_importances_)

In [25]:
feat_importances = pd.Series(model.feature_importances_, index=X_train.columns)
feat_importances

In [26]:
feat_importances = feat_importances.sort_values()
feat_importances

In [27]:
plt.figure(figsize=(10,10))
feat_importances.nlargest(30).plot(kind='barh')
plt.show()

**I'm going to drop 14 columns that are less valuable in affecting the results and retrain the model to see what the results are.**

In [28]:
columns=X_train[['fractal_dimension_se', 'symmetry_se' , 'smoothness_se' , 'symmetry_mean', 'fractal_dimension_mean','texture_se','compactness_se','concavity_se','concave points_se','smoothness_mean','symmetry_worst','fractal_dimension_worst','area_se','radius_se']]
X_train_2=X_train.drop(columns, axis=1)

columns=X_valid[['fractal_dimension_se', 'symmetry_se' , 'smoothness_se' , 'symmetry_mean', 'fractal_dimension_mean','texture_se','compactness_se','concavity_se','concave points_se','smoothness_mean','symmetry_worst','fractal_dimension_worst','area_se','radius_se']]
X_valid_2=X_valid.drop(columns, axis=1)

In [29]:
model_2 = ExtraTreesClassifier()
model_2.fit(X_train_2,y_train)

#Calculating Prediction
y_predict_model_2 = model_2.predict(X_valid_2)
print(y_predict_model_2)
print("-*-"*40)
#Calculating Details
print('model_2 Train Score is : ' , model_2.score(X_train_2, y_train))
print('model_2 Test Score is : ' , model_2.score(X_valid_2, y_valid))

**After taking this procedure, we note that the results have not changed, but the model speed has improved.**

In [30]:
#Calculating Confusion Matrix
from sklearn.metrics import confusion_matrix,classification_report,plot_confusion_matrix
confusion_matrix=confusion_matrix(y_valid,y_predict_model_2)
print(confusion_matrix,"\n")
print("-*-"*40 , "\n")
print(model.feature_importances_ ,"\n")
plt.figure(figsize=(10,8))
feat_importances.nlargest(14).plot(kind='barh')
plt.show()

# Feature selection
**I will achieve this part in some ways.If it is necessary**

# 1. forward selection method

**Forward Selection: The forward selection method is an iterative process that
starts by having no features in the dataset. During each iteration, features
are added with the intent of improving the performance of the model. If
performance is improved, the features are kept. Features that do not improve
the results are discarded. The process continues until improvement of the
model stalls.**

In [31]:
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
# Build EXC classifier to use in feature selection

clf = ExtraTreesClassifier(n_estimators=60, n_jobs=-1,random_state=42)

# Build step forward feature selection
sfs1 = sfs(clf,
           k_features=10,
           forward=True,
           floating=False,
           verbose=2,
           scoring='accuracy',
           cv=9)

# Perform SFFS
sfs1 = sfs1.fit(X_train_2, y_train)

In [32]:
help(sfs)

In [33]:
# Which features?
feat_cols = list(sfs1.k_feature_idx_)
print(feat_cols)

**The columns at these indexes are those which were selected, which is very good, now we can build on those features to build a full model using our training and test sets**

In [34]:
columns=X_train_2[['area_mean', 'compactness_mean' , 'concavity_mean' , 'perimeter_se', 'compactness_worst','concavity_worst','concave points_worst']]
X_train_3=X_train_2.drop(columns, axis=1)

columns=X_valid_2[['area_mean', 'compactness_mean' , 'concavity_mean' , 'perimeter_se', 'compactness_worst','concavity_worst','concave points_worst']]
X_valid_3=X_valid_2.drop(columns, axis=1)

In [35]:
from sklearn.metrics import accuracy_score as acc
# Build full model with selected features
clf = ExtraTreesClassifier(random_state=42)

clf.fit(X_train_3, y_train)

In [36]:
#Calculating Prediction for train data 
y_predict_train_model_3 = clf.predict(X_train_3)
#Calculating Prediction for test data 
y_predict_test_model_3 = clf.predict(X_valid_3)

In [37]:
from sklearn.metrics import accuracy_score
print("accuracy of train data ",accuracy_score(y_train, y_predict_train_model_3))
print("accuracy of test data ",accuracy_score(y_valid, y_predict_test_model_3))

**After using the forward selection method, the accuracy level has increased significantly.**

In [40]:
import seaborn as sns
from sklearn.metrics import confusion_matrix,classification_report,plot_confusion_matrix
#Generate the confusion matrix
cf_matrix = confusion_matrix(y_valid,y_predict_test_model_3)

In [41]:
group_names = ['True Neg','False Pos','False Neg','True Pos']

group_counts = ["{0:0.0f}".format(value) for value in cf_matrix.flatten()]

group_percentages = ["{0:.2%}".format(value) for value in cf_matrix.flatten()/np.sum(cf_matrix)]

labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]

labels = np.asarray(labels).reshape(2,2)

ax = sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

ax.set_title('Seaborn Confusion Matrix with labels\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])

## Display the visualization of the Confusion Matrix.
plt.show()

**The results have already changed, so I'll stop here for now.**

**But I will come back with a new way that improves the results.
Maybe next time I use a genetic algorithm in the feature selection process.**