<h2 style="color:blue;"> In this Code file, we will see our prediction before and after using bagging and boosting and will see its impact by comparing accuracy.</h2>

<h1 style="color:red;">BAGGING</h1>

Here we will see implementation of Bagging on 'Breast Cancer Dataset'. The breast cancer dataset is a classic and very easy binary classification dataset. It can be easily used since we can directly import it using sklearn.

### Importing libraries

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import BaggingClassifier
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

### Importing dataset

In [2]:
cancer = datasets.load_breast_cancer()
x=cancer.data
y=cancer.target

### Spitting Training and Testing Data

In [3]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

### Standard Scaling and Model Building

In [4]:
se= StandardScaler()
le = LogisticRegression()
pipeline = make_pipeline(se,le)

<h2 style="color:blue;"> Prediction without using Bagging</h2>

### Logistic Regression Model Fitting and Prediction

In [5]:
le.fit(x_train,y_train)
y_pred=le.predict(x_test)
print("training score without bagging",le.score(x_train,y_train))
print("testing score without bagging",le.score(x_test,y_test))

training score without bagging 0.9447236180904522
testing score without bagging 0.9590643274853801


<h2 style="color:blue;"> Prediction using Bagging</h2>

### Bagging Model Fitting and Prediction

In [6]:
bagging = BaggingClassifier(base_estimator=pipeline,n_estimators=100,max_features=10,max_samples=100,n_jobs=5)
bagging.fit(x_train,y_train)
y_pred=bagging.predict(x_train)
print("training score",bagging.score(x_train,y_train))
print("testing score",bagging.score(x_test,y_test))

training score 0.9673366834170855
testing score 0.9707602339181286


<h2 style="color:blue;">Comparing Accuracy Score</h2>

<b>BEFORE BAGGING</b><br/>
Training score percentage: <b>94%</b><br/>
Testing score percentage: <b>96%</b><br/>
<b>AFTER BAGGING</b><br/>
Training score percentage: <b>97%</b><br/>
Testing score percentage: <b>97%</b><br/>
As we can see, results improve by <b>3%</b> in training and <b>1%</b> in testing.

<h1 style="color:red;">BOOSTING</h1>

Here we will see implementation of Boosting on 'Mushroom Dataset'. This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one.

### Importing libraries

In [7]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

### Importing dataset

In [8]:
mushroom=pd.read_csv("https://github.com/ishreyakumari/winter-of-contributing/blob/Datascience_With_Python/Datascience_With_Python/DS%20Datasets/Machine%20Learning/mushrooms.csv")
df=mushroom.copy()

### Analyzing dataset

In [9]:
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


### Changing categorical values to numerical values

In [10]:
le=LabelEncoder()
for i in df.columns:
    df[i]=le.fit_transform(df[i])

### Information about dataset

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   class                     8124 non-null   int32
 1   cap-shape                 8124 non-null   int32
 2   cap-surface               8124 non-null   int32
 3   cap-color                 8124 non-null   int32
 4   bruises                   8124 non-null   int32
 5   odor                      8124 non-null   int32
 6   gill-attachment           8124 non-null   int32
 7   gill-spacing              8124 non-null   int32
 8   gill-size                 8124 non-null   int32
 9   gill-color                8124 non-null   int32
 10  stalk-shape               8124 non-null   int32
 11  stalk-root                8124 non-null   int32
 12  stalk-surface-above-ring  8124 non-null   int32
 13  stalk-surface-below-ring  8124 non-null   int32
 14  stalk-color-above-ring    8124 non-null 

In [12]:
df.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,...,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0
mean,0.482029,3.348104,1.827671,4.504677,0.415559,4.144756,0.974151,0.161497,0.309207,4.810684,...,1.603644,5.816347,5.794682,0.0,1.965534,1.069424,2.291974,3.59675,3.644018,1.508616
std,0.499708,1.604329,1.229873,2.545821,0.492848,2.103729,0.158695,0.368011,0.462195,3.540359,...,0.675974,1.901747,1.907291,0.0,0.242669,0.271064,1.801672,2.382663,1.252082,1.719975
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,3.0,0.0,2.0,1.0,0.0,0.0,2.0,...,1.0,6.0,6.0,0.0,2.0,1.0,0.0,2.0,3.0,0.0
50%,0.0,3.0,2.0,4.0,0.0,5.0,1.0,0.0,0.0,5.0,...,2.0,7.0,7.0,0.0,2.0,1.0,2.0,3.0,4.0,1.0
75%,1.0,5.0,3.0,8.0,1.0,5.0,1.0,0.0,1.0,7.0,...,2.0,7.0,7.0,0.0,2.0,1.0,4.0,7.0,4.0,2.0
max,1.0,5.0,3.0,9.0,1.0,8.0,1.0,1.0,1.0,11.0,...,3.0,8.0,8.0,0.0,3.0,2.0,4.0,8.0,5.0,6.0


### Checking for null values

In [13]:
df.isnull().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

### Separating feature columns and target column

In [14]:
x=df.drop('class',axis=1)
y=df['class']

<h2 style="color:blue;"> Prediction without using Boosting</h2>

### Model building and finding accuracy of model (DecisionTreeClassifier)

In [15]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
model = DecisionTreeClassifier(criterion="entropy",max_depth=1)
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_pred,y_test)
print("accuracy without boosting =",accuracy*100)

accuracy without boosting = 71.90319934372437


<h2 style="color:blue;">Prediction without using Boosting</h2>

### Model building and finding accuracy of model (AdaBoost)

In [16]:
AdaBoost = AdaBoostClassifier(base_estimator=model, n_estimators=400, learning_rate=1)
adaModel = AdaBoost.fit(x_train,y_train)
y_pred = AdaBoost.predict(x_test)
accuracy = accuracy_score(y_pred,y_test)
print("accuracy with boosting =",accuracy*100)

accuracy with boosting = 100.0


<h2 style="color:blue;">Comparing Accuracy Score</h2>

<b>BEFORE BOOSTING</b><br/>
Accuracy percentage: <b>72%</b><br/>
<b>AFTER BOOSTING</b><br/>
Accuracy percentage: <b>100%</b><br/>
As we can see, results improve by <b>28%</b>.