## Understanding the Business Problem
TalkingData is a Chinese big data company, and one of their areas of expertise is mobile advertisements.

In mobile advertisements, click fraud is a major source of losses. Click fraud is the practice of repeatedly clicking on an advertisement hosted on a website with the intention of generating revenue for the host website or draining revenue from the advertiser.

In this case, TalkingData happens to be serving the advertisers (their clients). TalkingData cover a whopping approx. 70% of the active mobile devices in China, of which 90% are potentially fraudulent (i.e. the user is actually not going to download the app after clicking).

You can imagine the amount of money they can help clients save if they are able to predict whether a given click is fraudulent (or equivalently, whether a given click will result in a download).

Their current approach to solve this problem is that they've generated a blacklist of IP addresses - those IPs which produce lots of clicks, but never install any apps. Now, they want to try some advanced techniques to predict the probability of a click being genuine/fraud.

In this problem, we will use the features associated with clicks, such as IP address, operating system, device type, time of click etc. to predict the probability of a click being fraud.



# Project on Bagging and Boosting ensemble model:


**The data contains observations of about 240 million clicks, and whether a given click resulted in a download or not (1/0):**

The detailed data dictionary is mentioned here:
- ```ip```: ip address of click.
- ```app```: app id for marketing.
- ```device```: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
- ```os```: os version id of user mobile phone
- ```channel```: channel id of mobile ad publisher
- ```click_time```: timestamp of click (UTC)
- ```attributed_time```: if user download the app for after clicking an ad, this is the time of the app download
- ```is_attributed```: the target that is to be predicted, indicating the app was downloaded

Let's try finding some useful trends in the data.

    **1. Explore the dataset for anomalies and missing values and take corrective actions if necessary.**

    **2. Which column has maximum number of unique values present among all the available columns**

    **3. Use an appropriate technique to get rid of all the apps that are very rare (say which comprise of less                than 20% clicks) and plot the rest..** 

    **4. By using Pandas derive new features such as - 'day_of_week' , 'day_of_year' , 'month' , and 'hour' as                  float/int datatypes using the 'click_time' column . Add the newly derived columns in original dataset.**

    **5. Divide the data into training and testing subsets into 80:20 ratio(Train_data = 80% , Testing_data = 20%) and
         check the average download rates('is_attributed') for train and test data, scores should be comparable.**

    **6. Apply XGBoostClassifier with default parameters on training data and make first 10 prediction for Test data.          NOTE: Use y_pred = model.predict_proba(X_test) since we need probabilities to compute AUC.** 

    **7. On evaluating the predictions made by the model what is the AUC/ROC score with default hyperparameters.**

    **8. Compute feature importance score and name the top 5 features/columns .**
    
    **9. Apply BaggingClassifier with base_estimator LogisticRegression and compute AUC/ROC score.
    
    **10.  On the basis of AUC/ROC score which one will you choose from BaggingClassifier and XGBoostClassifier and              why?What does AUC/ROC score signifies?
    
    **11.  What is the accuracy for BaggingClassifier and XGBoostClassifier?()
 ### All the Best!!!

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
from scipy.stats import zscore

In [2]:
pwd

'/Users/emani/Downloads/Bagging_Boosting_Project'

In [3]:
B_Boost = pd.read_csv('/Users/emani/Downloads/Projects/DT_RF_proj/SF_Crimes.csv')
B_Boost.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [4]:
miss_val=B_Boost.isna().sum()
miss_val

Dates         0
Category      0
Descript      0
DayOfWeek     0
PdDistrict    0
Resolution    0
Address       0
X             0
Y             0
dtype: int64

In [5]:
print("Column Names:",B_Boost.columns.values)

Column Names: ['Dates' 'Category' 'Descript' 'DayOfWeek' 'PdDistrict' 'Resolution'
 'Address' 'X' 'Y']


In [6]:
from sklearn.preprocessing import LabelEncoder

In [7]:
n=LabelEncoder()
n=n.fit(B_Boost['DayOfWeek'])

In [8]:
n.transform(B_Boost['DayOfWeek'])

array([6, 6, 6, ..., 1, 1, 1])

In [9]:
B_Boost['DayOfWeek']= n.transform(B_Boost['DayOfWeek'])
B_Boost.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,6,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,6,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,6,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,6,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,6,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [10]:
n=LabelEncoder()
n=n.fit(B_Boost['Category'])

In [11]:
n.transform(B_Boost['Category'])

array([37, 21, 21, ..., 16, 35, 12])

In [12]:
B_Boost['Category']= n.transform(B_Boost['Category'])
B_Boost.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,37,WARRANT ARREST,6,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,21,TRAFFIC VIOLATION ARREST,6,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,21,TRAFFIC VIOLATION ARREST,6,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,16,GRAND THEFT FROM LOCKED AUTO,6,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,16,GRAND THEFT FROM LOCKED AUTO,6,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [13]:
n=LabelEncoder()
n=n.fit(B_Boost['PdDistrict'])

In [14]:
n.transform(B_Boost['PdDistrict'])

array([4, 4, 4, ..., 7, 7, 0])

In [15]:
B_Boost['PdDistrict']= n.transform(B_Boost['PdDistrict'])
B_Boost.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,37,WARRANT ARREST,6,4,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,21,TRAFFIC VIOLATION ARREST,6,4,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,21,TRAFFIC VIOLATION ARREST,6,4,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,16,GRAND THEFT FROM LOCKED AUTO,6,4,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,16,GRAND THEFT FROM LOCKED AUTO,6,5,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [16]:
n=LabelEncoder()
n=n.fit(B_Boost['Resolution'])

In [17]:
n.transform(B_Boost['Resolution'])

array([ 0,  0,  0, ..., 11, 11, 11])

In [18]:
B_Boost['Resolution']= n.transform(B_Boost['Resolution'])
B_Boost.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,37,WARRANT ARREST,6,4,0,OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,21,TRAFFIC VIOLATION ARREST,6,4,0,OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,21,TRAFFIC VIOLATION ARREST,6,4,0,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,16,GRAND THEFT FROM LOCKED AUTO,6,4,11,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,16,GRAND THEFT FROM LOCKED AUTO,6,5,11,100 Block of BRODERICK ST,-122.438738,37.771541


In [20]:
n=LabelEncoder()
n=n.fit(B_Boost['Descript'])

In [21]:
n.transform(B_Boost['Descript'])

array([866, 810, 810, ..., 404, 496, 204])

In [22]:
B_Boost['Descript']= n.transform(B_Boost['Descript'])
B_Boost.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,37,866,6,4,0,OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,21,810,6,4,0,OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,21,810,6,4,0,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,16,404,6,4,11,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,16,404,6,5,11,100 Block of BRODERICK ST,-122.438738,37.771541


In [24]:
n=LabelEncoder()
n=n.fit(B_Boost['Address'])

In [25]:
n.transform(B_Boost['Address'])

array([19790, 19790, 22697, ..., 11315, 22308,  5128])

In [26]:
B_Boost['Address']= n.transform(B_Boost['Address'])
B_Boost.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,37,866,6,4,0,19790,-122.425892,37.774599
1,2015-05-13 23:53:00,21,810,6,4,0,19790,-122.425892,37.774599
2,2015-05-13 23:33:00,21,810,6,4,0,22697,-122.424363,37.800414
3,2015-05-13 23:30:00,16,404,6,4,11,4266,-122.426995,37.800873
4,2015-05-13 23:30:00,16,404,6,5,11,1843,-122.438738,37.771541


In [27]:
B_Boost['PdDistrict'].value_counts()

7    157182
3    119908
4    105296
0     89431
1     85460
9     81809
2     78845
8     65596
5     49313
6     45209
Name: PdDistrict, dtype: int64

Use an appropriate technique to get rid of all the apps that are very rare (say which comprise of less              than 20% clicks) and plot the rest

In [28]:
B_Boost.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Dates       878049 non-null  object 
 1   Category    878049 non-null  int64  
 2   Descript    878049 non-null  int64  
 3   DayOfWeek   878049 non-null  int64  
 4   PdDistrict  878049 non-null  int64  
 5   Resolution  878049 non-null  int64  
 6   Address     878049 non-null  int64  
 7   X           878049 non-null  float64
 8   Y           878049 non-null  float64
dtypes: float64(2), int64(6), object(1)
memory usage: 60.3+ MB


In [29]:
B_Boost[B_Boost.notna()]

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,37,866,6,4,0,19790,-122.425892,37.774599
1,2015-05-13 23:53:00,21,810,6,4,0,19790,-122.425892,37.774599
2,2015-05-13 23:33:00,21,810,6,4,0,22697,-122.424363,37.800414
3,2015-05-13 23:30:00,16,404,6,4,11,4266,-122.426995,37.800873
4,2015-05-13 23:30:00,16,404,6,5,11,1843,-122.438738,37.771541
...,...,...,...,...,...,...,...,...,...
878044,2003-01-06 00:15:00,25,661,1,8,11,15816,-122.459033,37.714056
878045,2003-01-06 00:01:00,16,404,1,2,11,11491,-122.447364,37.731948
878046,2003-01-06 00:01:00,16,404,1,7,11,11315,-122.403390,37.780266
878047,2003-01-06 00:01:00,35,496,1,7,11,22308,-122.390531,37.780607


i am dropping this because of the large number of missing values

In [30]:
B_Boost=B_Boost.drop('Dates',axis=1)
B_Boost.head()

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,37,866,6,4,0,19790,-122.425892,37.774599
1,21,810,6,4,0,19790,-122.425892,37.774599
2,21,810,6,4,0,22697,-122.424363,37.800414
3,16,404,6,4,11,4266,-122.426995,37.800873
4,16,404,6,5,11,1843,-122.438738,37.771541


In [31]:
(B_Boost['PdDistrict'].value_counts()/100000)

7    1.57182
3    1.19908
4    1.05296
0    0.89431
1    0.85460
9    0.81809
2    0.78845
8    0.65596
5    0.49313
6    0.45209
Name: PdDistrict, dtype: float64

In [32]:
Bagging=(B_Boost['PdDistrict'].value_counts()/100000)*100

In [33]:
Bagging

7    157.182
3    119.908
4    105.296
0     89.431
1     85.460
9     81.809
2     78.845
8     65.596
5     49.313
6     45.209
Name: PdDistrict, dtype: float64

In [34]:
Bagging=(B_Boost['PdDistrict'].value_counts()/100000)

In [35]:
Bagging[Bagging<0.2]

Series([], Name: PdDistrict, dtype: float64)

Which column has maximum number of unique values present among all the available columns

In [36]:
for col in B_Boost.columns:
    print(col, len(B_Boost[col].unique()))

Category 39
Descript 879
DayOfWeek 7
PdDistrict 10
Resolution 17
Address 23228
X 34243
Y 34243


The maximum number of unique value is Addess 8

By using Pandas derive new features such as - 'day_of_week' , 'day_of_year' , 'month' , and 'hour' as float/int datatypes using the 'click_time' column . Add the newly derived columns in original dataset

In [37]:
from datetime import datetime, timedelta

In [38]:
B_Boost['PdDistrict']=pd.to_datetime(B_Boost['PdDistrict'])

In [39]:
B_Boost.head()

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,37,866,6,1970-01-01 00:00:00.000000004,0,19790,-122.425892,37.774599
1,21,810,6,1970-01-01 00:00:00.000000004,0,19790,-122.425892,37.774599
2,21,810,6,1970-01-01 00:00:00.000000004,0,22697,-122.424363,37.800414
3,16,404,6,1970-01-01 00:00:00.000000004,11,4266,-122.426995,37.800873
4,16,404,6,1970-01-01 00:00:00.000000005,11,1843,-122.438738,37.771541


In [40]:
dt = '2017-11-07 09:30:38'

In [41]:
B_Boost['PdDistrict'] = pd.to_datetime(B_Boost['PdDistrict'])

In [42]:
B_Boost.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 8 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   Category    878049 non-null  int64         
 1   Descript    878049 non-null  int64         
 2   DayOfWeek   878049 non-null  int64         
 3   PdDistrict  878049 non-null  datetime64[ns]
 4   Resolution  878049 non-null  int64         
 5   Address     878049 non-null  int64         
 6   X           878049 non-null  float64       
 7   Y           878049 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(5)
memory usage: 53.6 MB


In [43]:
B_Boost['day_of_the_week'] = B_Boost['PdDistrict'].apply(lambda x: x.weekday())
B_Boost.head()

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,day_of_the_week
0,37,866,6,1970-01-01 00:00:00.000000004,0,19790,-122.425892,37.774599,3
1,21,810,6,1970-01-01 00:00:00.000000004,0,19790,-122.425892,37.774599,3
2,21,810,6,1970-01-01 00:00:00.000000004,0,22697,-122.424363,37.800414,3
3,16,404,6,1970-01-01 00:00:00.000000004,11,4266,-122.426995,37.800873,3
4,16,404,6,1970-01-01 00:00:00.000000005,11,1843,-122.438738,37.771541,3


In [44]:
B_Boost['day_of_the_year'] = B_Boost['PdDistrict'].apply(lambda x: x.dayofyear)
B_Boost.head()

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,day_of_the_week,day_of_the_year
0,37,866,6,1970-01-01 00:00:00.000000004,0,19790,-122.425892,37.774599,3,1
1,21,810,6,1970-01-01 00:00:00.000000004,0,19790,-122.425892,37.774599,3,1
2,21,810,6,1970-01-01 00:00:00.000000004,0,22697,-122.424363,37.800414,3,1
3,16,404,6,1970-01-01 00:00:00.000000004,11,4266,-122.426995,37.800873,3,1
4,16,404,6,1970-01-01 00:00:00.000000005,11,1843,-122.438738,37.771541,3,1


In [45]:
B_Boost['month'] = B_Boost['PdDistrict'].apply(lambda x: x.month)
B_Boost.head()

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,day_of_the_week,day_of_the_year,month
0,37,866,6,1970-01-01 00:00:00.000000004,0,19790,-122.425892,37.774599,3,1,1
1,21,810,6,1970-01-01 00:00:00.000000004,0,19790,-122.425892,37.774599,3,1,1
2,21,810,6,1970-01-01 00:00:00.000000004,0,22697,-122.424363,37.800414,3,1,1
3,16,404,6,1970-01-01 00:00:00.000000004,11,4266,-122.426995,37.800873,3,1,1
4,16,404,6,1970-01-01 00:00:00.000000005,11,1843,-122.438738,37.771541,3,1,1


In [46]:
B_Boost['hour'] = B_Boost['PdDistrict'].apply(lambda x: x.hour)
B_Boost.head()

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,day_of_the_week,day_of_the_year,month,hour
0,37,866,6,1970-01-01 00:00:00.000000004,0,19790,-122.425892,37.774599,3,1,1,0
1,21,810,6,1970-01-01 00:00:00.000000004,0,19790,-122.425892,37.774599,3,1,1,0
2,21,810,6,1970-01-01 00:00:00.000000004,0,22697,-122.424363,37.800414,3,1,1,0
3,16,404,6,1970-01-01 00:00:00.000000004,11,4266,-122.426995,37.800873,3,1,1,0
4,16,404,6,1970-01-01 00:00:00.000000005,11,1843,-122.438738,37.771541,3,1,1,0


In [47]:
B_Boost=B_Boost.drop('PdDistrict',axis=1)
B_Boost.head()

Unnamed: 0,Category,Descript,DayOfWeek,Resolution,Address,X,Y,day_of_the_week,day_of_the_year,month,hour
0,37,866,6,0,19790,-122.425892,37.774599,3,1,1,0
1,21,810,6,0,19790,-122.425892,37.774599,3,1,1,0
2,21,810,6,0,22697,-122.424363,37.800414,3,1,1,0
3,16,404,6,11,4266,-122.426995,37.800873,3,1,1,0
4,16,404,6,11,1843,-122.438738,37.771541,3,1,1,0


Divide the data into training and testing subsets into 80:20 ratio(Train_data = 80% , Testing_data = 20%) and check the average download rates('is_attributed') for train and test data, scores should be comparable

In [48]:
from sklearn.ensemble import BaggingClassifier

In [49]:
from sklearn.model_selection import train_test_split 
from sklearn import metrics
from sklearn import preprocessing

In [51]:
B_Boost_x = B_Boost.drop(['Address','month'], axis = 1)
B_Boost_y = B_Boost['month'] 

In [52]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(B_Boost_x, B_Boost_y, test_size=0.20)

In [53]:
print('Dimension of B_Boost_x :',B_Boost_x.shape)
print('Dimension of B_Boost_y :',B_Boost_y.shape)

Dimension of B_Boost_x : (878049, 9)
Dimension of B_Boost_y : (878049,)


In [54]:
B_Boost.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Category         878049 non-null  int64  
 1   Descript         878049 non-null  int64  
 2   DayOfWeek        878049 non-null  int64  
 3   Resolution       878049 non-null  int64  
 4   Address          878049 non-null  int64  
 5   X                878049 non-null  float64
 6   Y                878049 non-null  float64
 7   day_of_the_week  878049 non-null  int64  
 8   day_of_the_year  878049 non-null  int64  
 9   month            878049 non-null  int64  
 10  hour             878049 non-null  int64  
dtypes: float64(2), int64(9)
memory usage: 73.7 MB


Apply XGBoostClassifier with default parameters on training data and make first 10 prediction for Test data.NOTE: Use y_pred = model.predict_proba(X_test) since we need probabilities to compute AUC

In [57]:
from sklearn.ensemble import BaggingClassifier

model = BaggingClassifier()
model = model.fit(X_train, y_train)

In [61]:
from sklearn.ensemble import BaggingClassifier
Bagging_model=BaggingClassifier()
Bagging_model=Bagging_model.fit(X_train,y_train)

In [63]:
test_pred=Bagging_model.predict(X_test)

In [64]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,test_pred))

1.0


In [67]:
from sklearn.ensemble import BaggingClassifier
Bagging_model = BaggingClassifier()
Bagging_model.fit(X_train, y_train)

BaggingClassifier(base_estimator=None, bootstrap=True, bootstrap_features=False,
                  max_features=1.0, max_samples=1.0, n_estimators=10,
                  n_jobs=None, oob_score=False, random_state=None, verbose=0,
                  warm_start=False)

In [68]:
Bagging_pred=Bagging_model.predict(X_test)

In [69]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,Bagging_pred))

1.0


In [70]:
from sklearn.ensemble import BaggingClassifier
Bagging_model = BaggingClassifier(n_estimators=20)
Bagging_model.fit(X_train, y_train)

BaggingClassifier(base_estimator=None, bootstrap=True, bootstrap_features=False,
                  max_features=1.0, max_samples=1.0, n_estimators=20,
                  n_jobs=None, oob_score=False, random_state=None, verbose=0,
                  warm_start=False)

In [71]:
Bagging_pred=Bagging_model.predict(X_test)

In [72]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,Bagging_pred))

1.0


In [74]:
from sklearn.ensemble import BaggingClassifier
Bagging_model = BaggingClassifier(n_estimators=20)
Bagging_model.fit(X_train, y_train)

BaggingClassifier(base_estimator=None, bootstrap=True, bootstrap_features=False,
                  max_features=1.0, max_samples=1.0, n_estimators=20,
                  n_jobs=None, oob_score=False, random_state=None, verbose=0,
                  warm_start=False)

In [75]:
Bagging_pred=Bagging_model.predict(X_test)

In [76]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,Bagging_pred))

1.0


In [77]:
from sklearn.ensemble import BaggingClassifier
Bagging_model = BaggingClassifier(n_estimators=5)
Bagging_model.fit(X_train, y_train)

BaggingClassifier(base_estimator=None, bootstrap=True, bootstrap_features=False,
                  max_features=1.0, max_samples=1.0, n_estimators=5,
                  n_jobs=None, oob_score=False, random_state=None, verbose=0,
                  warm_start=False)

In [78]:
Bagging_pred=Bagging_model.predict(X_test)

In [79]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,Bagging_pred))

1.0


In [81]:
from sklearn.ensemble import BaggingClassifier
Bagging_model = BaggingClassifier(n_estimators=20)
Bagging_model.fit(X_train, y_train)

BaggingClassifier(base_estimator=None, bootstrap=True, bootstrap_features=False,
                  max_features=1.0, max_samples=1.0, n_estimators=20,
                  n_jobs=None, oob_score=False, random_state=None, verbose=0,
                  warm_start=False)

In [82]:
Bagging_pred=r_model.predict(X_test)

In [83]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,Bagging_pred))

1.0


In [84]:
Bagging_pred=r_model.predict(X_train)

In [85]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_train,Bagging_pred))

1.0
