In [None]:
'''For this lab and in the next lessons we will use the dataset 'Healthcare For All'
building a model to predict who will donate (TargetB) and how much they will give (TargetD) (will be used for lab on Friday).
You will be using `files_for_lab/learningSet.csv` file which you have already downloaded from class.

### Scenario

You are revisiting the Healthcare for All Case Study. 
You are provided with this historical data about Donors and how much they donated. 
Your task is to build a machine learning model that will help the company identify people who are more likely to donate
and then try to predict the donation amount.

### Instructions

In this lab, we will first take a look at the degree of imbalance in the data
and correct it using the techniques we learned in the class.

Here is the list of steps to be followed (building a simple model without balancing the data):

- Import the required libraries and modules that you would need.
- Read that data into Python and call the dataframe `donors`.
- Check the datatypes of all the columns in the data. 
- Check for null values in the dataframe. Replace the null values using the methods learned in class.
- Split the data into numerical and catagorical.  Decide if any columns need their dtype changed.
- Concatenate numerical and categorical back together again for your X dataframe.  Designate the Target as y.
  
  - Split the data into a training set and a test set.
  - Split further into train_num and train_cat.  Also test_num and test_cat.
  - Scale the features either by using normalizer or a standard scaler. (train_num, test_num)
  - Encode the categorical features using One-Hot Encoding or Ordinal Encoding.  (train_cat, test_cat)
      - **fit** only on train data transform both train and test
      - again re-concatenate train_num and train_cat as X_train as well as test_num and test_cat as X_test
  - Fit a logistic regression model on the training data.
  - Check the accuracy on the test data.

**Note**: So far we have not balanced the data.

Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model has changed.'''

In [26]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

In [37]:
donors = pd.read_csv(r"C:\Users\filip\OneDrive\Desktop\IRONHACK\Week13\learningSet.txt")
donors = donors.drop("TARGET_D", axis = 1)
donors.dtypes

ODATEDW       int64
OSOURCE      object
TCODE         int64
STATE        object
ZIP          object
             ...   
MDMAUD_R     object
MDMAUD_F     object
MDMAUD_A     object
CLUSTER2    float64
GEOCODE2     object
Length: 480, dtype: object

In [28]:
donors.isna().sum()

ODATEDW       0
OSOURCE       0
TCODE         0
STATE         0
ZIP           0
           ... 
MDMAUD_R      0
MDMAUD_F      0
MDMAUD_A      0
CLUSTER2    132
GEOCODE2    132
Length: 481, dtype: int64

In [38]:
cat = donors.select_dtypes(exclude=['integer', 'float'])
num = donors.select_dtypes(exclude=['object'])
num = num.dropna(axis = 1)

In [39]:
for column in cat.columns:
    unique_values = cat[column].unique()
    print(f"Unique values in {column}: {unique_values}")

Unique values in OSOURCE: ['GRI' 'BOA' 'AMH' 'BRY' ' ' 'CWR' 'DRK' 'NWN' 'LIS' 'MSD' 'AGR' 'CSM'
 'ENQ' 'HCC' 'USB' 'FRC' 'RKB' 'PCH' 'AMB' 'L15' 'BBK' 'L21' 'SYN' 'L01'
 'MOP' 'UCA' 'ESN' 'IMP' 'AVN' 'RMG' 'DNA' 'L04' 'AML' 'AIR' 'DUR' 'LHJ'
 'WKB' 'STL' 'DCD' 'IMA' 'ACS' 'ALZ' 'NEX' 'HAR' 'SGI' 'MBC' 'BSH' 'NAD'
 'HOS' 'HHL' 'GRT' 'L02' 'APP' 'DAC' 'BHG' 'NSH' 'NPT' 'L16' 'PV3' 'LOG'
 'ASC' 'AGS' 'ARG' 'DON' 'VIK' 'ARB' 'HHH' 'ANT' 'WRG' 'PBL' 'OMH' 'CRG'
 'UBA' 'ASH' 'COK' 'RPH' 'STV' 'NAS' 'SSS' 'LEO' 'KNG' 'KIP' 'ASS' 'GDA'
 'STR' 'CAW' 'GET' 'HAN' 'DEL' 'FLD' 'L25' 'MER' 'SYC' 'HAM' 'PSM' 'HIL'
 'SPN' 'DNB' 'GPS' 'ASP' 'INN' 'ABW' 'CFI' 'JFY' 'LAK' 'LVT' 'RED' 'TIM'
 'MON' 'MM3' 'FAR' 'MTR' 'HOW' 'FOR' 'LKE' 'DBL' 'K3M' 'PGR' 'ADD' 'IML'
 'SMZ' 'CNT' 'SUN' 'MCC' 'BEL' 'TVF' 'TRN' 'PCL' 'HRB' 'OVS' 'WFD' 'TX2'
 'NWF' 'KEN' 'NSN' 'NEW' 'CJR' 'NHB' 'FCR' 'BSM' 'SIG' 'CHT' 'CAP' 'TVG'
 'SUF' 'PRV' 'TRO' 'GUR' 'WIG' 'MAT' 'D02' 'GLP' 'HEA' 'BLI' 'EAS' 'SFH'
 'PBK' 'TOR' 'HJR' 'MCO' 'I

Unique values in RFA_9: ['S4E' 'A1E' 'S4F' 'A2F' 'S3E' 'A3E' 'S2F' 'A1F' ' ' 'S2C' 'S2G' 'A3F'
 'L1E' 'U1G' 'A1D' 'N2F' 'S2D' 'A1G' 'N2E' 'L1G' 'A2E' 'A4D' 'F1F' 'N2G'
 'A3G' 'N1E' 'F1G' 'S3G' 'A4F' 'S2E' 'N1F' 'S3D' 'N3E' 'N4E' 'S4D' 'F1E'
 'N1G' 'A2G' 'P1G' 'L2F' 'N4D' 'L3E' 'A3C' 'A3D' 'F1D' 'L3G' 'S3F' 'L2G'
 'A4E' 'S4G' 'L1D' 'A2D' 'I1G' 'N1D' 'N3D' 'N2D' 'L1F' 'P1F' 'I2E' 'S3C'
 'L2E' 'L4E' 'A4C' 'L3F' 'S3B' 'N3F' 'N4G' 'I2F' 'N3G' 'L2D' 'L3D' 'N4F'
 'A4G' 'A1A' 'A1C' 'I4E' 'I3G' 'S4C' 'P1E' 'L4G' 'U1F' 'N2A' 'L4F' 'L4D'
 'A1B' 'N2C' 'I3E' 'I2G' 'F1C' 'N1C' 'A2C' 'I4G' 'N3C' 'I3F' 'P1D' 'P1C'
 'P1A' 'I4F' 'I1F' 'N4C' 'S4B' 'A4B' 'S2B' 'A2B' 'A3B' 'U1D' 'I1E']
Unique values in RFA_10: ['S4E' 'A1E' ' ' 'L1D' 'A2F' 'S3E' 'A2D' 'A1F' 'S2G' 'A3F' 'L1E' 'A1D'
 'N1F' 'S2D' 'S4F' 'N2E' 'A2E' 'F1F' 'N2G' 'A3G' 'A1G' 'A4G' 'F1E' 'S2E'
 'N3E' 'A4F' 'A3E' 'N4E' 'S4G' 'N1G' 'A2G' 'N4D' 'S3F' 'A4C' 'S3G' 'S4D'
 'N1E' 'S2F' 'F1G' 'A3D' 'F1D' 'A4D' 'N1D' 'N2F' 'N3D' 'A4E' 'N2D' 'S3D'
 'N3F' 'N4G

Unique values in RFA_24: ['S4E' 'F1E' 'S3D' ' ' 'A3D' 'A3E' 'F1D' 'A1F' 'A1D' 'S3F' 'A2D' 'A1G'
 'S2E' 'L4F' 'A2F' 'S2F' 'A1E' 'A3G' 'S4D' 'L2F' 'N3D' 'N3E' 'S2G' 'A2E'
 'S2D' 'N2E' 'L3D' 'S4F' 'F1C' 'L4E' 'F1G' 'A4F' 'S3E' 'N2F' 'L1F' 'P1C'
 'L3E' 'L1G' 'A3F' 'A2G' 'F1F' 'S3G' 'L3F' 'S4G' 'A4D' 'S4C' 'N1G' 'S2C'
 'A3C' 'P1A' 'L4D' 'S4B' 'N1F' 'N3F' 'N2D' 'L2G' 'A4E' 'P1F' 'P1D' 'N2C'
 'N1E' 'A2C' 'A4C' 'N2G' 'S3C' 'P1E' 'N4E' 'N4F' 'L3G' 'A4G' 'N3C' 'U1E'
 'A3B' 'P1G' 'U1D' 'L4G' 'U1F' 'S2B' 'N4D' 'N4G' 'L2E' 'L1E' 'A4B' 'N3G'
 'L1D' 'S3B' 'I4F' 'N1D' 'P1B' 'U1G' 'N1C' 'N4C' 'I4G' 'U1C' 'L2D' 'A1C'
 'L4C']
Unique values in RFA_2R: ['L']
Unique values in RFA_2A: ['E' 'G' 'F' 'D']
Unique values in MDMAUD_R: ['X' 'C' 'D' 'L' 'I']
Unique values in MDMAUD_F: ['X' '1' '2' '5']
Unique values in MDMAUD_A: ['X' 'C' 'M' 'L' 'T']
Unique values in GEOCODE2: ['C' 'A' 'D' 'B' ' ' nan]


In [40]:
for column in num.columns:
    unique_values = num[column].unique()
    print(f"Unique values in {column}: {unique_values}")

Unique values in ODATEDW: [8901 9401 9001 8701 8601 8801 9601 9201 9301 9501 9101 9701 8804 9302
 9509 8810 9511 9111 9009 9309 8910 9510 9212 9506 8608 9410 9209 9103
 9102 8501 8711 9310 8611 8912 9010 8704 9512 8702 8609 9205 8401 9303
 9312 8604 8612 9109 9011 8909 9003 9202 9402 8707 9012 8306]
Unique values in TCODE: [    0     1     2    28     3  1002    42     4    18   980    14 28028
    72    22 13002    23    45    24  4002    30    13   202   136 72002
    96   116   100     6  4004 39002    61    47    36   228 14002  6400
    40    25    21    94    12 58002   134 18002    38     9    76    50
    27    93    17     7    44 24002 22002]
Unique values in DOB: [3712 5202    0 2801 2001 6001 3211 2301 2603 2709 5401 5201 3601 1601
 2311 4307 5601 1401 4809 2601 2904 2901 1002 1311 6801 5310 4611 3110
  908 3706 3001 1411 5210 3703 5801 5001 6401 1801 6201 4801 5701 1805
 2608 4403 2512 2807 3605 5101 4401 3201 6501 3501 3401 3812 2101 2410
 1101 4707  809 1408 2909 3901 20

Unique values in ETH5: [11  6  2 32  1  0  5 12  4 37 16  3 13 27 23 30  8 14 25 18 60 28  9 10
  7 17 15 19 45 98 50 20 34 77 31 43 24 49 59 21 44 29 70 38 89 58 41 46
 26 56 22 74 83 53 92 33 52 35 36 40 75 88 73 64 84 57 91 82 48 90 54 68
 67 47 71 63 95 80 51 87 81 96 86 61 66 85 72 93 42 94 62 65 69 78 39 55
 76 79 97 99]
Unique values in ETH6: [ 0  4  6  1  3  2  9  7  8  5 10 22 15 20 14 11 12 16 13 17 18]
Unique values in ETH7: [ 0  2  3  1 15  5 23 10  4  6  7 57 46 16 32 12 25 59  9 30 33 14  8 31
 22 36 39 20 37 55 53 35 11 26 38 13 17 24 50 27 56 34 51 19 47 29 44 58
 40 28 60 41 18 45 43 49 21 48 67 42 72]
Unique values in ETH8: [ 0  6  1  2  8  3  5 11  4 25  7 15 10  9 34 21 30 17 13 38 19 22 28 18
 14 12 29 23 26 36 16 33 20 31 61 32 40 45 35 39 27 24 50 42 52 37 41 98
 44 47 70 59 72 46 64 62 55 75 99 58]
Unique values in ETH9: [ 0  4  1  2 27  8  3  5  6  7  9 10 11 14 56 13 22 25 23 15 58 43 34 24
 12 19 17 20 42 30 16 48 29 40 31 35 37 26 64 33 32 21 49 50 52 18 36 

Unique values in HHP2: [276 360 254 283 323 265 263 289 250 287 246 145 296 303 374 252 243 272
 258 267 274 257 295 310 268 318 248 334 247 231 181   0 278 233 235 130
 221 317 300 240 253 245 280 286 384 223 322 356 269 242 301 190 262 260
 215 230 302 167 213 332 256 216 164 195 212 207 144 209 358 249 255 178
 238 328 277 184 290 196 285 279 271 305 294 298 362 194 259 251 306 219
 188 339 325 273 309 307 237 275 211 304 393 330 266 284 210 241 177 386
 299 239 270 264 169 183 320 312 333 329 389 350 201 281 205 369 180 166
 203 288 292 244 311 375 313 225 226 293 346 193 224 187 315 234 282 319
 222 340 206 232 229 170 151 197 261 327 204 308 345 200 173 208 176 353
 228 291 202 343 366 321 163 220 160 297 218 361 158 402 236 227 387 324
 379 314 165 341 199 148 405 348 331 191 400 352 174 214 179 326 390 316
 385 121 421 189 175 192 185 349 338 403 363 217 342 152 351 414 153 182
 344 336 116 142 149 337 157 186 420 143 439 127 365 372 370 364 383 154
 347 359 161 377 382 139 198

Unique values in HVP6: [ 0 94 10  1 91  2 32 11 84  4 21 95 80  7 18 71  6 99 92 23  8 22  3 33
  5 24 65 76 13 19 57 38 27 26 51 14  9 60 42 12 82 15 35 90 72 98 62 28
 43 66 25 52 20 67 50 68 83 86 45 49 93 59 70 17 30 37 61 73 89 46 39 16
 44 64 56 29 74 85 88 81 31 87 40 75 47 54 78 69 96 79 41 77 58 63 53 34
 55 97 48 36]
Unique values in HUR1: [ 1  0  4 15  2  5  7  8 17  6  3 14 11 20 12 21  9 74 50 25 10 28 13 22
 26 16 23 47 31 55 41 58 18 33 27 46 37 34 38 39 59 45 29 24 60 48 19 67
 30 44 68 40 32 57 95 42 99 35 70 52 49 66 36 51 62 43 53 71 89 76 78 54
 56 73 65 63 69 81 92 61 83 77 97 90 79 88 80 91 85 75 86 72 82 96 93 64]
Unique values in HUR2: [61 83 36 42 45 51 48 46 63 44 16 54 75 65 56 57 64 50 60 80 34 86 67 84
 43 91 31  7  0 59 40  6 47 92 85 70 35 39 29 30 25 33 53 28 13 20 95 32
 10 97 26  2 17 38 15 23  4 19 18 37 98  3 66 87 79 41 55 49 58 94 88 76
 71 73 21 78 77 68 81 93 14 72 69  9  5 27 62 89 22 52  8 99 90 74  1 11
 24 82 96 12]
Unique values in RHP1: [58

Unique values in TPE5: [ 0 15  9  1  7  2  4 17  6 12  5  3  8 24 21 16 20 13 10 19 11 14 25 18
 23 22 29 50 36 44 32 42 31 33 26 27 38 45 46 37 28 34 55 43 35 39 40 30
 71 41]
Unique values in TPE6: [ 0  3  1  2  4  5 27 14  7 21  8 11 29  9  6 26 30 13 17 10 12 16 15 47
 22 23 28]
Unique values in TPE7: [ 0  1  2  5  3  4  7  6 11 12 22  8 16 10  9 25 20 13]
Unique values in TPE8: [ 0  2  6  3  1 11  7  5  8 12  4 18 53 14 45 13 31  9 10 17 24 36 21 27
 16 15 22 38 32 99 30 28 46 57 44 40 19 23 25 29 20 62 58 34 39 49 33 26
 37 42 35 50 43 59 68 41 48 69 65 47 51 55 72 64 67 52 79 56 78 71 54 70
 91 63 61 60 73 90 74 66 96 76 85]
Unique values in TPE9: [ 4  6  2  0  1 13  3  5  7 16 10 12 11 17  8 20  9 21 22 15 25 18 14 19
 34 26 24 29 30 55 23 35 27 40 36 31 33 32 53 45 39 28 58 99 37 49 54 46
 42 43 59 52 41 38 44 56]
Unique values in PEC1: [ 1  0  2  3  5 19  4  8 15 34  7  6  9 36 10 24 28 14 46 17 22 27 33 12
 13 11 68 52 53 43 58 37 18 30 21 16 66 55 29 72 51 32 25 75 26 35 38

Unique values in OEDC3: [ 1  3  2  6  0  4 29  5  7  8 17 12 13 26  9 11 10 44 37 22 18 32 15 27
 24 21 30 19 16 20 14 28 23 34 33 25 39 35 41 36 49 31 50 40 46 38 51 45
 43 56 60 99 42 64]
Unique values in OEDC4: [ 7 16  8  6  5  4  2 24 10  9 17 12  0  1  3 11 20 23 13 37 14 21 18 25
 15 26 30 22 29 19 33 28 39 36 27 31 34 38 32 44 41 42 35 99 40 43 46 53
 73 51 78 54 55 56 50 57 45 65 48 58 61 64 67 52 47 71]
Unique values in OEDC5: [78 69 74 87 49 82 58 70 72 54 61 59 79 62 80 86 77 48 73 57 81 76  0 53
 91 68 71 65 63 66 84 88 75 67 83 52 64 89 90 60 37 56 47 55 51 85 39 93
 38 44 95 45 50 92 43 31 35 46 42 26 99 41 40 97 24 96 30 36 33 34 28 27
 94 19 32 29 21 98 23 25 16 12 15 18 22 20 11 14 17  9 13]
Unique values in OEDC6: [ 2  5  3  0 12  6  4 16  7 10 14  8 19 26 13  9 24  1 11 15 20 25 17 23
 35 31 18 38 27 21 44 30 28 46 32 22 29 39 74 58 99 55 33 37 36 84 43 41
 34 57 52 48 45 50 81 65 40 53 51 61 47 42 82]
Unique values in OEDC7: [ 0  2  1  4  3  5  8  6  7 11 49  9 10 1

Unique values in HC3: [ 5  2  3  0 12  1  6  4 43  9  8 14 10  7 29 13 11 28 18 20 42 25 26 44
 21 23 19 98 16 45 36 15 76 38 24 22 35 17 32 46 82 99 54 50 30 53 48 47
 65 33 27 49 34 39 58 41 37 31 51 81 59 57 78 52 91 80 68 66 84 74 79 62
 40 64 61 67 71 55 60 56 93 94 96]
Unique values in HC4: [14 26 12 10  1  5  2  0  4  6  3 13 17 99 11 23 21 20 57 32  8 58  7 40
 25 22 37 15 16  9 31 30 45 73 29 28 93 64 75 70 33 18 19 55 38 56 80 46
 90 68 71 92 54 91 27 49 41 34 35 24 60 53 42 85 51 48 77 47 44 61 65 36
 98 97 88 43 82 78 72 74 39 59 52 50 69 86 62 89 81 63 84 79 76 66 67 83
 96 94 87 95]
Unique values in HC5: [14 56 23 19  3 43 31  4  2  8  0 12 32 30 40 26 35 13 99 18 27 42 33 86
  6 85 44 16 68  7 49 39  1 11  5 28 29 51 54 24 53 45 52 55 73 50 46 98
 62 88 93 41 22 94 20 17 37 10 25 67 64 34 87 15 47 60 36  9 59 78 71 95
 96 38 48 21 58 80 79 97 76 74 63 65 57 69 90 61 83 72 91 89 82 92 81 66
 75 77 84 70]
Unique values in HC6: [31 97 50 39  6 75 30 77 16 23  0 37 59 53 20 

In [None]:
# OVERSAMPLING 

In [43]:
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler


smote = SMOTE()
X = num.drop("TARGET_B", axis = 1)
y = num["TARGET_B"]
x_sm, y_sm = smote.fit_resample(X, y)
y_sm.value_counts()

scaler = StandardScaler()
stand_data_over = scaler.fit_transform(x_sm)
stand_data_over = pd.DataFrame(stand_data_over, columns = x_sm.columns)

smote = pd.concat([stand_data_over, y_sm], axis = 1)
smote

Unnamed: 0,ODATEDW,TCODE,DOB,HIT,MALEMILI,MALEVET,VIETVETS,WWIIVETS,LOCALGOV,STATEGOV,...,MAXRAMNT,MAXRDATE,LASTGIFT,LASTDATE,FISTDATE,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,TARGET_B
0,-0.673768,-0.061637,0.491866,-0.387636,-0.209833,0.801232,0.327631,-0.886931,0.847781,-0.498063,...,-0.334752,-0.185470,-0.486615,-0.846907,-0.676447,-0.482082,-0.031302,-0.802034,2.107737,0
1,0.889002,-0.060483,1.226118,1.480468,-0.209833,-1.453724,1.828025,-1.309579,-0.144359,-0.498063,...,0.279323,0.482881,0.674418,-0.846907,0.663196,0.300259,0.904197,-0.802034,0.099995,0
2,-0.361214,-0.060483,-1.337356,-0.154123,-0.209833,-0.983941,-0.029605,0.018742,-0.144359,0.761746,...,-0.145806,-1.370273,-0.873626,-0.846907,-0.374272,-0.507794,-1.450554,1.246830,2.107737,0
3,-1.298875,-0.061637,0.042938,-0.154123,-0.209833,-0.702072,-1.101315,-0.102014,-0.888463,-0.917999,...,-0.381988,-0.130787,-0.486615,-0.846907,-1.378165,-0.573837,1.328030,1.246830,2.107737,0
4,-1.611429,-0.061637,-0.351291,6.617753,0.018430,-0.232289,-1.458552,1.226307,4.816340,-0.288094,...,-0.193042,1.023637,-0.099604,1.047194,-4.060809,-0.568668,-1.591108,1.246830,0.099995,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181133,-1.055083,-0.061637,-0.403033,0.079390,0.246693,-0.044376,0.684868,-1.007688,0.103676,-0.708031,...,-0.546232,-1.667993,-0.856939,-0.399984,-1.106208,-0.694044,1.037658,-0.802034,1.103866,1
181134,-0.670642,-0.060483,-0.045270,-0.387636,-0.209833,0.707275,-1.529999,0.743281,-0.640429,0.131842,...,0.659660,0.458577,1.297642,-0.506394,-0.686520,0.047846,-1.461441,-0.802034,-0.903876,1
181135,-1.305126,-0.061637,-0.382829,0.079390,-0.209833,-1.265811,-0.386842,-1.007688,-1.136498,-0.708031,...,-0.572552,-0.543949,-0.850918,0.430015,-1.136425,-0.798141,0.369082,-0.802034,1.103866,1
181136,0.663963,-0.061637,-0.436543,0.079390,-0.209833,-2.393289,-1.458552,-1.671849,-1.384533,-0.917999,...,-0.586552,-0.282685,-0.744413,1.047194,0.498678,-0.751020,0.671787,-0.802034,1.103866,1


In [46]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


X = smote.drop("TARGET_B",axis = 1)
y = smote["TARGET_B"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size = 0.2)
LR = LogisticRegression()
LR.fit(X_train, y_train)
pred = LR.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.94      1.00      0.97     18054
           1       1.00      0.94      0.96     18174

    accuracy                           0.97     36228
   macro avg       0.97      0.97      0.97     36228
weighted avg       0.97      0.97      0.97     36228



In [None]:
# Undersampling

In [49]:
category_0 = num[num['TARGET_B'] == 0]
category_1 = num[num['TARGET_B'] == 1]

In [50]:
category_0 = category_0.sample(len(category_1))
category_0

Unnamed: 0,ODATEDW,TCODE,DOB,HIT,MALEMILI,MALEVET,VIETVETS,WWIIVETS,LOCALGOV,STATEGOV,...,MAXRAMNT,MAXRDATE,LASTGIFT,LASTDATE,FISTDATE,AVGGIFT,CONTROLN,TARGET_B,HPHONE_D,RFA_2F
23614,8901,0,1701,1,0,39,29,30,7,2,...,11.0,9303,11.0,9509,8901,7.250000,59337,0,1,1
55289,9501,28,0,0,0,19,21,36,4,0,...,25.0,9506,25.0,9506,9506,25.000000,146557,0,0,1
45994,8601,0,4103,0,0,37,31,46,7,7,...,10.0,8812,7.0,9603,8609,5.285714,98832,0,1,4
8777,9201,1,4410,0,0,0,0,0,0,26,...,30.0,9501,18.0,9701,9201,15.000000,11340,0,1,1
37032,9201,28,2701,5,0,33,26,33,6,1,...,10.0,9512,10.0,9512,9202,6.142857,96711,0,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37707,9101,2,5103,18,0,20,17,35,4,22,...,25.0,9512,25.0,9512,9109,11.833333,111610,0,1,1
93471,9201,0,0,0,0,27,18,27,6,2,...,15.0,9401,5.0,9611,9209,10.611111,30350,0,0,1
52710,9401,1,1401,12,0,0,0,0,0,0,...,20.0,9603,20.0,9603,9401,14.000000,169629,0,1,1
10774,8601,0,1211,2,0,34,40,8,0,7,...,15.0,8905,10.0,9509,8608,8.684211,99170,0,1,1


In [51]:
data = pd.concat([category_0, category_1], axis = 0)
data

Unnamed: 0,ODATEDW,TCODE,DOB,HIT,MALEMILI,MALEVET,VIETVETS,WWIIVETS,LOCALGOV,STATEGOV,...,MAXRAMNT,MAXRDATE,LASTGIFT,LASTDATE,FISTDATE,AVGGIFT,CONTROLN,TARGET_B,HPHONE_D,RFA_2F
23614,8901,0,1701,1,0,39,29,30,7,2,...,11.0,9303,11.0,9509,8901,7.250000,59337,0,1,1
55289,9501,28,0,0,0,19,21,36,4,0,...,25.0,9506,25.0,9506,9506,25.000000,146557,0,0,1
45994,8601,0,4103,0,0,37,31,46,7,7,...,10.0,8812,7.0,9603,8609,5.285714,98832,0,1,4
8777,9201,1,4410,0,0,0,0,0,0,26,...,30.0,9501,18.0,9701,9201,15.000000,11340,0,1,1
37032,9201,28,2701,5,0,33,26,33,6,1,...,10.0,9512,10.0,9512,9202,6.142857,96711,0,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95298,8601,2,5304,0,0,45,28,37,9,2,...,17.0,9601,17.0,9601,8608,7.935667,154544,1,0,1
95309,9401,0,4701,1,1,32,43,24,7,5,...,15.0,9402,15.0,9512,9310,11.666667,171302,1,1,1
95398,8601,0,1110,0,1,32,21,26,9,1,...,25.0,9511,20.0,9602,8711,14.400000,78831,1,0,3
95403,9001,0,4001,0,0,24,46,20,6,1,...,20.0,9312,20.0,9601,9003,11.583333,84678,1,0,1


In [52]:
print(data['TARGET_B'].value_counts())

0    4843
1    4843
Name: TARGET_B, dtype: int64


In [53]:
X = data.drop("TARGET_B",axis = 1)
y = data["TARGET_B"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size = 0.2)
LR = LogisticRegression()
LR.fit(X_train, y_train)
pred = LR.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.54      0.59      0.56       985
           1       0.53      0.47      0.50       953

    accuracy                           0.53      1938
   macro avg       0.53      0.53      0.53      1938
weighted avg       0.53      0.53      0.53      1938

