![Python_logo](https://www.python.org/static/community_logos/python-logo-master-v3-TM.png)


  # **Cortex Game: Round2--Probability of Giving**

> Before playing the game, you need to connect to SASPy first.
>
>> If it is your first time, please follow the 4 steps mentioned below!

***
## **Connect to SASPy**

**0- Connect to your Google Drive folder**

In [None]:
my_folder = "/content/drive/MyDrive/retoo"

from google.colab import drive
drive.mount('/content/drive')

# Change the following code to set your Drive folder
import os
os.chdir(my_folder)
!pwd

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/retoo


**1- Make sure that your Python version is 3.3 or higher as well as your Java version is 1.8.0_162 or higher**

In [None]:
!echo "Python is at" $(which python)
!python --version

Python is at /usr/local/bin/python
Python 3.8.15


In [None]:
!echo "Java is at" $(which java)
!/usr/bin/java -version

Java is at /usr/bin/java
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu218.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu218.04, mixed mode, sharing)


**2- Install SASPy**

In [None]:
!pip install saspy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**3- Create the configuration file "sascfg_personal.py"**
Please, check that your Home Region is correct, you can check it at [ODA-SAS](https://welcome.oda.sas.com/home)

In [None]:
%%writefile sascfg_personal.py
SAS_config_names=['oda']
oda = {'java' : '/usr/bin/java',
#US Home Region 1
'iomhost' : ['odaws01-usw2.oda.sas.com','odaws02-usw2.oda.sas.com','odaws03-usw2.oda.sas.com','odaws04-usw2.oda.sas.com'],
#US Home Region 2
#'iomhost' : ['odaws01-usw2-2.oda.sas.com','odaws02-usw2-2.oda.sas.com'],
#European Home Region 1
#'iomhost' : ['odaws01-euw1.oda.sas.com','odaws02-euw1.oda.sas.com'],
#Asia Pacific Home Region 1
#'iomhost' : ['odaws01-apse1.oda.sas.com','odaws02-apse1.oda.sas.com'],
#Asia Pacific Home Region 2
#'iomhost' : ['odaws01-apse1-2.oda.sas.com','odaws02-apse1-2.oda.sas.com'],
'iomport' : 8591,
'authkey' : 'oda',
'encoding' : 'utf-8'
}

Overwriting sascfg_personal.py


**4- Create your .authinfo**

If there is no .authinfo file, you can create this

In [None]:
#%%writefile .authinfo
#oda user USR password PSW

Copy this file to home

In [None]:
!cp .authinfo ~/.authinfo

cp: cannot stat '.authinfo': No such file or directory


**5- Establish Connection (Need to do this step each time you use SASPy)**

In [None]:
import saspy
sas_session = saspy.SASsession(cfgfile=os.path.join(
    my_folder,"sascfg_personal.py"))
sas_session

Using SAS Config named: oda
Error trying to read authinfo file:/root/.authinfo
[Errno 2] No such file or directory: '/root/.authinfo'
Did not find key oda in authinfo file:/root/.authinfo

Please enter the OMR user id: A00828688@tec.mx
Please enter the password for OMR user : ··········
SAS Connection established. Subprocess id is 1204



Access Method         = IOM
SAS Config name       = oda
SAS Config file       = /content/drive/MyDrive/retoo/sascfg_personal.py
WORK Path             = /saswork/SAS_workE13600011571_odaws04-usw2.oda.sas.com/SAS_workB2E100011571_odaws04-usw2.oda.sas.com/
SAS Version           = 9.04.01M6P11072018
SASPy Version         = 4.4.1
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = utf-8
Python Encoding value = utf-8
SAS process Pid value = 71025


***
## Connect to Cortex Data Sets

Load Cortex datasets from SAS Studio

In [None]:
ps = sas_session.submit("""
    libname cortex '~/my_shared_file_links/u39842936/Cortex Data Sets';
    """)
print(ps["LOG"])


5                                                          The SAS System                      Friday, December  2, 2022 06:52:00 AM

24         ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg style=HTMLBlue;
24       ! ods graphics on / outputfmt=png;
25         
26         
27             libname cortex '~/my_shared_file_links/u39842936/Cortex Data Sets';
28         
29         
30         
31         ods html5 (id=saspy_internal) close;ods listing;
32         

6                                                          The SAS System                      Friday, December  2, 2022 06:52:00 AM

33         


For local Jupyter

In [None]:
#%%SAS sas_session
#libname cortex '~/my_shared_file_links/u39842936/Cortex Data Sets';

### Transform cloud SAS dataset to Python dataframe (pandas)

For reference: 

1. [Pandas library](https://pandas.pydata.org/docs/user_guide/index.html)


2. [sklearn.model_selection for data partition](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
import pandas as pd

data1 = sas_session.sasdata2dataframe(
table='hist',
libref='cortex'
)

data2 = sas_session.sasdata2dataframe(
table='target_rd2',
libref='cortex'
)

## Merge the Data

In [None]:
data_merge = pd.merge(data1, data2, on=["ID"],how="right")
#data_merge.head()
data_merge.sample(2)

Unnamed: 0,ID,LastName,FirstName,Woman,Age,Salary,Education,City,SeniorList,NbActivities,...,Frequency,Seniority,TotalGift,MinGift,MaxGift,GaveLastYear,AmtLastYear,Contact,GaveThisYear,AmtThisYear
618699,2618700.0,DEVANEY,HARVEY,0.0,28.0,16400.0,High School,Downtown,1.0,0.0,...,1.0,1.0,100.0,100.0,100.0,0.0,0.0,0.0,0.0,0.0
278815,2278816.0,SAYER,CHARLES,0.0,58.0,18500.0,University / College,Rural,5.0,0.0,...,,,,,,0.0,0.0,0.0,0.0,0.0


In [None]:
data_merge.rename(columns={"City": "Location"}, inplace = True)


## Treating Missing Values

Please be aware that deleting all missing values can induce a selection bias. 
Some missing values are very informative. For example, when MinGift is missing, it means that the donor never gave in the past 10 years (leading to but excluding last year). Instead of deleting this information, replacing it by 0 is more appropriate!

A good understanding of the business case and the data can help you come up with more appropriate strategies to deal with missing values.

In [None]:
# In this case, we are replacing MinGift with 0.
# You can do the same for what you think is reasonable for dealing with the other variables.
data_mergeN = pd.merge(data1, data2, on=["ID"],how="right")

import numpy as np

data_mergeN[['Salary']] = data_mergeN[['Salary']].fillna(value=0)  

data_mergeN[['Referrals']] = data_mergeN[['Referrals']].fillna(value=0)  

data_mergeN[['TotalGift']] = data_mergeN[['TotalGift']].fillna(value=0)  

data_mergeN[['MaxGift']] = data_mergeN[['MaxGift']].fillna(value=0)  

data_mergeN[['MinGift']] = data_mergeN[['MinGift']].fillna(value=0)  

data_mergeN[['AmtLastYear']] = data_mergeN[['AmtLastYear']].fillna(value=0)  
data_mergeN[['HistoricDonor']] = data_merge[['Frequency']].notna().astype(int)

data_mergeN[['Recency']] = data_mergeN[['Recency']].fillna(value=100)  

def dummies(df, column):
  df = pd.concat([df, pd.get_dummies(df[column])], axis = 1)
  return df.drop(column, axis=1)

data_mergeN = dummies(data_merge, "Education")
data_mergeN = dummies(data_mergeN, "Location")

data_mergeN['logSalary']= np.log(data_mergeN['Salary'])
data_mergeN['logReferrals']= np.log(data_mergeN['Referrals'])
data_mergeN['logTotalGift']= np.log(data_mergeN['TotalGift'])
data_mergeN['logMaxGift']= np.log(data_mergeN['MaxGift'])
data_mergeN['logMinGift']= np.log(data_mergeN['MinGift'])
data_mergeN['logAmtLastYear']= np.log(data_mergeN['AmtLastYear'])

data_mergeN_sorted = data_mergeN.sort_values('GaveThisYear',ascending=False)
data_mergeN_sorted.head(10)


  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,ID,LastName,FirstName,Woman,Age,Salary,SeniorList,NbActivities,Referrals,Recency,...,City,Downtown,Rural,Suburban,logSalary,logReferrals,logTotalGift,logMaxGift,logMinGift,logAmtLastYear
540449,2540450.0,THORKILDSEN,MICHAEL,0.0,16.0,67700.0,4.0,0.0,0.0,,...,0,0,1,0,11.122841,-inf,,,,-inf
853818,2853819.0,SPORER,STEVEN,0.0,27.0,203100.0,2.0,0.0,0.0,,...,1,0,0,0,12.221454,-inf,,,,-inf
646295,2646296.0,COSTALES,DEBORAH,1.0,35.0,49100.0,8.0,3.0,5.0,4.0,...,1,0,0,0,10.801614,1.609438,2.995732,2.995732,2.995732,-inf
853812,2853813.0,LLOYD,DONALD,0.0,48.0,17200.0,10.0,0.0,0.0,1.0,...,0,0,0,1,9.752665,-inf,5.105945,4.317488,2.302585,4.787492
172014,2172015.0,RYAN,SANDY,1.0,59.0,241700.0,2.0,0.0,0.0,,...,0,0,1,0,12.395453,-inf,,,,-inf
172015,2172016.0,SCHWARTZ,EDWARD,0.0,22.0,2600.0,2.0,0.0,2.0,2.0,...,0,0,0,1,7.863267,0.693147,2.995732,2.995732,2.995732,-inf
172017,2172018.0,STROUD,DEBRA,1.0,30.0,35400.0,6.0,0.0,0.0,,...,0,0,0,1,10.474467,-inf,,,,-inf
853808,2853809.0,BONTRAGER,PATRICIA,1.0,50.0,41900.0,7.0,0.0,3.0,,...,0,0,0,1,10.643041,1.098612,,,,-inf
172022,2172023.0,CARBARY,BENJAMIN,0.0,49.0,40800.0,7.0,3.0,5.0,1.0,...,0,0,1,0,10.616437,1.609438,4.174387,3.218876,2.302585,3.688879
172024,2172025.0,DODSON,NICHOLAS,0.0,19.0,31500.0,7.0,0.0,6.0,1.0,...,1,0,0,0,10.357743,1.791759,3.912023,3.218876,3.218876,2.302585


In [None]:
data_mergeN_sorted.loc[data_mergeN_sorted['logSalary'] < 1, 'logSalary'] = 0
data_mergeN_sorted.loc[data_mergeN_sorted['logReferrals'] < 1, 'logReferrals'] = 0
data_mergeN_sorted.loc[data_mergeN_sorted['logTotalGift'] < 1, 'logTotalGift'] = 0
data_mergeN_sorted.loc[data_mergeN_sorted['logTotalGift'] < 1, 'logTotalGift'] = 0
data_mergeN_sorted.loc[data_mergeN_sorted['logMaxGift'] < 1, 'logMaxGift'] = 0
data_mergeN_sorted.loc[data_mergeN_sorted['logMinGift'] < 1, 'logMinGift'] = 0
data_mergeN_sorted.loc[data_mergeN_sorted['logAmtLastYear'] < 1, 'logAmtLastYear'] = 0
data_mergeN_sorted.rename(columns={"City": "Location"}, inplace = True)



data_mergeN_sorted

Unnamed: 0,ID,LastName,FirstName,Woman,Age,Salary,SeniorList,NbActivities,Referrals,Recency,...,Location,Downtown,Rural,Suburban,logSalary,logReferrals,logTotalGift,logMaxGift,logMinGift,logAmtLastYear
540449,2540450.0,THORKILDSEN,MICHAEL,0.0,16.0,67700.0,4.0,0.0,0.0,,...,0,0,1,0,11.122841,0.000000,,,,0.000000
853818,2853819.0,SPORER,STEVEN,0.0,27.0,203100.0,2.0,0.0,0.0,,...,1,0,0,0,12.221454,0.000000,,,,0.000000
646295,2646296.0,COSTALES,DEBORAH,1.0,35.0,49100.0,8.0,3.0,5.0,4.0,...,1,0,0,0,10.801614,1.609438,2.995732,2.995732,2.995732,0.000000
853812,2853813.0,LLOYD,DONALD,0.0,48.0,17200.0,10.0,0.0,0.0,1.0,...,0,0,0,1,9.752665,0.000000,5.105945,4.317488,2.302585,4.787492
172014,2172015.0,RYAN,SANDY,1.0,59.0,241700.0,2.0,0.0,0.0,,...,0,0,1,0,12.395453,0.000000,,,,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
360331,2360332.0,YANG,JAMIE,1.0,21.0,17700.0,1.0,0.0,0.0,,...,0,0,1,0,9.781320,0.000000,,,,0.000000
360332,2360333.0,JIMENEZ,MARVIN,0.0,53.0,45900.0,8.0,0.0,4.0,1.0,...,1,0,0,0,10.734220,1.386294,2.302585,2.302585,2.302585,0.000000
360335,2360336.0,JURCZYK,MILTON,0.0,74.0,208000.0,0.0,0.0,0.0,,...,0,1,0,0,12.245293,0.000000,,,,0.000000
360336,2360337.0,PIZZI,MICHAEL,0.0,33.0,43500.0,9.0,1.0,2.0,2.0,...,0,0,0,1,10.680516,0.000000,4.499810,3.912023,2.708050,0.000000


## Data Partition

In [None]:
# The code below is an illustration on how to sample data on train and validation samples.
# You could use another library or a built-in function to perform sampling.

from sklearn.model_selection import train_test_split
train, validation = train_test_split(data_mergeN_sorted, test_size=0.4, random_state=44444) 
train.sample(2)

Unnamed: 0,ID,LastName,FirstName,Woman,Age,Salary,SeniorList,NbActivities,Referrals,Recency,...,Location,Downtown,Rural,Suburban,logSalary,logReferrals,logTotalGift,logMaxGift,logMinGift,logAmtLastYear
711287,2711288.0,DELACRUZ,SIDNEY,0.0,44.0,12500.0,7.0,1.0,2.0,,...,1,0,0,0,9.433484,0.0,,,,4.60517
126050,2126051.0,ALVAREZ,CHANCE,0.0,73.0,107000.0,10.0,3.0,3.0,4.0,...,1,0,0,0,11.580584,1.098612,3.912023,3.401197,2.995732,0.0


In [None]:
# v8

X_train = train[['logTotalGift', 'logSalary','logAmtLastYear','Age','Referrals','logMinGift','logMaxGift','Woman','Recency','Contact','GaveLastYear','University / College','Elementary','Downtown', 'Rural', 'Suburban']] 
Y_train = train['GaveThisYear']
X_valid = validation[['logTotalGift', 'logSalary','logAmtLastYear','Age','Referrals','logMinGift','logMaxGift','Woman','Recency','Contact','GaveLastYear','University / College','Elementary','Downtown', 'Rural', 'Suburban']] 
Y_valid = validation['GaveThisYear']

# Prebuilt Models

The sk-learn library offers more advanced models. 

sk-learn library: https://scikit-learn.org/stable/index.html  

## Logistic Regression Model

In [None]:
#from sklearn.linear_model import LogisticRegression
#regr = LogisticRegression()
#regr.fit(X_train,Y_train)
#regr_predict=regr.predict(X_valid)

In [None]:
#you can change the criteria

##import numpy as np
#from sklearn.metrics import confusion_matrix
#from sklearn.metrics import confusion_matrix

#confusion_matrix = confusion_matrix(Y_valid, regr_predict)
#print(confusion_matrix)

In [None]:
#from sklearn.metrics import classification_report
#print(classification_report(Y_valid, regr_predict))

## Decision Tree Model

In [None]:
#from sklearn.tree import DecisionTreeClassifier
#DT_model = DecisionTreeClassifier(max_depth=5,criterion="entropy").fit(X_train,Y_train)
#DT_predict_proba = DT_model.predict_proba(X_valid) #Predictions on Testing data
#DT_predict = DT_model.predict(X_valid) #Predictions on Testing data
# Probabilities for each class
#DT_probs = DT_model.predict_proba(X_valid)[:, 1]
#print(DT_probs)

In [None]:
#you can change the criteria
#import numpy as np
#from sklearn.metrics import confusion_matrix
#from sklearn.metrics import confusion_matrix


#confusion_matrix = confusion_matrix(Y_valid, DT_predict)
#print(confusion_matrix)


In [None]:
#from sklearn.metrics import classification_report
#print(classification_report(Y_valid, DT_predict))

### *Other models may also be helpful for this game*

Reference: https://scikit-learn.org/stable/supervised_learning.html
    

## XGB Boost Model

In [None]:
from xgboost import XGBClassifier

xgb_model = XGBClassifier().fit(X_train, Y_train)
XGB_predict = xgb_model.predict(X_valid)


In [None]:
#you can change the criteria

import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(Y_valid, XGB_predict)
print(confusion_matrix)

[[336796   3483]
 [ 53640   6081]]


In [None]:
from sklearn.metrics import classification_report
print(classification_report(Y_valid, XGB_predict))

              precision    recall  f1-score   support

         0.0       0.86      0.99      0.92    340279
         1.0       0.64      0.10      0.18     59721

    accuracy                           0.86    400000
   macro avg       0.75      0.55      0.55    400000
weighted avg       0.83      0.86      0.81    400000



In [None]:
import lightgbm as ltb
lgb_model = ltb.LGBMClassifier().fit(X_train, Y_train)

lgb_predict = lgb_model.predict(X_valid)

In [None]:
#you can change the criteria

import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(Y_valid, lgb_predict)
print(confusion_matrix)

[[336889   3390]
 [ 53384   6337]]


In [None]:
from sklearn.metrics import classification_report
print(classification_report(Y_valid, lgb_predict))

              precision    recall  f1-score   support

         0.0       0.86      0.99      0.92    340279
         1.0       0.65      0.11      0.18     59721

    accuracy                           0.86    400000
   macro avg       0.76      0.55      0.55    400000
weighted avg       0.83      0.86      0.81    400000



## Scoring New Data

### Prepare data for scoring

In [None]:
data3 = sas_session.sasdata2dataframe(
table='score',
libref='cortex'
)
data4 = sas_session.sasdata2dataframe(
table='score_rd2_contact',
libref='cortex'
)
data5 = sas_session.sasdata2dataframe(
table='SCORE_RD2_NOCONTACT',
libref='cortex'
)

 ### Score new data based on your champion model
 
 Pick your champion model from previous steps and use it to predict next year donations. 
 
 In this case, the Decision Tree model performed better than the Logistic Regression model based on the AUC criterion.

### Predict 'probability of giving' for members who were contacted

In [None]:
scoring_data_contact = pd.merge(data3, data4, on=["ID"],how="right")

# Perform the same strategy for handling missing values for the score dataset.
# In this case, we will only replace missing values of the MinGift variable.
scoring_data_contact.rename(columns={"City": "Location"}, inplace = True)

import numpy as np
scoring_data_contact[['Salary']] = scoring_data_contact[['Salary']].fillna(value=0)  
scoring_data_contact[['Referrals']] = scoring_data_contact[['Referrals']].fillna(value=0)  
scoring_data_contact[['TotalGift']] = scoring_data_contact[['TotalGift']].fillna(value=0)  
scoring_data_contact[['MaxGift']] = scoring_data_contact[['MaxGift']].fillna(value=0)  
scoring_data_contact[['MinGift']] = scoring_data_contact[['MinGift']].fillna(value=0)  
scoring_data_contact[['AmtLastYear']] = scoring_data_contact[['AmtLastYear']].fillna(value=0)  
scoring_data_contact[['Recency']] = scoring_data_contact[['Recency']].fillna(value=100)  
scoring_data_contact[['HistoricDonor']] = scoring_data_contact[['Frequency']].notna().astype(int)


scoring_data_contact = dummies(scoring_data_contact, "Education")
scoring_data_contact = dummies(scoring_data_contact, "Location")

scoring_data_contact['logSalary']= np.log(scoring_data_contact['Salary'])+1
scoring_data_contact['logReferrals']= np.log(scoring_data_contact['Referrals'])+1
scoring_data_contact['logTotalGift']= np.log(scoring_data_contact['TotalGift'])+1
scoring_data_contact['logMaxGift']= np.log(scoring_data_contact['MaxGift'])+1
scoring_data_contact['logMinGift']= np.log(scoring_data_contact['MinGift'])+1
scoring_data_contact['logAmtLastYear']= np.log(scoring_data_contact['AmtLastYear'])+1

scoring_data_contact.head(10)

scoring_data_contact.loc[scoring_data_contact['logSalary'] < 1, 'logSalary'] = 0
scoring_data_contact.loc[scoring_data_contact['logReferrals'] < 1, 'logReferrals'] = 0
scoring_data_contact.loc[scoring_data_contact['logTotalGift'] < 1, 'logTotalGift'] = 0
scoring_data_contact.loc[scoring_data_contact['logTotalGift'] < 1, 'logTotalGift'] = 0
scoring_data_contact.loc[scoring_data_contact['logMaxGift'] < 1, 'logMaxGift'] = 0
scoring_data_contact.loc[scoring_data_contact['logMinGift'] < 1, 'logMinGift'] = 0
scoring_data_contact.loc[scoring_data_contact['logAmtLastYear'] < 1, 'logAmtLastYear'] = 0

X = scoring_data_contact[['logTotalGift', 'logSalary','logAmtLastYear','Age','Referrals','logMinGift','logMaxGift','Woman','Recency','Contact','GaveLastYear','University / College','Elementary','Downtown', 'Rural', 'Suburban']] 
DT_predict_contact=lgb_model.predict_proba(X)[:,1]
scoring_data_contact['Prediction_prob'] = DT_predict_contact

scoring_data_contact= scoring_data_contact[['ID','Prediction_prob']]
scoring_data_contact = scoring_data_contact.rename({'Prediction_prob': 'ProbContact'}, axis=1) 
scoring_data_contact.head()

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,ID,ProbContact
0,2000001.0,0.18694
1,2000002.0,0.589875
2,2000003.0,0.602817
3,2000004.0,0.320543
4,2000005.0,0.604425


### Predict 'probability of giving' for members who were not contacted

In [None]:
scoring_data_nocontact = pd.merge(data3, data5, on=["ID"],how="right")

# Perform the same strategy for handling missing values for the score dataset.
# In this case, we will only replace missing values of the MinGift variable.

scoring_data_nocontact.rename(columns={"City": "Location"}, inplace = True)

import numpy as np
scoring_data_nocontact[['Salary']] = scoring_data_nocontact[['Salary']].fillna(value=0)  
scoring_data_nocontact[['Referrals']] = scoring_data_nocontact[['Referrals']].fillna(value=0)  
scoring_data_nocontact[['TotalGift']] = scoring_data_nocontact[['TotalGift']].fillna(value=0)  
scoring_data_nocontact[['MaxGift']] = scoring_data_nocontact[['MaxGift']].fillna(value=0)  
scoring_data_nocontact[['MinGift']] = scoring_data_nocontact[['MinGift']].fillna(value=0)  
scoring_data_nocontact[['AmtLastYear']] = scoring_data_nocontact[['AmtLastYear']].fillna(value=0)  
scoring_data_nocontact[['Recency']] = scoring_data_nocontact[['Recency']].fillna(value=100)  
scoring_data_nocontact[['HistoricDonor']] = scoring_data_nocontact[['Frequency']].notna().astype(int)

scoring_data_nocontact = dummies(scoring_data_nocontact, "Education")
scoring_data_nocontact = dummies(scoring_data_nocontact, "Location")

scoring_data_nocontact['logSalary']= np.log(scoring_data_nocontact['Salary'])+1
scoring_data_nocontact['logReferrals']= np.log(scoring_data_nocontact['Referrals'])+1
scoring_data_nocontact['logTotalGift']= np.log(scoring_data_nocontact['TotalGift'])+1
scoring_data_nocontact['logMaxGift']= np.log(scoring_data_nocontact['MaxGift'])+1
scoring_data_nocontact['logMinGift']= np.log(scoring_data_nocontact['MinGift'])+1
scoring_data_nocontact['logAmtLastYear']= np.log(scoring_data_nocontact['AmtLastYear'])+1

scoring_data_nocontact.head(10)

scoring_data_nocontact.loc[scoring_data_nocontact['logSalary'] < 1, 'logSalary'] = 0
scoring_data_nocontact.loc[scoring_data_nocontact['logReferrals'] < 1, 'logReferrals'] = 0
scoring_data_nocontact.loc[scoring_data_nocontact['logTotalGift'] < 1, 'logTotalGift'] = 0
scoring_data_nocontact.loc[scoring_data_nocontact['logTotalGift'] < 1, 'logTotalGift'] = 0
scoring_data_nocontact.loc[scoring_data_nocontact['logMaxGift'] < 1, 'logMaxGift'] = 0
scoring_data_nocontact.loc[scoring_data_nocontact['logMinGift'] < 1, 'logMinGift'] = 0
scoring_data_nocontact.loc[scoring_data_nocontact['logAmtLastYear'] < 1, 'logAmtLastYear'] = 0

X = scoring_data_nocontact[['logTotalGift', 'logSalary','logAmtLastYear','Age','Referrals','logMinGift','logMaxGift','Woman','Recency','Contact','GaveLastYear','University / College','Elementary','Downtown', 'Rural', 'Suburban']] 
DT_predict_nocontact=lgb_model.predict_proba(X)[:,1]
scoring_data_nocontact['Prediction_prob'] = DT_predict_nocontact

scoring_data_nocontact= scoring_data_nocontact[['ID','Prediction_prob']]
scoring_data_nocontact = scoring_data_nocontact.rename({'Prediction_prob': 'ProbNoContact'}, axis=1) 
scoring_data_nocontact.head()

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,ID,ProbNoContact
0,2000001.0,0.185747
1,2000002.0,0.07418
2,2000003.0,0.051489
3,2000004.0,0.054524
4,2000005.0,0.051818


In [None]:
result_Prob = pd.merge(scoring_data_contact, scoring_data_nocontact, on=["ID"],how="right")
result_Prob.sort_values(by=['ID'], inplace=True)
result_Prob.sample(10)

Unnamed: 0,ID,ProbContact,ProbNoContact
706075,2706076.0,0.346823,0.179104
842390,2842391.0,0.684224,0.121051
796923,2796924.0,0.590735,0.122103
198419,2198420.0,0.299109,0.091016
576339,2576340.0,0.240129,0.289761
983706,2983707.0,0.097268,0.340302
919284,2919285.0,0.209563,0.163385
226208,2226209.0,0.108761,0.119006
345880,2345881.0,0.358111,0.142997
632807,2632808.0,0.723291,0.274086


## Exporting Results to a CSV File

In [None]:
result_Prob.to_csv('Round2_Output_prob.csv', index=False)

In [None]:
import pandas as pd

probabilities = pd.read_csv("Round2_Output_prob.csv")
amounts = pd.read_csv("Round2_Output_amt.csv")

def Calc_Uplift(raw_data):
    return ((raw_data['AmtContact']*raw_data['ProbContact']) - (raw_data['AmtNoContact']*raw_data['ProbNoContact']))

raw_submission = pd.merge(probabilities, amounts, on=["ID"], how="right")
raw_submission["Uplift"] = raw_submission.apply(lambda row: Calc_Uplift(row), axis=1)

# Sorting data by descending Uplift value
raw_submission.sort_values(by=['Uplift'], ascending=False, inplace=True)


In [None]:
# Export the final csv file
NB = 180000

submission = raw_submission.head(NB)

submission["ID"].astype(int).to_csv('Round2 Output final.csv', index=False)

In [None]:
# Congratulations! You are now done with Round 2. You are ready to prepare your solution to upload it to the leaderboard.

In [None]:
# Reminder: Please note that you need only one column (the list of donors' IDs) to submit to the leaderboard.

In [None]:
!head Round2\ Output\ final.csv

ID
2328928
2170154
2093263
2694007
2395865
2058535
2245814
2393713
2890170
