<style>* {font-family: "SaxMono", "Consolas", monospace}</style>

# KDD Cup 2009

Author: Felipe Camargo de Pauli  
Date: 10/23

<img src="KDD_CUP.png" width="300px" alt="KDD Cup Image">

<style>* {font-family: "SaxMono", "Consolas", monospace}</style>

# 1. Problem Definition and Strategy  
- 1.1 Clearly define the problem and objective.  
- 1.2 Understand what the data represents and its characteristics.
- 1.3 Propose an initial solution.

### 1.1 Clearly define the problem and objective.  
Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

### 1.2 Understand what the data represents and its characteristics.
The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.

https://www.kdd.org/kdd-cup/view/kdd-cup-2009


### 1.3 Propose an initial solution.
Execute the Data Science's Workflow and create a final report with the results.
#### Deliveries
- Notebook with data analysis
- Report with results

## 2. Gather the Data
### 2.1 Seek out the data (datasets)
The data is available on comptetition's [link](https://www.kdd.org/kdd-cup/view/kdd-cup-2009). We will use the orange_small preset data.

### 2.2 Take a first look at the data
- It has 230 features
- From 1 to 190 we have numerical features; From 190 to 230 we have categorical features
- There are a huge mess, but structured mess
- The columns are separated by tabs

### 2.3 Prepare them for import into notebooks
There is no need. The data is ready for Data Analysis.


## 3. Data Loading, Initial Visualization, and Transformation
### 3.1 Load the data and visualize the first few rows.

In [6]:
# Basic imports
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier, GradientBoostingClassifier
from matplotlib import pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

In [7]:
train_data  = pd.read_csv('./data/orange_small_train.data', sep = '\t', na_filter=False)
target      = pd.read_csv('./data/orange_small_train_churn.labels', header=None) 

In [8]:
df = train_data.copy()

In [12]:
print(df.shape)
print(target.shape)

(50000, 230)
(50000, 1)


In [10]:
df.head()

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Var6,Var7,Var8,Var9,Var10,...,Var221,Var222,Var223,Var224,Var225,Var226,Var227,Var228,Var229,Var230
0,,,,,,1526.0,7,,,,...,oslk,fXVEsaq,jySVZNlOJy,,,xb3V,RAYp,F2FyR07IdsN7I,,
1,,,,,,525.0,0,,,,...,oslk,2Kb5FSF,LM8l689qOp,,,fKCe,RAYp,F2FyR07IdsN7I,,
2,,,,,,5236.0,7,,,,...,Al6ZaUT,NKv4yOc,jySVZNlOJy,,kG3k,Qu4f,02N6s8f,ib5G6X1eUxUn6,am7c,
3,,,,,,,0,,,,...,oslk,CE7uk3u,LM8l689qOp,,,FSa2,RAYp,F2FyR07IdsN7I,,
4,,,,,,1029.0,7,,,,...,oslk,1J2cvxe,LM8l689qOp,,kG3k,FSa2,RAYp,F2FyR07IdsN7I,mj86,


In [31]:
df.iloc[:, 180:200].dropna(subset=["Var190"])

Unnamed: 0,Var181,Var182,Var183,Var184,Var185,Var186,Var187,Var188,Var189,Var190,Var191,Var192,Var193,Var194,Var195,Var196,Var197,Var198,Var199,Var200
0,0,,,,,,,,462,,,bZkvyxLkBI,RO12,,taul,1K8T,lK27,ka_ns41,nQUveAzAF7,
1,0,,,,,,,,,,,CEat0G8rTN,RO12,,taul,1K8T,2Ix5,qEdASpP,y2LIM01bE1,
2,0,,,,,,,,,,,eOQt0GoOh3,AERks4l,SEuy,taul,1K8T,ffXs,NldASpP,y4g9XoZ,vynJTq9
3,0,,,,,,,,,,,jg69tYsGvO,RO12,,taul,1K8T,ssAy,_ybO0dd,4hMlgkf58mhwh,
4,0,,,,,,,,,,,IXSgUHShse,RO12,SEuy,taul,1K8T,uNkU,EKR938I,ThrHXVS,0v21jmy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,0,,,,,,,,,,,xOXr4RXktW,RO12,,taul,1K8T,ZNsX,7nPy3El,h3WsUQk,
49996,0,,,,,,,,396,,,S8dr4RQxul,2Knk1KF,SEuy,I9xt3GBDKUbd8,1K8T,JLbT,kJ1JA2C,7aPrx0x,tkF1jmy
49997,0,,,,,,,,,,,uUdt0G8EIb,2Knk1KF,,taul,1K8T,0Xwj,LK5nVRA,k10MzgT,_VHQRHe
49998,,0,,,,,,,276,,r__I,FoxgUHSK8h,RO12,,taul,1K8T,AHgj,VcW4jEC,LH0kFz12FM,


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 230 entries, Var1 to Var230
dtypes: float64(2), int64(1), object(227)
memory usage: 87.7+ MB


In [13]:
np.unique(target)

array([-1,  1])

All seems to be as expected. We have 

### 3.2 Perform individual transformations.


### 3.3 Identify and handle null values.

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Var6,Var7,Var8,Var9,Var10,...,Var221,Var222,Var223,Var224,Var225,Var226,Var227,Var228,Var229,Var230
0,,,,,,1526.0,7,,,,...,oslk,fXVEsaq,jySVZNlOJy,,,xb3V,RAYp,F2FyR07IdsN7I,,
1,,,,,,525.0,0,,,,...,oslk,2Kb5FSF,LM8l689qOp,,,fKCe,RAYp,F2FyR07IdsN7I,,
2,,,,,,5236.0,7,,,,...,Al6ZaUT,NKv4yOc,jySVZNlOJy,,kG3k,Qu4f,02N6s8f,ib5G6X1eUxUn6,am7c,
3,,,,,,,0,,,,...,oslk,CE7uk3u,LM8l689qOp,,,FSa2,RAYp,F2FyR07IdsN7I,,
4,,,,,,1029.0,7,,,,...,oslk,1J2cvxe,LM8l689qOp,,kG3k,FSa2,RAYp,F2FyR07IdsN7I,mj86,
