# Extract-Transform-Load and Exploratory Data Analysis
This notebook contains all code for the prelimiatory analysis of the KDD Cup 98 datasets

## Loading the tidy datasets
The class TidyDataset holds all transforming steps to create a valid dataset for later use. It transforms:
    - boolean fields that are coded in various ways in the original dataset -> 0/1
    - date strings yymm -> individual columns with yy and mm, name prefixed with original field name
    - Multi-value (bytewise) categories -> individual categorical fields
    - ZIP codes are stripped of trailing dashes and a category created

In [None]:
%load_ext autoreload
%autoreload 2
import eda.tidy_dataset as tds

In [1]:
lrn = tds.TidyDataset("cup98LRN.txt") # pull_stored=False) # To force reprocessing
learning = lrn.get_tidy_data()

  result = method(y)


In [None]:
val = tds.TidyDataset("cup98VAL.txt")
validation = val.get_tidy_data()

## A first look at the learning datasets

Check that the learning and validation sets are really disjoint:

In [None]:
set(learning.index.values) & set(validation.index.values)

Some basic info on the dimensions and data types contained. Category- and object-features will have to be dealt with further.

In [2]:
learning.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95412 entries, 95515 to 185114
Columns: 554 entries, OSOURCE to RFA_24_Amount
dtypes: bool(26), category(92), float64(74), int64(305), object(57)
memory usage: 329.7+ MB


Let's have a look at the object columns

In [3]:
object_cols = learning.select_dtypes(include="object")

In [4]:
object_cols.describe()

Unnamed: 0,OSOURCE,RFA_2R,RFA_2A,MDMAUD_R,MDMAUD_F,MDMAUD_A,GEOCODE2,ODATEDW_year,ODATEDW_month,DOB_year,...,ADATE_20_year,ADATE_20_month,ADATE_21_year,ADATE_21_month,ADATE_22_year,ADATE_22_month,ADATE_23_year,ADATE_23_month,ADATE_24_year,ADATE_24_month
count,95412,95412,95412,95412,95412,95412,95280,95412,95412,68912,...,45212,45212,60200,60200,69764,69764,39142,39142,58439,58439
unique,896,1,4,5,4,5,5,15,12,86,...,1,2,1,2,2,5,2,3,1,2
top,MBC,L,F,X,X,X,A,95,1,48,...,94,11,94,10,94,9,94,7,94,6
freq,4539,95412,46964,95118,95118,95118,34484,15369,95351,1912,...,45212,45198,60200,55865,68963,63908,39120,38877,58439,58161


The boolean columns:

In [6]:
bool_cols = learning.select_dtypes(include="bool")

In [7]:
bool_cols.describe()

Unnamed: 0,NOEXCH,RECINHSE,RECP3,RECPGVG,RECSWEEP,HOMEOWNR,MAJOR,COLLECT1,VETERANS,BIBLE,...,PHOTO,CRAFTS,FISHER,GARDENIN,BOATS,WALKER,KIDSTUFF,CARDS,PLATES,PEPSTRFL
count,95412,95412,95412,95412,95412,95412,95412,95412,95412,95412,...,95412,95412,95412,95412,95412,95412,95412,95412,95412,95412
unique,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,2
top,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
freq,95412,95412,95412,95412,95412,95412,95412,95412,95412,95412,...,95412,95412,95412,95412,95412,95412,95412,95412,95412,50143


Is this a processing problem? Check raw data!

In [None]:
raw = lrn.get_raw_data()

In [None]:
raw.info()

In [None]:
raw[empty_cols]

In [None]:
set(raw["RECPGVG"])

In [None]:
set(learning["NOEXCH"])

In [None]:
set(object_cols.HOMEOWNR)

The title code is a categorical field. It may be helpful to transform the code to the actual titles. However, as is seen below, several of the codes present in the data are not in the dictionary, leaving only the option to set these to empty, which would result in a loss of information. The numerical levels are therefore kept.

In [None]:
learning.info()

In [None]:
tcode_categories = {
        0: "_",
        1: "MR.",
        1001: "MESSRS.",
        1002: "MR. & MRS.",
        2: "MRS.",
        2002: "MESDAMES",
        3: "MISS",
        3003: "MISSES",
        4: "DR.",
        4002: "DR. & MRS.",
        4004: "DOCTORS",
        5: "MADAME",
        6: "SERGEANT",
        9: "RABBI",
        10: "PROFESSOR",
        10002: "PROFESSOR & MRS.",
        10010: "PROFESSORS",
        11: "ADMIRAL",
        11002: "ADMIRAL & MRS.",
        12: "GENERAL",
        12002: "GENERAL & MRS.",
        13: "COLONEL",
        13002: "COLONEL & MRS.",
        14: "CAPTAIN",
        14002: "CAPTAIN & MRS.",
        15: "COMMANDER",
        15002: "COMMANDER & MRS.",
        16: "DEAN",
        17: "JUDGE",
        17002: "JUDGE & MRS.",
        18: "MAJOR",
        18002: "MAJOR & MRS.",
        19: "SENATOR",
        20: "GOVERNOR",
        21002: "SERGEANT & MRS.",
        22002: "COLNEL & MRS.",
        24: "LIEUTENANT",
        26: "MONSIGNOR",
        27: "REVEREND",
        28: "MS.",
        28028: "MSS.",
        29: "BISHOP",
        31: "AMBASSADOR",
        31002: "AMBASSADOR & MRS.",
        33: "CANTOR",
        36: "BROTHER",
        37: "SIR",
        38: "COMMODORE",
        40: "FATHER",
        42: "SISTER",
        43: "PRESIDENT",
        44: "MASTER",
        46: "MOTHER",
        47: "CHAPLAIN",
        48: "CORPORAL",
        50: "ELDER",
        56: "MAYOR",
        59002: "LIEUTENANT & MRS.",
        62: "LORD",
        63: "CARDINAL",
        64: "FRIEND",
        65: "FRIENDS",
        68: "ARCHDEACON",
        69: "CANON",
        70: "BISHOP",
        72002: "REVEREND & MRS.",
        73: "PASTOR",
        75: "ARCHBISHOP",
        85: "SPECIALIST",
        87: "PRIVATE",
        89: "SEAMAN",
        90: "AIRMAN",
        91: "JUSTICE",
        92: "MR. JUSTICE",
        100: "M.",
        103: "MLLE.",
        104: "CHANCELLOR",
        106: "REPRESENTATIVE",
        107: "SECRETARY",
        108: "LT. GOVERNOR",
        109: "LIC.",
        111: "SA.",
        114: "DA.",
        116: "SR.",
        117: "SRA.",
        118: "SRTA.",
        120: "YOUR MAJESTY",
        122: "HIS HIGHNESS",
        123: "HER HIGHNESS",
        124: "COUNT",
        125: "LADY",
        126: "PRINCE",
        127: "PRINCESS",
        128: "CHIEF",
        129: "BARON",
        130: "SHEIK",
        131: "PRINCE AND PRINCESS",
        132: "YOUR IMPERIAL MAJEST",
        135: "M. ET MME.",
        210: "PROF."}

In [None]:
new_cats = {str(k):str(v) for k,v in tcode_categories.items()}
new_cats
def set_new_tcode(old):
    if old in new_cats:
        return new_cats[old]
    else:
        return new_cats['0']
    
temp = learning.TCODE.cat.rename_categories(new_categories=new_cats)
temp.cat.categories

# Feature Extraction
All explanatory fields have to be numerical for the subsequent operations with scikit-learn. Here, the necessary feature extractions are performed.

See [scikit-learn: feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html)

In [None]:
learning.dtypes

# Feature Selection
Meant to reduce dimensionality by selecting only features that are 'interesting enough' to be considered in order to boost performance of calculations / improve accuracy of the estimator
- By variance threshold
- Recursive Feature Elimination by Cross-Validation
- L1-based feature selection (Logistic Regression, Lasso, SVM)
- Tree-based feature selection

See [scikit-learn: feature selection](http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection)


# PCA

A first look at important features

In [None]:
from sklearn import decomposition

In [None]:
X = learning.drop(["TARGET_B","TARGET_D"],axis=1)

In [None]:
n_comp = 3
pca = decomposition.PCA(n_components = n_comp)
pca.fit(X)
result = pd.DataFrame(pca.transform(X), columns=["PCA%i" % i for i in range(n_comp)], index=X.index)

In [None]:
import cProfile
domain_spreader = tds.SymbolicFieldToDummies(learning,"RFA_24",["Recency", "Frequency", "Amount"])
cProfile.run('domain_spreader.spread()', sort='time')

In [None]:
learning.head()

In [None]:
import os
import numpy as np
import sys
os.getcwd()
proj_dir = os.path.split(os.getcwd())[0]
if proj_dir not in sys.path:
    sys.path.append(proj_dir)

In [None]:
import eda.tidy_dataset as tds
tidy = tds.TidyDataset("cup98LRN.txt")

In [None]:
raw = tidy.get_raw_data()

In [None]:
spreader = tds.SymbolicFieldToDummies(
    raw, "RFA_24", ["Recency", "Frequency", "Amount"])
spreader.spread()