## Analytics Vidhya's Knocktober 2016 Machine Learning Competition
### Author: Caio Taniguchi
### Date: 2016-10-21

### 1. Understanding the Problem

#### 1.1. Problem Statement
MedCamp is a non-profit organization dedicated in making health conditions for working professionals better. To achieve this goal, MedCamp organizes events to do health checks for free and have stalls to give out health information.

One of the main challenges of these events is to predict the right amount of inventory, due to the large discrepancy between number of registrations and actual number of participants.

To better predict inventory, the goal is to predict the probability that a registered person participates in the event. For some events participation is defined as having some health check, while others is number of stalls visited.

#### 1.2. Training / Test Set Split
- Training set: Camps conducted before 31st March 2006
- Test set: Camps conducted after 1st April 2006. Public/Private leaderboard split is 50-50, by date.

#### 1.3. Evaluation Metric
- AUC ROC
 
#### 1.4. Other Things to Note
- Some data is missing due to hardware failure
- There are 3 types of camps, 2 perform health checks, while the other consists of awareness stalls

#### 1.5. Candidate Features
- Profile data
- Overall attendance ratio
- Attendance ratio by type of camp, by camp
- Attendance ratio by camp
- Attendance ratio by camp date (year/quarter/month/week/day/weekday)
- Attendance ratio by camp trend using lag features
- Attendance ratio by registration date (year/quarter/month/week/day/weekday)
- Attendance ratio by registration trend using lag features
- Attendance ratio by participant profile
- Attendance ratio by participant
- Number of registrations by camp
- Number of registrations for a participant
- Hardware failure (missing data) ratio by event

### 2. Preliminary Preprocessing

#### 2.1. Importing and Merging datasets

In [1]:
# Import the datasets and merge them to perform the first preprocessing steps
import pandas as pd
import numpy as np

train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

first_health_camp_attended = pd.read_csv('data/first_health_camp_attended.csv')
first_health_camp_attended.drop('Unnamed: 4', axis=1, inplace=True)
second_health_camp_attended = pd.read_csv('data/second_health_camp_attended.csv')
third_health_camp_attended = pd.read_csv('data/third_health_camp_attended.csv')
health_camp_detail = pd.read_csv('data/health_camp_detail.csv')
patient_profile = pd.read_csv('data/patient_profile.csv')

train['is_test'] = np.zeros(train.shape[0])
test['is_test'] = np.ones(test.shape[0])
df = pd.concat([train, test])

df = pd.merge(df, first_health_camp_attended, on=['Patient_ID', 'Health_Camp_ID'], how='left')
df = pd.merge(df, second_health_camp_attended, on=['Patient_ID', 'Health_Camp_ID'], how='left')
df = pd.merge(df, third_health_camp_attended, on=['Patient_ID', 'Health_Camp_ID'], how='left')
df = pd.merge(df, health_camp_detail, on='Health_Camp_ID', how='left')
df = pd.merge(df, patient_profile, on='Patient_ID', how='left')

In [2]:
column_types = []
for column in df.columns:
     column_types.append(str(df[column].dtype))

columns = pd.concat([pd.Series(list(df.columns)), pd.Series(column_types)], axis=1)
columns.columns = ['feature', 'type']
columns

Unnamed: 0,feature,type
0,Patient_ID,int64
1,Health_Camp_ID,int64
2,Registration_Date,object
3,Var1,int64
4,Var2,int64
5,Var3,int64
6,Var4,int64
7,Var5,int64
8,is_test,float64
9,Donation,float64


In [3]:
# Convert date feature from object to datetime
for column in ['Registration_Date', 'Camp_Start_Date', 'Camp_End_Date', 'First_Interaction']:
    df[column] = pd.to_datetime(df[column], format="%d-%b-%y")
    
# Convert CategoryX features to int and drop Category3 due to too low variance
column_types = []
for column in df.columns:
     column_types.append(str(df[column].dtype))

columns = pd.concat([pd.Series(list(df.columns)), pd.Series(column_types)], axis=1)
columns.columns = ['feature', 'type']
columns

df['Category1'] = pd.Categorical(df['Category1']).codes
df['Category2'] = pd.Categorical(df['Category2']).codes
df.drop('Category3', axis=1, inplace=True)

# Replace 'None' fields with NA
# Convert 'Income', 'Education_Score', 'Age' to float
to_replace_none = ['Income', 'Education_Score', 'Age']

for column in to_replace_none:
    df.loc[df[column] == 'None', column] = np.nan

df[to_replace_none] = df[to_replace_none].astype(float)

# Convert City_Type and Employer_Category to numerical
df.loc[:, ['City_Type', 'Employer_Category']] = \
    pd.Categorical(df.loc[:, ['City_Type', 'Employer_Category']]).codes
    
# Generate the Outcome feature
outcomes = df.loc[:, ['Health_Score', 'Health Score', 'Number_of_stall_visited']]
df['Outcome'] = (outcomes.notnull().sum(axis=1) > 0).astype(int)

# Separate the datasets
train = df[df['is_test'] == 0].copy()
test = df[df['is_test'] == 1].copy()

test.drop(['Donation', 'Health_Score', 'Health Score', 'Number_of_stall_visited', 'Last_Stall_Visited_Number',
          'is_test', 'Outcome'], axis=1, inplace=True)


In [11]:
train['Camp_Start_Date'].describe()

count                   35249
unique                     18
top       2006-11-09 00:00:00
freq                     4214
first     2006-04-02 00:00:00
last      2007-01-30 00:00:00
Name: Camp_Start_Date, dtype: object