# Descriptive Data Analysis
  # Variable Identification
   * Identify ID, Input and Target features
   * Identify categorical and numerical features
   * Identify columns with missing values
        
# Univariate Analysis (explore variables one by one)
   * **Continuous Variables**:- In case of continuous variables, we need to understand the central tendency and spread of the variable. 
       * Central Tendency: mean, median, mode, min, max
       * Measure of Dispersion: Range, Quartile, IQR, Var, Std, Kurtosis, skewness
       * Visualization: Box Plot, Histogram
   * **Categorical Variable**: Frequency table and Bar Charts, count/%count in each category
       
       
# Bi-variate Analysis (relationship between two variables)
   * **Continuous & Continuous**:
    * **Scatter plot** is used to find out the relationship between two variables
    * **Correlation check**
   * **Categorical & Categorical**
    * **Two-way table**: We can start analyzing the relationship by creating a two-way table of count and count%. The rows represents the category of one variable and the columns represent the categories of the other variable. We show count or count% of observations available in each combination of row and column categories
    * **Stacked Column Chart**: This method is more of a visual form of Two-way table
    * **Chi-Square Test**:used to derive the statistical significance of relationship between the variables.Chi-square is based on the difference between the expected and observed frequencies in one or more categories in the two-way table.Probability less than 0.05,indicates that the relationship between the variables is significant at 95% confidence **Cramer’s V for Nominal Categorical Variable and Mantel-Haenszed Chi-Square for ordinal categorical variable** are used to test the power of test
           
                                **H0: Both are independent ag. H1: Both are dependent**
   * **Categorical & Continuous**:While exploring relation between categorical and continuous variables, we can draw **box plots for each level of categorical variables**. If levels are small in number, it will not show the statistical significance. To look at the statistical significance we can perform **Z-test, T-test or ANOVA**
    * **Z-Test/ T-Test**:- Either test assess whether mean of two groups are statistically different from each other or not. If the probability of Z is small then the difference of two averages is more significant. The T-test is very similar to Z-test but it is used when number of observation for both categories is less than 30.
    * **ANOVA**:- It assesses whether the average of more than two groups is statistically different
           
          
# Missing values treatment
   * Missing data in the training data set can reduce the power/fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification
   * **Deletion**:  It is of two types, List Wise Deletion and Pair Wise Deletion
       * **List Wise Deletion**: delete observations where any of the variable is missing. Its very simple but can reduce data size significantly
       * **Pair Wise Deletion**: we perform analysis with all cases in which the variables of interest are present. It gives more data for analysis but each variable can have a different length
   * **Mean/ Mode/ Median Imputation**: Most commonly used technique. We replace missing values with of a variable with its mean, median or mode. It can be done two ways.
       * **Generalised Imputation**: Let's say age is missing for some people in our data. Then we can calculate all the statistics from available colums and can replace it with missing values
       * **Similar Case Imputation**: We can devide the age variable into male and female and replace missing in male with statistics of male and similarly for female. Or we can come up with some other logic and can use that 
   * **Prediction Model**: build a model using variable with missing value as target and rest as explanatory. It has following drawbacks: 1)The model estimated values are usually more well-behaved than the true values and 2) If there are no relationships with attributes in the data set and the attribute with missing values, then the model will not be precise for estimating missing values
   * **KNN Imputation**:The missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function. It is also known to have certain advantage & disadvantages.
       * **Advantages**:
           * k-nearest neighbour can predict both qualitative & quantitative attributes
           * Creation of predictive model for each attribute with missing data is not required
           * Attributes with multiple missing values can be easily treated
           * Correlation structure of the data is taken into consideration
                        
       * **Disadvantage**:
           * KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset looking for the most similar instances
           * Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what we need whereas lower value of k implies missing out of significant attributes.
                    
# Outlier treatment
   * Outlier is an observation that appears far away and diverges from an overall pattern in a sample. it can result in wildly wrong estimations. Outlier can be of two types: 1) ubnivariate and 2) multivariate outlier
       * **Detection using Box plot and Scatter Plot**
       * **Traetment**: 1) Delete 2) Transform 3)Binning 4)Impute like missiing 5) Treat separately (Build multiple models)
       
       
# Variable transformation: making existing data more useful
   * **Scaling**: Converting each variabale to similar scale for better understanding. Does not change the shape of the variable distribution.
   * **Transform complex non-linear relationships into linear relationships**; using scatter plot we get the shape and can guess the relationship. its easier to get train linear relation than non-linear.
   * **Symmetric distribution is preferred over skewed distribution**: It is easier to interpret and generate inferences.So, whenever we have a skewed distribution, we can use transformations which reduce skewness. For right skewed distribution, we take square / cube root or logarithm of variable and for left skewed, we take square / cube or exponential of variables.
       
       
# Variable creation: Dummy variables: generate a new variables/features based on existing variable(s)


In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [6]:
fullData= pd.read_csv('Credit_Card_Default.csv', index_col = 0, skiprows = 1)
fullData.head(3)

Unnamed: 0_level_0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
2,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0


In [17]:
print list(fullData.columns) # This will show all the column names
#fullData.head(10) # Show first 10 records of dataframe
Summary = fullData.describe() #You can look at summary of numerical fields by using describe() function
#Summary.to_csv('Summary_CCDP.csv', sep=',') # saving summary in local
Summary.T

['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default payment next month']


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LIMIT_BAL,30000.0,167484.322667,129747.661567,10000.0,50000.0,140000.0,240000.0,1000000.0
SEX,30000.0,1.603733,0.489129,1.0,1.0,2.0,2.0,2.0
EDUCATION,30000.0,1.853133,0.790349,0.0,1.0,2.0,2.0,6.0
MARRIAGE,30000.0,1.551867,0.52197,0.0,1.0,2.0,2.0,3.0
AGE,30000.0,35.4855,9.217904,21.0,28.0,34.0,41.0,79.0
PAY_0,30000.0,-0.0167,1.123802,-2.0,-1.0,0.0,0.0,8.0
PAY_2,30000.0,-0.133767,1.197186,-2.0,-1.0,0.0,0.0,8.0
PAY_3,30000.0,-0.1662,1.196868,-2.0,-1.0,0.0,0.0,8.0
PAY_4,30000.0,-0.220667,1.169139,-2.0,-1.0,0.0,0.0,8.0
PAY_5,30000.0,-0.2662,1.133187,-2.0,-1.0,0.0,0.0,8.0


In [16]:
list(fullData.columns)[1:11]

['SEX',
 'EDUCATION',
 'MARRIAGE',
 'AGE',
 'PAY_0',
 'PAY_2',
 'PAY_3',
 'PAY_4',
 'PAY_5',
 'PAY_6']

In [19]:
#identifying diffeent types of variables
#ID_col = ['REF_NO']
target_col = ["default payment next month"]
#cat_cols = ['SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'']
cat_cols = list(fullData.columns)[1:11]
num_cols= list(set(list(fullData.columns))-set(cat_cols)-set(target_col))
num_cols

['PAY_AMT6',
 'PAY_AMT5',
 'PAY_AMT4',
 'PAY_AMT3',
 'PAY_AMT2',
 'PAY_AMT1',
 'BILL_AMT1',
 'BILL_AMT5',
 'BILL_AMT4',
 'BILL_AMT6',
 'LIMIT_BAL',
 'BILL_AMT3',
 'BILL_AMT2']

**Identifying columns with missing data**

In [20]:
fullData.isnull().any()#Will return the feature with True or False,True means have missing value else False

LIMIT_BAL                     False
SEX                           False
EDUCATION                     False
MARRIAGE                      False
AGE                           False
PAY_0                         False
PAY_2                         False
PAY_3                         False
PAY_4                         False
PAY_5                         False
PAY_6                         False
BILL_AMT1                     False
BILL_AMT2                     False
BILL_AMT3                     False
BILL_AMT4                     False
BILL_AMT5                     False
BILL_AMT6                     False
PAY_AMT1                      False
PAY_AMT2                      False
PAY_AMT3                      False
PAY_AMT4                      False
PAY_AMT5                      False
PAY_AMT6                      False
default payment next month    False
dtype: bool

In [None]:
#Impute numerical missing values with mean 
# not required here as no missing values
#fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)

In [None]:
#Impute categorical missing values with -9999
# not required here as no missing values
#fullData[cat_cols] = fullData[cat_cols].fillna(value = -9999)

In [24]:
#create label encoders for categorical features
for var in cat_cols:
 number = LabelEncoder()
 fullData[var] = number.fit_transform(fullData[var].astype('str'))

#Target variable is also a categorical so convert it
fullData["default payment next month"] = number.fit_transform(fullData["default payment next month"].astype('str'))

#train=fullData[fullData['Type']=='Train']
#test=fullData[fullData['Type']=='Test']

#train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
#Train, Validate = train[train['is_train']==True], train[train['is_train']==False]
print fullData.head(3)

    LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4  \
ID                                                                         
1       20000    1          2         1    8      7      7      0      0   
2      120000    1          2         2   35      0      7      5      5   
3       90000    1          2         2   40      5      5      5      5   

    PAY_5             ...              BILL_AMT4  BILL_AMT5  BILL_AMT6  \
ID                    ...                                                
1       1             ...                      0          0          0   
2       2             ...                   3272       3455       3261   
3       2             ...                  14331      14948      15549   

    PAY_AMT1  PAY_AMT2  PAY_AMT3  PAY_AMT4  PAY_AMT5  PAY_AMT6  \
ID                                                               
1          0       689         0         0         0         0   
2          0      1000      1000      1000        

In [27]:
features=list(set(list(fullData.columns))-set(target_col))

In [31]:
# taking entire data as training set as well as test set
x_train = fullData[list(features)].values
y_train = fullData["default payment next month"].values
#x_validate = Validate[list(features)].values
#y_validate = Validate["Account.Status"].values
#x_test=test[list(features)].values
#x_train

In [32]:
# random forest model fitting 
random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [42]:
import numpy as np
from sklearn import metrics
from sklearn.metrics import roc_curve, auc, roc_auc_score
status = rf.predict_proba(x_train)
fpr, tpr, _ = metrics.roc_curve(y_train, status[:,1])
roc_auc = auc(fpr, tpr)
print roc_auc
#status[0:2]
#final_status = rf.predict_proba(x_train)
#test["Account.Status"]=final_status[:,1]
pd.DataFrame(status).to_csv('RF_model_output.csv')

0.999993482474


In [43]:
fpr   # false positive rate

array([ 0.        ,  0.        ,  0.        , ...,  0.99820236,
        0.999786  ,  1.        ])