### Modeling Process

The process is as follows:

* Problem Definition: Understand and clearly describe the problem that is being solved.
* Analyze Data: Understand the information available that will be used to develop a model.
* Prepare Data: Discover and expose the structure in the dataset.
* Evaluate Algorithms: Develop a robust test harness and baseline accuracy from which to improve and spot check algorithms.
* Improve Results: Leverage results to develop more accurate models.
* Present Results: Describe the problem and solution so that it can be understood by third parties.

### 1. Problem Definition

#### Identify Data Needs:
* Start with Business
Question
* Determine data need
for delivering desired
outcome

#### Data Mapping
* Before preparing a data
request, it is necessary
to become as familiar
as possible with the
data sources and their
content that might be
available to address the
business question to be
answered.

* This Data Mapping has
basically three major components:
    * Interview clients
    * Obtain & study data
layouts
    * Obtain & evaluate
data samples

### 2. Analyze Data

A comprehensive data dictionary should be maintained and updated as and when any new information is gathered.

THINGS TO INCLUDE IN THE DATA DICTIONARY:
* Meaning of all Potential Predictors:
    * Maintain labels of as many variables as possible
    * If possible, one should also try to capture the business sense of these variables
* Dependent Variable Definition and Meaning
* Variable Classification: If not already given, one should always try and classify the variables like
    * Demographic variables, e.g. age, gender
    * Performance variables, e.g. spend, number of transactions
    * Credit Attributes, e.g. total credit line, FICO score
    * Census level, e.g. population, location attributes such as income levels
* EDA : Exploratory data analysis:
    * Univariate : Single variable -> Mean, STD, histograms, Box Plots
    * Bivariate: Two variables -> Correlation, Scatter Plot
    * Multivariate: More than two variables -> VIF,scatter Plot, Box Plots with category
    

### 3. Prepare the data

#### I. Test and Train Split

#### Why?
* To start the modeling process, there is a need to create modeling and validation datasets.
* Validation dataset helps validate the performance of the model which is built using the
modeling dataset. 
* A poor performance on validation dataset would imply that the model
is not robust.

#### Steps:
* ***Step 1*** : Before we start the modeling process, we need to define and create the modeling population. From the data that is shared by the client, depending upon the scope of the analysis, an assessment of the required data (a certain amount of history, a certain length of future for prediction, quality of data, etc.),list down the defining criterion for eligible population.

* ***Step 2***: Split the final eligible population into parts –modeling dataset (also called training dataset) and validation datasets. This can be done using:
    * a random assessment (60:40 split or 80:20 split); or
    * specific splitting criterion (based on time/segments)


In [12]:
#require(dplyr)
#train<-sample_frac(cars, 0.7)
#sid<-as.numeric(rownames(train)) # because rownames() returns character
#test<-mtcars[-sid,]


#### II. Identifying Non Usable Variables

Even at this early stage one can identify certain variables which can be deemed as ‘non-usable for
modeling purpose’. This way we can reduce the dimension of the dataset. Some logics that can be
applied are as follows:

* Variables with a single unique value throughout the dataset: By definition, such variables have zero explanatory
power and hence are irrelevant for any analysis. These variables are usually flags like merge indicators.
* ID Variables: Such variables may be needed in the dataset for observation tagging. However, they should NOT be used as
predictors in the model.
* Variables with very low fill rates:
    * Case I: Variable, in question, is defined over a specific segment only. This segment may be used to subset the modeling
dataset for developing segment-specific models. In such a case, the same variable is usable for one segment; while
non-usable for the other.
    * Case II: Missing value may signify something; and may be associated with a meaningful value.
    * Case III: Variable fill rate is less than even 50% but there is a strong business case for its inclusion. In this case, the
appropriate technique of missing value imputation should be applied.
    * Case IV: If none of the above cases holds, some minimum fill rate cut-off may be put for dataset dimension reduction.
According to standard modeling conventions, any variable having lower than 50% fill rate is not included in the model.
This cut-off for fill rate can be set higher or lower depending on how well populated is the data received.

* Variables which cannot be used because of implementation issues should be dropped.
* Certain variable like Gender, Ethnicity which cannot be used due to regulatory issues (depending upon the business
problem in context) should also be dropped.

#### III. Reformatting Variables

Categorical and continuous variables are treated differently in most of the analysis. Hence, it’s always
advisable to separate out possible categorical variables from the continuous ones.
Few points to remember:
* Look at data description to check variable format.
* Check number of unique values. Numerical variables taking only 10-15 unique
value may be treated as categorical. It’s a subjective call, depending on the
variable and its expected use in model.
* Apply business sense before treating variables as continuous / categorical

#### IV. Outlier Treatment

#### V. Missing value Treatment

### 4.Evaluate Algorithms:

#### Classification Metrics

* Classification Accuracy.
    * Classification accuracy is the number of correct predictions made as a ratio of all predictions made.

* Logarithmic Loss.
    * Logarithmic loss (or logloss) is a performance metric for evaluating the predictions of probabilities of membership to a given class.

    * The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm. Predictions that are correct or incorrect are rewarded or punished proportionally to the confidence of the prediction.

* Area Under ROC Curve.
    * Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems.

    * The AUC represents a model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random
![](ROC.png)
* Confusion Matrix.
![](mat.png)
* Classification Report.
    * The classification_report() function displays the precision, recall, f1-score and support for each class.
    * Positive Predictive Value or Precision : the proportion of positive cases that were correctly identified.
    * Sensitivity or Recall : the proportion of actual positive cases which are correctly identified.
    * Specificity : the proportion of actual negative cases which are correctly identified.



#### Regression metrics


* **R Square (Coefficient of Determination)** - As explained above, this metric explains the percentage of variance explained by covariates in the model. It ranges between 0 and 1. Usually, higher values are desirable but it rests on the data quality and domain. For example, if the data is noisy, you'd be happy to accept a model at low R² values. But it's a good practice to consider adjusted R² than R² to determine model fit.
* **Adjusted R²**- The problem with R² is that it keeps on increasing as you increase the number of variables, regardless of the fact that the new variable is actually adding new information to the model. To overcome that, we use adjusted R² which doesn't increase (stays same or decrease) unless the newly added variable is truly useful.

* **RMSE / MSE / MAE** - Error metric is the crucial evaluation number we must check. Since all these are errors, lower the number, better the model. Let's look at them one by one:
 * MSE - This is mean squared error. It tends to amplify the impact of outliers on the model's accuracy. For example, suppose the actual y is 10 and predictive y is 30, the resultant MSE would be (30-10)² = 400.
 * MAE - This is mean absolute error. It is robust against the effect of outliers. Using the previous example, the resultant MAE would be (30-10) = 20
 * RMSE - This is root mean square error. It is interpreted as how far on an average, the residuals are from zero. It nullifies squared effect of MSE by square root and provides the result in original units as data. Here, the resultant RMSE would be √(30-10)² = 20. Don't get baffled when you see the same value of MAE and RMSE. Usually, we calculate these numbers after summing overall values (actual - predicted) from the data.