# Predict Click-Through-Rate

## Problem To Be Solved:
The project is to predict whether a mobile ad will be clicked or not. Click-Through-Rate (CTR) metric is used for evaluating ad performance and CTR systems are widely used by internet economy. The problem and solutions are applicable to current internet economy. The problem provides quite a good learning opportunity with a typical setup in real word with lots of data and lots of features.

## Project Client:
The project is a completed [Kaggle Competetion](https://www.kaggle.com/c/avazu-ctr-prediction). There is already a solution available for the project and reward has been awarded to the winning team.

The project client is myself to see if I am able to apply advanced Machine Learning technologies and Data Science methodologies learned in this workshop to arrive to a solution that would match up to the top teams in the competition. Since the solution is available, there is a certainty to solution that one should arrive. 
As mentioned before, the main aim here is to apply learning of this course to a real life problem to 1) refine the learnings further 2) demonstrate the learnings  and 3) be ready to apply the learnings to next projects.

## Data Set:
The data-set is available [here](https://www.kaggle.com/c/avazu-ctr-prediction/data). The data-set contains training and test data. The data-set is in csv format. The zipped training data, is about 1 Gb in size and test data-set is about 118 Mb. The data-set is well defined with some of the categorical features anonymized.

## Data Fields:
* id: ad identifier
* click: 0/1 for non-click/click
* hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
* C1 -- anonymized categorical variable
* banner_pos
* site_id
* site_domain
* site_category
* app_id
* app_domain
* app_category
* device_id
* device_ip
* device_model
* device_type
* device_conn_type
* C14-C21 -- anonymized categorical variables

## Anonymized Categorical Variables:
Most of the fields are self-explanatory around mobile ad being clicked or not. However, there are anonymized categorical variables. It would have been good to get an idea about these variables. Not knowing the name or significance of these fields, any data engineering could not be applied to these fields and they have to be taken as they are given.

## Data Story:
Please see [here](../../data_story/data_story.html) for the data exploration of the data-set for this project. The following are the main points from the data story.

1.	From the data-set, the Click-Through-Rate is about 16.98%. The Click-Through-Rate is successfully clicked ad from the total data-set. As explained in the field, when Click value is greater than zero, an ad is clicked.
2.	The training data-set contains about 40+ million entries with 22 features affecting the prediction.
3.	With the large number of entries, only about 17% data is classified as positive class and remaining are in negative class. The negative class values are about 5 times more than positive class value. The data-set presents Class imbalance problem.
4.	With various exploration, at times it seems that day and time may not have much impact. However, the volume of data on a particular day may have an impact. It would be hard to tell if there is a real impact or not without running through various models.
5.	As discussed above, all features shall be considered for the modelling purpose.

## Predicting CTR:

### Glossary:
* **LR:** Logistic Regression
* **SGDC:** SGD Classifier
* **RFC:** Random Forest Classifier
* **LR50:** LinearRegression trained with data-set containing balanced 50-50 positive-negative classes.
* **LR33:** LinearRegression trained with data-set containing 33% positive class entries.
* **LR20:** LinearRegression trained with data-set containing 20% positive class entries.
* **SGDC50:** SGD Classifier trained with data-set containing balanced 50-50 positive-negative classes.
* **SGDC33:** SGD Classifier trained with data-set containing 33% positive class entries.
* **SGDC20:** SGD Classifier trained with data-set containing 20% positive class entries.
* **RFC50:** Random Forest Classifier trained with data-set containing balanced 50-50 positive-negative classes.

### Project Utils and Model Utils:

As the modelling process started, it became clear that a common inteface is required to try different classifier on the different data-sets. For this purpose project_utils and model_utils have been developed. The development of these utility files have been iterative. These modules were developed after initial modeling experiments. These modules can be found at the following link location.

* **[project_utils](./project_utils.py)**: This modules provides the following interfaces.
  * Reading large CSV files
  * Generating training samples with on-demand pos-neg balance ratio
  * Generating test samples
  * Plotting distribution plot of actual data or predicted data
  * Calibration curves.
  * Calculating CTR for TEST DATA.

* **[model_utils](./model_utils.py)**: This module specifies do_classify interface to find the best parameters for a classifier using **GridSeachCV** method.

### Predicting CTR with Smaller Sample:
As mentioned before, the training data-set is large for consumer grade computers. It may not qualify as Big-Data but available local computer resources would not be able to process the whole data-set. This would be recurring issue in many real-life problems.

The idea here is to apply the statistical principles to draw a reasonable workable sample that can be used on local resources to try different model. Once the best classifier is obtained on local resources, apply its principle on biggler sample or whole data-set to get the final model. 

#### [Samples](project_sample_data.ipynb):

**[Conver Data Into Sample](./convert_data.ipynb):** The code available [here](./convert_data.ipynb) converts big train csv into samll csv files. Once a big file is divided into small files, the sampling function could go through each file and draw equal amount of samples.

[This notebook](project_sample_data.ipynb) explores the samples used for the classifier. This notebook makes use of **project_utils.sample_data** function to generate required sample. This function draws number of sample per bins specified as an argument from each bin / samll csv file to create a required sample. While creating sample it ensures that the ratio of positive-negative classes is maintained as specified by the argument.

#### [LogisticRegression](project_lr.ipynb):
[LogisticRegression notebok](project_lr.ipynb) has the following layout. One could jump to specific section of [the notebook](project_lr.ipynb) by clicking the link. The Logistic Regression was used to build the library. So, the purpose here is to capture a snaphsot of the journey of the project.

* [Simple Logistic Regression:](project_lr.ipynb#Simple-Logistic-Regression:)<br/>
This code uses LogisticRgression available in scikit-learn library.

* [LR with KFold = 5:](project_lr.ipynb#LR-with-KFold-=-5:)<br/>
This code uses K-Fold approach on the LogisticRehression classifier used in the link above.

* [LR - Find Best C value:](project_lr.ipynb#LR---Find-Best-C-value:)<br/>
This code tried K-Fold approach on different classifier with different C values and picks the classifier which has the best accuracy score.

* [LR with GridSearchCV:](project_lr.ipynb#LR-with-GridSearchCV:)<br/>
[GridSearchCV]() available in scikit-learn module returns the best classifier with different parameters after applying KFold approach.

* [LR with do_classify:](project_lr.ipynb#LR-with-do_classify:)<br/>
**do_classify** function was introduced in LogisticRegression Mini Prject. This function has been adapted for the project. This function is going to be used for the classifier. This function can used on different training data-set and it would produce different result. For this project the following C values have been used.
  * **C values**: [0.001, 0.1, 1, 10, 100]<P/>
* [LR Accuracy Score:](project_lr.ipynb#LR-Accuracy-Score:)<br/>
  * **LR50 Accuracy Score**:
    <pre>
        BEST PARAMS {'C': 0.1}
        Accuracy on training data: 0.55
        Accuracy on test data:     0.56
    </pre>
  * **LR30 Accuracy Score**:
    <pre>
        BEST PARAMS {'C': 0.01}
        Accuracy on training data: 0.67
        Accuracy on test data:     0.67
    </pre>
  * **LR20 Accuracy Score**:
    <pre>
        BEST PARAMS {'C': 0.01}
        Accuracy on training data: 0.80
        Accuracy on test data:     0.80
    </pre>
    
  From the above result it seems that the classifier seems to be doing better as class ratio becomes more imbalanced. That may be because classifier might be over fitting it. The same classifier may not be good enough with other data. Also, it is not good enough to look at accuracy score. It is time to look at other perofrmance matrices.

* [LR Confusion Matrix:](project_lr.ipynb#LR-Confusion-Matrix:)
  * **LR50 + Xtest_lr_50 Confusion Matrix**:
    <pre>
        [[3238 4851]
         [2345 5738]]
    </pre>
    
   <P/> This seems like a good mix of result though False Positive and Flase Negative are quite high. <P/>
  * **LR33 + Xtest_lr_33 Confusion Matrix**:
    <pre>
        [[10859    39]
         [ 5257    17]]
    </pre>  
  * **LR20 + Xtest_lr_20 Confusion Matrix**:
    <pre>
        [[12923     0]
         [ 3249     0]]
    </pre>
  * **LR33 + Xtest_lr_50 Confusion Matrix**:
    <pre>
        [[8064   25]
         [8060   23]]
    </pre>
  * **LR20 + Xtest_lr_50 Confusion Matrix**:
    <pre>
        [[8089    0]
         [8083    0]]
    </pre>
    
  <p/>The confusion matrix gives a good idea that it is not all well with classifier trained with imbalanced data-set. The prediction has very high rate of False Positive. <p/>

* [LR Data Projection:](project_lr.ipynb#LR-Data-Projection:)
  * [LR50 Data Projection](project_lr.ipynb#LR50-Data-Projection:)
  The test data projection of actuals and predicted with LR50 classifier.
  * [LR33 Data Projection](project_lr.ipynb#LR33-Data-Projection:)
  The test data projection of actuals and predicted with LR33 classifier.
  * [LR20 Data Projection](project_lr.ipynb#LR20-Data-Projection:)
  The test data projection of actuals and predicted with LR20 classifier.
  
    Similar to confusion matrix results, the plots for LR33 and LR20 classifier show that they suffer from the issue off False Positive prediction. However LR50 classifier had some mis-classification and the problem requires something more than simple Logistic Regression. <p/>

* [LR Calibration Curve:](project_lr.ipynb#LR-Calibration-Curve:)
  * [LR50 Calibration Curve:](project_lr.ipynb#LR50-Calibration-Curve:)
  The calibration curve of all classifiers with X50 data set.
  * [LR33 Calibration Curve:](project_lr.ipynb#LR33-Calibration-Curve:)
  The calibration curve of all classifiers with X33 data set.
  * [LR20 Calibration Curve:](project_lr.ipynb#LR20-Calibration-Curve:)
  The calibration curve of all classifiers with X20 data set.
  
  The Calibration curve shows that all classifier perform very poorly.

#### [SGD Classifier](project_sgdc.ipynb):

* [Trying Linear SVM:](project_sgdc.ipynb#Trying-Linear-SVM:)<br/>
SVM was tried first but SVN took quite a long time. SGD Classifier with loss parameter set to *hinge* behaves as Linear SVM.
* [SGDC with do_classify:](project_sgdc.ipynb#SGDC-with-do_classify:)<br/>
SGD Classifier has been trained with the following different parameters:
  * **alpha: [0.01, 0.1, 1, 10, 100]** Alpha parameter is similar Cs parameter in LogisticRegression.
  * **n_iter: [50, 80, 100, 120, 150]** n_iter means number of iterations for training the classifier. As the number of iterations increases, the performace becomes slower. It would be good to find optimal iterations. While experimenting, started with default number of iteration. However, as we have number of features higher, it was clear to update the number of iteration. With higher number of iteration, the gaps between training and cross validation results decrases.
<P/>
* [SGDC Accuracy Score:](project_sgdc.ipynb#SGDC-Accuracy-Score:)
  * **SGDC50 Accuracy Score**:
    <pre>
        BEST PARAMS {'alpha': 1, 'n_iter': 120}
        Accuracy on training data: 0.50
        Accuracy on test data:     0.50
    </pre>
  * **SGDC30 Accuracy Score**:
    <pre>
        BEST PARAMS {'alpha': 1, 'n_iter': 120}
        Accuracy on training data: 0.46
        Accuracy on test data:     0.46
    </pre>
  * **SGDC20 Accuracy Score**:
    <pre>
        BEST PARAMS {'alpha': 10, 'n_iter': 200}
        Accuracy on training data: 0.80
        Accuracy on test data:     0.80
    </pre>
  * **Accuracy score on Xtest Data**:
    * **SGDC50**: Accuracy Score ytest: 0.492852
    * **SGDC33**: Accuracy Score ytest: 0.492110
    * **SGDC20**: Accuracy Score ytest: 0.498837

  The accuracy score looks stunningly similar. Next Let's look at confusion matrix scores. <P/>

* [SGDC Confusion Matrix:](project_sgdc.ipynb#SGDC-Confusion-Matrix:)
  * **SGDC50 + Xtest_sgd_50 Confusion Matrix**:
    <pre>
        [[7193  969]
         [7075  935]]
    </pre>
  * **SGDC33 + Xtest_sgd_33 Confusion Matrix**:
    <pre>
        [[4235 6572]
         [2132 3233]]
    </pre>  
  * **SGDC20 + Xtest_sgd_20 Confusion Matrix**:
    <pre>
        [[12952     0]
         [ 3220     0]]
    </pre>
  * **SGDC33 + Xtest_sgd_50 Confusion Matrix**:
    <pre>
        [[3232 4930]
         [3148 4862]]
    </pre>
  * **SGDC20 + Xtest_sgd_50 Confusion Matrix**:
    <pre>
        [[8162    0]
         [8010    0]]
    </pre>
  * **SGDC50 + Xtest Confusion Matrix**:
    <pre>
        [[8860 1224]
         [9028 1103]]
    </pre>
  * **SGDC33 + Xtest Confusion Matrix**:
    <pre>
        [[3884 6200]
         [4067 6064]]
    </pre>
  * **SGDC33 + Xtest Confusion Matrix**:
    <pre>
        [[10084     0]
         [10131     0]]
    </pre>
  
  From the above results it can be confirmed that SGD classifiers suffer from False Positive. SGD50 and SGD33 classifiers have a very close accuracy score but confusion matrix results seem very different. <P/>
  
* [SGDC Data Projection:](project_sgdc.ipynb#SGDC-Data-Projection:)
  * [SGDC50 Data Projection](project_lr.ipynb#SGDC50-Data-Projection:)
  The test data projection of actuals and predicted with SGDC50 classifier.
  * [SGDC33 Data Projection](project_lr.ipynb#SGDC33-Data-Projection:)
  The test data projection of actuals and predicted with SGDC33 classifier.
  * [SGDC20 Data Projection](project_lr.ipynb#SGDC20-Data-Projection:)
  The test data projection of actuals and predicted with SGDC20 classifier.
  
The data projection shows that SGDC50 classifier performs reasobaly well.

* [SGDC Calibration Curve:](project_sgdc.ipynb#SGDC-Calibration-Curve:)
  * [SGDC50 Calibration Curve:](project_lr.ipynb#SGDC50-Calibration-Curve:)
  The calibration curve of all SGD classifiers with X50 data set.
  * [SGDC33 Calibration Curve:](project_lr.ipynb#SGDC33-Calibration-Curve:)
  The calibration curve of all SGD classifiers with X33 data set.
  * [SGDC20 Calibration Curve:](project_lr.ipynb#SGDC20-Calibration-Curve:)
  The calibration curve of all SGD classifiers with X20 data set.

The calibration result shows a horizontal line for all classifier. This suggests that these classification could suffer from miss-calssification issue. Also, for the balanced data-set the accuracy howers around 50%.

From the above plots, confusion matrix and accuracy score, it can be inferred that SGD Classifier does not perform well. 

#### [Random Forest Classifier](project_rfc.ipynb):
LinearRegession, SGD Classifier and initial expeiments showed that imbalanced class really does not perform well. It tends to overfit the data and prediction on test data is really poor. Because of this reason, only 50-50 balanced data-set will be used.

Theory has suggested for lots of features and lots of data entries, Random Forest should be a good fit.

Since the sample is reduced drastically, for this classifier, **OOB** parameter will set to **True** for using OOB error elements for cross validation. 

* [RFC50 with do_classify:](project_rfc.ipynb#RFC50-with-do_classify:) <br/>
RandomForest Classifier has been trained with the following different parameters:
  * **n_estimators: [80, 100, 120, 150]**<br/>
    For initial experiments, RandomClassifier was used with default parameters. Some of the errors spitted by the model made it clear that n_estimators must be updated as number of features are much higher. Eventully, an ideal range was found where this classifier was perofrming the best. <br/>
    
  * **min_samples_leaf: [20, 50, 80]**<br/>
    With n_estimator to be higher value and min_samples_leaf set to default value (which is 1), the model started overfitting the training data. There was a huge gap between training accuracy and cross validation accuracy. Again, for different experiments, different values (starting from default value) were picked for this parameter and quickly the training accuracy started matching the cross validation accuracy.
    
* [RFC50 Accuracy Score:](project_rfc.ipynb#RFC50-Accuracy-Score:)
  * **RFC50 + X and Xtest Accuracy Score**:
    <pre>
        BEST PARAMS {'min_samples_leaf': 20, 'n_estimators': 100}
        Accuracy on training data: 0.70
        Accuracy on test data:     0.66
    </pre>
  * **Accuracy score on different XData**:
    * Accuracy Score Xtest: 0.696859
    * Accuracy Score X33: 0.647967
    * Accuracy Score X20: 0.635885  
  <P/>
 The above accuracy scores suggest that the model seems to be holding up well against different kind of data-sets. The gap between training and cross validation seems to be very minimum.
  <P/>
* [RFC50 Confusion Matrix:](project_rfc.ipynb#RFC50-Confusion-Matrix:)<br/>
  * **RFC50 + Xtest_rfc_50 Confusion Matrix**:
  <pre>
        [[4972 3051]
         [2387 5762]]
  </pre>
  * **RFC50 + X33 Confusion Matrix**:
  <pre>
        [[33613 20562]
         [ 7903 18781]]
  </pre>
  * **RFC50 + X20 Confusion Matrix**:
  <pre>
        [[39955 24732]
         [ 4710 11462]]
  </pre>

  The above results show that the classifier seems to be holding well with different data-sets. <P/>
  
* [RFC50 Data Projection:](project_rfc.ipynb#RFC50-Data-Projection:)<br/>
<P/>The [different data projection](project_rfc.ipynb#RFC50-Data-Projection:) of actual and predicted data with RFC50 show that the classifier is holding well. <P/>
* [RFC50 Calibration Curve:](project_rfc.ipynb#RFC50-Calibration-Curve:)<br/>
<P/>The [Calibration Curves](project_rfc.ipynb#RFC50-Calibration-Curve:) show that for a balanced set, the calibration projection line almost alignes with perfectly calibrated line. However, for imbalanced data-set, the line dips slightly in the middle but still the performance of the classifier is much superior than LR and SGD Classifiers.<P/>
* [RFC50 Feature Importances:](project_rfc.ipynb#RFC50-Feature-Importances:)<br/>
<P/>This shows the importance / weights of different features calculated by this training data-set.
   <pre>
        array([ 0.03440021,  0.00904071,  0.01600058,  0.08828306,  0.08419701,
                0.03522298,  0.0662777 ,  0.03661395,  0.04734966,  0.02005243,
                0.06662916,  0.06358917,  0.00886233,  0.02095102,  0.0750727 ,
                0.01903772,  0.04649285,  0.05405036,  0.03040371,  0.04324246,
                0.04842395,  0.08580628])
  </pre>

* [TEST DATA With RFC50 Classifier:](project_rfc.ipynb#TEST-DATA-With-RFC50-Classifier:)<P/>
  * [TEST DATA CTR:](project_rfc.ipynb#TEST-DATA-CTR:)<P/>
    The predicted CTR for TEST DATA is **35.11%**.<P/>
  * [TEST DATA Projection:](project_rfc.ipynb#TEST-DATA-Projection:)<P/>
    See the Test DATA Projection.<P/>
  * [TEST DATA Calibaration Curve:](project_rfc.ipynb#TEST-DATA-Calibaration-Curve:)<P/>
    See the Test DATA Calibration Curve.<P/>
  * [Next Step:](project_rfc.ipynb#Next-Step:)<P/>
    The sample is only **0.2%** of the whole data-set. Next step is to take bigger sample on server and see how this classifier holds against the classifier with larger sample.

### Predicting CTR with Larger Sample:

There were few experiments carried out before finalizing the project solution with the large data. The first instict was to use all data entries with positive classification and select the same number of negative classifier randomly from the data set to create a balanced sample. Use this balanced sample for training a classifier.

However, this approach proved to be difficult. This kind of data-set comes to more than 2Gb in size. A server with 32Gb could not handle RandomClassifier on this big data-set. Without using distributed computing solution, it seems impossible to use the entire training set.

With 25% of the balanced sample generated as mentioned above a classifier could be trained. However, the machine consumed almost 90% of its 32Gb RAM most of the time. With this approach only one classifier could be tried at a time and it would have been impossible to use GridSearchCV to find the best parameters with this data-set on 32Gb RAM machine.

For the next experiment, a 64Gb RAM was used and sample size was reduced to 10% of the total balanced sample. For a single run of classification, the server consumed about 12% of its RAM. Even 10% of the total balanced sample was 17 times more than the sample used for local resources.

The RAM usage was encouraging and it gave an idea that with this setup GridSearchCV could be used with varying two different parameters total 12 times. Since it was a very big server, n_jobs was set to 8 for parallel processing. The server could return the result with the best classifier in one and half hour time.

Like modeling with small sample, **OOB** was also set to True for large sample as well.

#### [Random Forest Classifier](project_server.ipynb):
* [RFC50 with do_classify:](project_server.ipynb#RFC50-with-do_classify:)<P/>
RandomForest Classifier has been trained with the following different parameters:
  * **n_estimators: [120, 150, 180, 200]**<br/>
    This values are slightly different and a bit higher from the values used with smaller sample. This could be because of sheer number of more data for training purpose.<br/>
  * **min_samples_leaf: [20, 50, 80]**<br/>
    Even for the biggger data-set min_samples_leaf remain the same as smaller sample. RandomForest Classifier worked with even min_samples_leaf value 1. The smaller value of min_samples_leaf tend to have more affinity towards overfitting. With value 20, the gap between training set accuracy and cross validation accuracy is minimum. 

* [RFC50 Accuracy Score:](project_server.ipynb#RFC50-Accuracy-Score:)
  * **RFC50 + X and Xtest Accuracy Score**:
    <pre>
        BEST PARAMS {'n_estimators': 150, 'min_samples_leaf': 20}
        Accuracy on training data: 0.70
        Accuracy on test data:     0.68
    </pre>
  * **Accuracy score on different XData**:
    * Accuracy Score Xtest_rfc_50: 0.683692
    * Accuracy Score X50: 0.694862
  <P/>
* [RFC50 Confusion Matrix:](project_server.ipynb#RFC50-Confusion-Matrix:)<P/>
  * **RFC50 + Xtest_rfc_50 Confusion Matrix**:
    <pre>
       [[ 83630  53030]
        [ 33829 104114]]
    </pre>
  * **RFC50 + Xtest_rfc_50 Confusion Matrix**:
    <pre>
        [[427149 259041]
         [159917 526906]]
    </pre>
    
 The Confusion Matrix suggests that classifier tends to classify some False Positive and that is where around 30% accuracy loss occurs.<P/>
 
* [RFC50 Data Projection:](project_server.ipynb#RFC50-Data-Projection:)<P/>
See the [different data projection](project_server.ipynb#RFC50-Data-Projection:) of actual and predicted data with RFC50.<P/>
* [RFC50 Calibration Curve:](project_server.ipynb#RFC50-Calibration-Curve:)<P/>
The [Calibration Curves](project_rfc.ipynb#RFC50-Calibration-Curve:) show that, the calibration projection line alignes very much with perfectly calibrated line. <P/>
* [RFC50 Feature Importances:](project_server.ipynb#RFC50-Feature-Importances:)<P/>
This shows the importance / weights of different features calculated by this training data-set.
  <pre>
    array([0.        ,  0.01111992,  0.01738841,  0.09691916,  0.08738399,
           0.04054367,  0.07020832,  0.03947884,  0.05462069,  0.0195703 ,
           0.05671768,  0.05611883,  0.01111614,  0.02457952,  0.07627469,
           0.0286369 ,  0.04700204,  0.05679609,  0.03451965,  0.04139404,
           0.04609788,  0.08351324])
  </pre>

* [TEST DATA With RFC50 Classifier:](project_server.ipynb#TEST-DATA-With-RFC50-Classifier:)<P/>
  * [TEST DATA CTR:](project_server.ipynb#TEST-DATA-With-CTR:)<P/>
  The predicted CTR for TEST DATA is **35.01%**.<P/>
  * [TEST DATA Projection:](project_server.ipynb#TEST-DATA-Projection:)<P/>
  See the Test DATA Projection.<P/>
  * [TEST DATA Calibaration Curve:](project_server.ipynb#TEST-DATA-Calibaration-Curve:)<P/>
  See the Test DATA Calibration Curve.<P/>


### Comparing RFC50 Small Sample and Large Sample Results:

The following table shows the results of **RFC50** Classifier trained on small and large sample. The large sample is 17 times larger than the small sample.

| Sample Type | Sample Size         | Train Accuracy| Test Accuracy | Predicted CTR 
|-------------|---------------------|---------------|---------------|---------------
|Small Sample |0.2% of training data|70%            |66%            |35.11%         
|Large Sample |3.4% of training data|70%            |67%            |35.01%         

### Conclusion:
* **RandomForest Classifier** trained with with 50-50 balanced sample perform the best among LogisticRegresson, SGDClassifier and RandomForest Classifier.
* Predicted **CTR** of the **TEST DATA** is **35%**.
* The results from small samples are comparable with large sample.

### Key Learnings:

The following is list of key learnings from this project.

* Work with complex data set.
* Use different classifiers from sklearn module / library.
* Find an appropriate classifier on the hand.
* Tune the model so that training accuracy and cross validation accuracy match.
* Find an acceptable solution that can be used on limited resources.
* Use servers in cloud compute solutions like Amazon AWS for modelling with large data-set.

### Future Work:

RandomForest Classifier does a reasonable job. However, the accuracy of the classifier could be improved. A few things like 1) reducing number of features for prediction 2) increasing weights of certain features 3) using different ensemble classifier could be done to extend the accuracy for the given project.  