# Revoult

- <b>Author:</b> Anish Mahapatra
- <b>Email:</b> anishmahapatra01@gmail.com

### Operations Task

The task to is help improve KYC. Expected KYC report to contain the following:
- Improve KYC / Presentation of findings
- Catching fraud report
- Supporiting Materials

All customers that would like an account would need to be validated. There are 2 main checks to be performed:
- Document check
- Facial Similarity Check

Getting a KYC 'clear' implies that the customer has cleared Document and Facial similarity check.

<b>Document (clear) + Facial Similarity (clear) = KYC (clear) </b>

Each customer has two attempts. The pass rate has decreased substantially in the recent period. (Why?)



#### The questions we are trying to answer:
1. What is the percentage of people who are passing the document check?
2. What is the percentage of people who are passing the facial similarity?
    a. What are the % break-downs of the other results for facial similarity?
3. What is the percentage of people who are passing KYC (document check + facial similarity)?
4. Why has the pass rate decreased in the recent period?
5. What is the percentage of people with attempts more than one?
6. How many people are failing due the maximum number of attempts?
7. What is the root cause?
8. What are the possible solutions to the decrease in pass rate?


<a id='TOC'></a>

# Table Of Contents

Following are the steps to the followed to perform the analysis:

- [#1 Data load, importing libraries & Sense Check of Data](#1)
    - [#1.1 Analysis of Document Check Data](#1.1)
    - [#1.2 Analysis of Facial Similarity Data](#1.2)
- [#2 Data Cleaning, Missing Value Treatment](#2)
- [#3 Exploratory Data Analysis (EDA)](#3)
    - [#3.1 Univariate and Bivariate Analysis of Columns](#3.1)
    - [#3.2 Outlier Analysis of the Data](#3.2)
- [#4 Test-Train Split, Feature Scaling & Handling Class Imbalance in the data - SMOTE ](#4)
- [#5 Modelling - Part 1: Obtaining best churn classification](#5)

<a id='1'></a>
## #1 Data load, importing libraries & Sense Check of Data
Back to [Table of Contents](#TOC)

In [3]:
# Importing the required packages

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn import linear_model
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import os

# Hide warnings

import warnings
warnings.filterwarnings('ignore')

In [4]:
# Removing the minimum display columns to 500
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

# Ignoring warnings
warnings.filterwarnings("ignore")

In [5]:
# Reading the dataset
documentData = pd.read_csv("Data/doc_reports.csv")

In [6]:
# Reading the dataset
facialSimilarityData = pd.read_csv("Data/facial_similarity_reports.csv")

<a id='1.1'></a>
### #1.1 Analysis of Document Check Data
Back to [Table of Contents](#TOC)

In [7]:
df = documentData.copy(deep = 'True')

In [8]:
df.shape

(176404, 19)

In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,user_id,result,visual_authenticity_result,image_integrity_result,face_detection_result,image_quality_result,created_at,supported_document_result,conclusive_document_quality_result,colour_picture_result,data_validation_result,data_consistency_result,data_comparison_result,attempt_id,police_record_result,compromised_document_result,properties,sub_result
0,0,ab23fae164e34af0a1ad1423ce9fd9f0,consider,consider,clear,clear,clear,2017-06-20T23:12:57Z,clear,,,clear,clear,,050a0596de424fab83c433eaa18b3f8d,clear,,"{'gender': 'Male', 'nationality': 'IRL', 'docu...",caution
1,1,15a84e8951254011b47412fa4e8f65b8,clear,clear,clear,clear,clear,2017-06-20T23:16:04Z,clear,,,clear,,,f69c1e5f45a64e50a26740b9bfb978b7,clear,,"{'gender': 'Female', 'document_type': 'driving...",clear
2,2,ffb82fda52b041e4b9af9cb4ef298c85,clear,clear,clear,clear,clear,2017-06-20T17:59:49Z,clear,,,clear,clear,,f9f84f3055714d8e8f7419dc984d1769,clear,,"{'gender': 'Male', 'nationality': 'ITA', 'docu...",clear
3,3,bd4a8b3e3601427e88aa1d9eab9f4290,clear,clear,clear,clear,clear,2017-06-20T17:59:38Z,clear,,,clear,clear,,10a54a1ecf794404be959e030f11fef6,clear,,"{'gender': 'Male', 'issuing_date': '2007-08', ...",clear
4,4,f52ad1c7e69543a9940c3e7f8ed28a39,clear,clear,clear,clear,clear,2017-06-20T18:08:09Z,clear,,,clear,clear,,1f320d1d07de493292b7e0d5ebfb1cb9,clear,,"{'gender': 'Male', 'nationality': 'POL', 'docu...",clear


In [10]:
# Summary of the dataset
print(df.info(verbose=True))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176404 entries, 0 to 176403
Data columns (total 19 columns):
Unnamed: 0                            176404 non-null int64
user_id                               176404 non-null object
result                                176404 non-null object
visual_authenticity_result            150290 non-null object
image_integrity_result                176403 non-null object
face_detection_result                 150261 non-null object
image_quality_result                  176403 non-null object
created_at                            176404 non-null object
supported_document_result             175900 non-null object
conclusive_document_quality_result    95217 non-null object
colour_picture_result                 95222 non-null object
data_validation_result                142974 non-null object
data_consistency_result               92229 non-null object
data_comparison_result                2548 non-null object
attempt_id                            176

In [11]:
# Printing all the columns with atleast one null value
df.columns[df.isna().any()].tolist()

['visual_authenticity_result',
 'image_integrity_result',
 'face_detection_result',
 'image_quality_result',
 'supported_document_result',
 'conclusive_document_quality_result',
 'colour_picture_result',
 'data_validation_result',
 'data_consistency_result',
 'data_comparison_result',
 'police_record_result',
 'compromised_document_result']

In [16]:
import sweetviz as sv

In [17]:
my_report = sv.analyze(df)

ImportError: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

<a id='1.2'></a>
### #1.2 Analysis of Facial Similarity Data
Back to [Table of Contents](#TOC)