# License Status Prediction - Multiclass Classification
<hr style="border:1px solid gray"> </hr>

## Contents.

    1. [x] Problem Description
    2. [x] Data Exploration
    3. [ ] Feature Selection
    4. [ ] KNN Algorithm
    5. [ ] Naive Bias
    5. [ ] Logistic Regression
    6. [ ] Decision Tree
    
    
--------------------------------------------------------------------------------------------------------------------------------


### Problem Description

Based on the following license dataset (real world problem - which holds various information related the business license), I'am going to predict license status for the given business. Lets start with delving deep into the data.
________________________________________________________________________________________________________________________________

In [124]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# will make your plot outputs appear and be stored within the notebook.
%matplotlib inline
import os
import category_encoders as ce

In [125]:
df = pd.read_csv("../dataset/License_dataset.csv")
df.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,ID,LICENSE ID,ACCOUNT NUMBER,SITE NUMBER,LEGAL NAME,DOING BUSINESS AS NAME,ADDRESS,CITY,STATE,ZIP CODE,...,LICENSE TERM START DATE,LICENSE TERM EXPIRATION DATE,LICENSE APPROVED FOR ISSUANCE,DATE ISSUED,LICENSE STATUS CHANGE DATE,SSA,LATITUDE,LONGITUDE,LOCATION,LICENSE STATUS
0,35342-20020816,1256593,32811,1,CARMEN CAHUE,CLAUDIA'S BRIDAL SHOP,2625 S CENTRAL PARK AVE 1,CHICAGO,IL,60623.0,...,2002-08-16T00:00:00,2003-08-15T00:00:00,2002-08-21T00:00:00,2006-04-11T00:00:00,,25.0,41.843613,-87.714618,"{'latitude': '41.843612879431845', 'longitude'...",AAI
1,1358463-20051116,1639294,262311,29,"ISLA TROPICAL, INC.",ISLA TROPICAL,2825 W MONTROSE AVE,CHICAGO,IL,60618.0,...,2005-11-16T00:00:00,2006-11-15T00:00:00,2006-04-05T00:00:00,2006-06-12T00:00:00,2006-06-15T00:00:00,60.0,41.961132,-87.699626,"{'latitude': '41.96113244107215', 'longitude':...",AAC
2,1980233-20090722,1980233,345008,1,DJS REMODELING,"DJS REMODELING, INC.",1605 CLAVEY RD 1,HIGHLAND,IL,60035.0,...,2009-07-22T00:00:00,2011-07-15T00:00:00,2009-07-22T00:00:00,2009-07-22T00:00:00,,,,,,AAI
3,1476582-20040211,1476582,273121,1,ALL-BRY CONSTRUCTION CO.,ALL-BRY CONSTRUCTION CO.,8 NORTH TRAIL,LEMONT,IL,60439.0,...,2004-02-11T00:00:00,2005-02-15T00:00:00,2004-02-10T00:00:00,2004-02-11T00:00:00,,,,,,AAI
4,1141408-20080516,1896750,213785,1,MCDONOUGH MECHANICAL SERVICE,MCDONOUGH MECHANICAL SERVICE,4081 JOSEPH DR,WAUKEGAN,IL,60087.0,...,2008-05-16T00:00:00,2010-05-15T00:00:00,2008-06-04T00:00:00,2008-06-05T00:00:00,,,,,,AAI


In [126]:
# change the columns name
new_col_name = [col.replace(" ", "_").lower() for col in df.columns]
df.columns   = new_col_name

## Variable Description

**Dependent Variable**  <br>
1. AAI - License status is issued <br>
2. AAC - License status is cancelled <br>
3. REV - License status is revoked <br>
4. REA - License status is revoked and appealed <br>
5. INQ - License status is in enquiry <br>

**Independent Variable(to predict):** <br>
* Timeline of the application status <br>
* Type of business <br>
* Location details of the business <br>
* Payment details <br>
_______________________________________________________________________________________________________________________________

In [127]:
df.shape

(85895, 32)

## Data Cleansing


![title](../images/data-cleasing.png)

1. Missing Values - lot of missing data for particular variables
2. Unique Data - id (no meaning)
3. Data leakage - The AAI alone doesn't have license status change value (as the license are never revoked/cancelled) 

In [128]:
df.isnull().sum()

id                                       0
license_id                               0
account_number                           0
site_number                              0
legal_name                               0
doing_business_as_name                   1
address                                  0
city                                     0
state                                    0
zip_code                                31
ward                                 49701
precinct                             56701
ward_precinct                        49700
police_district                      54012
license_code                             0
license_description                      0
license_number                           1
application_type                         0
application_created_date             64660
application_requirements_complete      214
payment_date                          1289
conditional_approval                     0
license_term_start_date                228
license_ter

In [129]:
# Drop columns which are not relevent for the prediction / too many missing values
drop_col_list = ["id","license_id","ssa","location","application_created_date","account_number","address"]
df = df.drop(drop_col_list, axis=1)

In [130]:
# just mapping whenever a True of False (null => True, else False) value exists to 1, 0
df["license_status_change"] = np.where(df.license_status_change_date.isnull(),1,0)

In [131]:
pd.crosstab(df.license_status_change, df.license_status)

license_status,AAC,AAI,INQ,REA,REV
license_status_change,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,30200,0,2,3,290
1,0,55400,0,0,0


## Data Transformation
1. Timeline creation
2. Encoding
3. New Feature

In [132]:
strings_cols =  ["application_requirements_complete", 
                 "payment_date", 
                 "license_term_start_date",
                 "license_term_expiration_date",
                 "license_approved_for_issuance",
                 "date_issued"] 

def str_to_date_format(columns: list):
    """convert the strings columns format to pandas.datetimeindex format
    e.g., 
    df[columns[i]][k]: '2002-06-28T00:00:00' to Timestamp('2002-06-28 00:00:00')

    Args:
        columns (list): df columns which contains date in str format
    """
    for col in columns:
        df[col] = pd.DatetimeIndex(df[col])

str_to_date_format(strings_cols)


In [133]:
type(df['license_status'])

pandas.core.series.Series

In [134]:
type(df[['license_status']])

pandas.core.frame.DataFrame

In [135]:
# Find #days btw different application status date
df["completion_to_start"]   = (df.license_term_start_date - df.application_requirements_complete).dt.days
df["start_to_expiry"]       = (df.license_term_expiration_date - df.license_term_start_date).dt.days
df["approval_to_issuance"]  = (df.date_issued - df.license_approved_for_issuance).dt.days
df["completion_to_payment"] = (df.payment_date - df.application_requirements_complete).dt.days

df["presence_of_enquiry_details"] = np.where(df.ward.isnull() | df.ward_precinct.isnull() | df.police_district | df.precinct , 0 ,1 )

## Target Encoding

In [136]:
# creating one hot encondig 
df["target"] = df[['license_status']].apply(lambda col:pd.Categorical(col).codes)

In [154]:
def target_enconding(df, cols_to_transform):
  """Function encodes the feature and returns dataframe

  Args:
      df (pandas dataframe): dataframe to be transformed
      cols_to_transform (List): list of column to be transformed

  Returns:
      df (pandas dataframe): all given columns were transformed
  """
  enc = ce.OneHotEncoder().fit(df.target.astype(str))
  y_onehot = enc.transform(df.target.astype(str))
  class_names = y_onehot.columns
  for class_ in class_names:
    enc = ce.TargetEncoder(smoothing=0)
    temp = enc.fit_transform(df[cols_to_transform], y_onehot[class_])
    temp.columns = [str(x) + "_" + str(class_) for x in temp.columns]
    df = pd.concat([df, temp], axis = 1)
  return df

cols_to_transform = ["license_description", "state", "city"]

df = target_enconding(df, cols_to_transform)

  elif pd.api.types.is_categorical(cols):


## Imbalanced Classification

As we can see, this problem does not have the same number of samples (data) for each class (of five). <br> The distribution (quantity) is disproportionate. **Resampling is required.**
<br> Any machine learning algorithm is only as good as its data, and imbalanced data will inevitably lead to inaccurate results. 

In [156]:
df.license_status.value_counts(normalize=True).mul(100).round(3).astype(str) + "%"

AAI    64.497%
AAC    35.159%
REV     0.338%
REA     0.003%
INQ     0.002%
Name: license_status, dtype: object

## Over/Under Sampling

In [157]:
# recall targets value
np.sort(df.target.unique()).tolist()

[0, 1, 2, 3, 4]

In [158]:
# Undersampling
df_0 = df[df.target==0].sample(frac=0.3,replace=False)
df_1 = df[df.target==1].sample(frac=0.3,replace=False)

# oversamplig
df_2 = df[df.target==2].sample(frac=200,replace=True)
df_3 = df[df.target==3].sample(frac=100,replace=True)
df_4 = df[df.target==4].sample(frac=2,replace=True)

sampled_df = pd.concat([df_0,df_1,df_2,df_3,df_4])

sampled_df.target.value_counts()
print(sampled_df.shape)

(26960, 62)
