# 2. Data Understanding
> In this stage, the study shall seek to understand the attributes of the data.

The Consumer Complaint Database is a compilation of complaints submitted to businesses asking them to address consumer financial goods and services. Complaints are publicized after the business responds and verifies a business relationship with the customer or after 15 days, whichever happens first. The Consumer Complaint Database does not include complaints that have been forwarded to other regulators, such as those regarding depository institutions with less than $10 billion in assets.

## 2.1 Data Description

## 2.2 Load the Data
> In this stage we loaded the data.

In [1]:
# Loading dependencies
import pandas as pd
import numpy as np

In [3]:
# Loading the data
data = pd.read_csv('data/complaints.csv')

# previewing the data
data.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2022-12-04,Checking or savings account,Other banking product or service,Problem with a lender or other company chargin...,Transaction was not authorized,,Company has responded to the consumer and the ...,"BANK OF AMERICA, NATIONAL ASSOCIATION",RI,2832,,Other,Web,2022-12-04,Closed with explanation,Yes,,6277013
1,2022-12-09,"Credit reporting, credit repair services, or o...",Credit reporting,Problem with a credit reporting company's inve...,Their investigation did not fix an error on yo...,,,"EQUIFAX, INC.",TX,79928,,,Web,2022-12-09,In progress,Yes,,6300442
2,2022-12-09,"Credit reporting, credit repair services, or o...",Credit reporting,Improper use of your report,Reporting company used your report improperly,,,"EQUIFAX, INC.",GA,30103,,,Web,2022-12-09,In progress,Yes,,6300444
3,2022-12-09,"Credit reporting, credit repair services, or o...",Credit reporting,Problem with a credit reporting company's inve...,Problem with personal statement of dispute,,,"EQUIFAX, INC.",PA,19104,,,Web,2022-12-09,In progress,Yes,,6300449
4,2022-12-04,"Credit reporting, credit repair services, or o...",Credit reporting,Improper use of your report,Reporting company used your report improperly,,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,FL,33167,,Other,Web,2022-12-04,Closed with explanation,Yes,,6274454


In [4]:
# Previewing the tail
data.tail()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
3147565,2017-02-09,Debt collection,I do not know,Cont'd attempts collect debt not owed,Debt resulted from identity theft,I have disputed my debts several times with no...,,Bonneville Billing and Collections,UT,84054,Servicemember,Consent provided,Web,2017-02-09,Closed with explanation,Yes,No,2334969
3147566,2015-04-29,Mortgage,Conventional fixed mortgage,"Loan modification,collection,foreclosure",,My father died in XX/XX/XXXX. Left me his only...,,"CITIBANK, N.A.",OK,74066,,Consent provided,Web,2015-04-29,Closed with explanation,Yes,No,1352738
3147567,2017-03-31,Credit reporting,,Credit reporting company's investigation,No notice of investigation status/result,cfbp i would Like to file a complaint on Exper...,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,MN,55379,,Consent provided,Web,2017-03-31,Closed with non-monetary relief,Yes,Yes,2412926
3147568,2017-01-16,Credit reporting,,Incorrect information on credit report,Account status,My husband and I are in the middle of an FHA S...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",GA,30215,,Consent provided,Web,2017-01-16,Closed with explanation,Yes,No,2292586
3147569,2018-03-07,Mortgage,Other type of mortgage,Trouble during payment process,,,Company has responded to the consumer and the ...,WELLS FARGO & COMPANY,CA,91304,,,Referral,2018-03-08,Closed with explanation,Yes,,2837068


## 2.3 Preliminary Data Inspection

In [37]:
## Checking the data summary

# Checking the dimensions of the data
print(f"This data has {data.shape[0]} rows and {data.shape[1]} columns")
print()
# summary
data.info()

This data has 3147570 rows and 18 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3147570 entries, 0 to 3147569
Data columns (total 18 columns):
 #   Column                        Dtype 
---  ------                        ----- 
 0   Date received                 object
 1   Product                       object
 2   Sub-product                   object
 3   Issue                         object
 4   Sub-issue                     object
 5   Consumer complaint narrative  object
 6   Company public response       object
 7   Company                       object
 8   State                         object
 9   ZIP code                      object
 10  Tags                          object
 11  Consumer consent provided?    object
 12  Submitted via                 object
 13  Date sent to company          object
 14  Company response to consumer  object
 15  Timely response?              object
 16  Consumer disputed?            object
 17  Complaint ID                  int64 
dtyp

### 2.3.1 Observations:
* The columns `Date received` and `Date sent to company` have been registered as object which is basic pandas for string yet they should be date_time objects.
* Every other data type looks okay.
* From the summary above, we could note that the data has 3147570 rows and 18 columns.
* There is only one numeric column - `Complaint ID`. This column in unnecessarily integer because it is a unique identifier of the consumer.

* The other columns are objects, which is pandas for string.

In [24]:
# Creating a class to check missing values
class understanding(object):
    """ This is a class that checks for missing values"""
    def __init__(self):
        pass
    
    def miss_no(df):
        """ A function that counts the missing values per column"""
        return df.isna().sum()
   
    def percent_missing(df):
        """ A function that calculates the percentage of the column that is missing"""
        return df.isna().sum() / len(df)
    def check_dup(df):
        """ A function that checks for duplicates in the data"""
        return df.duplicated().sum()
    
    def counts(df, col):
        """ A function that finds the value counts of a column"""
        return df[col].value_counts()
    
    def num_unique(df, col):
        """A function that finds the number of unique elements in  a column"""
        return df[col].nunique()
    
    def get_unique(df, col):
        """ A function that shows the unique values in a column"""
        return df[col].unique()

In [26]:
# missing values
understand = understanding

# The number of missing values per column
understand.miss_no(data)

Date received                         0
Product                               0
Sub-product                      235293
Issue                                 0
Sub-issue                        685381
Consumer complaint narrative    2015226
Company public response         1774590
Company                               0
State                             40190
ZIP code                          40631
Tags                            2792441
Consumer consent provided?       826205
Submitted via                         0
Date sent to company                  0
Company response to consumer          4
Timely response?                      0
Consumer disputed?              2379130
Complaint ID                          0
dtype: int64

In [27]:
# Percentage of missing values per column
understand.percent_missing(data)

Date received                   0.000000
Product                         0.000000
Sub-product                     0.074754
Issue                           0.000000
Sub-issue                       0.217749
Consumer complaint narrative    0.640248
Company public response         0.563797
Company                         0.000000
State                           0.012769
ZIP code                        0.012909
Tags                            0.887174
Consumer consent provided?      0.262490
Submitted via                   0.000000
Date sent to company            0.000000
Company response to consumer    0.000001
Timely response?                0.000000
Consumer disputed?              0.755862
Complaint ID                    0.000000
dtype: float64

There are a number of missing values in the data. Most notably, the columns with a high percentage of missing values are:
* `Consumer complaint narrative` - 64 %.
* `Company public response` - 56 %.
* `Tags` - 89 %.
* `Consumer disputed?` - 76%.

In [28]:
# The number of duplicated rows in the data
understand.check_dup(data)

0

There are no duplicates in the data.

### 2.3.2 Checking the Prospective Target Variables

#### 2.3.2.1 `Consumer  disputed?`

In [31]:
# Checking the value counts of the column 'Consumer disputed?'
understand.counts(data, 'Consumer disputed?')

No     620062
Yes    148378
Name: Consumer disputed?, dtype: int64

This is a binary variable

The classes seem to be imbalanced with a significant difference in the value counts of the classes.

#### 2.3.2.2 `Timely response?`

In [33]:
# Checking the value counts of the column 'Timely response?'
understand.counts(data, 'Timely response?')

Yes    3097148
No       50422
Name: Timely response?, dtype: int64

This is a binary variable.

The classes seem to be imbalanced with a significant difference in the value counts of the classes.

#### 2.3.2.3 `Company response to consumer`

In [34]:
# Checking the value counts of the column 'Company response to consumer'
understand.counts(data, 'Company response to consumer')

Closed with explanation            2429039
Closed with non-monetary relief     474084
Closed with monetary relief         123817
In progress                          70517
Closed without relief                17868
Closed                               17611
Untimely response                     9326
Closed with relief                    5304
Name: Company response to consumer, dtype: int64

This column is a multi class column.

The classes seem to be imbalanced with a significant difference in the value counts of the classes.

In [35]:
# Checking the value counts of the 'Submitted via'
understand.counts(data, 'Submitted via')

Web             2649097
Referral         237715
Phone            147848
Postal mail       86339
Fax               25660
Web Referral        487
Email               424
Name: Submitted via, dtype: int64

This column is a multi class column.

The classes seem to be imbalanced with a significant difference in the value counts of the classes.

# 3. Data Preparation
> In this phase, the data is prepared based on the findings from the data understanding.