<div align="center">
    <img src="images/logo.png" width="250" height="200">
</div>

<div align="center">
    To download the dataset used click <a href="https://files.consumerfinance.gov/ccdb/complaints.csv.zip" target="__blank">here</a>, provided by <a href="https://www.consumerfinance.gov/data-research/consumer-complaints/" target="__blank">CFPB</a>.
</div>

# Cleaning data

In [1]:
# import packages
import numpy as np
import pandas as pd

# set max rows displayed in output to 25
pd.set_option("display.max_rows", 25)

## Initial Dataset
In the original dataset, we have around 2 million entries. Each entry represents a single complaint.

|**Field name**| Description|
|---|---|
|**Date received**|The date the CFPB received the complaint.|
|**Product**|The type of product the consumer identified in the complaint.|
|**Sub-product**|The type of sub-product the consumer identified in the complaint.|
|**Issue**|The issue the consumer identified in the complaint.|
|**Sub-issue**|The sub-issue the consumer identified in the complaint.|
|**Consumer complaint narrative***|Consumer complaint narrative is the consumer-submitted description of “what happened” from the complaint.|
|**Company public response****|The company’s optional, public-facing response to a consumer’s complaint.|
|**Company**|The complaint is about this company.|
|**State**|The state of the mailing address provided by the consumer.|
|**ZIP code*****|The mailing ZIP code provided by the consumer.|
|**Tags**|Data that supports easier searching and sorting of complaints submitted by or on behalf of consumers.|
|**Consumer consent provided?**|Identifies whether the consumer opted in to publish their complaint narrative.
|**Submitted via**|How the complaint was submitted to the CFPB.|
|**Date sent to the company**|The date the CFPB sent the complaint to the company.|
|**Company response to consumer**|This is how the company responded.|
|**Timely response?**|Whether the company gave a timely response.|
|**Consumer disputed?**|Whether the consumer disputed the company’s response.|
|**Complaint ID**|The unique identification number for a complaint.|

<details>

*Consumers must opt-in to share their narrative. We will not publish the narrative unless the consumer consents, and consumers can opt-out at any time. The CFPB takes reasonable steps to scrub personal information from each complaint that could be used to identify the consumer.    
    
**Companies can choose to select a response from a pre-set list of options that will be posted on the public database. For example, “Company believes complaint is the result of an isolated error.”    
    
***This field may: i) include the first five digits of a ZIP code; ii) include the first three digits of a ZIP code (if the consumer consented to publication of their complaint narrative); or iii) be blank (if ZIP codes have been submitted with non-numeric values, if there are less than 20,000 people in a given ZIP code, or if the complaint has an address outside of the United States).
    
</details>


In [3]:
# read the dataset
df = pd.read_csv('complaints.csv')
print("Size: ", df.shape)
df.head(3)

Size:  (2050540, 18)


Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,transworld systems inc. \nis trying to collect...,,TRANSWORLD SYSTEMS INC,FL,335XX,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,,3384392
1,2019-09-19,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,PA,15206,,Consent not provided,Web,2019-09-20,Closed with non-monetary relief,Yes,,3379500
2,2019-11-08,Debt collection,I do not know,Communication tactics,Frequent or repeated calls,"Over the past 2 weeks, I have been receiving e...",,"Diversified Consultants, Inc.",NC,275XX,,Consent provided,Web,2019-11-08,Closed with explanation,Yes,,3433198


In [4]:
# change the index to Complaint ID
df.set_index('Complaint ID', inplace=True)
df.head(3)

Unnamed: 0_level_0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?
Complaint ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
3384392,2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,transworld systems inc. \nis trying to collect...,,TRANSWORLD SYSTEMS INC,FL,335XX,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,
3379500,2019-09-19,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,PA,15206,,Consent not provided,Web,2019-09-20,Closed with non-monetary relief,Yes,
3433198,2019-11-08,Debt collection,I do not know,Communication tactics,Frequent or repeated calls,"Over the past 2 weeks, I have been receiving e...",,"Diversified Consultants, Inc.",NC,275XX,,Consent provided,Web,2019-11-08,Closed with explanation,Yes,


In [5]:
# check column data types
print("Shape: ", df.shape)
df.dtypes

Shape:  (2050540, 17)


Date received                   object
Product                         object
Sub-product                     object
Issue                           object
Sub-issue                       object
Consumer complaint narrative    object
Company public response         object
Company                         object
State                           object
ZIP code                        object
Tags                            object
Consumer consent provided?      object
Submitted via                   object
Date sent to company            object
Company response to consumer    object
Timely response?                object
Consumer disputed?              object
dtype: object

In [6]:
# convert date received to an actual date type
df['Date received'] = pd.to_datetime(df['Date received'])
# convert data sent to the company to an actual date type
df['Date sent to company'] = pd.to_datetime(df['Date sent to company'])
# convert following columns to categorical data type
df[['Product', 'Sub-product', 'Issue', 'Sub-issue', 'Company', 'State']] = df[['Product', 'Sub-product', 'Issue', 'Sub-issue', 'Company', 'State']].astype('category')

print("Shape: ", df.shape)
df.dtypes

Shape:  (2050540, 17)


Date received                   datetime64[ns]
Product                               category
Sub-product                           category
Issue                                 category
Sub-issue                             category
Consumer complaint narrative            object
Company public response                 object
Company                               category
State                                 category
ZIP code                                object
Tags                                    object
Consumer consent provided?              object
Submitted via                           object
Date sent to company            datetime64[ns]
Company response to consumer            object
Timely response?                        object
Consumer disputed?                      object
dtype: object

In [7]:
df['ZIP code'] = df['ZIP code'].str.extract(r'(^\d{5}$)')
df = df[df['ZIP code'].notna()]

print("Shape: ", df.shape)
df.head()

Shape:  (1057135, 17)


Unnamed: 0_level_0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?
Complaint ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
3379500,2019-09-19,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,PA,15206,,Consent not provided,Web,2019-09-20,Closed with non-monetary relief,Yes,
3255455,2019-05-23,Checking or savings account,Checking account,Managing an account,Deposits and withdrawals,,Company has responded to the consumer and the ...,MIDFIRST BANK,AZ,85254,,,Referral,2019-05-28,Closed with explanation,Yes,
4267123,2021-04-02,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,,"EQUIFAX, INC.",PA,19403,,,Web,2021-04-02,Closed with explanation,Yes,
3446074,2019-11-20,Credit card or prepaid card,General-purpose credit card or charge card,Closing your account,Company closed your account,,Company has responded to the consumer and the ...,PENTAGON FEDERAL CREDIT UNION,VA,22304,,,Referral,2019-11-21,Closed with explanation,Yes,
4239229,2021-03-23,"Money transfer, virtual currency, or money ser...",Foreign currency exchange,Fraud or scam,,,,Square Inc.,CA,91387,,,Web,2021-03-23,Closed with explanation,Yes,


In [8]:
# drop the tags column (empty)
df.drop(['Tags'], axis=1, inplace=True)

# convert the rest data types
df = df.convert_dtypes()

print("Shape: ", df.shape)
df.dtypes

Shape:  (1057135, 16)


Date received                   datetime64[ns]
Product                               category
Sub-product                           category
Issue                                 category
Sub-issue                             category
Consumer complaint narrative            string
Company public response                 string
Company                               category
State                                 category
ZIP code                                string
Consumer consent provided?              string
Submitted via                           string
Date sent to company            datetime64[ns]
Company response to consumer            string
Timely response?                        string
Consumer disputed?                      string
dtype: object

In [10]:
# save the datase
df.to_pickle('clean_complaints.pkl')
df.to_csv('clean_complaints.csv')