Data processing notebook for the data from the CFPB

# Data

1) Consumer Financial Protection Bureau (CFPB) - https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data  
(raw) 5.03 gb size csv file (1.05 gb compressed), so will likely need to reduce size significantly  
(raw) 7,848,317 records  
(processed) 1.06 gb size csv file  
(processed) 7,638,765 records  

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('data/complaints.csv')

  df = pd.read_csv('data/complaints.csv')


In [4]:
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2025-01-29,Credit reporting or other personal consumer re...,Credit reporting,Incorrect information on your report,Account status incorrect,,,Experian Information Solutions Inc.,NY,12543,,,Web,2025-01-29,In progress,Yes,,11825440
1,2025-01-30,Credit reporting or other personal consumer re...,Credit reporting,Improper use of your report,Reporting company used your report improperly,,,Experian Information Solutions Inc.,NV,89148,,,Web,2025-01-30,In progress,Yes,,11844247
2,2025-01-29,"Payday loan, title loan, personal loan, or adv...",Installment loan,Problem when making payments,,,,TD BANK US HOLDING COMPANY,FL,33055,,,Phone,2025-02-04,In progress,Yes,,11824206
3,2025-01-27,Credit reporting or other personal consumer re...,Credit reporting,Problem with a company's investigation into an...,Their investigation did not fix an error on yo...,,,Resurgent Capital Services L.P.,NC,27455,,,Postal mail,2025-01-27,In progress,Yes,,11799137
4,2025-01-28,Credit reporting or other personal consumer re...,Credit reporting,Improper use of your report,Reporting company used your report improperly,,,Experian Information Solutions Inc.,IL,60609,,,Web,2025-01-28,In progress,Yes,,11810502


In [13]:
len(df)

7848317

In [5]:
df.columns

Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Consumer complaint narrative', 'Company public response', 'Company',
       'State', 'ZIP code', 'Tags', 'Consumer consent provided?',
       'Submitted via', 'Date sent to company', 'Company response to consumer',
       'Timely response?', 'Consumer disputed?', 'Complaint ID'],
      dtype='object')

In [12]:
df['Product'].value_counts(dropna=False)

Product
Credit reporting or other personal consumer reports                             3255674
Credit reporting, credit repair services, or other personal consumer reports    2163864
Debt collection                                                                  711906
Mortgage                                                                         414704
Checking or savings account                                                      269384
Credit card or prepaid card                                                      206372
Credit card                                                                      197794
Credit reporting                                                                 140429
Money transfer, virtual currency, or money service                               131002
Student loan                                                                     100101
Bank account or service                                                           86205
Vehicle loan or lease   

In [11]:
df['Consumer disputed?'].value_counts(dropna=False)

Consumer disputed?
NaN    7080011
No      619928
Yes     148378
Name: count, dtype: int64

In [9]:
df['Complaint ID'].value_counts()

Complaint ID
1110751     1
11825440    1
11844247    1
11824206    1
11799137    1
           ..
2918917     1
8383712     1
11636354    1
11482617    1
11482671    1
Name: count, Length: 7848317, dtype: int64

### Add year and month columns

No idea if I'll need either (maybe for dynamic slicing?), but just in case.

In [14]:
df['Date received'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 7848317 entries, 0 to 7848316
Series name: Date received
Non-Null Count    Dtype 
--------------    ----- 
7848317 non-null  object
dtypes: object(1)
memory usage: 59.9+ MB


In [3]:
# 2025-01-28
pd.to_datetime(df['Date received']).apply(lambda dt: dt.strftime("%b")).value_counts()

Date received
Jan    937041
Dec    722838
Oct    701879
Nov    672005
Aug    667081
Sep    653543
Jul    617997
Feb    613440
May    579762
Jun    575696
Apr    558695
Mar    548340
Name: count, dtype: int64

In [4]:
df['Month received'] = pd.to_datetime(df['Date received']).apply(lambda dt: dt.strftime("%b"))

In [5]:
df['Month received'].value_counts()

Month received
Jan    937041
Dec    722838
Oct    701879
Nov    672005
Aug    667081
Sep    653543
Jul    617997
Feb    613440
May    579762
Jun    575696
Apr    558695
Mar    548340
Name: count, dtype: int64

In [6]:
pd.to_datetime(df['Date received']).apply(lambda dt: dt.year).value_counts()

Date received
2024    2734508
2023    1292139
2022     800349
2025     607722
2021     495993
2020     444287
2019     277294
2018     257203
2017     242850
2016     191412
2015     168433
2014     153004
2013     108215
2012      72372
2011       2536
Name: count, dtype: int64

In [7]:
df['Year received'] = pd.to_datetime(df['Date received']).apply(lambda dt: dt.year)

In [8]:
df['Year received'].value_counts()

Year received
2024    2734508
2023    1292139
2022     800349
2025     607722
2021     495993
2020     444287
2019     277294
2018     257203
2017     242850
2016     191412
2015     168433
2014     153004
2013     108215
2012      72372
2011       2536
Name: count, dtype: int64

In [9]:
df.columns

Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Consumer complaint narrative', 'Company public response', 'Company',
       'State', 'ZIP code', 'Tags', 'Consumer consent provided?',
       'Submitted via', 'Date sent to company', 'Company response to consumer',
       'Timely response?', 'Consumer disputed?', 'Complaint ID',
       'Month received', 'Year received'],
      dtype='object')

In [10]:
df = df[['Date received', 'Month received', 'Year received', 'Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Consumer complaint narrative', 'Company public response', 'Company',
       'State', 'ZIP code', 'Tags', 'Consumer consent provided?',
       'Submitted via', 'Date sent to company', 'Company response to consumer',
       'Timely response?', 'Consumer disputed?', 'Complaint ID']]

In [11]:
df.columns

Index(['Date received', 'Month received', 'Year received', 'Product',
       'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative',
       'Company public response', 'Company', 'State', 'ZIP code', 'Tags',
       'Consumer consent provided?', 'Submitted via', 'Date sent to company',
       'Company response to consumer', 'Timely response?',
       'Consumer disputed?', 'Complaint ID'],
      dtype='object')

 ### Sorting through and standardizing product/sub-product/issue/sub-issue

 From my quick view of it earlier, it appears that there's multiple different names for the same categories (at least for Product, likely for the other 3 columns here). So let's check for nulls, then standardize categories a bit

In [88]:
df['Product'].value_counts(dropna=False)

Product
Credit reporting or other personal consumer reports                             3255674
Credit reporting, credit repair services, or other personal consumer reports    2163864
Debt collection                                                                  711906
Mortgage                                                                         414704
Checking or savings account                                                      269384
Credit card or prepaid card                                                      206372
Credit card                                                                      197794
Credit reporting                                                                 140429
Money transfer, virtual currency, or money service                               131002
Student loan                                                                     100101
Bank account or service                                                           86205
Vehicle loan or lease   

In [12]:
product_categories = df['Product'].value_counts(dropna=False).index.tolist()
product_categories = sorted(product_categories)
product_categories

['Bank account or service',
 'Checking or savings account',
 'Consumer Loan',
 'Credit card',
 'Credit card or prepaid card',
 'Credit reporting',
 'Credit reporting or other personal consumer reports',
 'Credit reporting, credit repair services, or other personal consumer reports',
 'Debt collection',
 'Debt or credit management',
 'Money transfer, virtual currency, or money service',
 'Money transfers',
 'Mortgage',
 'Other financial service',
 'Payday loan',
 'Payday loan, title loan, or personal loan',
 'Payday loan, title loan, personal loan, or advance loan',
 'Prepaid card',
 'Student loan',
 'Vehicle loan or lease',
 'Virtual currency']

Here are the groupings -> and what the category name will be named

['Bank account or service', 'Checking or savings account',] -> 'Bank accounts'  
['Consumer Loan',] -> 'Consumer loans'  
['Credit card', 'Credit card or prepaid card', 'Prepaid card',] -> 'Credit cards'  
['Credit reporting', 'Credit reporting or other personal consumer reports', 'Credit reporting, credit repair services, or other personal consumer reports',] -> 'Credit reporting'  
['Debt collection', 'Debt or credit management',] -> 'Debt collection'  
['Money transfer, virtual currency, or money service', 'Money transfers', 'Virtual currency'] -> 'Money services'  
['Mortgage',] -> 'Mortgages'  
['Other financial service',] -> 'Other'  
['Payday loan', 'Payday loan, title loan, or personal loan', 'Payday loan, title loan, personal loan, or advance loan',] -> 'Short-term loans'  
['Student loan',] -> 'Student loans'  
['Vehicle loan or lease',] -> 'Vehicle loans'   

In [13]:
def product_grouping_and_renaming(s):
    if s in ['Bank account or service', 'Checking or savings account',]:
        return 'Bank accounts' 
    elif s in ['Consumer Loan',]:
        return 'Consumer loans'
    elif s in ['Credit card', 'Credit card or prepaid card', 'Prepaid card',]:
        return 'Credit cards'
    elif s in ['Credit reporting', 'Credit reporting or other personal consumer reports', 'Credit reporting, credit repair services, or other personal consumer reports',]:
        return 'Credit reporting'
    elif s in ['Debt collection', 'Debt or credit management',]:
        return 'Debt collection'
    elif s in ['Money transfer, virtual currency, or money service', 'Money transfers', 'Virtual currency']:
        return 'Money services'
    elif s in ['Mortgage',]:
        return 'Mortgages'
    elif s in ['Payday loan', 'Payday loan, title loan, or personal loan', 'Payday loan, title loan, personal loan, or advance loan',]:
        return 'Short-term loans'
    elif s in ['Student loan',]:
        return 'Student loans'
    elif s in ['Vehicle loan or lease',]:
        return 'Vehicle loans'
    else: # including ['Other financial service',]
        return 'Other'

In [14]:
df['Product'] = df['Product'].apply(product_grouping_and_renaming)
# this reduced the size from 5.08gb to 

In [15]:
df['Product'].value_counts()

Product
Credit reporting    5559967
Debt collection      715615
Credit cards         417439
Mortgages            414704
Bank accounts        355589
Money services       136374
Student loans        100101
Vehicle loans         66898
Short-term loans      48998
Consumer loans        31574
Other                  1058
Name: count, dtype: int64

If you wanted a more granular look at some of these specific data, these categories may be something good to look into. Most of this I may just toss out for the purposes of this dashboard.

In [52]:
df['Sub-product'].value_counts(dropna=False)

Sub-product
Credit reporting                              5382512
Checking account                               275517
General-purpose credit card or charge card     256039
NaN                                            235295
I do not know                                  204425
                                               ...   
Student prepaid card                               43
Student loan debt relief                           41
Transit card                                       37
Tax refund anticipation loan or check              22
Electronic Benefit Transfer / EBT card             12
Name: count, Length: 87, dtype: int64

In [54]:
print(df['Sub-product'].value_counts(dropna=False).index.tolist())

['Credit reporting', 'Checking account', 'General-purpose credit card or charge card', nan, 'I do not know', 'Other debt', 'Credit card debt', 'Conventional home mortgage', 'Other mortgage', 'Conventional fixed mortgage', 'Domestic (US) money transfer', 'Medical debt', 'Federal student loan servicing', 'Loan', 'FHA mortgage', 'Other (i.e. phone, health club, etc.)', 'Store credit card', 'Mobile or digital wallet', 'Other personal consumer report', 'Installment loan', 'Credit card', 'Savings account', 'Other banking product or service', 'Conventional adjustable mortgage (ARM)', 'Non-federal student loan', 'Auto debt', 'Medical', 'Private student loan', 'VA mortgage', 'Other bank product/service', 'Vehicle loan', 'Payday loan', 'Other type of mortgage', 'General-purpose prepaid card', 'Virtual currency', 'International money transfer', 'Telecommunications debt', 'Payday loan debt', 'Personal line of credit', 'Home equity loan or line of credit', 'Home equity loan or line of credit (HELOC

In [55]:
df['Issue'].value_counts(dropna=False)

Issue
Incorrect information on your report                                                2684928
Improper use of your report                                                         1456795
Problem with a company's investigation into an existing problem                      663010
Problem with a credit reporting company's investigation into an existing problem     589336
Attempts to collect debt not owed                                                    275937
                                                                                     ...   
Lost or stolen refund                                                                     7
Lender sold the property                                                                  7
Property was damaged or destroyed property                                                7
NaN                                                                                       6
Lender damaged or destroyed property                                      

In [57]:
df['Sub-issue'].value_counts(dropna=False)

Sub-issue
Information belongs to someone else                                                 1715049
Reporting company used your report improperly                                        961150
NaN                                                                                  815662
Their investigation did not fix an error on your report                              722975
Credit inquiries on your report that you don't recognize                             485642
                                                                                     ...   
Insurance terms                                                                           6
Problem with a credit reporting company's investigation into an existing problem          5
Improper use of your report                                                               1
Problem with fraud alerts or security freezes                                             1
Credit monitoring or identity theft protection services               

In [51]:
df.columns

Index(['Date received', 'Month received', 'Year received', 'Product',
       'Sub-product', 'Issue', 'Sub-issue', 'Has consumer complaint narrative',
       'Company response', 'Company', 'State', 'ZIP code', 'Tags',
       'Consumer consent provided?', 'Submitted via', 'Date sent to company',
       'Company response to consumer', 'Timely response?',
       'Consumer disputed?', 'Complaint ID'],
      dtype='object')

### Dealing with the large text columns  

`Consumer complaint narrative` and `Company public response` both (although as I explore, more columns may be needed to treated like these) will be compressed in some way.  

`Consumer complaint narrative` -> into a column 'Has consumer complaint narrative': has text -> True, else -> False  
`Company public response` -> into 5 categories: 'No Public Response', 'Company Acted Appropriately', 'Dispute or Unverifiable Facts', 'External or Isolated Issue', 'Opportunity for Improvement'  



In [87]:
df.columns

Index(['Date received', 'Month received', 'Year received', 'Product',
       'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative',
       'Company public response', 'Company', 'State', 'ZIP code', 'Tags',
       'Consumer consent provided?', 'Submitted via', 'Date sent to company',
       'Company response to consumer', 'Timely response?',
       'Consumer disputed?', 'Complaint ID'],
      dtype='object')

In [21]:
df['Consumer complaint narrative'].isna().sum()

np.int64(5332868)

In [22]:
len(df)

7848317

In [23]:
current = 0
col_to_examine = 'Consumer complaint narrative'

In [25]:
count = 0
count_end = 5
for i in range(len(df)):
    if i < current:
        continue
    if pd.isna(df.loc[i, col_to_examine]):
        continue
    print(i, df.loc[i, col_to_examine])
    print('=================================')
    count += 1
    if count >= count_end:
        current = i
        break

20 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX De XXXX ] [ XXXX ] [ XXXX ] Consumer Financial Protection Bureau 1700 G Street NW XXXX XXXX XXXX  Subject : Formal Dispute of Inaccurate Credit Reporting Request for Deletion Under FCRA Dear CFPB, I am writing to dispute inaccurate information reported by TransUnion, XXXX, and XXXX regarding the account XXXX Under the Fair Credit Reporting Act ( FCRA ), I am entitled to accurate, complete, and consistent information on my credit reports. After reviewing my reports, I found several errors : Inconsistent " High Balance '' : TransUnion : {$520.00}, XXXX : {$0.00}, XXXX : {$320.00}. 

Violates 623 ( a ) ( 1 ) ( A ). 

Conflicting " Date of Last Activity '' ( DLA ) : TransUnion : XX/XX/XXXX, XXXX : XX/XX/XXXX, XXXX : XX/XX/XXXX. 

Violates 611 ( a ) ( 1 ) ( A ). 

Omission of " Closed Date '' : TransUnion lists XX/XX/XXXX, but XXXX and XXXX do not report this information.

Violates 609 ( a ) ( 1 ).

Discrepancy in Creditor Remar

In [35]:
def has_text(s):
    if pd.isnull(s):
        return False
    if len(s) == 0:
        return False
    return True

In [36]:
df['Has consumer complaint narrative'] = df['Consumer complaint narrative'].apply(has_text)

In [40]:
df.drop('Consumer complaint narrative', axis=1, inplace=True)
df.columns

Index(['Date received', 'Month received', 'Year received', 'Product',
       'Sub-product', 'Issue', 'Sub-issue', 'Has consumer complaint narrative',
       'Company public response', 'Company', 'State', 'ZIP code', 'Tags',
       'Consumer consent provided?', 'Submitted via', 'Date sent to company',
       'Company response to consumer', 'Timely response?',
       'Consumer disputed?', 'Complaint ID'],
      dtype='object')

In [19]:
df['Company public response'].value_counts(dropna=False)

Company public response
NaN                                                                                                                        3933594
Company has responded to the consumer and the CFPB and chooses not to provide a public response                            3629886
Company believes it acted appropriately as authorized by contract or law                                                    175548
Company chooses not to provide a public response                                                                             52473
Company believes the complaint is the result of a misunderstanding                                                           13637
Company disputes the facts presented in the complaint                                                                        12898
Company believes complaint caused principally by actions of third party outside the control or direction of the company       8200
Company believes complaint is the result of an isolated err

In [46]:
df['Company public response'].value_counts(dropna=False).index.values.tolist()

[nan,
 'Company has responded to the consumer and the CFPB and chooses not to provide a public response',
 'Company believes it acted appropriately as authorized by contract or law',
 'Company chooses not to provide a public response',
 'Company believes the complaint is the result of a misunderstanding',
 'Company disputes the facts presented in the complaint',
 'Company believes complaint caused principally by actions of third party outside the control or direction of the company',
 'Company believes complaint is the result of an isolated error',
 'Company believes complaint represents an opportunity for improvement to better serve consumers',
 "Company can't verify or dispute the facts in the complaint",
 "Company believes the complaint provided an opportunity to answer consumer's questions",
 'Company believes complaint relates to a discontinued policy or procedure']

Answers from dataset->  
[nan,  
'Company has responded to the consumer and the CFPB and chooses not to provide a public response',  
 'Company believes it acted appropriately as authorized by contract or law',  
 'Company chooses not to provide a public response',  
 'Company believes the complaint is the result of a misunderstanding',
 'Company disputes the facts presented in the complaint',  
 'Company believes complaint caused principally by actions of third party outside the control or direction of the company',  
 'Company believes complaint is the result of an isolated error',  
 'Company believes complaint represents an opportunity for improvement to better serve consumers',  
 "Company can't verify or dispute the facts in the complaint",  
 "Company believes the complaint provided an opportunity to answer consumer's questions",  
 'Company believes complaint relates to a discontinued policy or procedure']  

 
No Public Response  
[nan, 'Company has responded to the consumer and the CFPB and chooses not to provide a public response', 'Company chooses not to provide a public response', ]  

Company Acted Appropriately  
['Company believes it acted appropriately as authorized by contract or law', 'Company believes the complaint is the result of a misunderstanding', ]  

Dispute or Unverifiable Facts  
['Company disputes the facts presented in the complaint', "Company can't verify or dispute the facts in the complaint",  ]  

External or Isolated Issue  
['Company believes complaint caused principally by actions of third party outside the control or direction of the company', 'Company believes complaint is the result of an isolated error', 'Company believes complaint relates to a discontinued policy or procedure']  

Opportunity for Improvement  
['Company believes complaint represents an opportunity for improvement to better serve consumers', "Company believes the complaint provided an opportunity to answer consumer's questions",  ]  



In [47]:
def company_response_categories(s):
    if pd.isnull(s) or s in ['Company has responded to the consumer and the CFPB and chooses not to provide a public response', 
                             'Company chooses not to provide a public response', ]:
        return 'No Public Response'
    elif s in ['Company believes it acted appropriately as authorized by contract or law', 
               'Company believes the complaint is the result of a misunderstanding', ]:
        return 'Company Acted Appropriately'
    elif s in ['Company disputes the facts presented in the complaint', 
               "Company can't verify or dispute the facts in the complaint", ]:
        return 'Dispute or Unverifiable Facts'
    elif s in ['Company believes complaint caused principally by actions of third party outside the control or direction of the company', 
               'Company believes complaint is the result of an isolated error', 
               'Company believes complaint relates to a discontinued policy or procedure']:
        return 'External or Isolated Issue'
    else: # s in ['Company believes complaint represents an opportunity for improvement to better serve consumers', 
          # "Company believes the complaint provided an opportunity to answer consumer's questions", ]
        return 'Opportunity for Improvement'


In [48]:
df.columns

Index(['Date received', 'Month received', 'Year received', 'Product',
       'Sub-product', 'Issue', 'Sub-issue', 'Has consumer complaint narrative',
       'Company public response', 'Company', 'State', 'ZIP code', 'Tags',
       'Consumer consent provided?', 'Submitted via', 'Date sent to company',
       'Company response to consumer', 'Timely response?',
       'Consumer disputed?', 'Complaint ID'],
      dtype='object')

In [49]:
df['Company public response'] = df['Company public response'].apply(company_response_categories)
df.columns = ['Date received', 'Month received', 'Year received', 'Product',
       'Sub-product', 'Issue', 'Sub-issue', 'Has consumer complaint narrative',
       'Company response', 'Company', 'State', 'ZIP code', 'Tags',
       'Consumer consent provided?', 'Submitted via', 'Date sent to company',
       'Company response to consumer', 'Timely response?',
       'Consumer disputed?', 'Complaint ID']

In [50]:
df.columns

Index(['Date received', 'Month received', 'Year received', 'Product',
       'Sub-product', 'Issue', 'Sub-issue', 'Has consumer complaint narrative',
       'Company response', 'Company', 'State', 'ZIP code', 'Tags',
       'Consumer consent provided?', 'Submitted via', 'Date sent to company',
       'Company response to consumer', 'Timely response?',
       'Consumer disputed?', 'Complaint ID'],
      dtype='object')

`Company` seems like a pretty clean column so I'll leave it as is and use it as an interactive filter

In [62]:
df['Company'].value_counts().index.tolist()[:20]

['EQUIFAX, INC.',
 'TRANSUNION INTERMEDIATE HOLDINGS, INC.',
 'Experian Information Solutions Inc.',
 'BANK OF AMERICA, NATIONAL ASSOCIATION',
 'WELLS FARGO & COMPANY',
 'JPMORGAN CHASE & CO.',
 'CAPITAL ONE FINANCIAL CORPORATION',
 'CITIBANK, N.A.',
 'SYNCHRONY FINANCIAL',
 'Block, Inc.',
 'Navient Solutions, LLC.',
 'AMERICAN EXPRESS COMPANY',
 'U.S. BANCORP',
 'Onity Group Inc.',
 'PORTFOLIO RECOVERY ASSOCIATES INC',
 'DISCOVER BANK',
 'ENCORE CAPITAL GROUP INC.',
 'Resurgent Capital Services L.P.',
 'NAVY FEDERAL CREDIT UNION',
 'Bread Financial Holdings, Inc.']

In [69]:
[x for x in df['Company'].value_counts().index.tolist() if 'chase' in x.lower()]

['JPMORGAN CHASE & CO.']

In [70]:
df['Company'].nunique()

7539

Geographic features (`State` and `ZIP code`). Will need to add a geospatial file of US zip codes to be able to map these onto a Tableau map (either using the boundaries themselves ideally or by getting a centroid of the zip code).

Apologies to any of the Armed Forces abroad or citizens living in any of the territories and what not, but for the sake of simplificiation, I'll remove any entry not in the 50 states+DC (also partially for zip code concerns.

In [72]:
df['State'].value_counts(dropna=False)

State
FL    1005190
TX     912451
CA     846771
GA     557229
NY     508269
       ...   
AA         96
AS         71
MP         68
MH         33
PW         13
Name: count, Length: 64, dtype: int64

In [81]:
states = df['State'].value_counts().index.tolist()

In [79]:
# from this github gist : https://gist.github.com/JeffPaine/3083347
states_dc = ["AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "IA",
    "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME", "MI", "MN", "MO",
    "MS", "MT", "NC", "ND", "NE", "NH", "NJ", "NM", "NV", "NY", "OH", "OK",
    "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI",
    "WV", "WY",
    "DC",]

In [80]:
len(states_dc)

51

In [82]:
for state in states:
    if state not in states_dc:
        print(state, end=' ')

PR VI AE AP UNITED STATES MINOR OUTLYING ISLANDS GU FM AA AS MP MH PW 

In [83]:
len(df)

7848317

In [85]:
df = df[df['State'].isin(states_dc)]

In [87]:
len(df)

7767184

In [90]:
df['ZIP code'].nunique()

34105

In [91]:
df['ZIP code'].value_counts(dropna=False)

ZIP code
XXXXX    128419
30349     15125
33025     12152
19143     11593
77449     10631
          ...  
35575         1
43626         1
61617         1
96858         1
01696         1
Name: count, Length: 34106, dtype: int64

In [95]:
df = df[df['ZIP code'] != 'XXXXX']
len(df)

7638765

`Tags` are interesting data, but I don't think I'll use it in any way, so I'll toss it for now

In [96]:
df['Tags'].value_counts()

Tags
Servicemember                    339493
Older American                   167801
Older American, Servicemember     41873
Name: count, dtype: int64

In [98]:
df.drop('Tags', axis=1, inplace=True)
df.columns

Index(['Date received', 'Month received', 'Year received', 'Product',
       'Sub-product', 'Issue', 'Sub-issue', 'Has consumer complaint narrative',
       'Company response', 'Company', 'State', 'ZIP code',
       'Consumer consent provided?', 'Submitted via', 'Date sent to company',
       'Company response to consumer', 'Timely response?',
       'Consumer disputed?', 'Complaint ID'],
      dtype='object')

Likewise with consumer consent (`Consumer consent provided?`) information

In [99]:
df['Consumer consent provided?'].value_counts()

Consumer consent provided?
Consent not provided    3558645
Consent provided        2386493
Other                    299573
Consent withdrawn         11754
Name: count, dtype: int64

In [100]:
df.drop('Consumer consent provided?', axis=1, inplace=True)
df.columns

Index(['Date received', 'Month received', 'Year received', 'Product',
       'Sub-product', 'Issue', 'Sub-issue', 'Has consumer complaint narrative',
       'Company response', 'Company', 'State', 'ZIP code', 'Submitted via',
       'Date sent to company', 'Company response to consumer',
       'Timely response?', 'Consumer disputed?', 'Complaint ID'],
      dtype='object')

Probably won't use `Submitted via` but will keep for now

In [102]:
df['Submitted via'].value_counts()

Submitted via
Web             7091239
Referral         249593
Phone            173813
Postal mail       97852
Fax               24674
Web Referral       1239
Email               355
Name: count, dtype: int64

`Date sent to company` doesn't seem to interesting as a column as most seem to be instantly sent, so I'll remove this.

In [109]:
pd.to_datetime(df['Date sent to company']) - pd.to_datetime(df['Date received'])

0          0 days
1          0 days
2          6 days
3          0 days
4          0 days
            ...  
7848312    0 days
7848313    0 days
7848314    0 days
7848315    1 days
7848316   15 days
Length: 7638765, dtype: timedelta64[ns]

In [106]:
(pd.to_datetime(df['Date sent to company']) - pd.to_datetime(df['Date received'])).mean()

Timedelta('0 days 20:48:18.968092355')

In [110]:
df.drop('Date sent to company', axis=1, inplace=True)
df.columns

Index(['Date received', 'Month received', 'Year received', 'Product',
       'Sub-product', 'Issue', 'Sub-issue', 'Has consumer complaint narrative',
       'Company response', 'Company', 'State', 'ZIP code', 'Submitted via',
       'Company response to consumer', 'Timely response?',
       'Consumer disputed?', 'Complaint ID'],
      dtype='object')

In [111]:
df['Company response to consumer'].value_counts(dropna=False)

Company response to consumer
Closed with explanation            4509541
Closed with non-monetary relief    2449017
In progress                         464718
Closed with monetary relief         163937
Closed without relief                17724
Closed                               17210
Untimely response                    11354
Closed with relief                    5244
NaN                                     20
Name: count, dtype: int64

In [112]:
464718/7638765

0.06083679757133516

In [115]:
df['Company response to consumer'].unique().tolist()

['In progress',
 'Closed with non-monetary relief',
 'Closed with explanation',
 'Untimely response',
 'Closed without relief',
 'Closed with monetary relief',
 'Closed with relief',
 'Closed',
 nan]

Grouping these up:  
['In progress',
 'Closed with non-monetary relief',
 'Closed with explanation',
 'Untimely response',
 'Closed without relief',
 'Closed with monetary relief',
 'Closed with relief',
 'Closed',
 nan]  

Closed (relief unknown)  
['Closed', nan]  
'Closed with relief'  
['Closed with non-monetary relief', 'Closed with monetary relief', 'Closed with relief',]  
'Closed without relief'  
['Closed without relief', 'Closed with explanation',]  
'Untimely response'  
['Untimely response',]  
'In progress'  
['In progress',]  

In [126]:
def company_response_to_consumer_categories(s):
    if pd.isnull(s) or s in ['Closed',]:
        return 'Closed (relief unknown)'
    elif s in ['Closed with non-monetary relief', 
               'Closed with monetary relief', 
               'Closed with relief',]:
        return 'Closed with relief'
    elif s in ['Closed without relief', 
               'Closed with explanation',]:
        return 'Closed without relief'
    elif s in ['Untimely response',]:
        return 'Untimely response'
    else: # ['In progress',]
        return 'In progress'

In [127]:
df['Company response to consumer'] = (
    df['Company response to consumer']
    .apply(company_response_to_consumer_categories)
)

In [128]:
df['Company response to consumer'].value_counts()

Company response to consumer
Closed without relief      4527265
Closed with relief         2618198
In progress                 464718
Closed (relief unknown)      17230
Untimely response            11354
Name: count, dtype: int64

Will leave `Timely response?` as is

In [131]:
df['Timely response?'].value_counts(dropna=False)

Timely response?
Yes    7576280
No       62485
Name: count, dtype: int64

In [133]:
df[df['Company response to consumer']=='Untimely response']['Timely response?'].value_counts()

Timely response?
No    11354
Name: count, dtype: int64

In [134]:
df[df['Timely response?']=='No']['Company response to consumer'].value_counts()

Company response to consumer
Closed without relief      43770
Untimely response          11354
Closed with relief          5658
Closed (relief unknown)     1703
Name: count, dtype: int64

I will remove `Consumer disputed?` as it's a few levels in to this process (complaint -> response -> dispute) and there's the vast majority of the records missing values.

In [135]:
df['Consumer disputed?'].value_counts(dropna=False)

Consumer disputed?
NaN    6886841
No      607164
Yes     144760
Name: count, dtype: int64

In [136]:
df.drop('Consumer disputed?', axis=1, inplace=True)

In [138]:
df.columns

Index(['Date received', 'Month received', 'Year received', 'Product',
       'Sub-product', 'Issue', 'Sub-issue', 'Has consumer complaint narrative',
       'Company response', 'Company', 'State', 'ZIP code', 'Submitted via',
       'Company response to consumer', 'Timely response?', 'Complaint ID'],
      dtype='object')

Will also drop `Complaint ID`.

In [140]:
df.drop('Complaint ID', axis=1, inplace=True)

In [141]:
df.columns

Index(['Date received', 'Month received', 'Year received', 'Product',
       'Sub-product', 'Issue', 'Sub-issue', 'Has consumer complaint narrative',
       'Company response', 'Company', 'State', 'ZIP code', 'Submitted via',
       'Company response to consumer', 'Timely response?'],
      dtype='object')

In [143]:
df.to_csv('data/cfpb_complaints_processed.csv', index=False)

In [144]:
len(df)

7638765

Here's a sampled version (of just years 2017 and 2018) to upload to github. The full version should be on the CFPB website listed at the top of this notebook.

In [3]:
df = pd.read_csv('data/cfpb_complaints_processed.csv')

In [4]:
len(df)

7638765

In [19]:
years = [2017, 2018]

In [20]:
len(df[df['Year received'].isin(years)])

464762

In [21]:
df[df['Year received'].isin(years)].to_csv('cfpb_sample.csv', index=False)