# Exploratory Data Analysis on US Consumer Finance Complaints

## Introduction
This notebook performs exploratory data analysis (EDA) on the US Consumer Finance Complaints dataset. The goal is to understand the nature of the complaints and identify any patterns or trends in the data.

## Import Libraries


In [184]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.cluster
from scripts.utility import set_multiple_columns_datatype

ImportError: cannot import name 'set_multiple_columns_datatype' from 'scripts.utility' (C:\Users\Ari Castillo\Documents\Programas\Data\projects\EDA_US_Consumer_Finance_Complaints\scripts\utility.py)

## Load data

In [None]:
# Load the dataset
data_path = "../data/consumer_complaints.csv"
df = pd.read_csv(data_path, dtype={'column_5': str, 'column_11': str}, low_memory=False)
df = df.drop(columns=['complaint_id'])

#Sampling the data
df = df.sample(frac = 0.05)

# Display the first few rows of the dataframe
df.head()

In [185]:
#Look at the values of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27798 entries, 24977 to 195708
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   date_received                 27798 non-null  object
 1   product                       27798 non-null  object
 2   sub_product                   19703 non-null  object
 3   issue                         27798 non-null  object
 4   sub_issue                     10654 non-null  object
 5   consumer_complaint_narrative  3237 non-null   object
 6   company_public_response       4179 non-null   object
 7   company                       27798 non-null  object
 8   state                         27562 non-null  object
 9   zipcode                       27574 non-null  object
 10  tags                          3802 non-null   object
 11  consumer_consent_provided     6040 non-null   object
 12  submitted_via                 27798 non-null  object
 13  date_sent_to_com

In [186]:
#There is too many objects in the df, we need to cast some types in order to improve our model
for column in df.columns:
    print(f"Column: {column}, Number of unique Values: {len(df[column].unique())}\nValues:\n{df[column].unique()}")

Column: date_received, Number of unique Values: 1590
Values:
['12/05/2013' '03/21/2014' '03/20/2014' ... '05/26/2013' '02/23/2013'
 '04/15/2012']
Column: product, Number of unique Values: 11
Values:
['Debt collection' 'Credit reporting' 'Mortgage' 'Consumer Loan'
 'Bank account or service' 'Student loan' 'Credit card' 'Money transfers'
 'Prepaid card' 'Payday loan' 'Other financial service']
Column: sub_product, Number of unique Values: 44
Values:
['I do not know' 'Credit card' nan 'FHA mortgage' 'Vehicle loan'
 'Conventional fixed mortgage' 'Other (i.e. phone, health club, etc.)'
 'Other bank product/service' 'Other mortgage' 'Non-federal student loan'
 'Payday loan' 'International money transfer'
 'Home equity loan or line of credit' 'Medical' 'Checking account'
 'Conventional adjustable mortgage (ARM)' 'Reverse mortgage'
 'Second mortgage' 'Installment loan' 'General purpose card'
 'Savings account' 'Personal line of credit' 'Auto' 'VA mortgage'
 'Federal student loan' 'Vehicle leas

In [187]:
    # We are going to correct some data types in order of improving the analysis
    column_types = {
    'company': 'category',
    'company_public_response': 'category',
    'company_response_to_consumer': 'category',
    'consumer_complaint_narrative': 'string',
    'consumer_consent_provided': 'category',
    'consumer_disputed?': 'category',
    'date_received': 'datetime',
    'date_sent_to_company': 'datetime',
    'issue': 'category',
    'product': 'category',
    'state': 'category',
    'sub_issue': 'category',
    'sub_product': 'category',
    'submitted_via': 'category',
    'tags': 'category',
    'timely_response': 'category',
    'zipcode': 'string'
}
    
    df = set_multiple_columns_datatype(df, column_types)

NameError: name 'set_multiple_columns_datatype' is not defined

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
sns.countplot(data=df, y='company')
plt.xticks(rotation=90)
plt.show()
