# Exploratory Data Analysis - Part 1

### Loading required Libraries

In [2]:
library(data.table)
library(readr)
library(dplyr)
library(ggplot2)
library(quanteda)
library(tm)
library(RColorBrewer)
library(wordcloud)
library(openxlsx)
library(scales)

### Loading the data set

In [8]:
data <- read_csv("cc.csv", na = c("NA", "N/A", NULL, ""), progress = TRUE)
glimpse(data)

Parsed with column specification:
cols(
  complaint_id = col_integer(),
  date_received = col_datetime(format = ""),
  date_sent_to_company = col_datetime(format = ""),
  company = col_character(),
  product = col_character(),
  sub_product = col_character(),
  issue = col_character(),
  submitted_via = col_character(),
  company_public_response = col_character(),
  state = col_character(),
  zip_code = col_character(),
  company_response = col_character(),
  consumer_disputed = col_character(),
  sub_issue = col_character(),
  complaint_what_happened = col_character(),
  consumer_consent_provided = col_character(),
  timely = col_character(),
  tags = col_character()
)


Observations: 831,000
Variables: 18
$ complaint_id              <int> 759217, 2141773, 2163100, 885638, 1027760...
$ date_received             <dttm> 2014-03-12, 2016-10-01, 2016-10-17, 2014...
$ date_sent_to_company      <dttm> 2014-03-17, 2016-10-05, 2016-10-20, 2014...
$ company                   <chr> "M&T BANK CORPORATION", "TRANSUNION INTER...
$ product                   <chr> "Mortgage", "Credit reporting", "Consumer...
$ sub_product               <chr> "Other mortgage", NA, "Vehicle loan", NA,...
$ issue                     <chr> "Loan modification,collection,foreclosure...
$ submitted_via             <chr> "Referral", "Web", "Web", "Web", "Web", "...
$ company_public_response   <chr> NA, "Company has responded to the consume...
$ state                     <chr> "MI", "AL", "PA", "ID", "VA", "MN", "CA",...
$ zip_code                  <chr> "48382", "352XX", "177XX", "83854", "2323...
$ company_response          <chr> "Closed with explanation", "Closed with e...
$ consumer_dispu

### Cleaning and Munging data

We find that a lot of variables belong to class factor but are in character format and hence, we convert them into factor format.
Also, we need to find out the number of missing values in the data set. We do this with the help of sapply function which throws the number of missing values in each column of our data set

In [6]:
data.frame(Missing.Values = sapply(data, function(x) sum(is.na(x))))
data$company <- as.factor(data$company)
data$product <- as.factor(data$product)
data$sub_product <- as.factor(data$sub_product)
data$issue <- as.factor(data$issue)
data$company_public_response <- as.factor(data$company_public_response)
data$state <- as.factor(data$state)
data$company_response <- as.factor(data$company_response)
data$consumer_disputed <- as.factor(data$consumer_disputed)

Unnamed: 0,Missing.Values
complaint_id,0
date_received,0
date_sent_to_company,0
company,0
product,0
sub_product,232556
issue,0
submitted_via,0
company_public_response,597946
state,7233
