# Classification of Consumer Complaints

The Consumer Financial Protection Bureau publishes the Consumer Complaint Database, a collection of complaints about consumer financial products and services that were sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. 

You have been provided with a dataset of over 350,000 such complaints for 5 common issue types. Your goal is to train a text classification model to identify the issue type based on the consumer complaint narrative. The data can be downloaded from https://drive.google.com/file/d/1Hz1gnCCr-SDGjnKgcPbg7Nd3NztOLdxw/view?usp=share_link 

At the end of the project, your team should should prepare a short presentation where you talk about the following:
* What steps did you take to preprocess the data?
* How did a model using unigrams compare to one using bigrams or trigrams?
* How did a count vectorizer compare to a tfidf vectorizer?
* What models did you try and how successful were they? Where did they struggle? Were there issues that the models commonly mixed up?
* What words or phrases were most influential on your models' predictions?

**Bonus:** A larger dataset containing 20 additional categories can be downloaded from https://drive.google.com/file/d/1gW6LScUL-Z7mH6gUZn-1aNzm4p4CvtpL/view?usp=share_link. How well do your models work with these additional categories?

To start with, I thought I would see if I could do a simple sentiment analysis on the data

##### VADER
(Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is fully open-sourced under MIT license.

In [2]:
import numpy as np
import pandas as pd
from nltk import download
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [3]:
# this is a scored list of words and jargon that the sentiment analyser
# uses references when performing sentiment analysis
download("vader_lexicon", quiet=True)

True

In [4]:
complaints = pd.read_csv('../data/complaints.csv')

In [6]:
complaints=complaints.rename(columns={'Consumer complaint narrative':'narrative', 'Issue':'issue'})

In [5]:
analyser = SentimentIntensityAnalyzer()

In [7]:
complaints["review_sentiment"] = complaints["narrative"].apply(lambda x: analyser.polarity_scores(text=str(x))["compound"])

In [8]:
complaints.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 353432 entries, 0 to 353431
Data columns (total 3 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   narrative         353432 non-null  object 
 1   issue             353432 non-null  object 
 2   review_sentiment  353432 non-null  float64
dtypes: float64(1), object(2)
memory usage: 8.1+ MB


In [9]:
complaints_grouped = complaints.groupby(by="issue")["review_sentiment"].agg(func=["mean","count"]).sort_values(by="count", ascending=False)

In [10]:
complaints_grouped

Unnamed: 0_level_0,mean,count
issue,Unnamed: 1_level_1,Unnamed: 2_level_1
Incorrect information on your report,0.07408,229305
Attempts to collect debt not owed,-0.175104,73163
Communication tactics,-0.410674,21243
Struggling to pay mortgage,-0.060379,17374
Fraud or scam,-0.245331,12347


Unsurprisingly, we can see that the the overall sentiment is fairly negative in all categories.  

In [11]:
complaints.to_csv('../data/complaints_sentimentscore.csv')